= scRUBYt! Changelog
== 0.2.7
=== 15th April, 2007
=changes:
[NEW] download pattern: download the file pointed to by the
parent pattern
[NEW] checking checkboxes
[NEW] basic authentication support
[NEW] default values for missing elements
[NEW] possibility to resolve relative paths against a custom url
[NEW] first simple version of to_csv and to_hash
[NEW] complete rewrite of the exporting system (Credit: Neelance)
[NEW] first version of smart regular expressions: they are constructed
from examples, just as regular expressions (Credit: Neelance)
[NEW] Possibility to click the n-th link
[FIX] Clicking on links using scRUBYt's aadvanced example lookup
[NEW] Forcing writing text of non-leaf nodes with :write_text => true
[NEW] Possibility to set custom user-agent; Specified default user agent
as Microsoft IE6
[FIX] Fixed crawling to detail pages in case of leaving the
original site (Credit: Michael Mazour)
[FIX] fixing the '//' problem - if the relative url contained two
slashes, the fetching failed
[FIX] scrubyt assumed that documents have a list of nested elements
(Credit: Rick Bradley)
[FIX] crawling to detail pages works also if the parent pattern is
a string pattern
[FIX] shorcut url fixed again
[FIX] regexp pattern fixed in case it's parent was a string
[FIX] refactoring the core classes, lots of bugfixes and stabilization
== 0.2.6
=== 22th March, 2007
The mission of this release was to add even more powerful features,
like crawling to detail pages or compound example specification,
as well as fixing the most frequently popping-up bugs. Scraping
of concrete sites is more and more frequently the cause for new
features and bugfixes, which in my opinion means that the
framework is beginning to make sense: from a shiny toy which
looks cool and everybody wants to play with, it is moving
towards a tool which you reach after if you seriously want
to scrape a site.
The new stuff in this release is 99% scraping related - if
you are looking for new features in the navigation part,
probably the next version will be for you, where I will
concentrate more on adding new widgets and possibilities
to the navigation process. Firewatir integration is very
close, too - perhaps already the next release will
support FireWatir navigation!
=changes:
* [NEW] Automatically crawling to and extracting from detail pages
* [NEW] Compound example specification: So far the example of a pattern had to be a string.
Now it can be a hash as well, like {:contains => /\d\d-\d/, :begins_with => 'Telephone'}
* [NEW] More sophisticated example specification: Possible to use regexp as well, and need not
(but still possible of course) to specify the whole content of the node - nodes that
contain the string/match the regexp will be returned, too
* [NEW] Possibility to force writing text in case of non-leaf nodes
* [NEW] Crawling to the next page now possible via image links as well
* [NEW] Possibility to define examples for any pattern (before it did not make sense for ancestors)
* [NEW] Implementation of crawling to the next page with different methods
* [NEW] Heuristics: if something ends with _url, it is a shortcut for:
some_url 'href', :type => :attribute
* [FIX] Crawling to the next page (the broken google example): if the next
link text is not an , traverse down until the is found; if it is
still not found, traverse up until it is found
* [FIX] Crawling to next pages does not break if the next link is greyed out
(or otherwise present but has no href attribute (Credit: Robert Au)
* [FIX] DRY-ed next link lookup - it should be much more robust now as it is uses the 'standard' example lookup
* [NEW] Correct exporting of detail page extractors
* [NEW] Added more powerful XPath regexp (Credit: Karol Hosiawa)
* [NEW] New examples for the new featutres
* [FIX] Tons of bugfixes, new blackbox and unit tests, refactoring and stabilization
== 0.2.3
=== 20th February, 2007
Thanks to the feedback from all of you, I managed to find a lot of bugs as well as write up a nice feature request list. The bugs are mostly fixed and also some shiny new features have been added. Stability was also improved by adding new tests and totally refacroring the whole code.
The new features make this release much more powerful than the previous one. Sites requiring login, submitting forms with button click, filling text areas, dealing with variable-size results, smart handling of attribute lookup, https, custom proxy setting and tons of bugfixes make this release capable of doing much-much more than it was possible in 0.2.0.
I have added also some shiny new examples - scraping reddit, del.icio.us, rubyforge login, wordpress automatic comment
ing for example.
=changes:
* [FIX] Cookies (and other stuff) are now taken into consideration
* [NEW] select_indices feature. Example:
table do
(row '1').select_indices(:last)
end
this will select only the last row;
possibility to specify a Range, or an array of indices, or other
constants like :first, :every_odd etc. More to come in the future!
* [FIX] digg.com next page problem fixed
* [FIX] Fetching of https sites
* [FIX] Next page works incorrectly when given an absolute path
* [FIX] Fixing exporting if the pattern parameters are parenthesized
* [NEW] Possibility to submit forms by clicking a button
* [NEW] Added new unit test suite: pattern_test
* [NEW] Possibility to set a proxy for fetching the input document
* [NEW] Added possibility to choose an option from a selection list (Credit: Zaheed Haque)
* [FIX] Image pattern example lookup fix
* [NEW] Possibility to prefilter the document before passing it to Hpricot (Credit: Demitrious Kelly)
* [FIX] corrected gem dependencies (Credit: Tim Fletcher)
* [FIX] remove duplicates only if there are more examples present
* [NEW] new examples: wordpress comment (Credit: Zaheed Haque), rubyforge login, del.icio.us, reddit and more
* [FIX] if there is no scraper defined, exit with a message rather than raise an exception
* [NEW] smart handling of attribute lookup: try to look up the attribute in the parent, but if it is not there, traverse up until it is found (this is useful e.g. if an image is inside a span and the span is inside an )
== 0.2.0
=== 30th January, 2007
The first ever public release, 0.2.0 is out! I would say the feature set is impressive, though the the relyability still needs to be improved, and the whole thing needs to be tested, tested and tested thoroughly. This is not yet the release which you just pull out of the box anf works under any circumstances - however, the major bugs are fixed and the whole stuff is in a good-enough(TM) state, I guess.
=changes:
* better form detection heuristics
* report message if there are absolutely no results
* lots of bugfixes
* fixed amazon_data.books[0].item[0].title[0] style output access
and implemented it correctly in case of crawling as well
* /body/div/h3 not detected as XPath
* crawling problem (improved heuristics of url joining)
* fixed blackbox test runner - no more platform dependent code
* fixed exporting bug: swapped exported XPaths in the case of no example present
* fixed exporting bug: capturing \W (non-word character) after the\ pattern name; this way we can distinguish pattern names where one
name is substring of the other
* Evaluation stops if the example was not found - but not in the case
of next page link lookup
* google_data[0].link[0].url[0] style result lookup now works in the
case of more documents, too
* tons of others bugfixes
* overall stability fixes
* more blackbox tests
* more examples
* overall stability fixes
= 0.1.9
=== 28th January, 2007
This is a preview release before the first real public release, 0.2.0. Basically everything planned for 0.2.0 is in, now a testing phase (with light bugfixing :-) will follow, then 0.2.0 will be released.
=Changes:
* Possibility to specify multiple examples (hence a pattern can have more filters)
* Enhanced heuristics for example text detection
* First version of algorithm to remove dupes resulting from multiple examples
* empty XML leaf nodes are not written
* new examples
* TONS of bugfixes
= 0.1
=== 15th January, 2007
First pre-alpha (non-public) release
This release was made more for myself (to try and test rubyforge, gems, etc) rather than for the community at this time.
Fairly nice set of features, but still need a lot of testing and stabilizing before it will be really usable.
* Navigation:
* fetching pages
* clicking links
* filling input fields
* submitting forms
* automatically passing the document to the scraping
* both files and http:// support
* automatic crawling
* Scraping:
* Fairly powerful DSL to describe the full scraping process
* Automatic navigation with WWW::Mechanize
* Automatic scraping through examples with Hpricot
* automatic recursive scraping through the next button
=changes:
* [FIX] cookies (and other stuff) are now taken into consideration
* [FIX] digg.com next page problem fixed
* [FIX] fetching of https sites
* [FIX] Next page works incorrectly when given an absolute path
* [FIX] Fixing exporting if the pattern parameters are parenthesized
* [NEW] Possibility to submit forms by clicking a button
* [NEW] Added new unit test suite: pattern_test
* [NEW] Possibility to set a proxy for fetching the input document
* [NEW] Added possibility to choose an option from a selection list
* [NEW] select_indices feature. Example:
table do
(row '1').select_indices(:last)
end
this will select only the last row;
possibility to specify a Range, or an array of indices, or other
constants like :first, :every_odd etc. More to come in the future!
* [FIX] Image pattern example lookup fix
* [FIX] corrected gem dependencies (thanks to Tim Fletcher)
* [FIX] remove duplicates only if there are more examples present
* [NEW] new examples: gmail login, wordpress comment, del.icio.us, grab_rows (showcasing select_indices)
* [FIX] if there is no scraper defined, exit with a message rather than
raise an exception
* [NEW] smart handling of attribute lookup: try to look up the attribute in the parent, but if it is not there, traverse up until it is found (this is useful e.g. if an image is inside a span and the span is inside an )
== 0.2.0
=== 30th January, 2007
The first ever public release, 0.2.0 is out! I would say the feature set is impressive, though the the relyability still needs to be improved, and the whole thing needs to be tested, tested and tested thoroughly. This is not yet the release which you just pull out of the box anf works under any circumstances - however, the major bugs are fixed and the whole stuff is in a good-enough(TM) state, I guess.
=changes:
* better form detection heuristics
* report message if there are absolutely no results
* lots of bugfixes
* fixed amazon_data.books[0].item[0].title[0] style output access
and implemented it correctly in case of crawling as well
* /body/div/h3 not detected as XPath
* crawling problem (improved heuristics of url joining)
* fixed blackbox test runner - no more platform dependent code
* fixed exporting bug: swapped exported XPaths in the case of no example present
* fixed exporting bug: capturing \W (non-word character) after the\ pattern name; this way we can distinguish pattern names where one
name is substring of the other
* Evaluation stops if the example was not found - but not in the case
of next page link lookup
* google_data[0].link[0].url[0] style result lookup now works in the
case of more documents, too
* tons of others bugfixes
* overall stability fixes
* more blackbox tests
* more examples
* overall stability fixes
= 0.1.9
=== 28th January, 2007
This is a preview release before the first real public release, 0.2.0. Basically everything planned for 0.2.0 is in, now a testing phase (with light bugfixing :-) will follow, then 0.2.0 will be released.
=Changes:
* Possibility to specify multiple examples (hence a pattern can have more filters)
* Enhanced heuristics for example text detection
* First version of algorithm to remove dupes resulting from multiple examples
* empty XML leaf nodes are not written
* new examples
* TONS of bugfixes
= 0.1
=== 15th January, 2007
First pre-alpha (non-public) release
This release was made more for myself (to try and test rubyforge, gems, etc) rather than for the community at this time.
Fairly nice set of features, but still need a lot of testing and stabilizing before it will be really usable.
* Navigation:
* fetching pages
* clicking links
* filling input fields
* submitting forms
* automatically passing the document to the scraping
* both files and http:// support
* automatic crawling
* Scraping:
* Fairly powerful DSL to describe the full scraping process
* Automatic navigation with WWW::Mechanize
* Automatic scraping through examples with Hpricot
* automatic recursive scraping through the next button