I needed to do some screen scrapping using Ruby, so after a few google searches I came across this post which looked very promising, after a quick scan through the comments scRUBYt looked even better.
stumbling block #1: well it did take me a bit to get it installed, turns out it is currently (0.3.4) hard-wired to use RubyInline 3.6.3, but I had RubyInLine 3.7.0 installed, this post gives the details to work around this issue, otherwise you will get an error something like Gem::Exception: can’t activate RubyInline (= 3.6.3), already activated RubyInline-3.7.0].
stumbling block #2: trying the simple example for scrapping google here I got the following error - can’t convert Hash into String. So I tried commenting out some of the code, just limiting it to hit the submit button, the result - no error but no output either? after looking around on the scRUBYt forums turns out scRUBYt logging is not on by default, so turn it on… Scrubyt.logger = Scrubyt::Logger.new and run the code again [ERROR] No extractor defined, exiting… ok, so un-comment the code and run it again now that I have the logging on, same error, I am obviously doing something wrong? try a couple other code samples, more fun error messages - The error occurred while evaluating nil.example_type.
Well it turns out I was jumping ahead because I was calling this code from inside a Rake task. After moving the code into a regular ruby file and calling it using ruby instead of rake everything works just great! I did not look into why this code is failing from inside the rake task but I am guessing there is some conflict with the libraries? an exercise for another day…
i am using scrubyt, and below is my code to scap
google.
Scrubyt.logger = Scrubyt::Logger.new
google_data = Scrubyt::Extractor.define do
#Perform the action(s)
fetch 'http://www.google.com/'
fill_textfield 'q', 'ruby'
submit
#Construct the wrapper
#
link "//div[3]/div/ol/li" do
head "/h3[@class='r']"
des "/div[@class='s']"
end
next_page "Next", :limit => 2
end
and this wil output some thing like this
# Ruby Programming Language
# A dynamic, interpreted, open source programming language with a focus on simplicity and productivity. Site includes news, downloads, documentation, …www.ruby-lang.org/ - 12k - Cached - Similar pagesDownloadsDocumentationin Twenty MinutesWhat's RubyDownload RubyLibrariesAbout RubySecurityMore results from ruby-lang.org »
# Ruby (programming language) - Wikipedia, the free encyclopedia
# Ruby is a dynamic, reflective, general purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features. …en.wikipedia.org/wiki/Ruby_(programming_language) - 118k - Cached - Similar pages
since div class ='s' has text and some child nodes. I m getting all text of div class ='s' as well as its chlid nodes.
how to filter this( i dont want child node's text). Can any body help in this. What procedure i have to follow.