A DSL that wraps around Nokogiri and is used by stretched.io for web scraping.
Add this line to your application's Gemfile:
gem 'buzzsaw'And then execute:
$ bundle
Or install it yourself as:
$ gem install buzzsaw
This gem is what stretched.io uses for its DSL -- both the JSON-based one and the scripting one. You can use it independently, though.
Most of the time when I'm scraping the web, I just want to find the first
bit of matching text at a matching xpath. That's why find_by_xpath is the workhorse
of this query DSL.
This method takes the following arguments:
xpath: The xpath query string of the nodes that you want to search for a given pattern. This argument is mandatory.match: A regex that the text of the xpath node should match.capture: A regex that pulls only the matching text out of the matched string and returns it.pattern: If thepatternargument is present, thenmatch = capture = pattern.label: If this is present, then any positive match will return the string supplied by this argument.
Here's a look at how find_by_xpath works in practice.
Let's say that you want to extract the price of product2 from the following bit of HTML in products.html:
<div id="product1-details">
<ul>
<li>Status: In-stock</li>
<li>UPC: 00110012232</li>
<li>Price: $12.99</li>
</ul>
</div>
<div id="product2-details">
<ul>
<li>Status: In-stock</li>
<li>UPC: 00110012232</li>
<li>SKU: ITEM-2</li>
<li>Price: $12.99</li>
</ul>
</div>You might use find_by_xpath as follows:
source = File.open { |f| f.read("products.html") }
buzz = Buzzsaw::Document.new(source, format: :html)
buzz.find_by_xpath(
xpath: '//div[@id="product2-details"]//li',
pattern: /\$[0-9]+\.[0-9]+/
)
#=> "$12.99"If for whatever reason you wanted that entire price node, you could do:
buzz.find_by_xpath(
xpath: '//div[@id="product2-details"]//li',
match: /\$[0-9]+\.[0-9]+/
)
#=> "Price: $12.99"Now let's say that you only want "12.99", without the dollar sign. You could do that as follows:
buzz.find_by_xpath(
xpath: '//div[@id="product2-details"]//li',
match: /\$[0-9]+\.[0-9]+/
capture: /[0-9]+\.[0-9]/
)
#=> "12.99"Sometimes you might want to return a specific bit of text if you find a match on a page.
This can be done with the label argument.
For instance, what if we want to the find_by_xpath function to return the token
in_stock if we use it to find that the item is in stock. We'd do that as follows:
buzz.find_by_xpath(
xpath: '//div[@id="product2-details"]//li',
pattern: /Status: In-stock/
label: 'in_stock'
)
#=> in_stockThese examples are contrived, but you get the idea.
Consider the list of product details above. Let's say that I want
it capture and store those details as a human-readable string. If I have a Nokogiri::Document called
doc with the above HTML in it, then look at the following:
doc.xpath("//div[@id='product2-details']//li").text
#=> Status: In-stockUPC: 00110012232SKU: ITEM-2Price: $12.99All of the nodes are crammed together, but it would be nice if I could insert
a space in between them. That's one place where collect_by_xpath helps.
buzz.collect_by_xpath(
xpath: "//div[@id='product2-details']//li",
join: ' '
)
#=> Status: In-stock UPC: 00110012232 SKU: ITEM-2 Price: $12.99The collect_by_xpath function finds all of the matching nodes and concatenates
their text, using the character(s) supplied by optional join as a delimiter.
This method also takes the same match, capture, and pattern arguments
as find_by_xpath, and they do the same thing. You can use the match argument to
collect only matching nodes, and the capture argument to filter the final string.
Finally, this function also takes the label argument.
This method is useful for pulling text out of tables, one of the most annoying
jobs in web scraping. The find_in_table method takes the following arguments:
row: Either a regex for matching a row, or an integer row index. This argument is mandatory.column: Either a regex for matching a column, or an integer column index.
- Fork it ( https://github.com/jonstokes/Buzzsaw/fork )
- Create your feature branch (
git checkout -b my-new-feature) - Commit your changes (
git commit -am 'Add some feature') - Push to the branch (
git push origin my-new-feature) - Create a new Pull Request