Raspar is a HTML parsing library that parses HTML pages and converts HTML to ruby object by defining a map of ‘css’ or ‘xpath’ selectors. This gem can also manage parsers for multiple websites.
The sample output looks something like this
{ product: [ <Raspar::Result:0x007ffc91e4d640 @attrs => { :name=>"Test1", :price=>"10"}, @domain => "example.com", @name => :product> # ... # ... ] }
Why Raspar?
For almost every website that we parse, we need to customise the code to parse and convert data into our defined format. While doing this, we potentially face some of the following problems in parsing html.
- HTML page may contain multiple items or single item with the same CSS selectors.
- We need to collect different types of data from single page. For example, products, offers, comments etc.
- For the single website, the HTML structure could be difference on various page. For example, on one page the product name could have a CSS selector as
.product-name
while on another page in the same website, it may have the CSS selector as.pname
. - Sometimes we want to collect particular attributes as an array. For example, in the product section, we may want the various product features that are defined in the
li
tag as an array. - Some attributes are common for all pages, for example, the product comparison page has the same name and description but other attributes would differ from page to page.
Raspar helps to solve all these problems!
Usage
You can define a parser as shown below. In this example, we are parsing a currency exchange rate website and fetching the country, currency and it’s code.
class CurrencyCodeParser include Raspar domain 'www.exchange-rate.com' collection :currency_code, 'table.currency-codes tr' do attr :country, 'td.country' attr :currency, 'td.currency' attr :code, 'td.code' end end
We first include the Raspar
module and register the domain
that is going to be parsed. Then we have to plan the parsing strategy. In this page, there are currency codes for each country. So using the collection method, we can collect all the values based on the currency code. We can also define multiple collections – for example, in a page containing products and brands, we can define two collections, one for products and another for brands.
NOTE: You can set multiple css selectors too in order of priority. In the example below, if .country
element is not set, then the .nation
will be checked and returned.
attr :country, '.country, .nation'
collection: This takes two arguments and a block of code: the collection name, the html selector and the block in which the attributes are defined. In the example above, ‘table.currency-codes tr’ is a selector that contains all there attributes country, currency and the code. So, the parser collects all ‘table.currency-codes tr’ elements and makes a result object using the selectors defined for attribute.
attr: This takes two mandatory arguments: the name and html selector. It can take an optional third options argument that help in formatting or getting a particular property of the html element. Potential options are :prop
, :eval
. If the options are not defined, then the attr returns the text value of that html element.
:prop: This will return the value of the mentioned property of the selected element. In the case below, we want the src
property of img tag.
attr :image, '.lg_photo img', prop: 'src'
:eval: This evaluates the HTML element and processes it. This can be a Proc
or a method name (i.e. a symbol). Remember, the method or Proc defined must take 2 agreements: the method name and the element. For example,
attr :address, '.address', eval: Proc.new{|text, ele| text.split(':').last}
or
attr :address, '.address', eval: :parse_address def parse_address(text, ele) text.split(':').last end
If we need the attribute as an array, we can simply do the following:
attr :specifications, '.specs li', as: :array
NOTE: If attr
is defined outside any collection
block, it is considered a common attribute and will be included in all collections!
The parsing logic
Here is an example of the parsing a particular page in the domain we have specified. In the example below, Raspar will automatically load the parser depending on the domain, in our case the CurrencyCodeParse
. We don’t need to specify this in our code. The advantage of this is that we can customise or add new parser at will as long as we specify the right domain!
url = 'http://www.exchange-rate.com/currency-list.html' // Using RestClient get html page html = RestClient.get(url).to_str Raspar.parse(url, html).each {|c| p c; }
This will get us the following result:
{ currency_code: [ #<Raspar::Result:0x007ffc91e4d640 @attrs={:country=>"USA", :currency=>"USD", :code =>"$"}>, #<Raspar::Result:0x007ffc91e57be0 @attrs={:country=>"Japan", :currency=>"¥; ", :code =>"JPY"}>, ... ... ] }
Alternate ways to create parsers
There are other ways to define a parser.
By passing a block
Here we don’t need to define a separate class; just an anonymous parser!
Raspar.add('www.exchange-rate.com') do collection :currency_code, 'table.currency-codes tr' do attr :country, 'td.country' attr :currency, 'td.currency' attr :code, 'td.code' end end
By Passing a hash
This can be helpful if we have pre-defined selector map configured in the code or saved in our database or even if we want to add map dynamically i.e JSON file of web service etc.
domain = 'http://www.leguide.com' selector_map = { collections: { product: { select: '.offers_list li', attrs: { image: { select: 'img', prop: 'src'}, price: { select: '.price .euro.gopt', eval: :parse_price} } } } }
In the selector map above, we have defined a :parse_price
method. Here is how we can add it to Raspar. We can also define more data processing helpers in the ParserHelper
module as shown below.
module ParserHelper def parse_price(val, ele) val.gsub(/,/, '.').to_f end end Raspar.add(domain, selector_map, ParserHelper)
This gem is available on ruby gems and on github: Raspar
You can check out various examples too.
Go forth and parse!