Raspar – Build a html parser in 5 minutes

Raspar is a HTML parsing library that parses HTML pages and converts HTML to ruby object by defining a map of ‘css’ or ‘xpath’ selectors. This gem can also manage parsers for multiple websites.

The sample output looks something like this

{ product: [ 
    <Raspar::Result:0x007ffc91e4d640 @attrs => { :name=>"Test1", :price=>"10"}, 
    @domain => "example.com", @name => :product> 
    # ... 
    # ... 
    ] 
}

Why Raspar?

For almost every website that we parse, we need to customise the code to parse and convert data into our defined format. While doing this, we potentially face some of the following problems in parsing html.

  • HTML page may contain multiple items or single item with the same CSS selectors.
  • We need to collect different types of data from single page. For example, products, offers, comments etc.
  • For the single website, the HTML structure could be difference on various page. For example, on one page the product name could have a CSS selector as.product-name while on another page in the same website, it may have the CSS selector as .pname.
  • Sometimes we want to collect particular attributes as an array. For example,  in the product section, we may want the various product features that are defined in the li tag as an array.
  • Some attributes are common for all pages, for example, the product comparison page has the same name and description but other attributes would differ from page to page.

Raspar helps to solve all these problems!

Usage

You can define a parser as shown below. In this example, we are parsing a currency exchange rate website and fetching the country, currency and it’s code.

class CurrencyCodeParser
  include Raspar
  domain 'www.exchange-rate.com'

  collection :currency_code, 'table.currency-codes tr' do
    attr :country, 'td.country'
    attr :currency, 'td.currency'
    attr :code, 'td.code'
  end
end

We first include the Raspar module and register the domain that is going to be parsed. Then we have to plan the parsing strategy. In this page, there are currency codes for each country. So using the collection method, we can collect all the values based on the currency code. We can also define multiple collections – for example, in a page containing products and brands, we can define two collections, one for products and another for brands.

NOTE: You can set multiple css selectors too in order of priority. In the example below, if .country element is not set, then the .nation will be checked and returned.

attr :country, '.country, .nation'

collection: This takes two arguments and a block of code: the collection name, the html selector and the block in which the attributes are defined. In the example above, ‘table.currency-codes tr’ is a selector that contains all there attributes country, currency and the code. So, the parser collects all ‘table.currency-codes tr’ elements and makes a result object using the selectors defined for attribute.

attr: This takes two mandatory arguments: the name and html selector. It can take an optional third options argument that help in formatting or getting a particular property of the html element. Potential options are :prop, :eval. If the options are not defined, then the attr returns the text value of that html element.

:prop: This will return the value of the mentioned property of the selected element. In the case below, we want the src property of img tag.

attr :image, '.lg_photo img', prop: 'src'

:eval: This evaluates the HTML element and processes it. This can be a Proc or a method name (i.e. a symbol). Remember, the method or Proc defined must take 2 agreements: the method name and the element. For example,

attr :address, '.address', eval: Proc.new{|text, ele| text.split(':').last}

or

attr :address, '.address', eval: :parse_address

def parse_address(text, ele)
  text.split(':').last
end

If we need the attribute as an array, we can simply do the following:

attr :specifications, '.specs li', as: :array

NOTE: If attr is defined outside any collection block, it is considered a common attribute and will be included in all collections!

The parsing logic

Here is an example of the parsing a particular page in the domain we have specified. In the example below, Raspar will automatically load the parser depending on the domain, in our case the CurrencyCodeParse. We don’t need to specify this in our code. The advantage of this is that we can customise or add new parser at will as long as we specify the right domain!

url = 'http://www.exchange-rate.com/currency-list.html'

// Using RestClient get html page
html = RestClient.get(url).to_str

Raspar.parse(url, html).each {|c| p c; }

This will get us the following result:

{
  currency_code: [
    #<Raspar::Result:0x007ffc91e4d640
     @attrs={:country=>"USA", :currency=>"USD", :code =>"$"}>,
    #<Raspar::Result:0x007ffc91e57be0
     @attrs={:country=>"Japan", :currency=>"¥; ", :code =>"JPY"}>,
   ...
   ...
 ]
}

Alternate ways to create parsers

There are other ways to define a parser.

By passing a block

Here we don’t need to define a separate class; just an anonymous parser!

Raspar.add('www.exchange-rate.com') do
  collection :currency_code, 'table.currency-codes tr' do
    attr :country, 'td.country'
    attr :currency, 'td.currency'
    attr :code, 'td.code'
  end
end

By Passing a hash

This can be helpful if we have pre-defined selector map configured in the code or saved in our database or even if we want to add map dynamically i.e JSON file of web service etc.

domain = 'http://www.leguide.com'
selector_map = {
  collections: {
    product: {
      select: '.offers_list li',
      attrs: {
        image: { select: 'img', prop: 'src'},
        price: { select: '.price .euro.gopt', eval: :parse_price}
      }
    }
  }
}

In the selector map above, we have defined a :parse_price method. Here is how we can add it to Raspar. We can also define more data processing helpers in the ParserHelpermodule as shown below.

module ParserHelper
  def parse_price(val, ele)
    val.gsub(/,/, '.').to_f
  end
end

Raspar.add(domain, selector_map, ParserHelper)

This gem is available on ruby gems and on github: Raspar
You can check out various examples too.

Go forth and parse!

Advertisements
This entry was posted in Ruby and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s