PDF to Plain Text processing using docsplit

As Rubyists, don’t we just love searching for gems to do our work for us 🙂 But, that does not always work, does it? There are times when we don’t find solution and need to fix it ourselves. Do we remember to contribute back to the community? Here’s a similar story and some information about PDF to plain text parsing using pdf-reader and docsplit.

In one of our projects, we wanted to read data from PDF and convert it to “plain text”. As expected, we started searching for a gem that could help us achieve it and we quickly found pdf-reader. It was working as expected with Portrait orientation. However, when we tried to use the Landscape orientation, it failed!

On reading the code in pdf-reader, we found that it does not provide any option to parse a page in landscape orientation! When we passed a “Landscape” page to pdf-reader, it converted that page into plain text but the order of data changed and sometimes we even lost data. We tried to find a solution to this and fix pdf-reader but unfortunately it was getting really crazy and we discarded our attempt in light of our current need. We shall re-attempt this again soon.

Here’s what happened when we tried converting PDF to “plain text” using “pdf-reader”

After installing the gem, we tried the following code.

  reader = PDF::Reader.new(file_path)
  text = reader.pages[page_number].text

We get following output for sample text

F\nD\no\nm\nd\no\nt\nm\no\na\nt\ne\nn\nt\ne\nl\np\na\nS

This was a setback. So, we decided to look for an alternate solution. That’s when we came across docsplit. But, even with this we had the same problem. So we decided to read source code and we found docsplit internally uses pdftotext utility and we can pass different arguments to pdftotext but this was not implemented in docsplit so we contributed to that.Sanjeev Jha has submitted this pull request.

Steps to convert PDF to plain text using Docsplit

After installing the gem, we have several options we can pass to Docsplit. We chose the “-raw” option.

  Docsplit.extract_text(file_path, {pdf_opts: '-raw',  
       pages: from_page_number..to_page_number, 
       output: 'tmp_text_file'})

where,

pdf_opts: Format in which you want your text.

Docsplit generated a text file for each page in the current directory. (we can optionally specify output directory for the text files). With this raw option first page of the pdf file got converted to file whose contents were like this

Sample
text
in
vertical
format
for
demo
PDF.

But here was a problem – we wanted to do some processing on the text and getting the text on a separate line would not help. Suppose we have text

Employee name   abc xyz  a111

We know that the employee name and number is separated by “3 spaces” and the text before these 3 spaces is the name of employee. However, as shown above, the employee number can have spaces too (“abc xyz” or “a b c”)! So the above -raw option would not help us extract the name and number. So we searched for other options and I came across the ‘layout’ option.

  Docsplit.extract_text(file_path, {pdf_opts: '-layout',  
        pages: from_page_number..to_page_number, 
        output: 'tmp_text_file'})

Now, with the layout option first page of the pdf file got converted to text file whose contents were like this

Sample text in vertical format for demo PDF.

This is exactly what we wanted and were able to complete our work properly! To ensure sanity, we migrated entirely from pdf-reader to docsplit.

Lesson Learnt

Before using any gem check whether it fulfills all your requirements and if possible, try to contribute so that other people will not face same problem.

Advertisements
This entry was posted in Ruby and tagged , . Bookmark the permalink.

2 Responses to PDF to Plain Text processing using docsplit

  1. Muhammad Junaid Ehsan says:

    How to give extract all pages . I mean how we can specify all pages instead of range.

    • shwetakale says:

      There are two approaches :-
      1] If you want only single text file to be generated for entire pdf don’t pass ‘pages’ option.
      For ex. Docsplit.extract_text(filepath, {:pdf_opts => ‘-layout’, output: ‘tmp_text_file’})

      2] If you want different text file for each page in pdf, you can extract total number of pages and pass it as range
      ex. Docsplit.extract_text(filepath, {:pdf_opts => ‘-layout’, pages: 1..Docsplit.extract_info(filepath)[:length],output: ‘tmp_text_file’})

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s