PDF to Plain Text processing using docsplit

As Rubyists, don’t we just love searching for gems to do our work for us :) But, that does not always work, does it? There are times when we don’t find solution and need to fix it ourselves. Do we remember to contribute back to the community? Here’s a similar story and some information about PDF to plain text parsing using pdf-reader and docsplit.

In one of our projects, we wanted to read data from PDF and convert it to “plain text”. As expected, we started searching for a gem that could help us achieve it and we quickly found pdf-reader. It was working as expected with Portrait orientation. However, when we tried to use the Landscape orientation, it failed!

On reading the code in pdf-reader, we found that it does not provide any option to parse a page in landscape orientation! When we passed a “Landscape” page to pdf-reader, it converted that page into plain text but the order of data changed and sometimes we even lost data. We tried to find a solution to this and fix pdf-reader but unfortunately it was getting really crazy and we discarded our attempt in light of our current need. We shall re-attempt this again soon.

Here’s what happened when we tried converting PDF to “plain text” using “pdf-reader”

After installing the gem, we tried the following code.

  reader = PDF::Reader.new(file_path)
  text = reader.pages[page_number].text

We get following output for sample text

F\nD\no\nm\nd\no\nt\nm\no\na\nt\ne\nn\nt\ne\nl\np\na\nS

This was a setback. So, we decided to look for an alternate solution. That’s when we came across docsplit. But, even with this we had the same problem. So we decided to read source code and we found docsplit internally uses pdftotext utility and we can pass different arguments to pdftotext but this was not implemented in docsplit so we contributed to that.Sanjeev Jha has submitted this pull request.

Steps to convert PDF to plain text using Docsplit

After installing the gem, we have several options we can pass to Docsplit. We chose the “-raw” option.

  Docsplit.extract_text(file_path, {pdf_opts: '-raw',  
       pages: from_page_number..to_page_number, 
       output: 'tmp_text_file'})

where,

pdf_opts: Format in which you want your text.

Docsplit generated a text file for each page in the current directory. (we can optionally specify output directory for the text files). With this raw option first page of the pdf file got converted to file whose contents were like this

Sample
text
in
vertical
format
for
demo
PDF.

But here was a problem – we wanted to do some processing on the text and getting the text on a separate line would not help. Suppose we have text

Employee name   abc xyz  a111

We know that the employee name and number is separated by “3 spaces” and the text before these 3 spaces is the name of employee. However, as shown above, the employee number can have spaces too (“abc xyz” or “a b c”)! So the above -raw option would not help us extract the name and number. So we searched for other options and I came across the ‘layout’ option.

  Docsplit.extract_text(file_path, {pdf_opts: '-layout',  
        pages: from_page_number..to_page_number, 
        output: 'tmp_text_file'})

Now, with the layout option first page of the pdf file got converted to text file whose contents were like this

Sample text in vertical format for demo PDF.

This is exactly what we wanted and were able to complete our work properly! To ensure sanity, we migrated entirely from pdf-reader to docsplit.

Lesson Learnt

Before using any gem check whether it fulfills all your requirements and if possible, try to contribute so that other people will not face same problem.

Posted in Ruby | Tagged , | Leave a comment

What makes rspec3 so interesting?

Gautam Rege:

Here is a list of things that rspec3 has in store for us.

Originally posted on rails learning:

Being an rspec fan I have been waiting for quite some time for the rspec3 final release. rspec3 has finally been released and it’s ready for production use. There are many interesting changes that has been incorporated in rspec3. Thanks to myron, Andy and david and other contributors. Here are the few changes that makes testing more fun: Changes in rspec-expectations:Compound Expectations: Composing a expectation using two or more expectations is called compound expectation. Here is the example:

1 1
# In rspec3
RSpec.describe String do
example “expect to be a instance of String and it should be equal to ‘RUBY IS AWESOME'”
string = “RUBY IS AWESOME”

expect(string).to be_a(String).and eql(“RUBY IS AWESOME”)

View original 822 more words

Posted in General | Leave a comment

Managing images within AWS S3, without re-processing

Gautam Rege:

When you want to move images from one S3 bucket to another – do you bang your head against the wall? Even if you write a program to do that, does it take ages? Here is a neat and quick way to manage your images in AWS S3 without re-processing them!

Originally posted on foorubypho:

Its common to upload & retrieve images from Amazon S3. We also use existing one’s to create modified versions. Similarly, once our client came up with a requirement that he needs to duplicate his data along with images & then user can modify the cloned data as needed.

Consider an example of a car where,

  • car has several models/variants such as base, standard, superior etc
  • each variant contains some features same as they are in previous variant
  • engineers create a base variant first & then using same data they can customize new variant & so on
  • so engineers gets the previous variants data as is & can update as needed.

This data contains a lot of high resolution images for “minor details” such as:

  • Exterior images as options available in body parts, bumpers, vinyls, graphics, rear bumpers, spoilers etc
  • Interior options available such as leather color, steering types, mounted controls…

View original 681 more words

Posted in General | Leave a comment

Building web apps with Rails4 and AngularJS in 15 minutes

While learning AngularJS to make a single page app using Rails4, I found some good videos and blogs. However, I did not find any simple example for CRUD operations that made me easily understand the integration between Rails4 and AngularJS. So in this tutorial post, I explain how to create basic CRUD operation using Rails4 and AngularJS.

Here is my git repository for the complete code Github

Create rails project


$ rails new rails4_crud_with_angularjs

Create User model


$ rails g model user

file db/migrate/[timestamp]_create_users.rb


class CreateUsers < ActiveRecord::Migration
 def change
   create_table :users do |t|
     t.string :first_name
     t.string :last_name
     t.string :email
     t.string :phone
     t.timestamps
   end
 end
end


$ rake db:migrate

app/model/user.rb


class User < ActiveRecord::Base
 validates :first_name, :last_name, :email, presence: true
end

Create Users controller


$ rails g controller users

Create the CRUD operation in users controller and send JSON response. The code sample is here

Add angular gem

In Gemfile add these two gems.


gem 'angularjs-rails'
gem 'angular-ui-bootstrap-rails' #for bootstrap UI


$ bundle install

Setup layout

Adding ng-app and ng-view indicates that we have an AngularJS app in the page.


<html ng-app="myapplication">
 <head>
   <title>Rails4CrudWithAngularjs</title>
   <%= stylesheet_link_tag 'application', media: 'all', 'data-turbolinks-track' => true %>
   <%= javascript_include_tag 'application', 'data-turbolinks-track' => true %>
   <%= csrf_meta_tags %>
 </head>
 <body>
   <div class="container" ng-view>
     <%= yield %>
   </div>
 </body>
</html>

Create an angular controller

First let’s create a directory for our controllers. You can name it whatever you want.

$ mkdir -p app/assets/javascripts/angular/controllers

Now create users_controllers.js file. Here I have used the same naming convention as Rails.

// app/assets/javascripts/angular/controllers/users_controllers.js
var myApp = angular.module('myapplication', ['ngRoute', 'ngResource']);

‘myapplication’ is ng-app name.

Add Factory

Factory is the angular provider and you can learn more about it here. It basically interacts with the rails server and processes the json response.

myApp.factory('Users', ['$resource',function($resource){
 return $resource('/users.json', {},{
 query: { method: 'GET', isArray: true },
 create: { method: 'POST' }
 })
}]);

myApp.factory('User', ['$resource', function($resource){
 return $resource('/users/:id.json', {}, {
 show: { method: 'GET' },
 update: { method: 'PUT', params: {id: '@id'} },
 delete: { method: 'DELETE', params: {id: '@id'} }
 });
}]);

‘Users’ factory is used for getting the collection of users and creating users. ‘User’ factory is used to get the user details, update the user or delete the user.

Add Routes

Angular routes are used for deep-linking URLs to controllers and views (HTML partials). It watches $location.url() and tries to map the path to an existing route definition.

myApp.config([
 '$routeProvider', '$locationProvider', function($routeProvider, $locationProvider) {
 $routeProvider.when('/users',{
    templateUrl: '/templates/users/index.html',
    controller: 'UserListCtr'
 });
 $routeProvider.when('/users/new', {
   templateUrl: '/templates/users/new.html',
   controller: 'UserAddCtr'
 });
 $routeProvider.when('/users/:id/edit', {
   templateUrl: '/templates/users/edit.html',
   controller: "UserUpdateCtr"
 });
 $routeProvider.otherwise({
   redirectTo: '/users'
 });
 }
]);

In the code above, I have added the controllers UserListCtr, UserAddCtr, UserUpdateCtr for listing users and to create and update users.

Add Angular templates

Now we need to add templates. I have stored them in public/templates.

If we create a file public/templates/users/index.html with some arbitrary content, we should be able to see it in the browser. Here is a sample template for users.

 CRUD Actions

Now our setup is done and we are ready for processing CRUD operation.

Index Action:

myApp.controller("UserListCtr", ['$scope', '$resource', 'Users', 'User', '$location', function($scope, $resource, Users, User, $location) {
  $scope.users = Users.query(); //it's getting user collection
}]);

‘UserListCtr’ this controller listing users. you can check index.html here I am not explaining index template it’s straight forward angular template, you can read more about it here.

Create Action:

myApp.controller("UserAddCtr", ['$scope', '$resource', 'Users', '$location', function($scope, $resource, Users, $location) {
  $scope.save = function () {
    if ($scope.userForm.$valid){
      Users.create({user: $scope.user}, function(){
      $location.path('/');
    }, function(error){
      console.log(error)
    });
  }
 }
}]);

‘UserAddCtr’ this controller create user. you can check new.html here. Users.create() calling users controller create action. create() action we defined in ‘Users’ factory.

Update Action:

myApp.controller("UserUpdateCtr", ['$scope', '$resource', 'User', '$location', '$routeParams', function($scope, $resource, User, $location, $routeParams) {
   $scope.user = User.get({id: $routeParams.id})
   $scope.update = function(){
     if ($scope.userForm.$valid){
       User.update($scope.user,function(){
         $location.path('/');
       }, function(error) {
         console.log(error)
      });
     }
   };
}]);

‘UserUpdateCtr’ this controller update the user. you can check edit.html here. Users.update() calling users controller update action. update() action we defined in ‘User’ factory.

Delete Action:

For delete user I am not creating separate angular controller. I am writing deleteUser event in ‘UserListCtr’  controller.


myApp.controller("UserListCtr", ['$scope', '$http', '$resource', 'Users', 'User', '$location', function($scope, $http, $resource, Users, User, $location) {

  $scope.users = Users.query();

  $scope.deleteUser = function (userId) {
    if (confirm("Are you sure you want to delete this user?")){
      User.delete({ id: userId }, function(){
        $scope.users = Users.query();   // after delete user get users collection.
        $location.path('/');
      });
    }
  };
}]);

User.delete() calling users controller destroy action. delete() action we defined in ‘User’ factory.

In  public/templates/users/index.html for adding ‘Remove’ link


<a href="" ng-click="deleteUser(user.id)">Remove</a>

Remember href should be blank, if you add href=”#” it will call default route in your application.

I hope this blog helps those are started development in Rails + AngularJS.

Posted in Javascript, Ruby on Rails, Tutorials | Tagged , | 2 Comments

The First-Ever Go Conference in India – GopherConIndia 2015

After the resounding success of GopherCon 2014 in Denver, Co, USA the Go Language Community in India together with the Innovation And Technology Trust (ITT) (a non-profit organization, established to organize and conduct technology conferences in India whose current portfolio includes RubyConf India, GopherCon India and DevOpsDays India) are bringing you the first-ever Go conference in India – GopherConIndia 2015 in Bengaluru (Bangalore) from 19-21 Feb. 2015.

The Indian Go programming community is growing at a dramatic pace. The number of companies utilizing Go, as part of their technology stack, continues to grow steadily.

GopherConIndia 2015

GopherConIndia 2015

Planning and organizing an all India conference is not an easy task but a small team of dedicated volunteers (Ajey Gore, Gautam Rege, Karan Misra, Krishnaprasad Varma, Pravin Mishra, Santosh B Malleshappa, Sathish VJ, Satish Talim – this does not list all of the volunteers, without whom GopherConIndia will not be possible) are already on the job and good progress has already been made. Special mention needs to be made of Brian Ketelsen, Erik St. Martin and Matt Aimonetti who have been guiding and supporting the team of volunteers.

Why Bengaluru (Bangalore)?

Bengaluru, the Silicon Valley of India, is the ideal location for the first-ever Indian Go conference. It has been the host to all the global software companies for many years. It boasts some of the best world-class restaurants and great nightlife to round out your conference experience. No wonder, Bengaluru is termed as the Software and Party Hub of India.

The Venue

Hotel Royal Orchid is approximately less than an hour away from the Kempegowda International Airport and 10 kms from the Central Railway Station. The hotel’s vicinity offers a platter of multiple dining restaurants, electronic stores and shopping malls in the nearby surroundings giving every traveler a remarkable opportunity to enjoy the spirit of Bengaluru city. With all modern facilities, the hotel is a perfect venue for a conference like GopherConIndia.

Dates

GopherConIndia is a single-track event (fully in English) that you don’t want to miss and where everyone gets the opportunity to see the same talks. 20-21 Feb. 2015 are the main days of the conference. A paid Go workshop conducted by William Kennedy is planned for 19th Feb. 2015 followed by a pre-conference party for all the speakers and participants.

Estimated Audience

We expect around 300 participants at the conference.

Sponsors and Supporters

The Sponsors and Supporters of the conference realize that this is a great opportunity for them to reach a captive audience of early adopters. Their sponsorship helps keep GopherConIndia affordable and accessible to the widest possible audience.

Please do lend your support to GopherConIndia 2015.

Sponsors and Supporters are lining up! So far we have the following sponsors and supporters and many more are expected soon.

Gold Sponsors

Media Sponsor

Go Books / Screencasts

Student Sponsors

Live Blogging from 19-21 Feb. 2015

Speakers

The CFP is already open and we are getting a lot of awesome talk proposals.

Some of the international speakers who have confirmed their participation so far, are:

Student Scholarships

As part of GopherConIndia’s commitment to encourage students in India to excel in computing and technology, student scholarships are being offered. Female students will be given a preference.

Connect with GopherConIndia

Connect with GopherConIndia to know the latest about the conference:

Be a part of this awesome Go conference and make it a big success.

Posted in Conferences, Go | Tagged , , | Leave a comment

API Throttling on Requests Per Minute

In my previous blog post I have discussed API designing and versioning. Now I am going to build a simple algorithm to restrict API access based on requests per minute using redis

Very often, as an API provider we need to control request traffic based on certain criteria, like account subscription, time interval or requests per day or per month. Redis provides a key expiry functionality based on TTL (time to live) and using this we can implement the requests per minute feature.

# ruby redis client expire method.
redis_client.expire(key, time_to_expire_in_secs)
# i.e
redis_client.expire("1", 60)

Here is the implementation of API request counting store.

  • incr method increments the key and sets the expiry for the key if it’s set for the first time.
  • threshold? method checks the value after incrementing it.
class ApiRpmStore

  TIME_TO_EXPIRE = 60 # 1 min

  class << self
    attr_accessor :redis_client

    def init(config = {})
       self.redis_client = Redis.new(:url => "redis://#{config['host']}:#{config['port']}/#{config['database']}")
    end

    def incr(key)
      val = redis_client.incr(key)
      redis_client.expire(key, TIME_TO_EXPIRE) if val == 1
      val
    end

    def threshold?(key, threshold_value = 0)
      self.incr(key) < threshold_value
    end

  end

end

Test this using the console.

irb> ApiRpmStore.init({'host' => 'localhost', port: 6379, database: 0})

# threshold value is 1 for key 'user-1'
irb> ApiRpmStore.threshold?('user-1', 1) # return true

irb> ApiRpmStore.threshold?('user-1', 1) # return false

Now implement the before action methods in the controller in which we are going to validate for requests per minute. If the requests per minute validation fails, then we return a response ‘422: too many requests’ with some helper url, like the subscription or license page.

class Api::ApiController < ActionController::Base

  private

  def authenticate
    authenticate_or_request_with_http_token do |token, options|
      @user = User.where(api_key: token).first
    end
  end

  def validate_rpm
    if ApiRpmStore.threshold?(@user.id.to_s, @user.request_per_min) # request_per_min is  threshold for
      render json: {help: 'http://mysite.com/plans'}, status: :too_many_requests
      return false
    end
  end
end

This is the events controller on which we are going to throttle the API requests. If you want to apply rate limit to all api then add before_action :validate_rpm to your base api controller (in my case it is the ‘Api::ApiController’ ).

class Api::V1::EventsController < Api::ApiController
  before_action :authenticate
  before_action :validate_rpm

  respond_to :json

  def index
    @events = Event.all
    respond_with @events
  end
end

You can find Rails application code sample on github

Posted in Ruby on Rails | Tagged , | Leave a comment

Real Time notifications using slanger and sidekiq

Gautam Rege:

Did you want Real Time Notifications in your web app? Here is how to do it using Slanger and Sidekiq with a ready to use demo repository.

Originally posted on rails learning:

With the increasing expectation of web-applications, everyone wants real time updates or real time notifications to improve the web portals user experience. Understandably, my project required Real time notifications too. I successfully implemented and deployed on production. During my development I found some interesting gems, javascript libraries that I came across. So I thought it would be helpful if I share my experience. To make it more useful I have created a sample repository for demonstrating Real Time notifications for your further reference.

The AIM: Notify the user with a reminder on specific user defined date and time using Web notifications.

I choose slanger gem. (Recently updated by Jiren to make it compatible with rails4). To make it work, add pusher to rails app and for slanger instead of adding it in rails Gemfile, I created a sub directory called ‘slanger’ in my rails project and added slanger. Now we are done with…

View original 404 more words

Posted in General | Leave a comment