ProTips PostGres 1: Looking through the Postgres Window

Content posted here with the permission of the author, who is currently employed at Josh Software. Original post available here.

Problem – 

We encountered a situation where we had to select only one record for each values of a column (lets say ‘user_id’) ordered by a different column (lets say ‘price’) in a certain table (lets say ‘purchases’). Doing this active-record way was very inefficient. By active-record way, we had to:

  1. Fetch all the user_ids from purchases table
  2. Loop through all the user_ids, for each user_id
    1. Select once record ordered by price

This approach lead us to query the database number of user times + one query for selecting all the user_ids (N + 1 query problem).

Solution – 

We struggled to find a better solution in active-record for this problem and window functions of postgresql came to the rescue.

To understand postgres window functions, we must understand GROUP BY clause. We use GROUP BY clause to apply aggregate functions on a subset of rows of a table.

For example you have a table called “purchases” with the schema as shown below.

  CREATE TABLE purchases (
    purchase_id serial PRIMARY KEY,
    product_id INT NOT NULL,
    price DECIMAL (11, 2),
    user_id INT NOT NULL
  );

Lets insert data into this table.

 INSERT INTO purchases (product_id, price, user_id)
 VALUES
 (1, 10, 1),
 (2, 20, 1),
 (3, 30, 1),
 (4, 40, 2),
 (1, 10, 2),
 (5, 50, 2),
 (6, 30, 2)

 

To calculate average of all the purchase prices, we will use AVG aggregate function.

  SELECT AVG (price) FROM purchases;

This returned the average of prices for both users with id 1 and 2.

  avg
 --------
  27.1428571428571429

 

Now, if you want to find AVG (A postgres aggregate function) of the price of items that a specific user has purchased. You would use GROUP BY clause on user_id. Thus the following statement will return average for each user_id.

  SELECT AVG (price) FROM purchases GROUP BY user_id;

Above query return the following

 user_id          avg
 --------      --------
    1          20.0000000000000000
    2          32.5000000000000000

As you can see,

The AVG aggregate function has reduced the number of rows returned by the sql query.

The GROUP BY clause groups the rows of the table into subset of rows based on a column(s).

Similarly, a window function also operates on a subset of rows of a table but it does not reduce the number of rows. It returns the actual records of the table in the output with an extra column “avg” if you have applied AVG function to the prices.

In the query below the AVG function works as a window function that operates on a set of rows specified by the OVER (PARTITION BY) clause.

For example, below query will return the product_id, user_id, price and average from “purchases” table for each user group.

  SELECT
  product_id,
  user_id,
  price,
  AVG (price) OVER ( PARTITION BY user_id )
  FROM
  purchases

The output

  product_id    user_id   price   avg
 ------------ ---------- ------- -----
      1           1       10.00   20.0000000000000000
      2           1       20.00   20.0000000000000000
      3           1       30.00   20.0000000000000000
      4           2       40.00   32.5000000000000000
      1           2       10.00   32.5000000000000000
      5           2       50.00   32.5000000000000000
      6           2       30.00   32.5000000000000000

You can also apply other operations with PARTITION BY clause. Like if you want the results to be returned in an order, you can use ORDER BY clause. You can also use built in window functions like row_number(), rank(), etc.

These built-in window functions adds a number to each row based on their order. The row_number() function assigns a serial number to each of the rows returned. So if you want a limited number of records for each user_id, you can use it.

The query below will return only one record for each user_id, decreasingly ordered by their purchase price.

  SELECT * FROM
  (
    SELECT
    product_id,
    user_id,
    price,
    row_number() OVER ( PARTITION BY user_id ORDER BY price DESC )
    FROM
    purchases
  ) tmp
  WHERE tmp.row_number <= 1

In the above query, I have ordered each partition by decreasing price and then added limit on the number of rows for each partition by using the row_number() function. Below is the output.

  product_id   user_id   price   row_number
------------- --------- ------- ------------
     3           1       30.00       1
     5           2       50.00       1

Postgres provides many built-in window functions. Some of them are explained below.

RANK() function

The RANK() function assigns ranking within an ordered partition.  If the values of the two rows are the same, the  RANK() function assigns the same rank, with the next ranking(s) skipped.

  SELECT
  product_id,
  user_id,
  price,
  RANK () OVER ( PARTITION BY user_id ORDER BY price )
  FROM purchases

The output

 product_id    user_id    price   rank
 ------------ ---------- ------- -----
    1            1        10.00    1
    1            1        10.00    1
    2            1        20.00    3
    3            1        30.00    4
    1            2        10.00    1
    6            2        30.00    2
    4            2        40.00    3
    5            2        50.00    4

 

DENSE RANK()

The DENSE_RANK() function assigns the ranking within an ordered partition, but the ranks are consecutive. In other words, the same ranks are assigned to multiple rows and no ranks are skipped.

  SELECT
  product_id,
  user_id,
  price,
  DENSE_RANK () OVER ( PARTITION BY user_id ORDER BY price )
  FROM purchases

The output

  product_id   user_id    price   rank
 ------------ ---------- -------  -----
      1         1        10.00     1
      1         1        10.00     1
      2         1        20.00     2
      3         1        30.00     3
      1         2        10.00     1
      6         2        30.00     2
      4         2        40.00     3
      5         2        50.00     4

There are many other window functions that are listed below.

  1. FIRST_VALUE()
  2. LAST_VALUE()
  3. NTH_VALUE()
  4. PERCENT_RANK()
  5. LAG()
  6. LEAD()

 

In the next blog I will share performance analysis of using postgres window functions vs active-record way for same task.

 

Advertisements

About Sachin Shintre

Director, Josh Software
This entry was posted in General. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s