Efficient Evenly Distributed Sampling of Time Series Records in PostgreSQL

The Problem

I have been working on an application that, at it’s heart, stores a large amount of data that is organized primarily through the use of a foreign key and a timestamp field. The table’s own primary key is UUID based, combining the foreign key with a UUID for the individual record itself, and it has a single primary data field that utilizes a JSONB type since it can receive arbitrary data. The table sees frequent, regular inserts, and periodic deletions, with old data being thinned out over time, but for each foreign key, there may be tens of thousands of records distributed amongst hundreds of thousands, or millions of other records for other foreign keys.

ColumnType
iduuid
server_iduuid
datajsonb
created_attimestamp(6) without time zone
Basic Table Schema

This was all very simple, but when the time came to start writing the code that generates data graphs from this table, I encountered a puzzle.

How does one ensure that the API doesn’t return too much data? Too many data points just means sending more data than the user probably needs, and it results in the graphing tool having to work with more data than it wants for quick, responsive performance, as well.

And how does one, efficiently, find that data without running a query take takes a burdensome amount of time? In our case, the data is being returned to a React based front end by an API, and snappy application performance hinges on snappy API performance.

THe first solution

Early in the history of the application, I arrived at a solution. If I had a maximum cap on the number of data points to query, such as 500, I could query the total count of records which matched my query, and then a little integer division would give me an interval to use when querying.

Counting the data points is simple. It looks something like this:

sql = <<-ESQL
SELECT
  COUNT(*)
FROM
  telemetries
WHERE
  server_id = $1
  AND created_at BETWEEN $2 AND $3
  AND data ? 'load_avg'
ESQL

count = 0_64
DBH.using_connection do |conn|
  count = conn.query_one(sql, uuid, start_date, end_date, as: {Int64})
end

Once the count of records is determined, an interval can be calculated which will be used to query the sample of records.

i.e. if there are 5000 data points, and I want to sample 500 of them, then I need to query every 10th record. It looks something like this to find that interval:

row_modulo = count // limit
row_modulo = 1 if row_modulo == 0

Once one has an interval, there is a technique that can be used with Postgresql to select records on that interval. The row_number() is a window function that assigns a sequential number to each row in a result set. Once each record has a monotonically increasing sequential number assigned to it, that number can be used in a WHERE clause.

SELECT
  stuff,
  ROW_NUMBER()
    OVER (ORDER BY created_at ASC)
    AS row
FROM
  mytable
WHERE
  row % 10 = 0

This example would select, for every 10th record from mytable, the stuff field.

In the context of full, working code, assembling that query looked like this:

sql = <<-ESQL
SELECT
  t.*
FROM (
  SELECT
    data,
    created_at,
    row_number()
      OVER (ORDER BY created_at ASC)
      AS row
  FROM
    telemetries
  WHERE
    server_id = $1
    AND created_at BETWEEN $2 AND $3
    AND data ? 'load_avg'
) 
AS t
WHERE
  t.row %#{row_modulo} = 0
ESQL

This worked! It’s a viable general technique when you want to select every nth record from some result set, and you want to make the database do the work instead of your application. It’s also almost always faster and less resource intensive to do data management like this inside the database than it is to pull all of the data into your application and make it responsible for sorting through the data and pruning unneeded rows.

A Wrinkle: Counting Isn’t Cheap!

There are a couple of performance problems with this approach that become apparent when the table starts significantly growing.

First, pulling a count is not cheap. MySQL maintains a global record count for tables as part of it’s MyISAM data format. PostgreSQL, however, uses something called a multi-version concurrency control strategy with its tables, which essentially means that different views of a database may see different sets of rows. Thus there is no one single, simple count of records for it to fall back on. Thus, when you count records in a table in PostgreSQL, the database is required to actually walk through the data and count all of the visible records.

This is a relatively intense, and thus slow process.

If you simply want an estimate of the number of total rows in a table, there is a way to get that very cheaply:

SELECT
  reltuples::bigint
    AS estimated_count
FROM
  pg_class
WHERE
  relname = 'mytable'

This doesn’t work when you want to count only a subset of records, though, and this value is only an estimate. It is the estimate that the query planner uses, so it should generally always be within about 10% of the real value, but it is unlikely to ever match exactly unless the table size changes only rarely.

There are other counting strategies, but they all have tradeoffs or inherent inaccuracies, so for this use case, there is no getting around paying that up-front time and resource cost just to get a count of records to use when calculating the query interval that is needed.

The second expensive part of this technique is the use of row_number() in combination with a modulo (%) in the WHERE clause. This means that the database must traverse every possible record when running the query in order to figure out which ones satisfy the WHERE clause. So if there are 150000 records, but one only wants 500 of them, all 150000 will still be scanned.

These factors combine to make this approach brutally, unusably slow for queries that are intended to be ran ad hoc, and quickly, as part of an API driving a UI.

                                                                      QUERY PLAN                                                                      
------------------------------------------------------------------------------------------------------------------------------------------------------
 Subquery Scan on t  (cost=105.19..2583.85 rows=1 width=387) (actual time=418.318..26002.490 rows=545 loops=1)
   Filter: ((t."row" % '49'::bigint) = 0)
   Rows Removed by Filter: 26198
   ->  WindowAgg  (cost=105.19..2582.55 rows=87 width=387) (actual time=210.259..25995.686 rows=26743 loops=1)
         ->  Bitmap Heap Scan on telemetries  (cost=105.19..2581.46 rows=87 width=379) (actual time=210.248..25959.166 rows=26743 loops=1)
               Recheck Cond: (data ? 'load_avg'::text)
               Filter: (server_id = 'a0dcc312-0623-af60-4dc0-238301cc9bf8'::uuid)
               Rows Removed by Filter: 178886
               Heap Blocks: exact=39489
               ->  Bitmap Index Scan on telemetries_data_idx  (cost=0.00..105.17 rows=689 width=0) (actual time=101.188..101.188 rows=205629 loops=1)
                     Index Cond: (data ? 'load_avg'::text)
 Planning Time: 1.860 ms
 Execution Time: 26006.389 ms

This is a real example of a query on a real database using the prior technique, and this example had the advantage that the index that it uses (a BTREE index across the data field, since in production we are limiting results to fields that have one specific type of data) was already warm and cached in the database’s working set when I ran this example, so this result was a best case for this technique, on this database. If that index were not available, or were not used, it would have been even slower given that this index filter rejected almost 180,000 rows. That’s too slow to be triggered directly via an API request, as the user will be waiting a half-minute for data to even begin to show up in their browser.

There has to be a better way

It turns out that Postgresql offers a high performance option to sample a random set of data in a table. There is a TABLESAMPLE clause that can be placed in the FROM section of a query that will sample a subset of a table.

SELECT
  data
FROM
  mytable
  TABLESAMPLE SYSTEM(5)

This would return a roughly random set of about 5% of mytable‘s rows. If one wants a specific number of rows, there is an extension that can provide that, tsm_system_rows.

SELECT
  data
FROM
  mytable
  TABLESAMPLE SYSTEM_ROWS(500)

This would return a random-ish set of 500 rows from the table. A WHERE clause can be used in a query that uses TABLESAMPLE in order to select only the rows of interest, but the TABLESAMPLE is applied before the WHERE clause, which makes this method unsuitable for my use case. As an example:

SELECT
  data,
  created_at
FROM
  telemetries
  TABLESAMPLE SYSTEM_ROWS(500)
WHERE
  server_id = $1

This would first select 500 random rows from the entire data set, and would then try to find records from that set which matched the WHERE clause. This would probably result in the query only returning a very small, and fairly unpredictable number of rows of data that is actually wanted. Also, because the records are random, there is no guarantee that they are evenly distributed through the data set. This might be fine if the data is being queried for statistical reasons, but it isn’t ideal when pulling data for graphs.

So while TABLESAMPLE can be a very fast way to select a random set of records over an entire table, it doesn’t work when we want a set of rows that is evenly distributed through the data set, but is only for a segment of the table’s total data, and for which we want to have some predictable control over the number of rows selected.

Other Meanderings

There are other solutions available when the problem to be solved is the random selection of table rows, but none of them are particularly useful for the selection of N or close-to N evenly distributed data points, and there is limited inspiration that can be found from them.

As a quick overview, the simplest approach is to just insert a comparison to a random number in the WHERE clause:

SELECT * FROM table WHERE RANDOM() < 0.1

This will select approximately 10% of the total rows, but it requires a full table scan. If one wants a specific number of random rows, one can do it with randomization in the ORDER BY clause. This is much slower, particularly on a large table, because the full table scan is followed by a sort of the entire table, but it does work:

SELECT * FROM table ORDER BY RANDOM() LIMIT 500

If the table is indexed with an incrementing integer primary key, and there aren’t many gaps in the keys, it is possible to perform a rather fast random selection of records through the use of a generated sequence of random numbers within the range of IDs.

SELECT
  *
FROM
  (
    SELECT
      DISTINCT 1 + trunc(random() * 650000)::integer AS id
    FROM
      GENERATE_SERIES(1, 550)
    )
JOIN telemetries
USING (id)
LIMIT 500;

If the telemetries table were indexed with an integer primary key, and it had 650000 records with few gaps (no chunks has been deleted), the above code would loop over a generated series of 550 integers, and would produce a distinct set of random integers in the range of 1 to 650000. You can think of that code inside of the FROM clause as a SQL version of a for loop that is generating an array of random numbers. Those numbers are then used as record IDs by the JOIN, which will return a final set limited to 500 records. We pull a few more than 500 in case there are some gaps, to hedge against not finding 500 records.

This technique is very fast, given the caveats already mentioned, and while it can not be applied to the case of a table that is indexed with a UUID, it occurred to me that the core concept in this implementation idea could be generalized to provide a very fast solution that works for time indexed data.

Finally, A Solution

The solution that I arrived at can return a time distributed sample of some subset of the data in a table in a handful of milliseconds. It leverages PostgreSQL’s ability to generate series data via a function, along with a little math inside of the API server when it builds the query, to create a query that, when indexed properly, will return evenly distributed data very quickly.

The idea hinges on the fact that all of the records carry a timestamp which preserves their position in the time series. So, if the database itself can generate a series of timestamps that is evenly distributed throughout the total range of the data, then a query can return a row for the interval between one timestamp and the next. This will result in a set of rows no larger than the maximum desired set, though it may be smaller if the intervals are just too small to find data within each, and Postgresql processes this query surprisingly quickly.

So, if we imagine that we want to pull 500 data points of a specific type from the database, for the period between 2020-09-01 00:00:00 and 2020-09-15 23:59:59, then the first step is to figure out what interval is needed to get 500 steps between those starting and ending timestamps. This is simple math:

points = 500
start = Time.parse(start_timestamp, ,"%Y-%m-%d %H:%M:%S", Time::Location::UTC)
finish = Time.parse(finish_timestamp, ,"%Y-%m-%d %H:%M:%S", Time::Location::UTC)
seconds_per_point = (finish - start).to_i / points

Once that interval is calculated, the query can be generated.

SQL = <<-ESQL
SELECT
  (
    SELECT
      t.data,
      t.created_at
    FROM
      telemetries t
    WHERE
      t.server_id = $1
      AND t.data ? 'load_avg'
      AND t.created_at >= series.target
      AND t.created_at <= (series.target + INTERVAL $2)
      LIMIT 1
  )
FROM
  (
    SELECT
      target
    FROM
      GENERATE_SERIES(
        $3,
        $4,
        $5'
      ) target
  ) series;
ESQL

interval = "#{seconds_per_point} seconds"
query_data = [] of Tuple(JSON::Any, Int64)
DBH.using_connection do |conn|
  conn.query_each(
    sql,
    uuid,
    interval,
    start_timestamp,
    finish_timestamp,
    interval
  ) do |rs|
    query_data << {
      rs.read(JSON::Any),
      rs.read(Time).to_unix_ms,
    }
  end
end

This would generate SQL that ends up being evaluated something like the code below.

SELECT
  (
    SELECT
      t.data,
      t.created_at
    FROM
      telemetries t
    WHERE
      server_id = 'a0dcc312-0623-af60-4dc0-238301cc9bf8'
      AND data ? 'load_avg'
      AND t.created_at >= series.target
      AND t.created_at <= (series.target + INTERVAL '2592 seconds')
      LIMIT 1
  )
FROM
  (
    SELECT
      target
    FROM
      GENERATE_SERIES(
        '2020-09-01 00:00:00'::timestamp,
        '2020-09-15 23:59:59'::timestamp,
        INTERVAL '2592 seconds'
      ) target
  ) series

This approach leverages the PostgreSQL GENERATE_SERIES() function, which takes a starting value, an ending value, and an interval, and generates a set of value from them.

There is a problem with this SQL, though. It will fail with an error:

ERROR:  subquery must return only one column

Postgresql provides a convenient mechanism to circumvent this limitation on subqueries, however. Postgresql lets one define custom types. Using this capability, a type can be defined which holds both a JSONB and a TIMESTAMP WITHOUT TIME ZONE entity, and this will count as a single data item for purposes of the restriction that was encountered above.

CREATE TYPE
  jsonb_x_timestamp AS
    (
      data JSONB,
      created_at TIMESTAMP WITHOUT TIME ZONE
    )

The query can then be rewritten to return a single instance of this custom type, which eliminates the error and lets it work as desired.

SELECT
  (
    SELECT
      (
        t.data,
        t.created_at
      )::jsonb_x_timestamp
    FROM
      telemetries t
    WHERE
      server_id = 'a0dcc312-0623-af60-4dc0-238301cc9bf8'
      AND data ? 'load_avg'
      AND t.created_at >= series.target
      AND t.created_at <= (series.target + INTERVAL '2592 seconds')
      LIMIT 1
  )
FROM
  (
    SELECT
      target
    FROM
      GENERATE_SERIES(
        '2020-09-01 00:00:00'::timestamp,
        '2020-09-15 23:59:59'::timestamp,
        INTERVAL '2592 seconds'
      ) target
  ) series

When this is evaluated against the same data set as the original row_number() based solution:

                                                                                                                                                                                                                          QUERY PLAN                                                         >
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->
 Function Scan on generate_series target  (cost=0.00..570117.03 rows=1000 width=32) (actual time=0.075..7.527 rows=500 loops=1)
   SubPlan 1
     ->  Limit  (cost=23.52..570.11 rows=1 width=32) (actual time=0.015..0.015 rows=1 loops=500)
           ->  Bitmap Heap Scan on telemetries t  (cost=23.52..570.11 rows=1 width=32) (actual time=0.014..0.014 rows=1 loops=500)
                 Recheck Cond: ((server_id = 'a0dcc312-0623-af60-4dc0-238301cc9bf8'::uuid) AND (created_at >= target.target) AND (created_at <= (target.target + '00:43:12'::int>
                 Filter: (data ? 'load_avg'::text)
                 Heap Blocks: exact=500
                 ->  Bitmap Index Scan on telemetries_load_avg_idx  (cost=0.00..23.52 rows=142 width=0) (actual time=0.010..0.010 rows=48 loops=500)
                       Index Cond: ((server_id = 'a0dcc312-0623-af60-4dc0-238301cc9bf8'::uuid) AND (created_at >= target.target) AND (created_at <= (target.target + '00:43:12':>
 Planning Time: 0.130 ms
 Execution Time: 7.586 ms
(11 rows)
That is 3428X faster than the row_number() based original solution!

Truth in Advertising…

I will note that it isn’t actually quite that much faster on average. The database instance that I am conducting these tests with is a very small t3.micro Amazon RDS instance, so there ends up being some variability in the timings, with most results ranging from slightly faster than this example to about 9ms, and averaging close to 8ms. Also, if the index is not in the cache at the time that the query is executed, the performance of the query on first execution, in the above described test environment, drops to about 40ms for that first query.

In practice, for our application, this query ends up being only about 3250X faster than the original on average.

It is important to have your database tuned properly for your data set, and that your buffers are large enough to hold a reasonable working set. If the database engine has to pull very large amounts of data from storage, overall query performance will suffer simply as a consequence of waiting on IO operations.

A Few Final Tweaks

For the base query that lacks the (data ? 'load_avg'::text) filter, the execution time is further cut in half.

Planning Time: 0.113 ms
Execution Time: 3.736 ms

This is of limited utility in our use case, as we want to be able to specifically select the data that is being graphed, but it serves to illustrate just how fast the base technique can be.

This is far faster than I had hoped or expected when I started working through a better solution to this problem. Allowing the database to generate an interval, and then accepting the first record that the database finds in that interval generates a result set that is distributed evenly across the total time set very quickly, at the cost of not being guaranteed an absolutely precise period between the data points. In testing this experimentally, data points tend to be distributed around the time interval, but there are occasional pairs of points where one falls near the end of it’s interval, and the next near the beginning, or vice versa.

If a precise period between data points is desired, this technique can still deliver excellent results. Simply adding an ORDER BY t.created_at before the LIMIT clause guarantees that each data point is the first one to be found in it’s interval. i.e.

SELECT
  (
.
.
.
    ORDER BY
      t.created_at
    LIMIT 1
.
.
.
  ) series;

Without the ORDER BY, the data set ends up looking something like this:

 ("[""load_avg"", ""0.00""]","2020-09-01 00:08:29.943824")
 ("[""load_avg"", ""0.00""]","2020-09-01 01:11:30.024431")
 ("[""load_avg"", ""0.00""]","2020-09-01 01:45:29.944221")
 ("[""load_avg"", ""0.01""]","2020-09-01 02:30:29.903342")
 ("[""load_avg"", ""0.00""]","2020-09-01 03:28:29.915026")
 ("[""load_avg"", ""0.00""]","2020-09-01 04:00:30.006452")
 ("[""load_avg"", ""0.00""]","2020-09-01 04:28:29.926444")
 ("[""load_avg"", ""0.07""]","2020-09-01 05:02:30.383213")
 ("[""load_avg"", ""0.00""]","2020-09-01 05:47:29.911266")
 ("[""load_avg"", ""0.00""]","2020-09-01 06:29:30.010537")
 ("[""load_avg"", ""0.00""]","2020-09-01 07:12:29.945272")
 ("[""load_avg"", ""0.00""]","2020-09-01 07:55:30.13172")
 ("[""load_avg"", ""0.00""]","2020-09-01 08:38:30.210716")
 ("[""load_avg"", ""0.00""]","2020-09-01 09:22:30.21479")
 ("[""load_avg"", ""0.00""]","2020-09-01 10:26:30.202852")
 ("[""load_avg"", ""0.00""]","2020-09-01 10:48:30.357601")

Whereas with the ORDER BY, the data points that are selected, while having significant overlap with the previous set, are much more strictly separated by the interval for this particular query (2592 second is about 43.2 minutes).

 ("[""load_avg"", ""0.00""]","2020-09-01 00:00:30.189119")
 ("[""load_avg"", ""0.00""]","2020-09-01 00:43:30.303489")
 ("[""load_avg"", ""0.00""]","2020-09-01 01:26:30.369254")
 ("[""load_avg"", ""0.00""]","2020-09-01 02:10:30.01558")
 ("[""load_avg"", ""0.07""]","2020-09-01 02:53:30.084433")
 ("[""load_avg"", ""0.06""]","2020-09-01 03:36:29.956608")
 ("[""load_avg"", ""0.00""]","2020-09-01 04:19:30.034901")
 ("[""load_avg"", ""0.07""]","2020-09-01 05:02:30.383213")
 ("[""load_avg"", ""0.00""]","2020-09-01 05:46:30.351767")
 ("[""load_avg"", ""0.00""]","2020-09-01 06:29:30.010537")
 ("[""load_avg"", ""0.00""]","2020-09-01 07:12:29.945272")
 ("[""load_avg"", ""0.00""]","2020-09-01 07:55:30.13172")
 ("[""load_avg"", ""0.00""]","2020-09-01 08:38:30.210716")
 ("[""load_avg"", ""0.00""]","2020-09-01 09:22:30.21479")
 ("[""load_avg"", ""0.00""]","2020-09-01 10:05:30.354957")
 ("[""load_avg"", ""0.00""]","2020-09-01 10:48:30.357601")

There is a performance cost to utilizing ORDER BY, since it forces the database to sort the results in each of the intervals. The cost is not necessarily a cause for concern, however:

                                                                                                                QUERY PLAN                                                                           >
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->
 Function Scan on generate_series target  (cost=0.00..523454.21 rows=1000 width=32) (actual time=0.124..32.060 rows=500 loops=1)
   SubPlan 1
     ->  Limit  (cost=523.44..523.44 rows=1 width=379) (actual time=0.064..0.064 rows=1 loops=500)
           ->  Sort  (cost=523.44..523.44 rows=1 width=379) (actual time=0.063..0.063 rows=1 loops=500)
                 Sort Key: t.created_at
                 Sort Method: top-N heapsort  Memory: 25kB
                 ->  Bitmap Heap Scan on telemetries t  (cost=19.03..523.43 rows=1 width=379) (actual time=0.015..0.056 rows=48 loops=500)
                       Recheck Cond: ((server_id = 'a0dcc312-0623-af60-4dc0-238301cc9bf8'::uuid) AND (created_at >= target.target) AND (created_at <= (target.target + '00:43:12'::interval)))
                       Filter: (data ? 'load_avg'::text)
                       Heap Blocks: exact=21375
                       ->  Bitmap Index Scan on telemetries_load_avg_idx  (cost=0.00..19.03 rows=131 width=0) (actual time=0.011..0.011 rows=48 loops=500)
                             Index Cond: ((server_id = 'a0dcc312-0623-af60-4dc0-238301cc9bf8'::uuid) AND (created_at >= target.target) AND (created_at <= (target.target + '00:43:12'::interval)) AND>
 Planning Time: 0.169 ms
 Execution Time: 32.130 ms
This is still 809X faster than the original solution, with completely comparable results!

The caveat mentioned earlier about database tuning especially applies with this version of the query. Since sorting is involved, it is much more sensitive to variance in the data that is in the working set. Even in the worst case, where none of the data is in the working set, though, this query is still several times faster than the row_number() based solution.

Final Thoughts

SQL, and PostgreSQL, are toolkits where there is almost always more than one way to do it, and where it is possible to get more deeply nuanced and refined as one puts more time into the desired outcome. There may be better techniques than this for selecting a time distributed sample of rows from a larger table, but this technique is vastly faster than the more naïve version, and it exactly delivers the results that our application requires, at a speed that provides a great user experience. At the same time, there are some things that might make it faster yet, particularly as the total data set grows. For example, since a query is always focused on a single server ID, partitioning by server_id might be a win.

In the end, though, liberal use of PostgreSQL’s EXPLAIN ANALYZE to understand what the database engine is doing as it executes queries, alongside experimentation and a lot of failures as well as small stepwise improvements as the solution was honed moved us from a solution that returned great data, but with an impractically long wait for those results, to a solution that still returns great results at a speed that was faster than I had hoped for when I started working on the reimplementation of this query.

2 thoughts on “Efficient Evenly Distributed Sampling of Time Series Records in PostgreSQL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.