Our largest customer environments generate a lot of data, on the order of millions of rows per day in several tables. We wanted to partition our tables to improve performance while also making it easier to archive or delete old data. There was only one problem: we had to do it with zero downtime. Here’s how we did it.

The Problem: Too Much Data

Spec provides industry-leading Customer Journey Security, enabling realtime analysis and decision-making on data collected throughout the entirety of the customer journey, rather than just at hard-coded checkpoints. This makes our system extremely resilient to attempts to game it, but it also requires that we store a LOT of data.

For our largest customers, we write on the order of 2-5 million records per day. Our most active tables receive around 1 million records per day each. Because we often sit between our customers and their users, latency is a major concern. For our performance-critical pathways, we strive to keep our writes at 5 ms or faster.

As you might imagine, over time, this winds up being a lot of data. As of May of last year, we were starting to reach a point where our tables were getting very large, making them harder to query effectively, ballooning our storage costs, and making even the most trivial of database migrations rather fraught affairs.

However, there was a silver lining. For most of our write-heavy tables, our reads only ever cared about recent data. Of course, we do store historical data that is used as part of our realtime analysis, but that is extracted from the massive incoming firehose and stored in tables specially designed for it. We had also recently completed a project that used change data capture and outbox tables to exfiltrate incoming data to both S3 for long-term storage and an analytics database for flexible user-facing queries. As such, we were free to get rid of old data.

Of course, dropping old records is easier said than done at scale. We tried adding background jobs to roll through and delete old data, but these were either too CPU-heavy to avoid affecting the latency of other, more important queries, or they were unable to keep up with the rate of incoming data.

The Solution: Partitions

It was obvious from the beginning that the right way to solve our problems was to use Postgres’ native partitioning. If we could partition our tables on their time of record creation, we could use pg_partman to manage the creation of new tables and automatically drop old ones. The only issue was that we had ~5 TB of existing data, with our database growing by several GB per day, and we knew from past experience that trying to negotiate downtime with our customers could mean waiting for several months to actually deploy anything. This was also severely complicated by the fact that we would have to negotiate separate downtimes with each of our customers, since they all get their own databases. So, if we were going to switch to partitions, we had to do it in-place, with the engine running, without causing any hiccups or degradation of service. Easy, right?

The Plan

We spent a lot of time thinking about how accomplish this goal, and the final plan wound up being relatively straightforward:

  1. Create partitioned shadow tables for each of the tables we wanted to partition, e.g. for events, create events_partitioned
  2. Set up pg_partman to manage creation of future partitions and dropping of ones beyond our desired retention time
  3. Set up a trigger on each of the old tables so that when any record is written to it, it is also written to the corresponding partitioned table
  4. Allow enough time to pass for the shadow tables to fill up with data
  5. Update all queries hitting the tables being partitioned to ensure that they included the partition keys
  6. Compare performance of all performance-sensitive queries against both old and new tables, updating indexes on new tables if needed
  7. In a single transaction, rename the old tables and create new views with the original table names pointing to the partitioned tables, e.g. rename events to events_old, and create the view events pointing to events_partitioned
  8. Profit?

That plan largely worked as written! However, there were some points of complexity and some nitty gritty details that didn’t quite work out as planned. We’ll go into details there in the next section, but if you’re looking for a high-level overview, the steps above are essentially what we did, in the end rolling out to all of our customers with zero downtime and zero performance degradation.

Why use views?Why use views, rather than just renaming the partitioned tables to the names of the original tables? Well, in addition to portioning, we were also doing some type updates from e.g. text to UUID and int to bigint. The views allowed us to do type conversions on INSERT and SELECT to maintain a consistent interface, which allowed us to then do a multi-step process of pointing writes to the backing tables, then swapping the types on the views, then swapping writes back to the views. In general, having the views as an interface layer has been really nice, so I recommend setting things up that way from the beginning for high-availability tables!

Complications

Plan Time and Prepared Statements

Whenever you are analyzing your postgres queries with EXPLAIN ANALYZE, it is important to look not only at execution time, but also plan time. The plan time is how long it takes the query planner to decide how to actually go about executing your query, while the execution time is how long it takes to execute the resulting plan. In general, plan time is amortized via the use of prepared statements, which are able to reuse an earlier plan. Assuming you are using some kind of database framework to issue your SQL queries, you are almost certainly using prepared statements under the hood, which means much of the time, you can ignore the impact of a slow plan time.

However, for performance-critical queries, it is vital to ensure that the plan time remains acceptable when executing your query as a prepared statement. You can do this by preparing your statement manually and then running EXPLAIN ANALYZE on the execution of that statement. For example:

PREPARE mystatement (uuid) AS 
SELECT * FROM events WHERE id = $1;

EXPLAIN ANALYZE EXECUTE mystatement(uuid_generate_v4());

Be sure to run EXPLAIN ANALYZE EXECUTE twice to see the difference between the first run and the run with the cached plan. This allows you to see the performance of your query as it will be executed by your framework.

We found two issues with plan caching. First, reducing variable parts as much as possible in queries helped improve the planner’s ability to utilize the cache, and second, partitions don’t necessarily play nicely with the plan cache.

Reducing Variable Query Parts

As mentioned earlier, once we had our shadow tables in place, we had to go through every query that was hitting the original tables and update it to include the partition key in its filters. Including the partition key allows postgres to perform partition pruning, where it selects a subset of partitions of the parent table against which to execute the query. This is important! If you have 100 child partitions in a partitioned table, a query without pruning is going to have to execute against all 100 tables in order to return results.

Since all of our performance-sensitive queries on these tables were only looking at recent data, we were updating a lot of queries to look like:

SELECT * FROM events 
    WHERE id = $1 
    AND created BETWEEN now() - interval '7 days' AND now();

Note the inclusion of the upper bound of now()! When you are pre-creating partitions, it is important to include the upper bound so that postgres does not have to scan future tables. I have read that upcoming versions of postgres will do a better job of automatically excluding empty tables from the plan, but for now at least including this bound is essential.

With a query like the one above though, we were seeing subpar performance in planning time, even with prepared queries, often taking several milliseconds (which was unacceptable for our most performance-sensitive queries). Some experimentation showed that the query planner was much better at reusing plans when reducing any amount of inline calculation or variability. So, we updated queries like the above to something like:

SELECT * FROM events 
    WHERE id = $1 
    AND created BETWEEN $2 AND $3;

Passing in the time bounds as arguments significantly improved our plan times and the ability of the planner to reuse cached plans.

The Plan Cache and Partitions

The other problem that we ran into was inconsistent plan cache performance when running plans against partitioned tables. The queries would use the plan cache effectively for some time, and then we would see significant degradation in the planning time. Some investigation led us to this post by postgres developer Amit Langote: On performance of prepared statements with partitioning. In it, he explains that when postgres is evaluating whether to use a cached plan, it must acquire read locks on all of the relations that it scans. For a query against partitioned tables, the generic plan does not know which tables will need to be scanned at the time of plan cache evaluation, and so it must acquire read locks on ALL partitioned tables, even though runtime partition pruning will remove most of them. This takes more time, obviously, but even worse, the planner will sometimes interpret the fact that it might need to scan every partition table as meaning that it will be cheaper to use a custom plan than the generic, cached plan, which leads to it failing to use the plan cache at all. This was a particular problem for us, given that some of our more complex queries could take over 50 ms to plan!

There are a few ways to resolve this issue. One, recommended in the blog post as a workaround, is to set plan_cache_mode = force_generic_plan, so that postgres will not use custom plans even when it thinks the generic plan may be more expensive. However, this has the potential to affect other queries where postgres’ default logic is correct. In our case, we were able to resolve the issue by reducing the total number of child partitions on a given table by reducing the number of premade partitions. This has led to the planner doing a much better job at interpreting the generic plan to be cheaper and using it more consistently.

Performance of Shared Queries

In preparation for swapping over to our partitioned tables, we tried to update all queries hitting those tables to include partition keys. Mostly, as mentioned above, this was a mater of adding upper and lower time bounds onto the query. This allowed us to update the queries such that they would still work on the original tables while also working efficiently against the partitioned tables at the point when we swapped over.

However, we had one query where this approach did not work! We were going from a query that looked like this (very much simplified):

SELECT * FROM events
    WHERE created_utc > now() - interval '7 days';

and adding the what seemed like trivial addition of an upper bound to the time filter:

SELECT * FROM events
    WHERE created_utc BETWEEN now() - interval '7 days' AND now();

Unfortunately, for whatever reason, the addition of the upper bound absolutely wrecked performance of the query on the existing table, going from < 5 ms to hundreds of ms! No matter how many variants I tried, I could not get the planner to not decide on a totally different plan than the one it used without the upper bound. Meanwhile, failing to include the upper bound meant that the query had to scan empty future future partitions, increasing the execution time from < 5 ms to up to 80 ms.

So, we were in a bit of a bind. We couldn’t write a query that would work both before and after our swap to partitioned tables, with the performance degradation of the query in either direction being unacceptable. In our deployment pipeline, our database is migrated prior to the applications being updated, so we had to be sure that the application would work efficiently in both states (i.e. we couldn’t just deploy the DB update with an update to the query to work well with the partitioned tables).

Our solution here was to add a small control table, to which we would write a row with a title like use_partition_compatible_queries and a value of false. When we deployed the migration to swap to partitioned tables, we could swap this to true in the same transaction. Then, in the application layer, we could check the value of this row and use it to dynamically determine whether to issue the old query or the new one. This was a very fast query, taking 1-2 ms at most, so the performance hit was substantially less than either sticking with the old query on the partitioned tables or trying to use the partition-friendly query on the old tables.

Once we verified the swap to partitioned tables worked as intended, we were able to do a quick code deploy to remove this dynamic path.

Partman Locks on Cleaning Up Old Partitions

After having our shadow tables deployed for some amount of time, they reached the point where partman’s nightly maintenance would be pruning old partitions. The first time this happened, I wound up being woken up by PagerDuty at 2 AM because new inserts had been blocked on the original table for several minutes. Just as I was getting my computer open, the issue resolved and writes were able to continue.

Digging around the next day, we came to the conclusion that it was a confluence of two factors. First, partman’s logic to clean up old partitions happens at the beginning of the maintenance procedure for a table. You can see this here. In combination with literally any other issue causing the remainder of partition maintenance to take a long time, e.g. this, the partman process will maintain an ACCESS EXCLUSIVE lock for the entire duration. Writing this blog post reminded me that I intended to open an issue about this in the partman repo, so I finally did.

In the meantime, we wrote our own logic in a background job that goes through each child partition and parsing the output of the partition bounds from the following query:

SELECT c.relname,
       n.nspname,
       pg_get_expr(c.relpartbound, c.oid) as bounds
FROM pg_partition_tree($1::text::regclass) p
     JOIN pg_catalog.pg_class c ON c.oid = p.relid
     JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE p.isleaf IS TRUE

From there, we are able to check the age of each table and determine whether or not it should be dropped. When we drop, we first detach the partition (and commit), to ensure the lock on the parent table is as short as possible, and then we drop the table after doing any needed maintenance or archival tasks.

Conclusion

Swapping to partitioned tables has been a huge benefit for us. Our performance is more consistent and easier to optimize, our databases require much less storage, and we’re able to perform migrations much more quickly and safely. Inserts are significantly faster, given that the indexes they are updating are much smaller, and our overall write latency across all DB operations has dropped substantially, along with CPU usage and IOPS.

Through the use of views as an intermediary layer, we were able to maintain full interface compatibility with our old tables, despite some type changes in the new tables. By carefully planning, preparing queries, and measuring against both the original tables and the shadowed partition tables, we were able to ensure that we maintained our existing high level of performance both pre- and post-swap. And, ultimately, we were able to exercise the swap with zero customer downtime!

There are many minor details that weren’t included in this writeup. If you’re curious or want to know more, feel free to email me at matthew@spec-trust.com.