Presented by:

Display pic

Sai Srirampur

PeerDB Inc

I am the Co-founder and CEO of PeerDB. Prior to PeerDB, I worked at Microsoft leading solutions engineering for all Postgres services on Azure. Before that I worked at Citus Data, as an early employee and saw it through the Microsoft acquisition. For the past 8 years, I have been an active member in Postgres community, helping customers implement Postgres and Citus.

No video of the event yet, sorry!

pg_dump and pg_restore are reliable tools for backing up and restoring Postgres databases. They are widely used for various real-world use cases, including Database Migrations, Disaster Recovery, and more.They offer fine grain control, allowing selection of specific db objects for backup and restore, choice of dump format (plain text or compressed), and parallel processing of multiple tables. Furthermore they guarantee dumping and restoring a consistent snapshot of the database.

However, pg_dump and pg_restore are single-threaded at the table level, significantly slowing down dump and restore of databases with star schema (common in real-world apps), with a few large and many small tables. For such databases with over 1TB of data, pg_dump and pg_restore can take multiple days, resulting in increased downtime during migrations and higher Recovery Time Objectives (RTOs) in DR scenarios.

In this talk, we'll discuss an idea to parallelize pg_dump and pg_restore at the individual table level, still ensuring consistency. It involves creating a Postgres snapshot, using that snapshot to logically partition a large table based on CTIDs, and then parallelly dumping/restoring each partition. As enhancements, a) each partition can be read from the source and written to the target simultaneously, enabling pg_dump | pg_restore and b) by tracking the partitions that have been dumped and restored, pg_dump and pg_restore can resume from where they left off in case of failures, instead of restarting from the beginning.

This idea has already been implemented in PeerDB, an open-source Postgres replication tool. The benefits have been significant, enabling the replication of large production databases with over a TB of data in just a few hours, compared to days. We will share these insights and assess how they can be integrated into the Postgres upstream!

Date:
2024 April 19 16:00 PDT
Duration:
20 min
Room:
San Pedro
Conference:
Postgres Conference 2024
Language:
English
Track:
Dev
Difficulty:
Intermediate