Homogeneous Migration of Large PostgreSQL Databases to Cloud SQL using Native Replication

Trushar Borse
Google Cloud - Community
8 min readAug 9, 2023

--

Collaborators: Trushar Borse, Somdyuti Paul

Background:

The first tool of choice for Homogeneous PostgreSQL migrations to GCP Cloud SQL PostgreSQL or AlloyDB for PostgresSQL should be DMS.

However, there are scenarios (usually in self-managed customized Postgres instances) like unsupported extensions, large number of tables without Primary Key and so on which lead to the need of looking beyond DMS in currently especially for Larger databases (5TB +) and hence leverage either.

[a] Logical Replication based low downtime migration

OR

[b] using GCP partner solutions like Striim, Fivetran (also using logical replication capabilities of Postgres engine under the hoods) to effect the minimal downtime migration.

Remainder of this document mainly focuses on [a].

1. Target (Cloud SQL PG) Pre-requisites

a. Create a CloudSQL destination instance with the same config as source. (Consider using a larger shape for the duration till end of cut-over to expedite the process); refer the parameters in the Appendix section to tune the build process of target)

b. Enable point in time recovery around the cutover window.

c. Enable HA after the migration is complete. This will help to get better backfilling and CDC performance.

d. Add all relevant DB parameters at source. Please refer to the Appendix below.

e. Create the Read replica if required.

f. Keep the backups disabled in CloudSQL instance for the duration of hydration (backfill+CDC).

g. Enable maintenance window for CloudSQL instance.

h. Enabling all the supported extensions as used in the source PG.

i. Disable or drop the foreign key constraints on destination databases.

Create an Alter command for the foreign key constraints
ALTER TABLE <table_name> DROP CONSTRAINT <fk_constraint_name>;

j. Drop the secondary indexes from the target instance after taking the schema dump.

2. Source Pre-requisites

  1. Perform pre-req

a. Set up logical decoding. Refer the Appendix for parameters to be set.

b. Create read only user with replication permission

2. Enable REPLICA IDENTITY FULL for tables without Primary Key.

3. No DDL (table or column creation or table and column drop) operation has to be executed during Initial and CDC process.

4. To minimize the dump and restore time below activity can be done on the source db instance,

a. Clean up the tables which do not require a migration.

b. Run Vacuum on the db.

c. Increase the computation of source db instances.

5. Collect the last WAL position.
SELECT pg_current_wal_lsn();

3. Initial Load

Dump and restore using pg_dump & pg_restore

[a] Take the dump of the source database

[b] Transfer it to GCP GCE VM with Postgres client installed on it

[c] Restore it on target using CSQL PG instance using pg_restore utility

NOTE: If the number of tables without PK is less and the tables are smaller, initial load can be performed only for the tables with PK. Tables without PK can be migrated in one go using a dump and restore approach during the cutover window (thus also adding a bit to the downtime window). Benefit would be not needing to enable REPLICA IDENTITY FULL for those non-PK tables and one can leverage the benefits of pglogical replication in the form of initial backfill using pgdump and restore approach. Process for the same is here.

Take a schema structure before initiating whole data dump-

pg_dump -h ipaddress -U <user_name> -d AllTables — schema-only — Fp -v -f /pgsql-backup/schema/<schema_name1>.sql

Copy the users (Run it as Postgres user):

/usr/bin/pg_dumpall -h ipaddress_primary -p port -U user — roles-only > /pgsql-backup/roles/read_roles.sql

NOTE: If the source Postgres instance is a managed service like AWS RDS, Azure Database for Postgres, GCP Cloud SQL etc., use the below mentioned 2 step process to accomplish the user and role capture.

Step 1. Export/import Roles without passwords

Pipe pgdumpall and psql for exporting users and roles without password.

PGPASSWORD=***** pg_dumpall -h <<SOURCE_IP>> -U postgres -r — no-role-passwords | sed ‘/cloudsqladmin/d;/cloudsqlagent/d;/cloudsqliamserviceaccount/d;/cloudsqliamuser/d;/cloudsqlimportexport/d;/cloudsqlreplica/d;/cloudsqlsuperuser/d;s/NOSUPERUSER//g’ | PGPASSWORD=******* psql -h <<TARGET_CLOUDSQL>> -U postgres

Step 2:- Generate Alter password and apply on target

Pipe psql with sql generating alter command from source having necessary flags enable for SELECT on pg_shadow view and apply it using psql with connection details of target instance.

PGPASSWORD=***** psql -h <<SOURCE_IP>> -U postgres -t -c “select ‘alter user ‘|| usename || ‘ with encrypted password ‘|| ‘’’’|| passwd || ‘’’’||’;’ from pg_shadow where usename not like ‘cloudsql%’” | PGPASSWORD=***** psql -h <<TARGET_CLOUDSQL>> -U postgres

Restore the users (Run it as Postgres user):

/usr/bin/psql -h ipaddress_replica -U <user> -d <database> < /pgsql-backup/roles/read_roles.sql

i) Take a schema wise backup with data:

pg_dump -h ipaddress -U <user_name> -d AllTables — schema=<schema_name1> -Fd -j 16 -v -f /pgsql-backup/<schema_name1> 2> /pgsql-backup/<schema_name1>-dump.log

pg_dump -h ipaddress -U <user_name> -d AllTables — schema=<schema_name2> -Fd -j 16 -v -f /pgsql-backup/<schema_name2> 2> /pgsql-backup/<schema_name2>-dump.log

ii) Restore the schemas in Target-CloudSQL Instance:

pg_restore -h ipaddress -U postgres — dbname alltables — schema=<schema_name1> -j 12 -v “/pgsql-backup/<schema_name1>” 2> /pgsql-backup/<schema_name1>-restore.log

pg_restore -h ipaddress -U postgres — dbname alltables — schema=<schema_name2> -j 12 -v “/pgsql-backup/<schema_name2>” 2> /pgsql-backup/<schema_name2>-restore.log

4. CDC using Postgres Native Replication

Enable the Logical (Native) Replication to enable the replication of changes since the beginning of Initial Dump at source.

Define appropriate numbers of Publishers and Subscribers to carry out the sync operation.

To create a Consistent Backup

When a new replication slot is created using the streaming replication interface, a snapshot is exported (using pg_export_snapshot() ).

Then you can use that snapshot ID to dump/export the database and all delta changes till your export runs will be captured by the replication slot and subsequently applied in replica.

This combination will be faster when using just copy+CDC of Native Logical replication (as you can run your export in parallel making use of #vCPUs) and will also ensure consistent data.

Example:-

Publisher side/Primary Instance

Session 1:-

test=> begin;

BEGIN

test=*> select pg_export_snapshot(); ←Here we are manually getting the DB Snapshot_id which will be internally generated in the replication slot creation command.

pg_export_snapshot

— — — — — — — — — — -

00000007–0000000A-1

test=*> SELECT * FROM pg_create_logical_replication_slot(‘subscription1’, ‘test_decoding’);<-This will guarantee that changes are captured after the snapshot-00000007–0000000A-1

test=*> create publication pub for all tables;

Use the snapshot to take consistent database export

pg_dump -h <hostname_primary> -U postgres -d <dbname> — format=d — file=test — snapshot=00000007–0000000A-1 -j 24

When the Logical Replication slot is created it creates the database snapshot at ‘00000007–0000000A-1’ and then you use the same snapshot ID to run parallel exports. All delta changes since ‘00000007–0000000A-1’ will be published in the replication slot subscription1 for your replica/subscriber to apply those delta changes

Subscription Side

Create the Subscription- Use create_slot=>False and slot_name=subscription1 (the name of the logical replication slot that was created in primary). Also set copy_data=False as pre-existing data/initial load will be done using pg_restore.

Once pg_dump completes, use pg_restore to copy the Initial data.

Appendix:

Important Parameters for Native Replication Performance. This is just for REFERENCE purpose only. Actual parameter settings would depend on the shapes of instances under consideration.

e.g. Source Postgres (16 vCPU, 128GB RAM)

max_worker_processes=16

max_logical_replication_workers=16

max_parallel_workers 16

max_replication_slots=30

max_sync_workers_per_subscription=4

max_wal_senders=32

shared_buffers=30 GB

wal_buffers=16 MB

work_mem = 64 MB

max_wal_size=10GB

Destination

E.g. 20 vCPU recommended with 128GB for CSQL PG

max_worker_processes 20

max_logical_replication_workers 20

max_parallel_workers 18

max_replication_slots 30

max_sync_workers_per_subscription 6

max_wal_senders 32

work_mem = 64 MB

Other considerations:

  1. Create a separate publisher for the largest/larger table.
  2. Create one publisher for tables with PK
  3. Create one publisher for tables without PK (you will need to Alter each of those tables using REPLICA IDENTITY FULL to capture the changes)
  4. Similarly, on target db (subscriber side), create as many subscriptions with 1:1 mapping.
  5. Disable FK constraints on tables in Destination db
  6. Drop secondary indexes on tables in Destination db
  7. Choose a sufficiently large CSQL Shape to aid the faster Native Replication.

Check Replication Status:

Source side:

select * from Pg_stat_replication;

select * from pg_replication_slots;

select p.pid, wait_event_wait_event_type, state from pg_stat_activity p, pg_replication_slots s where p.pid = s.active_pid;

select pid,query from pg_stat_activity where pid in (28404, 28405);

Target side:

select subname, subconninfo, subslotname from pg_subscription;

select * from pg_stat_subscription;

select pg_size_pretty(pg_database_size(‘<db_name>’));

Quick facts on Logical replication using pglogical and Native Logical replication

1. For Physical Replication (streaming WAL replication) pg_replication_slots.slot_type is ‘physical’ . For Logical Replication pg_replication_slots.slot_type is ‘logical’, but the pg_replication_slots.plugin will have different values based on the type of logical replication used.

2. pg_replication_slots.plugin is ‘pgoutput’ for the Native (Pub/Sub model available from v10+) Logical replication. It is ‘pglogical_output’ when using the pglogical extension.

3. Both AWS and GCP DMS service for Database Migration-migrates data using the PostgreSQL pglogical extension under the hood. GCP creates a source Representation Instance with metadata. The target replica instance created is an External Primary Read Replica.

4. Some other CDC Tools come with different logical replication plugins.

5. We cannot perform DMLs on PostgreSQL Physical standbys/read replicas (that we create either manually on-premise or in the AWS/GCP/Azure as Read Replicas).

We will get the below error-”cannot execute INSERT in a read-only transaction”

6. We can perform DMLs on Logical standbys, even on the external Primary Read Replica that DMS uses.

7. Logical standbys can be a source for Database Migration Services provided by AWS/GCP. The External Primary Read Replica that GCP DMS uses cannot be a source for another DMS Job.

8. The physical standby/read replicas created in AWS/GCP/Azure cannot act as a source for DMS Jobs.

9. pg_is_in_recovery() will return ‘f’ for Logical replica/standby instances and pg_is_wal_replay_paused() will return an error.

10. pg_is_in_recovery() will return ‘t’ for Physical standby instances and pg_is_wal_replay_paused() will return ‘f’ if streaming replication is going on without any errors. Also pg_stat_wal_receiver will return data.

11. Pglogical replicates sequences but Native Replication does not.

Pglogical.replication_set_add_sequence and pglogical.replication_set_add_all_sequences can be used to add sequence/s and synchronize sequence data.

12. Pglogical allows Selective replication of table rows at either publisher or subscriber side (row_filter) , whereas Native replication supports row filter at publisher side from Postgres version 15 onwards.

13. Pglogical allows Selective replication of table columns at publisher side

14.Pglogical allows data gather/merge from multiple upstream servers and use of pglogical.conflict_resolution to resolve conflicts when a subscriber node is subscribed to multiple providers and/or local writes are allowed in the subscriber node.

15. Pglogical does not allow to replicate UPDATEs and DELETEs for tables that do not have Primary Keys, only INSERTs are replicated. Native logical replication allows tables without primary key to be replicated which have a REPLICA IDENTITY FULL set.

16. Pglogical copies the schema(using synchronize_structure of pglogical.create_subscription) at the subscriber node but in Native replication database schema and DDL commands are not replicated automatically.

17. Pglogical gives the option to re-synchronize a single table which has fallen behind using pglogical.alter_subscription_resynchronize_table but in Native replication the entire subscription needs to be refreshed.

--

--