Ping a Specific Port

Question

kapso

Asked: 2017-12-05 15:21:03 +0800 CST2017-12-05 15:21:03 +0800 CST 2017-12-05 15:21:03 +0800 CST

Streaming PostgreSQL pg_dump to S3

772

Is it possible or advisable to stream/pipe pg_dump output to S3?

We are dumping large datasets to our instance and the database size is big. So trying to optimize for local disk space (avoid temp space for dump) and create the backup straight on S3.

We have a PostgreSQL v9.6.3 on Ubuntu 16.04.

3 Answers

Voted

JaakL · Answer 1 · 2018-03-22T01:27:06+08:00

pg_dump streaming directly to S3 seems to work fine. I have 350GB database and do not want to create temp additional drives. You need to make sure that the multipart chunk size is big enough, otherwise I ran into 'too many segments' issue. With AWS CLI the commands:

aws configure set default.s3.multipart_chunksize 200MB 
time sudo -u postgres pg_dump -Z 9 -v DB_NAME |aws s3 cp - s3://BUCKET/DB_NAME.dump.gz

With my db it took about 8 hours and result was 130GB file in S3. Now restoring has to be done with psql, as pg_restore does not like plain sql dumps what the command above creates. I could not use custom format there, as this wants to create directories which cannot (probably?) be piped away.

Finally restoring same way, without intermediate file saving. Note that I had to uncompress the data before psql using zcat:

wget -O - 'https://S3-URL/BUCKET/DB_NAME.dump.gz' |zcat |sudo -u postgres psql DB_NAME

The restoring seems to take about same time (~8 hours) as dumping, probably depends where and how big is your server (AWS or somewhere else, mine is outside AWS).

Larry · Answer 2 · 2017-12-05T18:37:23+08:00

Best Answer

Larry

2017-12-05T18:37:23+08:002017-12-05T18:37:23+08:00

You can use s3's multipart upload feature to stream the dump as it's being generated. However that is likely to be error prone and less than reliable. A better approach is to create an ephemeral EBS volume, dump your database to it. And then upload the compressed backup to s3/Glacier if that is where it needs to go.

If you are wanting a backup for point in time recovery doing a pg_basebackup to an EBS volume and archiving the WAL stream from the point after the backup means you could cut the time to recovery without keeping a full replica node. If your concern is availability then setting up replication is the way to go. Although you will still want backups.

Replication is not backup, if someone drops a table on the Origin it will be dropped on the Replica; so you still need PITR or checkpoint backups.

5

Evan Carroll · Answer 3 · 2017-12-05T17:32:55+08:00

Evan Carroll

2017-12-05T17:32:55+08:002017-12-05T17:32:55+08:00

No, it's not wise. Instead set up actual replication which PostgreSQL supports. I would use the subscriber model, but you can also do WAL-log shipping to s3 if you want using archive_command.

However, that's mostly unnecessary. I wouldn't consider it unless I had more a special use-case.

I would upgrade to 10.1 and jump on Logical Replication with the subscriber model.

1

Streaming PostgreSQL pg_dump to S3

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?