Is it possible or advisable to stream/pipe pg_dump output to S3?
We are dumping large datasets to our instance and the database size is big. So trying to optimize for local disk space (avoid temp space for dump) and create the backup straight on S3.
We have a PostgreSQL v9.6.3 on Ubuntu 16.04.
pg_dump streaming directly to S3 seems to work fine. I have 350GB database and do not want to create temp additional drives. You need to make sure that the multipart chunk size is big enough, otherwise I ran into 'too many segments' issue. With AWS CLI the commands:
With my db it took about 8 hours and result was 130GB file in S3. Now restoring has to be done with psql, as pg_restore does not like plain sql dumps what the command above creates. I could not use custom format there, as this wants to create directories which cannot (probably?) be piped away.
Finally restoring same way, without intermediate file saving. Note that I had to uncompress the data before psql using zcat:
The restoring seems to take about same time (~8 hours) as dumping, probably depends where and how big is your server (AWS or somewhere else, mine is outside AWS).
You can use s3's multipart upload feature to stream the dump as it's being generated. However that is likely to be error prone and less than reliable. A better approach is to create an ephemeral EBS volume, dump your database to it. And then upload the compressed backup to s3/Glacier if that is where it needs to go.
If you are wanting a backup for point in time recovery doing a
pg_basebackup
to an EBS volume and archiving the WAL stream from the point after the backup means you could cut the time to recovery without keeping a full replica node. If your concern is availability then setting up replication is the way to go. Although you will still want backups.Replication is not backup, if someone drops a table on the Origin it will be dropped on the Replica; so you still need PITR or checkpoint backups.
No, it's not wise. Instead set up actual replication which PostgreSQL supports. I would use the subscriber model, but you can also do WAL-log shipping to s3 if you want using
archive_command
.However, that's mostly unnecessary. I wouldn't consider it unless I had more a special use-case.
I would upgrade to 10.1 and jump on Logical Replication with the subscriber model.