We are having a problem where our Postgres 9.0 server occasionally locks up and kills our webapp. Restarting Postgres fixes the problem.
Here's what I've been able to observe:
- First, usage of one CPU jumps to 100% for a few minutes
- Disk operations drop to ~0 during this time
- Database operations drop to 0 (blocks and tuples per sec)
- Logs show during this time:
- WARNING: worker took too long to start; cancelled
- WARNING: worker took too long to start; cancelled
- No Queries in logs (only those over 200ms are logged)
- No unusually long-running queries logged before or during
- Then the second CPU jumps to 100%
- The number of postgres processes jumps from the usual 8-10 to ~20
- Matched by a spike in Postgres Blocks per second (about twice normal)
- Logs show
- LOG: could not accept SSL connection: EOF detected
- Queries are running but slow
- Restarting postgres returns everything to normal
Setup:
Server: Amazon EC2 Large
Ubuntu 10.04.2 LTS
Postgres 9.0.3
Dedicated DB server
Does anyone have any idea what's causing this? Or any suggestions about what else I should be checking out?
Make sure you are not running out of memory and causing disk thrashing issues.
If you have plenty of open memory, then go directly into PostgreSQL and look for an offending query.