I have a PostgreSQL 13 instance that keeps crashing:
LOG: server process (PID 10722) was terminated by signal 11: Segmentation fault
DETAIL: Failed process was running: COMMIT
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
I've updated /etc/postgresql/13/main/pg_ctl.conf
to include core dumps
pg_ctl_options = '--core-files'
and restarted postgresql
service. Now it seems to allow core dumps:
$ for f in `pgrep postgres`; do cat /proc/$f/limits | grep core; done
Max core file size unlimited unlimited bytes
gdb
backtrace gives following output
$ gdb /usr/lib/postgresql/13/bin/postgres 13/main/core.postgres.12264
Program terminated with signal SIGSEGV, Segmentation fault.
#0 slot_deform_heap_tuple (natts=5, offp=0x557cc2e60720, tuple=<optimized out>, slot=0x557cc2e606d8) at ./build/../src/backend/executor/execTuples.c:930
930 ./build/../src/backend/executor/execTuples.c: No such file or directory.
(gdb) bt
#0 slot_deform_heap_tuple (natts=5, offp=0x557cc2e60720, tuple=<optimized out>, slot=0x557cc2e606d8) at ./build/../src/backend/executor/execTuples.c:930
#1 tts_buffer_heap_getsomeattrs (slot=0x557cc2e606d8, natts=5) at ./build/../src/backend/executor/execTuples.c:695
#2 0x0000557cc1d3998c in slot_getsomeattrs_int (slot=slot@entry=0x557cc2e606d8, attnum=5) at ./build/../src/backend/executor/execTuples.c:1912
#3 0x0000557cc1d28fba in slot_getsomeattrs (attnum=<optimized out>, slot=0x557cc2e606d8) at ./build/../src/include/executor/tuptable.h:344
#4 ExecInterpExpr (state=0x557cc2e620a8, econtext=0x557cc2ea1768, isnull=<optimized out>) at ./build/../src/backend/executor/execExprInterp.c:482
#5 0x0000557cc1d5548d in ExecEvalExprSwitchContext (isNull=0x7ffdd2599507, econtext=0x557cc2ea1768, state=0x557cc2e620a8) at ./build/../src/include/executor/executor.h:322
#6 ExecQual (econtext=0x557cc2ea1768, state=0x557cc2e620a8) at ./build/../src/include/executor/executor.h:391
#7 MJFillInner (node=0x557cc2ea1558) at ./build/../src/backend/executor/nodeMergejoin.c:494
#8 0x0000557cc1d55ce8 in ExecMergeJoin (pstate=0x557cc2ea1558) at ./build/../src/backend/executor/nodeMergejoin.c:1353
#9 0x0000557cc1d2cc83 in ExecProcNode (node=0x557cc2ea1558) at ./build/../src/include/executor/executor.h:248
#10 ExecutePlan (execute_once=<optimized out>, dest=0x557cc2e1a630, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x557cc2ea1558,
estate=0x557cc2ea12f8) at ./build/../src/backend/executor/execMain.c:1632
#11 standard_ExecutorRun (queryDesc=0x557cc2e1a5a0, direction=<optimized out>, count=0, execute_once=<optimized out>) at ./build/../src/backend/executor/execMain.c:350
#12 0x00007f0ec05ae09d in pgss_ExecutorRun (queryDesc=0x557cc2e1a5a0, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at ./build/../contrib/pg_stat_statements/pg_stat_statements.c:1045
#13 0x0000557cc1cdbcd4 in PersistHoldablePortal (portal=portal@entry=0x557cc2d44b78) at ./build/../src/backend/commands/portalcmds.c:407
#14 0x0000557cc1ff95f9 in HoldPortal (portal=portal@entry=0x557cc2d44b78) at ./build/../src/backend/utils/mmgr/portalmem.c:642
#15 0x0000557cc1ff9e7d in PreCommit_Portals (isPrepare=isPrepare@entry=false) at ./build/../src/backend/utils/mmgr/portalmem.c:738
#16 0x0000557cc1c001c4 in CommitTransaction () at ./build/../src/backend/access/transam/xact.c:2087
#17 0x0000557cc1c015d5 in CommitTransactionCommand () at ./build/../src/backend/access/transam/xact.c:3085
#18 0x0000557cc1ea211d in finish_xact_command () at ./build/../src/backend/tcop/postgres.c:2662
#19 0x0000557cc1ea4703 in exec_simple_query (query_string=0x557cc2c9cd28 "COMMIT") at ./build/../src/backend/tcop/postgres.c:1264
#20 0x0000557cc1ea6143 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x557cc2cf6c68, dbname=<optimized out>, username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4339
#21 0x0000557cc1e25bcd in BackendRun (port=0x557cc2ce94d0, port=0x557cc2ce94d0) at ./build/../src/backend/postmaster/postmaster.c:4526
#22 BackendStartup (port=0x557cc2ce94d0) at ./build/../src/backend/postmaster/postmaster.c:4210
#23 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1739
#24 0x0000557cc1e26b41 in PostmasterMain (argc=5, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1412
#25 0x0000557cc1b70f4f in main (argc=5, argv=0x557cc2c96c30) at ./build/../src/backend/main/main.c:210
Adding log_statement = 'all'
to /etc/postgresql/13/main/postgresql.conf
doesn't really help, as postmaster
terminates all processes immediately and the query doesn't get written to logs.
here's strace
output after running COMMIT
[pid 20006] pwrite64(29, "COMMIT", 6, 15936) = 6
[pid 20006] pwrite64(29, "\0", 1, 15942) = 1
[pid 20006] close(29) = 0
[pid 20006] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x10} ---
[pid 20006] +++ killed by SIGSEGV (core dumped) +++
<... select resumed> ) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_DUMPED, si_pid=20006, si_uid=108, si_status=SIGSEGV, si_utime=0, si_stime=0} ---
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV && WCOREDUMP(s)}], WNOHANG, NULL) = 20006
write(2, "2021-09-08 13:38:51.853 UTC [299"..., 198) = 198
write(2, "2021-09-08 13:38:51.853 UTC [299"..., 88) = 88
kill(19324, SIGQUIT) = 0
kill(-19324, SIGQUIT) = 0
kill(19331, SIGQUIT) = 0
kill(-19331, SIGQUIT) = 0
kill(19320, SIGQUIT) = 0
kill(-19320, SIGQUIT) = 0
kill(19319, SIGQUIT) = 0
kill(-19319, SIGQUIT) = 0
kill(19321, SIGQUIT) = 0
kill(-19321, SIGQUIT) = 0
kill(19322, SIGQUIT) = 0
kill(-19322, SIGQUIT) = 0
kill(19323, SIGQUIT) = 0
kill(-19323, SIGQUIT) = 0
wait4(-1, 0x7ffe90814374, WNOHANG, NULL) = 0
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
select(7, [5 6], NULL, NULL, {tv_sec=5, tv_usec=0}) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=19320, si_uid=108, si_status=2, si_utime=14, si_stime=3} ---
Is there a way how to trace back the exact SQL query that was executed?
Firstly install debugging symbols for your distribution, for Debian distros:
Jump to a frame that contains some
queryDesc
variable, e.g12
:print that variable:
now copy the line above after equal sign and dereference it using
*
It doesn't give you the whole query but at least an idea on which table is the query executed.
Based on the
gdb
output I've managed to isolate clients that were executing such query.I've tried running
VACUUM FULL
on the affected table, rebuilding table and indexes, switching to replica, copying whole database usingpg_dump
. Nonetheless the issue still persisted also on database copies.Finally I've managed to isolate a minimal SQL code to replicate the issue.
With a code to replicate this was enough to report a bug to pgsql-bugs mailing list (there's also a webform). Turned out to be a bug with re-execution of a plan that already reached completion on a un-stable cursor that was included in PostgreSQL 13.4, 12.8 (and possibly other versions).