Postgresql进程卡住无法退出原因和解决方法

2022-09-23 10:21:21 浏览数 (4)

前言

Postgresql进程卡住无法退出怎么办?例如以下场景:

  • kill 进程:无效
  • kill -2/-12 进程:无效
  • pg_terminate_backend(pid):无效

1 模拟卡住

代码语言:javascript复制
-- pg14下测试通过

cd `pg_config --libdir`/postgresql

cat << EOF > loop.c
#include "postgres.h"
#include "fmgr.h"
#include <unistd.h>
PG_MODULE_MAGIC;
PG_FUNCTION_INFO_V1(loop);
Datum loop(PG_FUNCTION_ARGS)
{
    /* an endless loop */
    while(1) 
    {
        // 注意这里没有加:CHECK_FOR_INTERRUPTS();
        sleep(2); 
    }
}
EOF

gcc -I /data01/bin/pg9000/include/postgresql/server -fPIC -shared -o loop.so loop.c

-- psql
CREATE FUNCTION loop() RETURNS void LANGUAGE c AS 'loop';

-- 卡住
postgres=# 
postgres=# select loop();

2 尝试解决失败的方法

psql执行ctrl c失败

代码语言:javascript复制
postgres=# select loop();
^CCancel request sent
^CCancel request sent

kill 失败/kill -2失败/kill -12失败

代码语言:javascript复制
$ kill 11699
$ kill -2 11699
$ kill -12 11699

pg_terminate_backend失败

代码语言:javascript复制
postgres=# select pg_cancel_backend(11699);
 pg_cancel_backend 
-------------------
 t
(1 row)

postgres=# select pg_terminate_backend(11699);
 pg_terminate_backend 
----------------------
 t
(1 row)

3 进程在干什么

stracp -p

代码语言:javascript复制
$ strace -p 11699
Process 11699 attached
restart_syscall(<... resuming interrupted call ...>) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [URG], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], SA_RESTORER|SA_RESTART|SA_NOCLDSTOP, 0x7fbbe4fcc630}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0
nanosleep({2, 0}, 
0x7fff60c0bb80)       = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [URG], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], SA_RESTORER|SA_RESTART|SA_NOCLDSTOP, 0x7fbbe4fcc630}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [URG], NULL, 8) = 0
nanosleep({2, 0}, 0x7fff60c0bb80)       = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [URG], 8) = 0

gstack

代码语言:javascript复制
$ gstack 11699
#0  0x00007fbbe45a68d0 in __nanosleep_nocancel () from /lib64/libc.so.6
#1  0x00007fbbe45a6784 in sleep () from /lib64/libc.so.6
#2  0x00007fbbe585e145 in loop () from /data01/bin/pg9000/lib/postgresql/loop.so
#3  0x0000000000728b0f in ExecInterpExpr (state=0x2bf0100, econtext=0x2befe00, isnull=0x7fff60c0c037) at execExprInterp.c:725
#4  0x000000000072a965 in ExecInterpExprStillValid (state=0x2bf0100, econtext=0x2befe00, isNull=0x7fff60c0c037) at execExprInterp.c:1824
#5  0x000000000077eec1 in ExecEvalExprSwitchContext (state=0x2bf0100, econtext=0x2befe00, isNull=0x7fff60c0c037) at ../../../src/include/executor/executor.h:339
#6  0x000000000077ef2a in ExecProject (projInfo=0x2bf00f8) at ../../../src/include/executor/executor.h:373
#7  0x000000000077f118 in ExecResult (pstate=0x2befce8) at nodeResult.c:136
#8  0x000000000073eefa in ExecProcNodeFirst (node=0x2befce8) at execProcnode.c:463
#9  0x00000000007338aa in ExecProcNode (node=0x2befce8) at ../../../src/include/executor/executor.h:257
#10 0x0000000000736130 in ExecutePlan (estate=0x2befab0, planstate=0x2befce8, use_parallel_mode=false, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0x2bee790, execute_once=true) at execMain.c:1551
#11 0x0000000000733f15 in standard_ExecutorRun (queryDesc=0x2b2c850, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:361
#12 0x00007fbbe589ac5e in pgss_ExecutorRun (queryDesc=0x2b2c850, direction=ForwardScanDirection, count=0, execute_once=true) at pg_stat_statements.c:1003
#13 0x0000000000733d27 in ExecutorRun (queryDesc=0x2b2c850, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:303
#14 0x000000000097cd6d in PortalRunSelect (portal=0x2ba3f90, forward=true, count=0, dest=0x2bee790) at pquery.c:921
#15 0x000000000097ca2c in PortalRun (portal=0x2ba3f90, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x2bee790, altdest=0x2bee790, qc=0x7fff60c0c4e0) at pquery.c:765
#16 0x00000000009766d3 in exec_simple_query (query_string=0x2b064f0 "select loop();") at postgres.c:1213
#17 0x000000000097abf1 in PostgresMain (argc=1, argv=0x7fff60c0c770, dbname=0x2b32c00 "postgres", username=0x2b32bd8 "mingjiegao") at postgres.c:4494
#18 0x00000000008b6de6 in BackendRun (port=0x2b2b720) at postmaster.c:4530
#19 0x00000000008b6765 in BackendStartup (port=0x2b2b720) at postmaster.c:4252
#20 0x00000000008b2bdd in ServerLoop () at postmaster.c:1745
#21 0x00000000008b24af in PostmasterMain (argc=1, argv=0x2b000e0) at postmaster.c:1417
#22 0x00000000007b4d2b in main (argc=1, argv=0x2b000e0) at main.c:209

4 原因&解决方案

4.1 原因

原因是PG当前堆栈没有进入信号相应处理函数,一般就是进入死等堆栈了:

  • 一般能cacnel的堆栈:超时后检查中断,然后继续sleep
代码语言:javascript复制
while
  sleep(timeout)
  CHECK_FOR_INTERRUPTS();
  • 死等堆栈
代码语言:javascript复制
while
  sleep()

4.2 解决方案

注意:千万不要kill -9,SIGKILL没有信号处理函数,OS会直接停掉进程;PG父进程发现子进程异常退出,会停掉所有进程,释放共享内存,在重新申请共享内存,拉起所有进程。效果就等于异常重启,启动时肯定会需要时间redo,可能造成几分钟的停止服务。(除非后果可以接受,否则不要kill -9)

执行kill -2或kill -12后或者执行pg_terminate_backend后,主动调用信号处理函数,让PG正常退出。

代码语言:javascript复制
$ gdb attach 11699
(gdb) p ProcessInterrupts()
[Inferior 1 (process 11699) exited with code 01]
The program being debugged exited while in a function called from GDB.
Evaluation of the expression containing the function
(ProcessInterrupts) will be abandoned.
(gdb) q

服务端效果

代码语言:javascript复制
postgres=# 
postgres=# select loop();
^CCancel request sent
^CCancel request sent

(GDB调用后)

FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
postgres=# q

服务端日志:server没有其他报错、server没有重启

代码语言:javascript复制
11699 [local] mingjiegao postgres 2022-08-19 09:41:43 UTC 42723STATEMENT:  CREATE FUNCTION loop() RETURNS void LANGUAGE c AS 'loop';
11699 [local] mingjiegao postgres 2022-08-19 09:42:24 UTC 00000LOG:  statement: select loop();
13083 [local] mingjiegao postgres 2022-08-19 09:45:39 UTC 00000LOG:  statement: select pg_cancel_backend(11699);
13083 [local] mingjiegao postgres 2022-08-19 09:45:47 UTC 00000LOG:  statement: select pg_terminate_backend(11699);
11699 [local] mingjiegao postgres 2022-08-19 09:52:44 UTC 57P01FATAL:  terminating connection due to administrator command
11699 [local] mingjiegao postgres 2022-08-19 09:52:44 UTC 57P01STATEMENT:  select loop();

1 人点赞