导读
腾讯云 云数据仓库 PostgreSQL(Cloud Data Warehouse PostgreSQL)(以下我们简称CDWPG)为企业提供简单、快速、经济高效的 PB 级云端数据仓库解决方案。云数据仓库兼容 Greenplum 开源数据仓库,是一种基于 MPP(大规模并行处理)架构的数仓服务。借助于该产品,可以使用丰富的 PostgreSQL 开源生态工具,实现对云数据仓库中海量数据的即席查询分析、ETL 处理及可视化探索;还可以借助云端数据无缝集成特性,轻松分析位于 COS、TencentDB、ES 等数据引擎上的 PB 级数据。
作者介绍
岳涛
腾讯云 CSIG 大数据中心产品架构师
多年分布式、高并发大数据系统的研发、系统架构设计经验,擅长主流大数据架构技术平台的落地和实施
目前专注于大数据架构相关组件的研究推广和最佳实践的沉淀,致力于帮助企业完成数字化转型
pg_stat_activity 简介
pg_stat_activity官方介绍:
One row per server process, showing information related to the current activity of that process, such as state and current query。
每一行都表示一个系统进程,显示与当前会话的活动进程的一些信息,比如当前回话的状态和查询等。
pg_stat_activity有哪些功能?
- 视图信息:通过视图信息,我们可以获取到所有请求相关的明细。通过这些信息,我们可以清楚准确地了解到当前数仓正在发生哪些事情;
- 分析场景:通过一些查询的组合,我们可以了解到有哪些异常的请求正在执行,并视情况采取行动;
- 排除故障:当CDWPG数仓存在使用异常的情况下,我们可以通过pg_stat_activity获取到异常的请求,并干预这些请求。
查看系统视图pg_stat_activity
```
postgres=> d pg_stat_activity
View "pg_catalog.pg_stat_activity"
Column | Type | Modifiers | Storage | Description
------------------ -------------------------- ----------- ---------- -------------
datid | oid | | plain |
datname | name | | plain |
pid | integer | | plain |
sess_id | integer | | plain |
usesysid | oid | | plain |
usename | name | | plain |
application_name | text | | extended |
client_addr | inet | | main |
client_hostname | text | | extended |
client_port | integer | | plain |
backend_start | timestamp with time zone | | plain |
xact_start | timestamp with time zone | | plain |
query_start | timestamp with time zone | | plain |
state_change | timestamp with time zone | | plain |
waiting | boolean | | plain |
state | text | | extended |
backend_xid | xid | | plain |
backend_xmin | xid | | plain |
query | text | | extended |
waiting_reason | text | | extended |
rsgid | oid | | plain |
rsgname | text | | extended |
rsgqueueduration | interval | | plain |
View definition:
SELECT s.datid,
d.datname,
s.pid,
s.sess_id,
s.usesysid,
u.rolname AS usename,
s.application_name,
s.client_addr,
s.client_hostname,
s.client_port,
s.backend_start,
s.xact_start,
s.query_start,
s.state_change,
s.waiting,
s.state,
s.backend_xid,
s.backend_xmin,
s.query,
s.waiting_reason,
s.rsgid,
s.rsgname,
s.rsgqueueduration
FROM pg_database d,
pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, waiting, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, cl
ient_port, backend_xid, backend_xmin, sess_id, waiting_reason, rsgid, rsgname, rsgqueueduration),
pg_authid u
WHERE s.datid = d.oid AND s.usesysid = u.oid;
```
字段详细信息:
字段 | 类型 | 描述 |
---|---|---|
datid | oid | 连接后端的数据库OID |
datname | name | 连接后端的数据库名称 |
pid | integer | 后端进程ID |
sess_id | integer | Session ID |
usesysid | oid | 登陆后端的用户OID |
usename | name | 登陆到该后端的用户名 |
application_name | text | 连接到后端的应用名 |
client_addr | inet | 连接到后端的客户端的IP地址。 如果此字段是null, 它表明通过服务器机器上UNIX套接字连接客户端或者这是内部进程如autovacuum |
client_hostname | text | 连接客户端的主机名,通过client_addr的反向DNS查找报告。 这个字段将只是非空的IP连接,并且仅仅当启动log_hostname的时候 |
client_port | integer | 客户端用于与后端通讯的TCP端口号,或者如果使用Unix套接字,则为-1 |
backend_start | timestamptz | 该过程开始的时间,比如当客户端连接服务器时 |
xact_start | timestamptz | 启动当前事务的时间,如果没有事务是活的,则为null。如果当前查询是 首个事务,则这列等同于query_start列 |
query_start | timestamptz | 开始当前活跃查询的时间, 或者如果state是非活跃的, 当开始最后查询时 |
state_change | timestampz | 上次状态改变的时间 |
waiting | boolean | 如果后端当前正等待锁则为真(t),否则为(f) |
state | text | 该后端当前总体状态。可能值是: ● 活跃的(active):后端正在执行一个查询。 ● 空闲的(idle):后端正在等待一个新的客户端命令。 ● 空闲事务(idle in transaction):后端在事务中,但是目前无法执行查询。 ● 被终止的空闲事务(idle in transaction (aborted)):这个情况类似于空闲事务,除了事务导致错误的一个语句之一。 ● 快速路径函数调用(fastpath function call):后端正在执行一个快速路径函数。 ● 禁用(disabled):如果后端禁用track_activities,则报告这个状态。 |
query | text | 该后端的最新查询文本。如果状态是活跃的, 此字段显示当前正在执行的查询。在所有其他情况中,这表明执行过去的查询 |
waiting_reason | text | 服务器进程等待的原因,可能值为: lock, replication, or resgroup |
rsgid | oid | 资源组 OID 或者 0 |
rsgname | text | 资源组名称 或者 unknown |
rsgqueueduration | interval | 查询请求在查询队列中排队的总时长 |
注意:启用资源组时。只有查询调度程序(QD)进程将具有rsgid和rsgname。其他服务器进程(例如查询执行器(QE)进程或会话连接进程)将具有rsgid的价值0和一个rsgname的价值未知。QE流程与调度QD流程由同一资源组管理。
获取视图信息
- 获取连接信息
通过该SQL可以查询出当前有哪些连接,以及简单的信息
```
postgres=> SELECT datname,
postgres-> usename,
postgres-> client_addr,
postgres-> client_port
postgres-> FROM pg_stat_activity;
datname | usename | client_addr | client_port
----------- -------------- ---------------- -------------
gpperfmon | gpadmincloud | |
postgres | cdwadmin | 101.68.144.103 | 40601
postgres | cdwadmin | 101.68.144.103 | 40123
postgres | cdwadmin | 101.68.144.103 | 41225
postgres | cdwadmin | 101.68.144.103 | 41240
postgres | cdwadmin | 101.68.144.103 | 41011
gpperfmon | gpadmincloud | |
postgres | cdwadmin | 10.0.0.6 | 56964
postgres | cdwadmin | 101.68.144.103 | 41326
postgres | cdwadmin | 101.68.144.103 | 39664
(10 rows)
```
- 获取执行SQL的信息
通过该SQL可以查询出当前的执行SQL信息
```
postgres=> SELECT datname,
postgres-> usename,
postgres-> query
postgres-> FROM pg_stat_activity;
datname | usename | query
----------- -------------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------
gpperfmon | gpadmincloud | <insufficient privilege>
postgres | cdwadmin | SELECT d.datname, d.oid, pg_get_userbyid(d.datdba) AS owner, shobj_description(d.oid, 'pg_database') AS comment, t.spcname, d.datacl, d.datlastsysoid, d.encoding,
pg_encoding_to_char(d.encoding) AS encodingname FROM pg_database d LEFT JOIN pg_tablespace t ON d.dattablespace=t.oid
postgres | cdwadmin | SELECT c.oid, obj_description(c.oid), c.relhasoids AS hasoids, n.nspname AS schemaname, c.relname AS tablename, c.relkind, pg_get_userbyid(c.relowner) AS tableown
er, t.spcname AS "tablespace", c.relhasindex AS hasindexes, c.relhasrules AS hasrules, c.relhastriggers AS hastriggers, ft.ftoptions, fs.srvname, c.relacl, c.reltuples, ((SELECT count(*) FROM
pg_inherits WHERE inhparent = c.oid) > 0) AS inhtable, i2.nspname AS inhschemaname, i2.relname AS inhtablename, c.reloptions AS param, c.relpersistence AS unlogged FROM pg_class c LEFT JOIN
pg_namespace n ON n.oid = c.relnamespace LEFT JOIN pg_tablespace t ON t.oid = c.reltablespace LEFT JOIN (pg_inherits i INNER JOIN pg_class c2 ON i.inhparent = c2.oid LEFT JOIN pg_namespace n2
ON n2.oid = c2.relnamespace) i2 ON i2.inhrelid = c.oid LEFT JOIN pg_foreign_table ft ON ft.ftrelid = c.oid LEFT JOIN pg_foreign_server fs ON ft.ftserver = fs.oid WHERE ((c.relkind = 'r'::"ch
ar") OR (c.relkind = 'f'::"char")) AND n.nspname = 'public'
postgres | cdwadmin | SELECT rolname, rolsuper, rolinherit, rolcreaterole, rolcreatedb, rolcatupdate, rolcanlogin, rolconnlimit, rolvaliduntil, rolconfig, oid , pg_catalog.shobj_descrip
tion(oid, 'pg_authid') AS comment FROM pg_roles
postgres | cdwadmin | SELECT c.conkey FROM pg_constraint c WHERE c.contype='p' AND c.conrelid = 24756
postgres | cdwadmin | SELECT t.relname, fns.nspname AS foreign_schema, f.relname AS foreign_table, c.conname, c.confkey, a.attname AS foreign_field, a.attnum FROM pg_constraint c LEFT J
OIN pg_namespace ns ON (c.connamespace = ns.oid) LEFT JOIN pg_class t ON (c.conrelid = t.oid) INNER JOIN pg_class f ON (c.confrelid = f.oid) LEFT JOIN pg_namespace fns ON (f.relnamespace = fn
s.oid) INNER JOIN pg_attribute a ON (a.attrelid = f.oid) WHERE a.attnum > 0 AND ns.nspname = 'public' AND t.relname = 'user_behavior' ORDER BY t.relname, c.conname, a.attnum
gpperfmon | gpadmincloud | <insufficient privilege>
postgres | cdwadmin | SELECT datname,usename,query FROM pg_stat_activity;
postgres | cdwadmin | SELECT c.conkey FROM pg_constraint c WHERE c.contype='p' AND c.conrelid = 24756
postgres | cdwadmin | SELECT COUNT(1) FROM user_behavior;
(10 rows)
```
上面一条SQL的查询结果包含了当前没有查询的空连接,如果想只查询当前正在运行的SQL,需要用下面这条SQL
```
postgres=> SELECT datname,
postgres-> usename,
postgres-> query
postgres-> FROM pg_stat_activity
postgres-> WHERE state != 'idle';
datname | usename | query
---------- ---------- -----------------------------------------------
postgres | cdwadmin | SELECT datname,
| | usename,
| | query
| | FROM pg_stat_activity
| | WHERE state != 'idle';
postgres | cdwadmin | SELECT * FROMr
| | (SELECT *,r
| | row_number() over(partition by hostr
| | ORDER BY age desc) agedr
| | FROM user_behavior) tr
| | WHERE aged = 1r
| | LIMIT 10;
(2 rows)
```
- 获取查询耗时比较久的查询
通过该SQL可以查询出当前耗时较长的执行SQL信息
```
postgres=> SELECT current_timestamp - query_start AS run_time,
postgres-> datname,
postgres-> usename,
postgres-> query
postgres-> FROM pg_stat_activity
postgres-> WHERE state != 'idle'
postgres-> ORDER BY 1 desc;
runtime | datname | usename | query
----------------- ---------- ---------- ----------------------------------------------------------------------------
00:00:11.070832 | postgres | cdwadmin | SELECT * FROMr
| | | (SELECT *,r
| | | row_number() over(partition by hostr
| | | ORDER BY age desc) agedr
| | | FROM user_behavior) tr
| | | WHERE aged = 1r
| | | LIMIT 10;
00:00:00 | postgres | cdwadmin | select current_timestamp - query_start as runtime, datname, usename, query
| | | from pg_stat_activity
| | | where state != 'idle'
| | | order by 1 desc;
(2 rows)
```
可以发现第一条SQL执行了11秒还没结束。
重要字段讲解
前面我们列举过字段详细信息:
waiting有两个值,分别为:假(f),真(t);
state有6种状态,这里我们介绍其中4种:
活跃的(active);
空闲的(idle);
空闲事务(idle in transaction);
被终止的空闲事务(idle in transaction (aborted))。
下面通过几个简单的例子,来讲解这两个相对重要的字段:waiting,state。
场景一:查看当前正在执行的请求
state
- active:请求正在执行中
waiting
- f:当前请求没有在等待
```
postgres=> SELECT datid,
postgres-> datname,
postgres-> usename,
postgres-> waiting,
postgres-> state,
postgres-> query
postgres-> FROM pg_stat_activity
postgres-> WHERE usename = 'cdwadmin' AND
postgres-> state = 'active';
datid | datname | usename | waiting | state | query
------- ---------- ---------- --------- -------- -----------------------------------
12810 | postgres | cdwadmin | f | active | SELECT datid,
| | | | | datname,
| | | | | usename,
| | | | | waiting,
| | | | | state,
| | | | | query
| | | | | FROM pg_stat_activity
| | | | | WHERE usename = 'cdwadmin' AND
| | | | | state = 'active';
12810 | postgres | cdwadmin | f | active | SELECT user, sex, age, COUNT(1)r
| | | | | FROM user_behaviorr
| | | | | GROUP BY user, sex, age;
(2 rows)
```
这里可以看到当前用户cdwadmin符合状态为active的SQL有两条,其中waiting的值为f(false)。
场景二:查看当前执行等待的请求
state
- active:请求正在执行中
waiting
- t:当前请求正在等待
```
postgres=> BEGIN;
BEGIN
postgres=> ALTER TABLE user_behavior ADD COLUMN region TEXT;
ALTER TABLE
postgres=>
```
这里我们用BEGIN语句使当前用户处在事务中,并发起了一个ALTER请求使产生表锁, 然后下面用另一个客户端再对该表发起查询 。
```
postgres=> SELECT * FROM user_behavior LIMIT 1;
```
这里发现查询请求处在没有响应的状态,一直卡在这里,这是因为增加字段会产生表锁,锁释放之前该表无法进行其他操作。然后下面再打开新的一个客户端获取执行信息。
```
postgres=> SELECT datid,
postgres-> datname,
postgres-> usename,
postgres-> waiting,
postgres-> state,
postgres-> query
postgres-> FROM pg_stat_activity
postgres-> WHERE usename = 'cdwadmin' AND
postgres-> waiting = 't';
datid | datname | usename | waiting | state | query
------- ---------- ---------- --------- -------- --------------------------------------
12810 | postgres | cdwadmin | t | active | SELECT * FROM user_behavior LIMIT 1;
(1 row)
```
这里可以看到当前用户cdwadmin符合waiting状态为t的SQL有一条,其中state为active。
场景三:查看当前没有请求的空连接
state
- idle:当前没有请求在执行的长连接,在等待请求
```
postgres=> SELECT datid,
postgres-> datname,
postgres-> usename,
postgres-> waiting,
postgres-> state,
postgres-> query
postgres-> FROM pg_stat_activity
postgres-> WHERE usename = 'cdwadmin' AND
postgres-> state = 'idle';
datid | datname | usename | waiting | state | query
------- ---------- ---------- --------- -------- -----------------------------------
12810 | postgres | cdwadmin | f | idle | SELECT user, sex, age, COUNT(1)r
| | | | | FROM user_behaviorr
| | | | | GROUP BY user, sex, age;
(1 rows)
```
这里可以看到当前用户cdwadmin符合状态为idle的SQL有一条,其中waiting的值为f(false)。
场景四:查看当前空闲的事务
state
- idle in transaction:空闲的事务
```
postgres=> BEGIN;
BEGIN
postgres=> SELECT * FROM
postgres-> (SELECT *,
postgres(> row_number() over(partition by host
postgres(> ORDER BY age desc) aged
postgres(> FROM user_behavior) t
postgres-> WHERE aged = 1
postgres-> LIMIT 10;
host | use | sex | age | behavior_id | aged
----------------- ---------- ----- ----- -------------------------------------- ------
192.168.100.107 | Gil | boy | 19 | b0893205-f626-49c9-82ab-c7ab6eef5d21 | 1
192.168.100.110 | Gore | boy | 19 | 9b5bae42-2d52-4dee-acf9-54ba60bb08a7 | 1
192.168.100.113 | Albin | boy | 19 | 8c710196-fc73-4878-a7b0-54b9d0adfc89 | 1
192.168.100.114 | Gil | boy | 19 | 74348395-c0b5-4a49-b798-19c733c4b664 | 1
192.168.100.117 | Denny | boy | 19 | 0103a06b-490d-4eab-bcc3-4598012ece13 | 1
192.168.100.121 | Hargrove | boy | 19 | a32d0a88-e0a5-4e15-95a0-0d54b92d9fbc | 1
192.168.100.123 | Felix | boy | 19 | 93fae52b-9f54-4b97-940d-04aea35be0d5 | 1
192.168.100.125 | Jordon | boy | 19 | 1b99dd74-1a0c-42dc-a119-8802c3173478 | 1
192.168.100.128 | Felix | boy | 19 | 220ee35b-88d1-464b-806d-07ff725402d9 | 1
192.168.100.13 | Gorman | boy | 19 | 82346ccd-6e07-42aa-9714-62f71a28c3c4 | 1
(10 rows)
```
这里我们用BEGIN语句使当前用户处在事务中,然后下面用另一个客户端获取执行信息。
```
postgres=> SELECT datid,
postgres-> datname,
postgres-> usename,
postgres-> waiting,
postgres-> state,
postgres-> query
postgres-> FROM pg_stat_activity
postgres-> WHERE usename = 'cdwadmin' AND
postgres-> state = 'idle in transaction';
datid | datname | usename | waiting | state | query
------- ---------- ---------- --------- --------------------- ---------------------------------------------
12810 | postgres | cdwadmin | f | idle in transaction | SELECT * FROM
| | | | | (SELECT *,
| | | | | row_number() over(partition by host
| | | | | ORDER BY age desc) aged
| | | | | FROM user_behavior) t
| | | | | WHERE aged = 1
| | | | | LIMIT 10;
(1 row)
```
这里可以看到当前用户cdwadmin符合状态为idle in transaction的SQL有一条,其中waiting的值为f(false)。
场景五:查看当前发生错误的空闲事务
state
- idle in transaction (aborted):由于发生错误而被终止的空闲事务
```
postgres=> BEGIN;
BEGIN
postgres=> SELECT * FROM
postgres-> (SELECT *,
postgres(> row_number() over(partition by host
postgres(> ORDER BY age desc) aged
postgres(> FROM user_behavior) t
postgres-> WHERE aged = 1
postgres-> LIMIT 10;
host | use | sex | age | behavior_id | aged
----------------- ---------- ----- ----- -------------------------------------- ------
192.168.100.100 | Tobias | boy | 19 | 0034dcf7-f5f3-42d2-a463-0d1708286b64 | 1
192.168.100.101 | Thorpe | boy | 19 | bb2bad2b-2e79-42d5-9df7-57746ecab218 | 1
192.168.100.102 | Titus | boy | 19 | 22722d52-fd41-426b-b844-e61be731fe03 | 1
192.168.100.103 | Webber | boy | 19 | 35115e19-84ae-411a-8a5c-584fbcccc030 | 1
192.168.100.104 | Swift | boy | 19 | bc7a1e8d-4f05-4434-add2-a323503065a4 | 1
192.168.100.105 | Rowell | boy | 19 | 5578bd6b-9b16-4c65-9519-05ae38686480 | 1
192.168.100.108 | Webber | boy | 19 | be97a453-05de-413d-b8ab-b7671559d60d | 1
192.168.100.119 | Sheridan | boy | 19 | 7841e7a2-27d0-44a1-ac5f-a6e8a3689587 | 1
192.168.100.12 | Russ | boy | 19 | 4780063f-a131-40b1-b3a7-e4deba865e7a | 1
192.168.100.120 | Gil | boy | 19 | 9ef60038-8fe0-46aa-a69a-97e498e08f98 | 1
(10 rows)
postgres=> SELEC * FROM user_behavior LIMIT 1;
ERROR: syntax error at or near "SELEC"
LINE 1: SELEC * FROM user_behavior LIMIT 1;
^
```
这里我们用BEGIN语句使当前用户处在事务中,并发起了一个错误的语法请求, 然后下面用另一个客户端获取执行信息 。
```
postgres=> SELECT datid,
datname,
usename,
waiting,
state,
query
FROM pg_stat_activity
WHERE usename = 'cdwadmin' AND
state = 'idle in transaction (aborted)';
datid | datname | usename | waiting | state | query
------- ---------- ---------- --------- ------------------------------- -------------------------------------
12810 | postgres | cdwadmin | f | idle in transaction (aborted) | SELEC * FROM user_behavior LIMIT 1;
(1 row)
```
这里可以看到当前用户cdwadmin符合状态为idle in transaction (aborted)的SQL有一条,其中waiting的值为f(false)。
任务分析进阶——诊断原因并修复
在上面的场景二中,我们发现了有正在等待的请求,下面我们来学习如何找出发生等待事件的根本原因。
由于Greenplum没有提供任何直接的方法来确认请求等待的原因,所以我们将引入另一张视图————pg_locks。
将pg_locks与pg_stat_activity结合使用,可以使我们获得很多与锁定相关的有用信息。
- 查看当前执行等待的请求以及等待的原因
```
postgres=> SELECT waiting.query AS waiting_query,
postgres-> waiting.pid AS waiting_pid,
postgres-> waiting.usename AS waiting_user,
postgres-> locking.query AS locking_query,
postgres-> locking.pid AS locking_pid,
postgres-> locking.usename AS locking_user,
postgres-> t.schemaname || '.' || t.relname AS tablename
postgres-> FROM pg_stat_activity waiting
postgres-> JOIN pg_locks l1 ON
postgres-> waiting.pid = l1.pid AND
postgres-> NOT l1.granted
postgres-> JOIN pg_locks l2 ON
postgres-> l1.relation = l2.relation AND
postgres-> l2.granted
postgres-> JOIN pg_stat_activity locking ON
postgres-> l2.pid = locking.pid
postgres-> JOIN pg_stat_user_tables t ON
postgres-> l1.relation = t.relid
postgres-> WHERE waiting.waiting = 't';
waiting_query | waiting_pid | waiting_user | locking_query | locking_pid | locking_user | tablename
-------------------------------------- ------------- -------------- --------------------------------------------------- ------------- -------------- ----------------------
SELECT * FROM user_behavior LIMIT 1; | 17674 | cdwadmin | ALTER TABLE user_behavior ADD COLUMN region TEXT; | 15126 | cdwadmin | public.user_behavior
(1 row)
```
这里可以看到当前用户cdwadmin有一条正在等待的请求,其对应锁定的原因也很明朗。在确认了阻塞的原因之后,可以使用pg_cancel_backend或者pg_terminate_backend的方式来进行修复。
- 使用pg_cancel_backend终止查询
```
postgres=> SELECT PG_CANCEL_BACKEND(17674);
pg_cancel_backend
-------------------
t
(1 row)
```
执行成功,等待的查询已被终止,其中17674是上面查询出来的waiting_pid。需要注意的是,这里只终止了查询,并没有释放其session。
- 使用pg_terminate_backend终止查询并释放连接
```
postgres=> SELECT PG_TERMINATE_BACKEND(17674);
pg_terminate_backend
----------------------
t
(1 row)
postgres=> SELECT PG_TERMINATE_BACKEND(17674);
WARNING: PID 17674 is not a PostgreSQL server process
pg_terminate_backend
----------------------
f
(1 row)
```
可以发现第二次执行pg_terminate_backend时返回了false,因为该连接已经被第一次执行释放掉了。
小结
简而言之,pg_stat_activity的主要作用是显示CDWPG中的当前活动。pg_stat_activity是一张系统统计视图,可以用于任务分析诊断。这是非常有用且功能强大的视图,其价值很难衡量。
因此,如果想知道在任意给定时刻CDWPG数据仓库中正在发生什么,那么pg_stat_activity是最好的选择,是CDWPG使用者们最好的帮手。
参考资料
[1] https://cloud.tencent.com/document/product/878
[2] https://docs.greenplum.org/6-8/common/gpdb-features.html