全程干货!腾云忆想CSIG 产品架构师分享CDWPG云数仓库管理小窍门

2021-08-31 11:16:23 浏览数 (2)

导读

腾讯云 云数据仓库 PostgreSQL(Cloud Data Warehouse PostgreSQL)(以下我们简称CDWPG)为企业提供简单、快速、经济高效的 PB 级云端数据仓库解决方案。云数据仓库兼容 Greenplum 开源数据仓库,是一种基于 MPP(大规模并行处理)架构的数仓服务。借助于该产品,可以使用丰富的 PostgreSQL 开源生态工具,实现对云数据仓库中海量数据的即席查询分析、ETL 处理及可视化探索;还可以借助云端数据无缝集成特性,轻松分析位于 COS、TencentDB、ES 等数据引擎上的 PB 级数据。

作者介绍

岳涛

腾讯云 CSIG 大数据中心产品架构师

多年分布式、高并发大数据系统的研发、系统架构设计经验,擅长主流大数据架构技术平台的落地和实施

目前专注于大数据架构相关组件的研究推广和最佳实践的沉淀,致力于帮助企业完成数字化转型

pg_stat_activity 简介

pg_stat_activity官方介绍:

One row per server process, showing information related to the current activity of that process, such as state and current query。

每一行都表示一个系统进程,显示与当前会话的活动进程的一些信息,比如当前回话的状态和查询等。

pg_stat_activity有哪些功能?

  • 视图信息:通过视图信息,我们可以获取到所有请求相关的明细。通过这些信息,我们可以清楚准确地了解到当前数仓正在发生哪些事情;
  • 分析场景:通过一些查询的组合,我们可以了解到有哪些异常的请求正在执行,并视情况采取行动;
  • 排除故障:当CDWPG数仓存在使用异常的情况下,我们可以通过pg_stat_activity获取到异常的请求,并干预这些请求。

查看系统视图pg_stat_activity

```

postgres=> d pg_stat_activity

View "pg_catalog.pg_stat_activity"

Column | Type | Modifiers | Storage | Description

------------------ -------------------------- ----------- ---------- -------------

datid | oid | | plain |

datname | name | | plain |

pid | integer | | plain |

sess_id | integer | | plain |

usesysid | oid | | plain |

usename | name | | plain |

application_name | text | | extended |

client_addr | inet | | main |

client_hostname | text | | extended |

client_port | integer | | plain |

backend_start | timestamp with time zone | | plain |

xact_start | timestamp with time zone | | plain |

query_start | timestamp with time zone | | plain |

state_change | timestamp with time zone | | plain |

waiting | boolean | | plain |

state | text | | extended |

backend_xid | xid | | plain |

backend_xmin | xid | | plain |

query | text | | extended |

waiting_reason | text | | extended |

rsgid | oid | | plain |

rsgname | text | | extended |

rsgqueueduration | interval | | plain |

View definition:

SELECT s.datid,

d.datname,

s.pid,

s.sess_id,

s.usesysid,

u.rolname AS usename,

s.application_name,

s.client_addr,

s.client_hostname,

s.client_port,

s.backend_start,

s.xact_start,

s.query_start,

s.state_change,

s.waiting,

s.state,

s.backend_xid,

s.backend_xmin,

s.query,

s.waiting_reason,

s.rsgid,

s.rsgname,

s.rsgqueueduration

FROM pg_database d,

pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, waiting, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, cl

ient_port, backend_xid, backend_xmin, sess_id, waiting_reason, rsgid, rsgname, rsgqueueduration),

pg_authid u

WHERE s.datid = d.oid AND s.usesysid = u.oid;

```

字段详细信息:

字段

类型

描述

datid

oid

连接后端的数据库OID

datname

name

连接后端的数据库名称

pid

integer

后端进程ID

sess_id

integer

Session ID

usesysid

oid

登陆后端的用户OID

usename

name

登陆到该后端的用户名

application_name

text

连接到后端的应用名

client_addr

inet

连接到后端的客户端的IP地址。 如果此字段是null, 它表明通过服务器机器上UNIX套接字连接客户端或者这是内部进程如autovacuum

client_hostname

text

连接客户端的主机名,通过client_addr的反向DNS查找报告。 这个字段将只是非空的IP连接,并且仅仅当启动log_hostname的时候

client_port

integer

客户端用于与后端通讯的TCP端口号,或者如果使用Unix套接字,则为-1

backend_start

timestamptz

该过程开始的时间,比如当客户端连接服务器时

xact_start

timestamptz

启动当前事务的时间,如果没有事务是活的,则为null。如果当前查询是 首个事务,则这列等同于query_start列

query_start

timestamptz

开始当前活跃查询的时间, 或者如果state是非活跃的, 当开始最后查询时

state_change

timestampz

上次状态改变的时间

waiting

boolean

如果后端当前正等待锁则为真(t),否则为(f)

state

text

该后端当前总体状态。可能值是: ● 活跃的(active):后端正在执行一个查询。 ● 空闲的(idle):后端正在等待一个新的客户端命令。 ● 空闲事务(idle in transaction):后端在事务中,但是目前无法执行查询。 ● 被终止的空闲事务(idle in transaction (aborted)):这个情况类似于空闲事务,除了事务导致错误的一个语句之一。 ● 快速路径函数调用(fastpath function call):后端正在执行一个快速路径函数。 ● 禁用(disabled):如果后端禁用track_activities,则报告这个状态。

query

text

该后端的最新查询文本。如果状态是活跃的, 此字段显示当前正在执行的查询。在所有其他情况中,这表明执行过去的查询

waiting_reason

text

服务器进程等待的原因,可能值为: lock, replication, or resgroup

rsgid

oid

资源组 OID 或者 0

rsgname

text

资源组名称 或者 unknown

rsgqueueduration

interval

查询请求在查询队列中排队的总时长

注意:启用资源组时。只有查询调度程序(QD)进程将具有rsgid和rsgname。其他服务器进程(例如查询执行器(QE)进程或会话连接进程)将具有rsgid的价值0和一个rsgname的价值未知。QE流程与调度QD流程由同一资源组管理。

获取视图信息

  • 获取连接信息

通过该SQL可以查询出当前有哪些连接,以及简单的信息

```

postgres=> SELECT datname,

postgres-> usename,

postgres-> client_addr,

postgres-> client_port

postgres-> FROM pg_stat_activity;

datname | usename | client_addr | client_port

----------- -------------- ---------------- -------------

gpperfmon | gpadmincloud | |

postgres | cdwadmin | 101.68.144.103 | 40601

postgres | cdwadmin | 101.68.144.103 | 40123

postgres | cdwadmin | 101.68.144.103 | 41225

postgres | cdwadmin | 101.68.144.103 | 41240

postgres | cdwadmin | 101.68.144.103 | 41011

gpperfmon | gpadmincloud | |

postgres | cdwadmin | 10.0.0.6 | 56964

postgres | cdwadmin | 101.68.144.103 | 41326

postgres | cdwadmin | 101.68.144.103 | 39664

(10 rows)

```

  • 获取执行SQL的信息

通过该SQL可以查询出当前的执行SQL信息

```

postgres=> SELECT datname,

postgres-> usename,

postgres-> query

postgres-> FROM pg_stat_activity;

datname | usename | query

----------- -------------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------

gpperfmon | gpadmincloud | <insufficient privilege>

postgres | cdwadmin | SELECT d.datname, d.oid, pg_get_userbyid(d.datdba) AS owner, shobj_description(d.oid, 'pg_database') AS comment, t.spcname, d.datacl, d.datlastsysoid, d.encoding,

pg_encoding_to_char(d.encoding) AS encodingname FROM pg_database d LEFT JOIN pg_tablespace t ON d.dattablespace=t.oid

postgres | cdwadmin | SELECT c.oid, obj_description(c.oid), c.relhasoids AS hasoids, n.nspname AS schemaname, c.relname AS tablename, c.relkind, pg_get_userbyid(c.relowner) AS tableown

er, t.spcname AS "tablespace", c.relhasindex AS hasindexes, c.relhasrules AS hasrules, c.relhastriggers AS hastriggers, ft.ftoptions, fs.srvname, c.relacl, c.reltuples, ((SELECT count(*) FROM

pg_inherits WHERE inhparent = c.oid) > 0) AS inhtable, i2.nspname AS inhschemaname, i2.relname AS inhtablename, c.reloptions AS param, c.relpersistence AS unlogged FROM pg_class c LEFT JOIN

pg_namespace n ON n.oid = c.relnamespace LEFT JOIN pg_tablespace t ON t.oid = c.reltablespace LEFT JOIN (pg_inherits i INNER JOIN pg_class c2 ON i.inhparent = c2.oid LEFT JOIN pg_namespace n2

ON n2.oid = c2.relnamespace) i2 ON i2.inhrelid = c.oid LEFT JOIN pg_foreign_table ft ON ft.ftrelid = c.oid LEFT JOIN pg_foreign_server fs ON ft.ftserver = fs.oid WHERE ((c.relkind = 'r'::"ch

ar") OR (c.relkind = 'f'::"char")) AND n.nspname = 'public'

postgres | cdwadmin | SELECT rolname, rolsuper, rolinherit, rolcreaterole, rolcreatedb, rolcatupdate, rolcanlogin, rolconnlimit, rolvaliduntil, rolconfig, oid , pg_catalog.shobj_descrip

tion(oid, 'pg_authid') AS comment FROM pg_roles

postgres | cdwadmin | SELECT c.conkey FROM pg_constraint c WHERE c.contype='p' AND c.conrelid = 24756

postgres | cdwadmin | SELECT t.relname, fns.nspname AS foreign_schema, f.relname AS foreign_table, c.conname, c.confkey, a.attname AS foreign_field, a.attnum FROM pg_constraint c LEFT J

OIN pg_namespace ns ON (c.connamespace = ns.oid) LEFT JOIN pg_class t ON (c.conrelid = t.oid) INNER JOIN pg_class f ON (c.confrelid = f.oid) LEFT JOIN pg_namespace fns ON (f.relnamespace = fn

s.oid) INNER JOIN pg_attribute a ON (a.attrelid = f.oid) WHERE a.attnum > 0 AND ns.nspname = 'public' AND t.relname = 'user_behavior' ORDER BY t.relname, c.conname, a.attnum

gpperfmon | gpadmincloud | <insufficient privilege>

postgres | cdwadmin | SELECT datname,usename,query FROM pg_stat_activity;

postgres | cdwadmin | SELECT c.conkey FROM pg_constraint c WHERE c.contype='p' AND c.conrelid = 24756

postgres | cdwadmin | SELECT COUNT(1) FROM user_behavior;

(10 rows)

```

上面一条SQL的查询结果包含了当前没有查询的空连接,如果想只查询当前正在运行的SQL,需要用下面这条SQL

```

postgres=> SELECT datname,

postgres-> usename,

postgres-> query

postgres-> FROM pg_stat_activity

postgres-> WHERE state != 'idle';

datname | usename | query

---------- ---------- -----------------------------------------------

postgres | cdwadmin | SELECT datname,

| | usename,

| | query

| | FROM pg_stat_activity

| | WHERE state != 'idle';

postgres | cdwadmin | SELECT * FROMr

| | (SELECT *,r

| | row_number() over(partition by hostr

| | ORDER BY age desc) agedr

| | FROM user_behavior) tr

| | WHERE aged = 1r

| | LIMIT 10;

(2 rows)

```

  • 获取查询耗时比较久的查询

通过该SQL可以查询出当前耗时较长的执行SQL信息

```

postgres=> SELECT current_timestamp - query_start AS run_time,

postgres-> datname,

postgres-> usename,

postgres-> query

postgres-> FROM pg_stat_activity

postgres-> WHERE state != 'idle'

postgres-> ORDER BY 1 desc;

runtime | datname | usename | query

----------------- ---------- ---------- ----------------------------------------------------------------------------

00:00:11.070832 | postgres | cdwadmin | SELECT * FROMr

| | | (SELECT *,r

| | | row_number() over(partition by hostr

| | | ORDER BY age desc) agedr

| | | FROM user_behavior) tr

| | | WHERE aged = 1r

| | | LIMIT 10;

00:00:00 | postgres | cdwadmin | select current_timestamp - query_start as runtime, datname, usename, query

| | | from pg_stat_activity

| | | where state != 'idle'

| | | order by 1 desc;

(2 rows)

```

可以发现第一条SQL执行了11秒还没结束。

重要字段讲解

前面我们列举过字段详细信息:

waiting有两个值,分别为:假(f),真(t);

state有6种状态,这里我们介绍其中4种:

活跃的(active);

空闲的(idle);

空闲事务(idle in transaction);

被终止的空闲事务(idle in transaction (aborted))。

下面通过几个简单的例子,来讲解这两个相对重要的字段:waiting,state。

场景一:查看当前正在执行的请求

state

- active:请求正在执行中

waiting

- f:当前请求没有在等待

```

postgres=> SELECT datid,

postgres-> datname,

postgres-> usename,

postgres-> waiting,

postgres-> state,

postgres-> query

postgres-> FROM pg_stat_activity

postgres-> WHERE usename = 'cdwadmin' AND

postgres-> state = 'active';

datid | datname | usename | waiting | state | query

------- ---------- ---------- --------- -------- -----------------------------------

12810 | postgres | cdwadmin | f | active | SELECT datid,

| | | | | datname,

| | | | | usename,

| | | | | waiting,

| | | | | state,

| | | | | query

| | | | | FROM pg_stat_activity

| | | | | WHERE usename = 'cdwadmin' AND

| | | | | state = 'active';

12810 | postgres | cdwadmin | f | active | SELECT user, sex, age, COUNT(1)r

| | | | | FROM user_behaviorr

| | | | | GROUP BY user, sex, age;

(2 rows)

```

这里可以看到当前用户cdwadmin符合状态为active的SQL有两条,其中waiting的值为f(false)。

场景二:查看当前执行等待的请求

state

- active:请求正在执行中

waiting

- t:当前请求正在等待

```

postgres=> BEGIN;

BEGIN

postgres=> ALTER TABLE user_behavior ADD COLUMN region TEXT;

ALTER TABLE

postgres=>

```

这里我们用BEGIN语句使当前用户处在事务中,并发起了一个ALTER请求使产生表锁, 然后下面用另一个客户端再对该表发起查询 。

```

postgres=> SELECT * FROM user_behavior LIMIT 1;

```

这里发现查询请求处在没有响应的状态,一直卡在这里,这是因为增加字段会产生表锁,锁释放之前该表无法进行其他操作。然后下面再打开新的一个客户端获取执行信息。

```

postgres=> SELECT datid,

postgres-> datname,

postgres-> usename,

postgres-> waiting,

postgres-> state,

postgres-> query

postgres-> FROM pg_stat_activity

postgres-> WHERE usename = 'cdwadmin' AND

postgres-> waiting = 't';

datid | datname | usename | waiting | state | query

------- ---------- ---------- --------- -------- --------------------------------------

12810 | postgres | cdwadmin | t | active | SELECT * FROM user_behavior LIMIT 1;

(1 row)

```

这里可以看到当前用户cdwadmin符合waiting状态为t的SQL有一条,其中state为active。

场景三:查看当前没有请求的空连接

state

- idle:当前没有请求在执行的长连接,在等待请求

```

postgres=> SELECT datid,

postgres-> datname,

postgres-> usename,

postgres-> waiting,

postgres-> state,

postgres-> query

postgres-> FROM pg_stat_activity

postgres-> WHERE usename = 'cdwadmin' AND

postgres-> state = 'idle';

datid | datname | usename | waiting | state | query

------- ---------- ---------- --------- -------- -----------------------------------

12810 | postgres | cdwadmin | f | idle | SELECT user, sex, age, COUNT(1)r

| | | | | FROM user_behaviorr

| | | | | GROUP BY user, sex, age;

(1 rows)

```

这里可以看到当前用户cdwadmin符合状态为idle的SQL有一条,其中waiting的值为f(false)。

场景四:查看当前空闲的事务

state

- idle in transaction:空闲的事务

```

postgres=> BEGIN;

BEGIN

postgres=> SELECT * FROM

postgres-> (SELECT *,

postgres(> row_number() over(partition by host

postgres(> ORDER BY age desc) aged

postgres(> FROM user_behavior) t

postgres-> WHERE aged = 1

postgres-> LIMIT 10;

host | use | sex | age | behavior_id | aged

----------------- ---------- ----- ----- -------------------------------------- ------

192.168.100.107 | Gil | boy | 19 | b0893205-f626-49c9-82ab-c7ab6eef5d21 | 1

192.168.100.110 | Gore | boy | 19 | 9b5bae42-2d52-4dee-acf9-54ba60bb08a7 | 1

192.168.100.113 | Albin | boy | 19 | 8c710196-fc73-4878-a7b0-54b9d0adfc89 | 1

192.168.100.114 | Gil | boy | 19 | 74348395-c0b5-4a49-b798-19c733c4b664 | 1

192.168.100.117 | Denny | boy | 19 | 0103a06b-490d-4eab-bcc3-4598012ece13 | 1

192.168.100.121 | Hargrove | boy | 19 | a32d0a88-e0a5-4e15-95a0-0d54b92d9fbc | 1

192.168.100.123 | Felix | boy | 19 | 93fae52b-9f54-4b97-940d-04aea35be0d5 | 1

192.168.100.125 | Jordon | boy | 19 | 1b99dd74-1a0c-42dc-a119-8802c3173478 | 1

192.168.100.128 | Felix | boy | 19 | 220ee35b-88d1-464b-806d-07ff725402d9 | 1

192.168.100.13 | Gorman | boy | 19 | 82346ccd-6e07-42aa-9714-62f71a28c3c4 | 1

(10 rows)

```

这里我们用BEGIN语句使当前用户处在事务中,然后下面用另一个客户端获取执行信息。

```

postgres=> SELECT datid,

postgres-> datname,

postgres-> usename,

postgres-> waiting,

postgres-> state,

postgres-> query

postgres-> FROM pg_stat_activity

postgres-> WHERE usename = 'cdwadmin' AND

postgres-> state = 'idle in transaction';

datid | datname | usename | waiting | state | query

------- ---------- ---------- --------- --------------------- ---------------------------------------------

12810 | postgres | cdwadmin | f | idle in transaction | SELECT * FROM

| | | | | (SELECT *,

| | | | | row_number() over(partition by host

| | | | | ORDER BY age desc) aged

| | | | | FROM user_behavior) t

| | | | | WHERE aged = 1

| | | | | LIMIT 10;

(1 row)

```

这里可以看到当前用户cdwadmin符合状态为idle in transaction的SQL有一条,其中waiting的值为f(false)。

场景五:查看当前发生错误的空闲事务

state

- idle in transaction (aborted):由于发生错误而被终止的空闲事务

```

postgres=> BEGIN;

BEGIN

postgres=> SELECT * FROM

postgres-> (SELECT *,

postgres(> row_number() over(partition by host

postgres(> ORDER BY age desc) aged

postgres(> FROM user_behavior) t

postgres-> WHERE aged = 1

postgres-> LIMIT 10;

host | use | sex | age | behavior_id | aged

----------------- ---------- ----- ----- -------------------------------------- ------

192.168.100.100 | Tobias | boy | 19 | 0034dcf7-f5f3-42d2-a463-0d1708286b64 | 1

192.168.100.101 | Thorpe | boy | 19 | bb2bad2b-2e79-42d5-9df7-57746ecab218 | 1

192.168.100.102 | Titus | boy | 19 | 22722d52-fd41-426b-b844-e61be731fe03 | 1

192.168.100.103 | Webber | boy | 19 | 35115e19-84ae-411a-8a5c-584fbcccc030 | 1

192.168.100.104 | Swift | boy | 19 | bc7a1e8d-4f05-4434-add2-a323503065a4 | 1

192.168.100.105 | Rowell | boy | 19 | 5578bd6b-9b16-4c65-9519-05ae38686480 | 1

192.168.100.108 | Webber | boy | 19 | be97a453-05de-413d-b8ab-b7671559d60d | 1

192.168.100.119 | Sheridan | boy | 19 | 7841e7a2-27d0-44a1-ac5f-a6e8a3689587 | 1

192.168.100.12 | Russ | boy | 19 | 4780063f-a131-40b1-b3a7-e4deba865e7a | 1

192.168.100.120 | Gil | boy | 19 | 9ef60038-8fe0-46aa-a69a-97e498e08f98 | 1

(10 rows)

postgres=> SELEC * FROM user_behavior LIMIT 1;

ERROR: syntax error at or near "SELEC"

LINE 1: SELEC * FROM user_behavior LIMIT 1;

^

```

这里我们用BEGIN语句使当前用户处在事务中,并发起了一个错误的语法请求, 然后下面用另一个客户端获取执行信息 。

```

postgres=> SELECT datid,

datname,

usename,

waiting,

state,

query

FROM pg_stat_activity

WHERE usename = 'cdwadmin' AND

state = 'idle in transaction (aborted)';

datid | datname | usename | waiting | state | query

------- ---------- ---------- --------- ------------------------------- -------------------------------------

12810 | postgres | cdwadmin | f | idle in transaction (aborted) | SELEC * FROM user_behavior LIMIT 1;

(1 row)

```

这里可以看到当前用户cdwadmin符合状态为idle in transaction (aborted)的SQL有一条,其中waiting的值为f(false)。

任务分析进阶——诊断原因并修复

在上面的场景二中,我们发现了有正在等待的请求,下面我们来学习如何找出发生等待事件的根本原因。

由于Greenplum没有提供任何直接的方法来确认请求等待的原因,所以我们将引入另一张视图————pg_locks。

将pg_locks与pg_stat_activity结合使用,可以使我们获得很多与锁定相关的有用信息。

  • 查看当前执行等待的请求以及等待的原因

```

postgres=> SELECT waiting.query AS waiting_query,

postgres-> waiting.pid AS waiting_pid,

postgres-> waiting.usename AS waiting_user,

postgres-> locking.query AS locking_query,

postgres-> locking.pid AS locking_pid,

postgres-> locking.usename AS locking_user,

postgres-> t.schemaname || '.' || t.relname AS tablename

postgres-> FROM pg_stat_activity waiting

postgres-> JOIN pg_locks l1 ON

postgres-> waiting.pid = l1.pid AND

postgres-> NOT l1.granted

postgres-> JOIN pg_locks l2 ON

postgres-> l1.relation = l2.relation AND

postgres-> l2.granted

postgres-> JOIN pg_stat_activity locking ON

postgres-> l2.pid = locking.pid

postgres-> JOIN pg_stat_user_tables t ON

postgres-> l1.relation = t.relid

postgres-> WHERE waiting.waiting = 't';

waiting_query | waiting_pid | waiting_user | locking_query | locking_pid | locking_user | tablename

-------------------------------------- ------------- -------------- --------------------------------------------------- ------------- -------------- ----------------------

SELECT * FROM user_behavior LIMIT 1; | 17674 | cdwadmin | ALTER TABLE user_behavior ADD COLUMN region TEXT; | 15126 | cdwadmin | public.user_behavior

(1 row)

```

这里可以看到当前用户cdwadmin有一条正在等待的请求,其对应锁定的原因也很明朗。在确认了阻塞的原因之后,可以使用pg_cancel_backend或者pg_terminate_backend的方式来进行修复。

  • 使用pg_cancel_backend终止查询

```

postgres=> SELECT PG_CANCEL_BACKEND(17674);

pg_cancel_backend

-------------------

t

(1 row)

```

执行成功,等待的查询已被终止,其中17674是上面查询出来的waiting_pid。需要注意的是,这里只终止了查询,并没有释放其session。

  • 使用pg_terminate_backend终止查询并释放连接

```

postgres=> SELECT PG_TERMINATE_BACKEND(17674);

pg_terminate_backend

----------------------

t

(1 row)

postgres=> SELECT PG_TERMINATE_BACKEND(17674);

WARNING: PID 17674 is not a PostgreSQL server process

pg_terminate_backend

----------------------

f

(1 row)

```

可以发现第二次执行pg_terminate_backend时返回了false,因为该连接已经被第一次执行释放掉了。

小结

简而言之,pg_stat_activity的主要作用是显示CDWPG中的当前活动。pg_stat_activity是一张系统统计视图,可以用于任务分析诊断。这是非常有用且功能强大的视图,其价值很难衡量。

因此,如果想知道在任意给定时刻CDWPG数据仓库中正在发生什么,那么pg_stat_activity是最好的选择,是CDWPG使用者们最好的帮手。

参考资料

[1] https://cloud.tencent.com/document/product/878

[2] https://docs.greenplum.org/6-8/common/gpdb-features.html

0 人点赞