MySQL PXC 集群死锁分析案例-2

2024-07-26 10:40:11 浏览数 (1)

接前文

二、节点日志分析108节点

■■ 以下日志显示,3:18分109节点出现问题(问题详见"109节点日志分析"),继而109节点被驱逐出去集群。

代码语言:shell复制
2024-07-15T03:18:34.562042Z 0 [Note] [MY-000000] [Galera] Member 1(pxc-cluster-node-2) initiates vote on 2a37cad4-f834-11ee-916e-da042b
ca7bc3:9338777,892e7c82cf62c56b:  Multi-statement transaction required more than 'max_binlog_cache_size' bytes of storage; increase thi
s mysqld variable and try again, Error_code: 1197; Writing one row to the row-based binary log failed, Error_code: 1534;
2024-07-15T03:18:34.562391Z 0 [Note] [MY-000000] [Galera] Recomputed vote based on error codes: 1197, 1534. New vote d659be586fd355cc w
ill be used for further steps. Old Vote: 892e7c82cf62c56b
2024-07-15T03:18:34.562448Z 0 [Note] [MY-000000] [Galera] Votes over 2a37cad4-f834-11ee-916e-da042bca7bc3:9338777:
   0000000000000000:   1/3
   d659be586fd355cc:   1/3
Waiting for more votes.
2024-07-15T03:18:34.562568Z 15 [Note] [MY-000000] [Galera] Got vote request for seqno 2a37cad4-f834-11ee-916e-da042bca7bc3:9338777
2024-07-15T03:18:34.564109Z 0 [Note] [MY-000000] [Galera] Member 0(pxc-cluster-node-3) responds to vote on 2a37cad4-f834-11ee-916e-da04
2bca7bc3:9338777,0000000000000000: Success
2024-07-15T03:18:34.564159Z 0 [Warning] [MY-000000] [Galera] Received bogus VOTE message: 9338777.0, from node 2a36a321-f834-11ee-bc44-
bb66609ee19b, expected > 9338860. Ignoring.
2024-07-15T03:18:34.564188Z 0 [Note] [MY-000000] [Galera] Votes over 2a37cad4-f834-11ee-916e-da042bca7bc3:9338777:
   0000000000000000:   1/3
   d659be586fd355cc:   1/3
Waiting for more votes.
2024-07-15T03:18:34.564212Z 0 [Note] [MY-000000] [Galera] Member 2(pxc-cluster-node-1) responds to vote on 2a37cad4-f834-11ee-916e-da04
2bca7bc3:9338777,0000000000000000: Success
2024-07-15T03:18:34.564229Z 0 [Note] [MY-000000] [Galera] Votes over 2a37cad4-f834-11ee-916e-da042bca7bc3:9338777:
   0000000000000000:   2/3
   d659be586fd355cc:   1/3
Winner: 0000000000000000
Winner: 0000000000000000
2024-07-15T03:18:34.564279Z 15 [Note] [MY-000000] [Galera] Vote 0 (success) on 2a37cad4-f834-11ee-916e-da042bca7bc3:9338777 is consiste
nt with group. Continue.
2024-07-15T03:18:34.571056Z 0 [Note] [MY-000000] [Galera] declaring 99999999-9910 at tcp://[9999:9999:9999:9999::6e]:4567 stable
2024-07-15T03:18:34.571138Z 0 [Note] [MY-000000] [Galera] forgetting 601c1fce-8a87 (tcp://[9999:9999:9999:9999::6d]:4567)
2024-07-15T03:18:34.572400Z 0 [Note] [MY-000000] [Galera] Node 99999999-9910 state primary
2024-07-15T03:18:34.578145Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,99999999-9910,10)
memb {
        99999999-9910,0
        99999999-9908,0
        }
joined {
        }
left {
        }
partitioned {
        601c1fce-8a87,0
        }
)
2024-07-15T03:18:34.578202Z 0 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk
2024-07-15T03:18:34.579951Z 0 [Note] [MY-000000] [Galera] forgetting 601c1fce-8a87 (tcp://[9999:9999:9999:9999::6d]:4567)
2024-07-15T03:18:34.579977Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2024-07-15T03:18:34.580032Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: Waiting for state UUID.
2024-07-15T03:18:34.581317Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: sent state msg: eb67b0a0-4258-11ef-a082-2e388f64aaa3
2024-07-15T03:18:34.582510Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: eb67b0a0-4258-11ef-a082-2e388f64aaa3 from 0 (p
xc-cluster-node-3)
2024-07-15T03:18:34.582561Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: eb67b0a0-4258-11ef-a082-2e388f64aaa3 from 1 (p
xc-cluster-node-1)
2024-07-15T03:18:34.582581Z 0 [Note] [MY-000000] [Galera] Quorum results:
        version    = 6,
        component  = PRIMARY,
        conf_id    = 9,
        members    = 2/2 (primary/total),
        act_id     = 9338911,
        last_appl. = 9338766,
        protocols  = 2/11/4 (gcs/repl/appl),
        vote policy= 0,
        group UUID = 2a37cad4-f834-11ee-916e-da042bca7bc3
2024-07-15T03:18:34.582642Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [141, 141]
2024-07-15T03:18:34.582782Z 13 [Note] [MY-000000] [Galera] ####### processing CC 9338912, local, ordered
2024-07-15T03:18:34.582833Z 13 [Note] [MY-000000] [Galera] Maybe drain monitors from 9338911 upto current CC event 9338912 upto:9338911
2024-07-15T03:18:34.582853Z 13 [Note] [MY-000000] [Galera] Drain monitors from 9338911 up to 9338911
2024-07-15T03:18:34.582875Z 13 [Note] [MY-000000] [Galera] ####### My UUID: 6663bf31-40da-11ef-b7ff-dfd7e45ce66f
2024-07-15T03:18:34.582890Z 13 [Note] [MY-000000] [Galera] Skipping cert index reset
2024-07-15T03:18:34.582925Z 13 [Note] [MY-000000] [Galera] REPL Protocols: 11 (6)
2024-07-15T03:18:34.582951Z 13 [Note] [MY-000000] [Galera] ####### Adjusting cert position: 9338911 -> 9338912
2024-07-15T03:18:34.583030Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.
2024-07-15T03:18:34.586904Z 13 [Note] [MY-000000] [Galera] ================================================
View:
  id: 2a37cad4-f834-11ee-916e-da042bca7bc3:9338912
  status: primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREOR
DERED, STREAMING, NBO
  final: no
  own_index: 1
  members(2):
        0: 2a36a321-f834-11ee-bc44-bb66609ee19b, pxc-cluster-node-3
        1: 6663bf31-40da-11ef-b7ff-dfd7e45ce66f, pxc-cluster-node-1
=================================================
2024-07-15T03:18:34.587017Z 13 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-07-15T03:18:34.595121Z 13 [Note] [MY-000000] [Galera] Recording CC from group: 9338912
2024-07-15T03:18:34.595181Z 13 [Note] [MY-000000] [Galera] Lowest cert index boundary for CC from group: 9338767
2024-07-15T03:18:34.595204Z 13 [Note] [MY-000000] [Galera] Min available from gcache for CC from group: 9337022

■■ 但随后的日志显示,108节点处于正常服务状态。

■■ 随后的日志显示,04:00有节点重启加入集群:

代码语言:shell复制
2024-07-15T04:00:07.309528Z 0 [Note] [MY-000000] [Galera] (99999999-9908, 'tcp://[9999:9999:9999:9999::6c]:4567') connection established
 to 99999999-9909 tcp://[9999:9999:9999:9999::6d]:4567
2024-07-15T04:00:07.573846Z 116353 [Warning] [MY-013360] [Server] Plugin mysql_native_password reported: ''mysql_native_password' is dep
recated and will be removed in a future release. Please use caching_sha2_password instead'
2024-07-15T04:00:07.812276Z 0 [Note] [MY-000000] [Galera] declaring 99999999-9910 at tcp://[9999:9999:9999:9999::6e]:4567 stable
2024-07-15T04:00:07.812392Z 0 [Note] [MY-000000] [Galera] declaring 99999999-9909 at tcp://[9999:9999:9999:9999::6d]:4567 stable
2024-07-15T04:00:07.813733Z 0 [Note] [MY-000000] [Galera] Node 99999999-9910 state primary
2024-07-15T04:00:07.815269Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,99999999-9910,11)
memb {
        99999999-9910,0
        99999999-9908,0
        99999999-9909,0
        }
joined {
        }
left {
        }
partitioned {
        }
)
2024-07-15T04:00:07.815311Z 0 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk
2024-07-15T04:00:07.817183Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
2024-07-15T04:00:07.817262Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: Waiting for state UUID.
2024-07-15T04:00:07.821664Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: sent state msg: b97e1f85-425e-11ef-af8a-cac3db9396f3
2024-07-15T04:00:07.822896Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: b97e1f85-425e-11ef-af8a-cac3db9396f3 from 0 (px
c-cluster-node-3)
2024-07-15T04:00:07.822956Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: b97e1f85-425e-11ef-af8a-cac3db9396f3 from 1 (px
c-cluster-node-1)
2024-07-15T04:00:08.310999Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: b97e1f85-425e-11ef-af8a-cac3db9396f3 from 2 (px
c-cluster-node-2)
2024-07-15T04:00:08.311090Z 0 [Note] [MY-000000] [Galera] Quorum results:
        version    = 6,
        component  = PRIMARY,
        conf_id    = 10,
        members    = 2/3 (primary/total),
        act_id     = 9339054,
        last_appl. = 9338766,
        protocols  = 2/11/4 (gcs/repl/appl),
        vote policy= 0,
        group UUID = 2a37cad4-f834-11ee-916e-da042bca7bc3
2024-07-15T04:00:08.311280Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [173, 173]
2024-07-15T04:00:08.311566Z 13 [Note] [MY-000000] [Galera] ####### processing CC 9339055, local, ordered
2024-07-15T04:00:08.311730Z 13 [Note] [MY-000000] [Galera] Maybe drain monitors from 9339054 upto current CC event 9339055 upto:9339054
2024-07-15T04:00:08.311758Z 13 [Note] [MY-000000] [Galera] Drain monitors from 9339054 up to 9339054
2024-07-15T04:00:08.311777Z 13 [Note] [MY-000000] [Galera] ####### My UUID: 6663bf31-40da-11ef-b7ff-dfd7e45ce66f
2024-07-15T04:00:08.311791Z 13 [Note] [MY-000000] [Galera] Skipping cert index reset
2024-07-15T04:00:08.311804Z 13 [Note] [MY-000000] [Galera] REPL Protocols: 11 (6)
2024-07-15T04:00:08.311820Z 13 [Note] [MY-000000] [Galera] ####### Adjusting cert position: 9339054 -> 9339055
2024-07-15T04:00:08.311910Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.
2024-07-15T04:00:08.315209Z 13 [Note] [MY-000000] [Galera] ================================================
View:
  id: 2a37cad4-f834-11ee-916e-da042bca7bc3:9339055
  status: primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORD
ERED, STREAMING, NBO
  final: no
  own_index: 1
  members(3):
        0: 2a36a321-f834-11ee-bc44-bb66609ee19b, pxc-cluster-node-3
        1: 6663bf31-40da-11ef-b7ff-dfd7e45ce66f, pxc-cluster-node-1
        2: b93005c2-425e-11ef-8b79-73a96136cc10, pxc-cluster-node-2
=================================================

■■ 但紧接着是较多的锁表等待,根据以往测试情况,每次冲突也是总有线程获胜,而同时必然有线程受害:

代码语言:shell复制
TRANSACTION 114209457, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT , undo log entries 3
MySQL thread id 16, OS thread handle 139734347618048, query id 6745542 Applying batch of row changes (update)
TRANSACTION 114208702, ACTIVE 1338 sec
, undo log entries 6
MySQL thread id 114856, OS thread handle 139691406636800, query id 6676148 2409:807c:5a06:1::104:1705 tm_xj wsrep: replicating and certi
fying write set
commit
2024-07-15T04:00:08.333081Z 16 [Note] [MY-000000] [WSREP] --------- CONFLICT DETECTED --------
2024-07-15T04:00:08.333098Z 16 [Note] [MY-000000] [WSREP] cluster conflict due to high priority abort for threads:

2024-07-15T04:00:08.333111Z 16 [Note] [MY-000000] [WSREP] Winning thread:
   THD: 16, mode: high priority, state: exec, conflict: executing, seqno: 9339058
   SQL: (null)

2024-07-15T04:00:08.333122Z 16 [Note] [MY-000000] [WSREP] Victim thread:
   THD: 114856, mode: local, state: exec, conflict: certifying, seqno: -1
   SQL: commit

TRANSACTION 114209458, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT , undo log entries 3
MySQL thread id 12, OS thread handle 139734349731584, query id 6745554 Applying batch of row changes (update)
TRANSACTION 114208878, ACTIVE 994 sec
, undo log entries 10
MySQL thread id 115190, OS thread handle 139671641110272, query id 6694341 2409:807c:5a06:1::104:1702 tm_xj wsrep: replicating and certi
fying write set
commit
2024-07-15T04:00:08.334316Z 12 [Note] [MY-000000] [WSREP] --------- CONFLICT DETECTED --------
2024-07-15T04:00:08.334343Z 12 [Note] [MY-000000] [WSREP] cluster conflict due to high priority abort for threads:

2024-07-15T04:00:08.334363Z 12 [Note] [MY-000000] [WSREP] Winning thread:
   THD: 12, mode: high priority, state: exec, conflict: executing, seqno: 9339060
   SQL: (null)

2024-07-15T04:00:08.334379Z 12 [Note] [MY-000000] [WSREP] Victim thread:
   THD: 115190, mode: local, state: exec, conflict: certifying, seqno: -1
   SQL: commit

连续发生了188次冲突

root@node1:0 /u01/mysqld.log# grep "CONFLICT DETECTED" mysqld.log.108|grep "2024-07-15T04:00"|wc -l

188

最后一次冲突如下:

代码语言:shell复制
2024-07-15T04:00:08.729167Z 15 [Note] [MY-000000] [WSREP] --------- CONFLICT DETECTED --------
2024-07-15T04:00:08.729202Z 15 [Note] [MY-000000] [WSREP] cluster conflict due to high priority abort for threads:

2024-07-15T04:00:08.729221Z 15 [Note] [MY-000000] [WSREP] Winning thread:
   THD: 15, mode: high priority, state: exec, conflict: executing, seqno: 9339273
   SQL: (null)

2024-07-15T04:00:08.729236Z 15 [Note] [MY-000000] [WSREP] Victim thread:
   THD: 115577, mode: local, state: exec, conflict: must_replay, seqno: 9339387
   SQL: commit

■■ 如下日志显示,在这些冲突发生110秒后,108节点试图连接109节点:

代码语言:shell复制
2024-07-15T04:01:50.417379Z 0 [Note] [MY-000000] [Galera] (99999999-9908, 'tcp://[9999:9999:9999:9999::6c]:4567') turning message relay
requesting on, nonlive peers: tcp://[9999:9999:9999:9999::6d]:4567
2024-07-15T04:01:51.894449Z 0 [Note] [MY-000000] [Galera] (99999999-9908, 'tcp://[9999:9999:9999:9999::6c]:4567') reconnecting to b93005
c2-8b79 (tcp://[9999:9999:9999:9999::6d]:4567), attempt 0
2024-07-15T04:01:52.574837Z 116695 [Warning] [MY-013360] [Server] Plugin mysql_native_password reported: ''mysql_native_password' is dep
recated and will be removed in a future release. Please use caching_sha2_password instead'
2024-07-15T04:01:53.277070Z 116696 [Warning] [MY-013360] [Server] Plugin mysql_native_password reported: ''mysql_native_password' is dep
recated and will be removed in a future release. Please use caching_sha2_password instead'
2024-07-15T04:01:54.435874Z 116697 [Warning] [MY-013360] [Server] Plugin mysql_native_password reported: ''mysql_native_password' is dep
recated and will be removed in a future release. Please use caching_sha2_password instead'

以下日志显示,3秒钟没能连接成功,怀疑超时,断定109节点非活动,该节点被驱逐

代码语言:shell复制
2024-07-15T04:01:54.847941Z 0 [Note] [MY-000000] [Galera] declaring node with index 2 suspected, timeout PT5S (evs.suspect_timeout)
2024-07-15T04:01:54.848077Z 0 [Note] [MY-000000] [Galera] evs::proto(99999999-9908, OPERATIONAL, view_id(REG,99999999-9910,11)) suspecti
ng node: 99999999-9909
2024-07-15T04:01:54.848102Z 0 [Note] [MY-000000] [Galera] evs::proto(99999999-9908, OPERATIONAL, view_id(REG,99999999-9910,11)) suspecte
d node without join message, declaring inactive
2024-07-15T04:01:55.348302Z 0 [Note] [MY-000000] [Galera] declaring node with index 2 inactive (evs.inactive_timeout)

2024-07-15T04:01:55.851247Z 0 [Note] [MY-000000] [Galera] declaring 99999999-9910 at tcp://[9999:9999:9999:9999::6e]:4567 stable
2024-07-15T04:01:55.852703Z 0 [Note] [MY-000000] [Galera] Node 99999999-9910 state primary
2024-07-15T04:01:55.858696Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,99999999-9910,12)
memb {
        99999999-9910,0
        99999999-9908,0
        }
joined {
        }
left {
        }
partitioned {
        99999999-9909,0
        }
)
2024-07-15T04:01:55.858739Z 0 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk
2024-07-15T04:01:55.860637Z 0 [Note] [MY-000000] [Galera] forgetting 99999999-9909 (tcp://[9999:9999:9999:9999::6d]:4567)
2024-07-15T04:01:55.860717Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2024-07-15T04:01:55.860857Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: Waiting for state UUID.
2024-07-15T04:01:55.860748Z 0 [Note] [MY-000000] [Galera] (99999999-9908, 'tcp://[9999:9999:9999:9999::6c]:4567') turning message relay 
requesting off

■■ 之后交换状态信息,获取到仲裁结果,确认了2个活动节点:108、110,并清理了109节点

代码语言:shell复制
2024-07-15T04:01:55.862455Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: sent state msg: f9e37a10-425e-11ef-86d1-67533f4bae37
2024-07-15T04:01:55.863749Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: f9e37a10-425e-11ef-86d1-67533f4bae37 from 0 (px
c-cluster-node-3)
2024-07-15T04:01:55.863809Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: f9e37a10-425e-11ef-86d1-67533f4bae37 from 1 (px
c-cluster-node-1)
2024-07-15T04:01:55.863837Z 0 [Note] [MY-000000] [Galera] Quorum results:
        version    = 6,
        component  = PRIMARY,
        conf_id    = 11,
        members    = 2/2 (primary/total),
        act_id     = 9339389,
        last_appl. = 9338766,
        protocols  = 2/11/4 (gcs/repl/appl),
        vote policy= 0,
        group UUID = 2a37cad4-f834-11ee-916e-da042bca7bc3
2024-07-15T04:01:55.864077Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [141, 141]
2024-07-15T04:01:55.864321Z 14 [Note] [MY-000000] [Galera] ####### processing CC 9339390, local, ordered
2024-07-15T04:01:55.864384Z 14 [Note] [MY-000000] [Galera] Maybe drain monitors from 9339389 upto current CC event 9339390 upto:9339389
2024-07-15T04:01:55.864411Z 14 [Note] [MY-000000] [Galera] Drain monitors from 9339389 up to 9339389
2024-07-15T04:01:55.864436Z 14 [Note] [MY-000000] [Galera] ####### My UUID: 6663bf31-40da-11ef-b7ff-dfd7e45ce66f
2024-07-15T04:01:55.864456Z 14 [Note] [MY-000000] [Galera] Skipping cert index reset
2024-07-15T04:01:55.864475Z 14 [Note] [MY-000000] [Galera] REPL Protocols: 11 (6)
2024-07-15T04:01:55.864521Z 14 [Note] [MY-000000] [Galera] ####### Adjusting cert position: 9339389 -> 9339390
2024-07-15T04:01:55.864632Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.
2024-07-15T04:01:55.867920Z 14 [Note] [MY-000000] [Galera] ================================================
View:
  id: 2a37cad4-f834-11ee-916e-da042bca7bc3:9339390
  status: primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORD
ERED, STREAMING, NBO
  final: no
  own_index: 1
  members(2):
        0: 2a36a321-f834-11ee-bc44-bb66609ee19b, pxc-cluster-node-3
        1: 6663bf31-40da-11ef-b7ff-dfd7e45ce66f, pxc-cluster-node-1
=================================================
2024-07-15T04:01:55.868021Z 14 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-07-15T04:01:55.874762Z 14 [Note] [MY-000000] [Galera] Recording CC from group: 9339390
2024-07-15T04:01:55.874837Z 14 [Note] [MY-000000] [Galera] Lowest cert index boundary for CC from group: 9338767
2024-07-15T04:01:55.874866Z 14 [Note] [MY-000000] [Galera] Min available from gcache for CC from group: 9337022
2024-07-15T04:01:58.396276Z 0 [Note] [MY-000000] [Galera]  cleaning up 99999999-9909 (tcp://[9999:9999:9999:9999::6d]:4567)

■■ 之后的日志显示,109节点再次加入未成,继续被驱逐出集群。

三、节点日志分析110节点

■■ 前面的日志同108节点

■■ 4:03的日志显示,109节点再次加入未成,继续被驱逐出集群。

四、总结

根据现场反馈的信息,109节点于11:18分被驱逐后,系统缓慢,业务阻断,此时数据库任务积压,SQL运行慢,大量的锁表,数据库连接数1100左右,大量的wrsp同步线程,杀掉变成killed状态后怀疑资源并未释放。此时即使停掉2个节点,仅保留1个节点,仍然无法提交事务,进而继续大量锁表,此时只能重启数据库解决问题。

总结以上分析过程,以下按照时间顺序客观描述一些症状、现象,以利于找到问题的根本原因。

1、11:16

节点109 第二次发生大事务,全局缓存页达到了 570MB,报错,导致节点109 11:18 停止服务,被驱逐出集群。

2、11:23

一线人员反馈登录时报服务异常,然后自动跳转回登陆页面,导致登录不上系统。项目经理收到问题反馈后开始排查系统问题。登录ELK系统,查看系统监控发现家客异步导出微服务延迟异常。进一步查看详情,发现数据库出现积压,SQL执行变慢。

3、11:47

其他系统侧人员电话联系反馈该系统接口调用失败,导致相关工单无法办理。

4、12时

系统恢复正常登陆后,其他侧反馈接口正常,但是仍有小部分接口出现异常情况,需多次提交,才能成功。

5、12:03

运维人员后台监控发现出入库异常,"Error updating database. Caus"

运维人员根据异常信息定位到有锁表进程并对锁表进程进行释放

6、12:35左右

出入库接口依然存在无法正常回单问题,经核查发现数据库进程状态虽然变成“kill”,但是并没有释放。然后立即将现场最新情况反馈给公司专家,寻求远程协助。

7、13:30

停掉了两个备库节点,只保留主数据库节点。然后滚动重启相关服务及接口。重启完后,验证系统登陆正常,观察相关接口也正常。

8、14:00

支撑群里面一线装维又反馈回单出现报错如下:

Could not commit JDBC transaction; nested exception is com.mysql.cj.jdbc.exceptions.MySQLTra Deadlock found when trying to getlock; try restarting transaction

经与公司专家紧急开会讨论,初步估计是PXC集群异常,虽然已经下线了两个备库节点,但是主节点还在做强一致性同步,导致事务无法提交,从而出现大量锁表的情况。

9、16:15

电话联系系统管理员XXX,申请重启数据库。于16:25完成数据库重启,然后滚动重启相关服务及接口。重启完后,验证系统登陆正常,系统接口也正常。

10、16:44

在支撑群中通知一线人员系统恢复正常,重新进行回单。

一线分析参考:

本次故障是一线异步导出全网数据时,后台会同时更新XX平台割接过来的XXX关系状态,导出数据量过大时,出现了大事务提交,导致备库节点异常,从而整个PXC集群各节点数据同步卡死,最终导致事务提交不了,出现锁表的情况。

0 人点赞