接前文
二、节点日志分析108节点
■■ 以下日志显示,3:18分109节点出现问题(问题详见"109节点日志分析"),继而109节点被驱逐出去集群。
代码语言:shell复制2024-07-15T03:18:34.562042Z 0 [Note] [MY-000000] [Galera] Member 1(pxc-cluster-node-2) initiates vote on 2a37cad4-f834-11ee-916e-da042b
ca7bc3:9338777,892e7c82cf62c56b: Multi-statement transaction required more than 'max_binlog_cache_size' bytes of storage; increase thi
s mysqld variable and try again, Error_code: 1197; Writing one row to the row-based binary log failed, Error_code: 1534;
2024-07-15T03:18:34.562391Z 0 [Note] [MY-000000] [Galera] Recomputed vote based on error codes: 1197, 1534. New vote d659be586fd355cc w
ill be used for further steps. Old Vote: 892e7c82cf62c56b
2024-07-15T03:18:34.562448Z 0 [Note] [MY-000000] [Galera] Votes over 2a37cad4-f834-11ee-916e-da042bca7bc3:9338777:
0000000000000000: 1/3
d659be586fd355cc: 1/3
Waiting for more votes.
2024-07-15T03:18:34.562568Z 15 [Note] [MY-000000] [Galera] Got vote request for seqno 2a37cad4-f834-11ee-916e-da042bca7bc3:9338777
2024-07-15T03:18:34.564109Z 0 [Note] [MY-000000] [Galera] Member 0(pxc-cluster-node-3) responds to vote on 2a37cad4-f834-11ee-916e-da04
2bca7bc3:9338777,0000000000000000: Success
2024-07-15T03:18:34.564159Z 0 [Warning] [MY-000000] [Galera] Received bogus VOTE message: 9338777.0, from node 2a36a321-f834-11ee-bc44-
bb66609ee19b, expected > 9338860. Ignoring.
2024-07-15T03:18:34.564188Z 0 [Note] [MY-000000] [Galera] Votes over 2a37cad4-f834-11ee-916e-da042bca7bc3:9338777:
0000000000000000: 1/3
d659be586fd355cc: 1/3
Waiting for more votes.
2024-07-15T03:18:34.564212Z 0 [Note] [MY-000000] [Galera] Member 2(pxc-cluster-node-1) responds to vote on 2a37cad4-f834-11ee-916e-da04
2bca7bc3:9338777,0000000000000000: Success
2024-07-15T03:18:34.564229Z 0 [Note] [MY-000000] [Galera] Votes over 2a37cad4-f834-11ee-916e-da042bca7bc3:9338777:
0000000000000000: 2/3
d659be586fd355cc: 1/3
Winner: 0000000000000000
Winner: 0000000000000000
2024-07-15T03:18:34.564279Z 15 [Note] [MY-000000] [Galera] Vote 0 (success) on 2a37cad4-f834-11ee-916e-da042bca7bc3:9338777 is consiste
nt with group. Continue.
2024-07-15T03:18:34.571056Z 0 [Note] [MY-000000] [Galera] declaring 99999999-9910 at tcp://[9999:9999:9999:9999::6e]:4567 stable
2024-07-15T03:18:34.571138Z 0 [Note] [MY-000000] [Galera] forgetting 601c1fce-8a87 (tcp://[9999:9999:9999:9999::6d]:4567)
2024-07-15T03:18:34.572400Z 0 [Note] [MY-000000] [Galera] Node 99999999-9910 state primary
2024-07-15T03:18:34.578145Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,99999999-9910,10)
memb {
99999999-9910,0
99999999-9908,0
}
joined {
}
left {
}
partitioned {
601c1fce-8a87,0
}
)
2024-07-15T03:18:34.578202Z 0 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk
2024-07-15T03:18:34.579951Z 0 [Note] [MY-000000] [Galera] forgetting 601c1fce-8a87 (tcp://[9999:9999:9999:9999::6d]:4567)
2024-07-15T03:18:34.579977Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2024-07-15T03:18:34.580032Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: Waiting for state UUID.
2024-07-15T03:18:34.581317Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: sent state msg: eb67b0a0-4258-11ef-a082-2e388f64aaa3
2024-07-15T03:18:34.582510Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: eb67b0a0-4258-11ef-a082-2e388f64aaa3 from 0 (p
xc-cluster-node-3)
2024-07-15T03:18:34.582561Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: eb67b0a0-4258-11ef-a082-2e388f64aaa3 from 1 (p
xc-cluster-node-1)
2024-07-15T03:18:34.582581Z 0 [Note] [MY-000000] [Galera] Quorum results:
version = 6,
component = PRIMARY,
conf_id = 9,
members = 2/2 (primary/total),
act_id = 9338911,
last_appl. = 9338766,
protocols = 2/11/4 (gcs/repl/appl),
vote policy= 0,
group UUID = 2a37cad4-f834-11ee-916e-da042bca7bc3
2024-07-15T03:18:34.582642Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [141, 141]
2024-07-15T03:18:34.582782Z 13 [Note] [MY-000000] [Galera] ####### processing CC 9338912, local, ordered
2024-07-15T03:18:34.582833Z 13 [Note] [MY-000000] [Galera] Maybe drain monitors from 9338911 upto current CC event 9338912 upto:9338911
2024-07-15T03:18:34.582853Z 13 [Note] [MY-000000] [Galera] Drain monitors from 9338911 up to 9338911
2024-07-15T03:18:34.582875Z 13 [Note] [MY-000000] [Galera] ####### My UUID: 6663bf31-40da-11ef-b7ff-dfd7e45ce66f
2024-07-15T03:18:34.582890Z 13 [Note] [MY-000000] [Galera] Skipping cert index reset
2024-07-15T03:18:34.582925Z 13 [Note] [MY-000000] [Galera] REPL Protocols: 11 (6)
2024-07-15T03:18:34.582951Z 13 [Note] [MY-000000] [Galera] ####### Adjusting cert position: 9338911 -> 9338912
2024-07-15T03:18:34.583030Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.
2024-07-15T03:18:34.586904Z 13 [Note] [MY-000000] [Galera] ================================================
View:
id: 2a37cad4-f834-11ee-916e-da042bca7bc3:9338912
status: primary
protocol_version: 4
capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREOR
DERED, STREAMING, NBO
final: no
own_index: 1
members(2):
0: 2a36a321-f834-11ee-bc44-bb66609ee19b, pxc-cluster-node-3
1: 6663bf31-40da-11ef-b7ff-dfd7e45ce66f, pxc-cluster-node-1
=================================================
2024-07-15T03:18:34.587017Z 13 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-07-15T03:18:34.595121Z 13 [Note] [MY-000000] [Galera] Recording CC from group: 9338912
2024-07-15T03:18:34.595181Z 13 [Note] [MY-000000] [Galera] Lowest cert index boundary for CC from group: 9338767
2024-07-15T03:18:34.595204Z 13 [Note] [MY-000000] [Galera] Min available from gcache for CC from group: 9337022
■■ 但随后的日志显示,108节点处于正常服务状态。
■■ 随后的日志显示,04:00有节点重启加入集群:
代码语言:shell复制2024-07-15T04:00:07.309528Z 0 [Note] [MY-000000] [Galera] (99999999-9908, 'tcp://[9999:9999:9999:9999::6c]:4567') connection established
to 99999999-9909 tcp://[9999:9999:9999:9999::6d]:4567
2024-07-15T04:00:07.573846Z 116353 [Warning] [MY-013360] [Server] Plugin mysql_native_password reported: ''mysql_native_password' is dep
recated and will be removed in a future release. Please use caching_sha2_password instead'
2024-07-15T04:00:07.812276Z 0 [Note] [MY-000000] [Galera] declaring 99999999-9910 at tcp://[9999:9999:9999:9999::6e]:4567 stable
2024-07-15T04:00:07.812392Z 0 [Note] [MY-000000] [Galera] declaring 99999999-9909 at tcp://[9999:9999:9999:9999::6d]:4567 stable
2024-07-15T04:00:07.813733Z 0 [Note] [MY-000000] [Galera] Node 99999999-9910 state primary
2024-07-15T04:00:07.815269Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,99999999-9910,11)
memb {
99999999-9910,0
99999999-9908,0
99999999-9909,0
}
joined {
}
left {
}
partitioned {
}
)
2024-07-15T04:00:07.815311Z 0 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk
2024-07-15T04:00:07.817183Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
2024-07-15T04:00:07.817262Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: Waiting for state UUID.
2024-07-15T04:00:07.821664Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: sent state msg: b97e1f85-425e-11ef-af8a-cac3db9396f3
2024-07-15T04:00:07.822896Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: b97e1f85-425e-11ef-af8a-cac3db9396f3 from 0 (px
c-cluster-node-3)
2024-07-15T04:00:07.822956Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: b97e1f85-425e-11ef-af8a-cac3db9396f3 from 1 (px
c-cluster-node-1)
2024-07-15T04:00:08.310999Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: b97e1f85-425e-11ef-af8a-cac3db9396f3 from 2 (px
c-cluster-node-2)
2024-07-15T04:00:08.311090Z 0 [Note] [MY-000000] [Galera] Quorum results:
version = 6,
component = PRIMARY,
conf_id = 10,
members = 2/3 (primary/total),
act_id = 9339054,
last_appl. = 9338766,
protocols = 2/11/4 (gcs/repl/appl),
vote policy= 0,
group UUID = 2a37cad4-f834-11ee-916e-da042bca7bc3
2024-07-15T04:00:08.311280Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [173, 173]
2024-07-15T04:00:08.311566Z 13 [Note] [MY-000000] [Galera] ####### processing CC 9339055, local, ordered
2024-07-15T04:00:08.311730Z 13 [Note] [MY-000000] [Galera] Maybe drain monitors from 9339054 upto current CC event 9339055 upto:9339054
2024-07-15T04:00:08.311758Z 13 [Note] [MY-000000] [Galera] Drain monitors from 9339054 up to 9339054
2024-07-15T04:00:08.311777Z 13 [Note] [MY-000000] [Galera] ####### My UUID: 6663bf31-40da-11ef-b7ff-dfd7e45ce66f
2024-07-15T04:00:08.311791Z 13 [Note] [MY-000000] [Galera] Skipping cert index reset
2024-07-15T04:00:08.311804Z 13 [Note] [MY-000000] [Galera] REPL Protocols: 11 (6)
2024-07-15T04:00:08.311820Z 13 [Note] [MY-000000] [Galera] ####### Adjusting cert position: 9339054 -> 9339055
2024-07-15T04:00:08.311910Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.
2024-07-15T04:00:08.315209Z 13 [Note] [MY-000000] [Galera] ================================================
View:
id: 2a37cad4-f834-11ee-916e-da042bca7bc3:9339055
status: primary
protocol_version: 4
capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORD
ERED, STREAMING, NBO
final: no
own_index: 1
members(3):
0: 2a36a321-f834-11ee-bc44-bb66609ee19b, pxc-cluster-node-3
1: 6663bf31-40da-11ef-b7ff-dfd7e45ce66f, pxc-cluster-node-1
2: b93005c2-425e-11ef-8b79-73a96136cc10, pxc-cluster-node-2
=================================================
■■ 但紧接着是较多的锁表等待,根据以往测试情况,每次冲突也是总有线程获胜,而同时必然有线程受害:
代码语言:shell复制TRANSACTION 114209457, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT , undo log entries 3
MySQL thread id 16, OS thread handle 139734347618048, query id 6745542 Applying batch of row changes (update)
TRANSACTION 114208702, ACTIVE 1338 sec
, undo log entries 6
MySQL thread id 114856, OS thread handle 139691406636800, query id 6676148 2409:807c:5a06:1::104:1705 tm_xj wsrep: replicating and certi
fying write set
commit
2024-07-15T04:00:08.333081Z 16 [Note] [MY-000000] [WSREP] --------- CONFLICT DETECTED --------
2024-07-15T04:00:08.333098Z 16 [Note] [MY-000000] [WSREP] cluster conflict due to high priority abort for threads:
2024-07-15T04:00:08.333111Z 16 [Note] [MY-000000] [WSREP] Winning thread:
THD: 16, mode: high priority, state: exec, conflict: executing, seqno: 9339058
SQL: (null)
2024-07-15T04:00:08.333122Z 16 [Note] [MY-000000] [WSREP] Victim thread:
THD: 114856, mode: local, state: exec, conflict: certifying, seqno: -1
SQL: commit
TRANSACTION 114209458, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT , undo log entries 3
MySQL thread id 12, OS thread handle 139734349731584, query id 6745554 Applying batch of row changes (update)
TRANSACTION 114208878, ACTIVE 994 sec
, undo log entries 10
MySQL thread id 115190, OS thread handle 139671641110272, query id 6694341 2409:807c:5a06:1::104:1702 tm_xj wsrep: replicating and certi
fying write set
commit
2024-07-15T04:00:08.334316Z 12 [Note] [MY-000000] [WSREP] --------- CONFLICT DETECTED --------
2024-07-15T04:00:08.334343Z 12 [Note] [MY-000000] [WSREP] cluster conflict due to high priority abort for threads:
2024-07-15T04:00:08.334363Z 12 [Note] [MY-000000] [WSREP] Winning thread:
THD: 12, mode: high priority, state: exec, conflict: executing, seqno: 9339060
SQL: (null)
2024-07-15T04:00:08.334379Z 12 [Note] [MY-000000] [WSREP] Victim thread:
THD: 115190, mode: local, state: exec, conflict: certifying, seqno: -1
SQL: commit
连续发生了188次冲突
root@node1:0 /u01/mysqld.log# grep "CONFLICT DETECTED" mysqld.log.108|grep "2024-07-15T04:00"|wc -l
188
最后一次冲突如下:
代码语言:shell复制2024-07-15T04:00:08.729167Z 15 [Note] [MY-000000] [WSREP] --------- CONFLICT DETECTED --------
2024-07-15T04:00:08.729202Z 15 [Note] [MY-000000] [WSREP] cluster conflict due to high priority abort for threads:
2024-07-15T04:00:08.729221Z 15 [Note] [MY-000000] [WSREP] Winning thread:
THD: 15, mode: high priority, state: exec, conflict: executing, seqno: 9339273
SQL: (null)
2024-07-15T04:00:08.729236Z 15 [Note] [MY-000000] [WSREP] Victim thread:
THD: 115577, mode: local, state: exec, conflict: must_replay, seqno: 9339387
SQL: commit
■■ 如下日志显示,在这些冲突发生110秒后,108节点试图连接109节点:
代码语言:shell复制2024-07-15T04:01:50.417379Z 0 [Note] [MY-000000] [Galera] (99999999-9908, 'tcp://[9999:9999:9999:9999::6c]:4567') turning message relay
requesting on, nonlive peers: tcp://[9999:9999:9999:9999::6d]:4567
2024-07-15T04:01:51.894449Z 0 [Note] [MY-000000] [Galera] (99999999-9908, 'tcp://[9999:9999:9999:9999::6c]:4567') reconnecting to b93005
c2-8b79 (tcp://[9999:9999:9999:9999::6d]:4567), attempt 0
2024-07-15T04:01:52.574837Z 116695 [Warning] [MY-013360] [Server] Plugin mysql_native_password reported: ''mysql_native_password' is dep
recated and will be removed in a future release. Please use caching_sha2_password instead'
2024-07-15T04:01:53.277070Z 116696 [Warning] [MY-013360] [Server] Plugin mysql_native_password reported: ''mysql_native_password' is dep
recated and will be removed in a future release. Please use caching_sha2_password instead'
2024-07-15T04:01:54.435874Z 116697 [Warning] [MY-013360] [Server] Plugin mysql_native_password reported: ''mysql_native_password' is dep
recated and will be removed in a future release. Please use caching_sha2_password instead'
以下日志显示,3秒钟没能连接成功,怀疑超时,断定109节点非活动,该节点被驱逐
代码语言:shell复制2024-07-15T04:01:54.847941Z 0 [Note] [MY-000000] [Galera] declaring node with index 2 suspected, timeout PT5S (evs.suspect_timeout)
2024-07-15T04:01:54.848077Z 0 [Note] [MY-000000] [Galera] evs::proto(99999999-9908, OPERATIONAL, view_id(REG,99999999-9910,11)) suspecti
ng node: 99999999-9909
2024-07-15T04:01:54.848102Z 0 [Note] [MY-000000] [Galera] evs::proto(99999999-9908, OPERATIONAL, view_id(REG,99999999-9910,11)) suspecte
d node without join message, declaring inactive
2024-07-15T04:01:55.348302Z 0 [Note] [MY-000000] [Galera] declaring node with index 2 inactive (evs.inactive_timeout)
2024-07-15T04:01:55.851247Z 0 [Note] [MY-000000] [Galera] declaring 99999999-9910 at tcp://[9999:9999:9999:9999::6e]:4567 stable
2024-07-15T04:01:55.852703Z 0 [Note] [MY-000000] [Galera] Node 99999999-9910 state primary
2024-07-15T04:01:55.858696Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,99999999-9910,12)
memb {
99999999-9910,0
99999999-9908,0
}
joined {
}
left {
}
partitioned {
99999999-9909,0
}
)
2024-07-15T04:01:55.858739Z 0 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk
2024-07-15T04:01:55.860637Z 0 [Note] [MY-000000] [Galera] forgetting 99999999-9909 (tcp://[9999:9999:9999:9999::6d]:4567)
2024-07-15T04:01:55.860717Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2024-07-15T04:01:55.860857Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: Waiting for state UUID.
2024-07-15T04:01:55.860748Z 0 [Note] [MY-000000] [Galera] (99999999-9908, 'tcp://[9999:9999:9999:9999::6c]:4567') turning message relay
requesting off
■■ 之后交换状态信息,获取到仲裁结果,确认了2个活动节点:108、110,并清理了109节点
代码语言:shell复制2024-07-15T04:01:55.862455Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: sent state msg: f9e37a10-425e-11ef-86d1-67533f4bae37
2024-07-15T04:01:55.863749Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: f9e37a10-425e-11ef-86d1-67533f4bae37 from 0 (px
c-cluster-node-3)
2024-07-15T04:01:55.863809Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: got state msg: f9e37a10-425e-11ef-86d1-67533f4bae37 from 1 (px
c-cluster-node-1)
2024-07-15T04:01:55.863837Z 0 [Note] [MY-000000] [Galera] Quorum results:
version = 6,
component = PRIMARY,
conf_id = 11,
members = 2/2 (primary/total),
act_id = 9339389,
last_appl. = 9338766,
protocols = 2/11/4 (gcs/repl/appl),
vote policy= 0,
group UUID = 2a37cad4-f834-11ee-916e-da042bca7bc3
2024-07-15T04:01:55.864077Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [141, 141]
2024-07-15T04:01:55.864321Z 14 [Note] [MY-000000] [Galera] ####### processing CC 9339390, local, ordered
2024-07-15T04:01:55.864384Z 14 [Note] [MY-000000] [Galera] Maybe drain monitors from 9339389 upto current CC event 9339390 upto:9339389
2024-07-15T04:01:55.864411Z 14 [Note] [MY-000000] [Galera] Drain monitors from 9339389 up to 9339389
2024-07-15T04:01:55.864436Z 14 [Note] [MY-000000] [Galera] ####### My UUID: 6663bf31-40da-11ef-b7ff-dfd7e45ce66f
2024-07-15T04:01:55.864456Z 14 [Note] [MY-000000] [Galera] Skipping cert index reset
2024-07-15T04:01:55.864475Z 14 [Note] [MY-000000] [Galera] REPL Protocols: 11 (6)
2024-07-15T04:01:55.864521Z 14 [Note] [MY-000000] [Galera] ####### Adjusting cert position: 9339389 -> 9339390
2024-07-15T04:01:55.864632Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.
2024-07-15T04:01:55.867920Z 14 [Note] [MY-000000] [Galera] ================================================
View:
id: 2a37cad4-f834-11ee-916e-da042bca7bc3:9339390
status: primary
protocol_version: 4
capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORD
ERED, STREAMING, NBO
final: no
own_index: 1
members(2):
0: 2a36a321-f834-11ee-bc44-bb66609ee19b, pxc-cluster-node-3
1: 6663bf31-40da-11ef-b7ff-dfd7e45ce66f, pxc-cluster-node-1
=================================================
2024-07-15T04:01:55.868021Z 14 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2024-07-15T04:01:55.874762Z 14 [Note] [MY-000000] [Galera] Recording CC from group: 9339390
2024-07-15T04:01:55.874837Z 14 [Note] [MY-000000] [Galera] Lowest cert index boundary for CC from group: 9338767
2024-07-15T04:01:55.874866Z 14 [Note] [MY-000000] [Galera] Min available from gcache for CC from group: 9337022
2024-07-15T04:01:58.396276Z 0 [Note] [MY-000000] [Galera] cleaning up 99999999-9909 (tcp://[9999:9999:9999:9999::6d]:4567)
■■ 之后的日志显示,109节点再次加入未成,继续被驱逐出集群。
三、节点日志分析110节点
■■ 前面的日志同108节点
■■ 4:03的日志显示,109节点再次加入未成,继续被驱逐出集群。
四、总结
根据现场反馈的信息,109节点于11:18分被驱逐后,系统缓慢,业务阻断,此时数据库任务积压,SQL运行慢,大量的锁表,数据库连接数1100左右,大量的wrsp同步线程,杀掉变成killed状态后怀疑资源并未释放。此时即使停掉2个节点,仅保留1个节点,仍然无法提交事务,进而继续大量锁表,此时只能重启数据库解决问题。
总结以上分析过程,以下按照时间顺序客观描述一些症状、现象,以利于找到问题的根本原因。
1、11:16
节点109 第二次发生大事务,全局缓存页达到了 570MB,报错,导致节点109 11:18 停止服务,被驱逐出集群。
2、11:23
一线人员反馈登录时报服务异常,然后自动跳转回登陆页面,导致登录不上系统。项目经理收到问题反馈后开始排查系统问题。登录ELK系统,查看系统监控发现家客异步导出微服务延迟异常。进一步查看详情,发现数据库出现积压,SQL执行变慢。
3、11:47
其他系统侧人员电话联系反馈该系统接口调用失败,导致相关工单无法办理。
4、12时
系统恢复正常登陆后,其他侧反馈接口正常,但是仍有小部分接口出现异常情况,需多次提交,才能成功。
5、12:03
运维人员后台监控发现出入库异常,"Error updating database. Caus"
运维人员根据异常信息定位到有锁表进程并对锁表进程进行释放
6、12:35左右
出入库接口依然存在无法正常回单问题,经核查发现数据库进程状态虽然变成“kill”,但是并没有释放。然后立即将现场最新情况反馈给公司专家,寻求远程协助。
7、13:30
停掉了两个备库节点,只保留主数据库节点。然后滚动重启相关服务及接口。重启完后,验证系统登陆正常,观察相关接口也正常。
8、14:00
支撑群里面一线装维又反馈回单出现报错如下:
Could not commit JDBC transaction; nested exception is com.mysql.cj.jdbc.exceptions.MySQLTra Deadlock found when trying to getlock; try restarting transaction
经与公司专家紧急开会讨论,初步估计是PXC集群异常,虽然已经下线了两个备库节点,但是主节点还在做强一致性同步,导致事务无法提交,从而出现大量锁表的情况。
9、16:15
电话联系系统管理员XXX,申请重启数据库。于16:25完成数据库重启,然后滚动重启相关服务及接口。重启完后,验证系统登陆正常,系统接口也正常。
10、16:44
在支撑群中通知一线人员系统恢复正常,重新进行回单。
一线分析参考:
本次故障是一线异步导出全网数据时,后台会同时更新XX平台割接过来的XXX关系状态,导出数据量过大时,出现了大事务提交,导致备库节点异常,从而整个PXC集群各节点数据同步卡死,最终导致事务提交不了,出现锁表的情况。