MySQL 8.0 MGR网络抖动一例

2022-04-01 19:29:23 浏览数 (1)

MySQL 8.0 MGR网络抖动怎么办?

今天中午,线上一个MySQL8.0的MGR失联了一阵,之前其实没有遇到过这个场景,觉得挺新鲜,就记录了下当时的状态。

01

背景

首先介绍下我们这套环境,这套环境是由4个MySQL 8.0.20节点组成的MGR集群(建议配置奇数个节点,这套环境比较特殊),配置的是MGR的多主环境。

线上某个业务反馈连接MGR有连接报错,报错内容如下:

exec insert into check_job failed: Error 1290: The MySQL server is running with the --super-read-only option so it cannot execute this statement

报错原因:业务向MGR中写入数据的时候,报错MGR开启了--super-read-only参数,写不进去。

正常运行的MGR是不会将节点设置为--super-read-only的,于是使用SQL查看了下MGR集群的状态,发现果然掉了一个节点。

正常节点

代码语言:javascript复制
11:55:29> select * from performance_schema.replication_group_members;
 --------------------------- -------------------------------------- -------------- ------------- -------------- ------------- ---------------- 
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST  | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
 --------------------------- -------------------------------------- -------------- ------------- -------------- ------------- ---------------- 
| group_replication_applier | 18142f04-95fe-11ec-aff9-246e96be2d00 | 10.xx.xx.184 |        5562 | ONLINE       | PRIMARY     | 8.0.20         |
| group_replication_applier | 32a10c3f-ce69-11eb-bd13-fa163e311dfa | 10.xxx.1.236 |        5562 | ONLINE       | PRIMARY     | 8.0.20         |
| group_replication_applier | 63a8b9e3-95f9-11ec-bfc3-fa163e06c313 | 10.xxx.1.129 |        5562 | ONLINE       | PRIMARY     | 8.0.20         |
 --------------------------- -------------------------------------- -------------- ------------- -------------- ------------- ---------------- 
3 rows in set (0.00 sec)

可以看到,正常节点无法和异常的节点通信,当前组里面只有3个节点了。

异常节点:

代码语言:javascript复制
select * from performance_schema.replication_group_members;
 --------------------------- -------------------------------------- ------------- ------------- -------------- ------------- ---------------- 
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
 --------------------------- -------------------------------------- ------------- ------------- -------------- ------------- ---------------- 
| group_replication_applier | ca12c7a7-95fd-11ec-9979-6c92bf982df2 | 10.xx.xx.21 |        5562 | ERROR        |             | 8.0.20         |
 --------------------------- -------------------------------------- ------------- ------------- -------------- ------------- ---------------- 
1 row in set (0.00 sec)

可以看到,异常的节点无法和其他3个正常节点通信,而且自己的状态是ERROR状态。

02

排查过程

首先查看坏掉节点的MySQL日志,报错如下,为了方便理解,我对日志做了一些注释:

代码语言:javascript复制
2022-03-30T08:35:12.689967 08:00 26 [Warning] [MY-010957] [Server] The replication timestamps have returned to normal values.
----------这部分日志说明它连接不上其他节点了----------------
2022-03-30T11:23:59.451419 08:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.xxx.1.236:5562 has become unreachable.'
2022-03-30T11:23:59.466959 08:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.xxx.17.37:5562 has become unreachable.'
2022-03-30T11:23:59.467008 08:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.xxx.1.129:5562 has become unreachable.'
2022-03-30T11:23:59.467020 08:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.xx.xx.184:5562 has become unreachable.'

---------无法连接其他节点,停止所有更新,并提示可以使用参数group_replication_force_members重开一个组--------------
2022-03-30T11:23:59.467218 08:00 0 [ERROR] [MY-011495] [Repl] Plugin group_replication reported: 'This server is not able to reach a majority of members in the group. This
server will now block all updates. The server will remain blocked until contact with the majority is restored. It is possible to use group_replication_force_members to forc
e a new group membership.'

----------这部分日志说明它网络恢复了----------------
2022-03-30T11:24:02.141493 08:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.xxx.1.236:5562 is reachable again.'
2022-03-30T11:24:03.219209 08:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.xxx.17.37:5562 is reachable again.'
2022-03-30T11:24:03.245901 08:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.xxx.1.129:5562 is reachable again.'
2022-03-30T11:24:03.245941 08:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.xx.xx.184:5562 is reachable again.'
2022-03-30T11:24:03.245956 08:00 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the gro
up. Regular operation is restored and transactions are unblocked.'

--------由于网络原因,它被从MGR复制组中踢出了,状态变更成ERROR-------
2022-03-30T11:24:06.907099 08:00 0 [ERROR] [MY-011505] [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing membe
r status to ERROR.'

--------一些事务将会回滚--------------
2022-03-30T11:24:06.922962 08:00 0 [Warning] [MY-011630] [Repl] Plugin group_replication reported: 'Due to a plugin error, some transactions were unable to be certified and
will now rollback.'

--------MySQL被自动设置成read only模式-----------
2022-03-30T11:24:06.923035 08:00 0 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was de
tected.'

--------等待冲突检测,执行before_commit函数失败---------
2022-03-30T11:24:06.923078 08:00 312718417 [ERROR] [MY-011615] [Repl] Plugin group_replication reported: 'Error while waiting for conflict detection procedure to finish on
session 312718417'
2022-03-30T11:24:06.923081 08:00 397259854 [ERROR] [MY-011615] [Repl] Plugin group_replication reported: 'Error while waiting for conflict detection procedure to finish on
session 397259854'
2022-03-30T11:24:06.923165 08:00 312718417 [ERROR] [MY-010207] [Repl] Run function 'before_commit' in plugin 'group_replication' failed
2022-03-30T11:24:06.923163 08:00 397259854 [ERROR] [MY-010207] [Repl] Run function 'before_commit' in plugin 'group_replication' failed

-------由于自身状态ERROR无法加入组,提示修复错误或者重启MGR------
2022-03-30T11:24:08.587817 08:00 397259854 [ERROR] [MY-011601] [Repl] Plugin group_replication reported: 'Transaction cannot be executed while Group Replication is on ERROR
state. Check for errors and restart the plugin'

得了,那就按照提示,重启下MGR吧:

代码语言:javascript复制
12:00:38 >stop group_replication;
Query OK, 0 rows affected (1.01 sec)

12:00:46 >start group_replication;                                  
Query OK, 0 rows affected (3.36 sec)


12:00:52 >select * from performance_schema.replication_group_members;
 --------------------------- -------------------------------------- -------------- ------------- -------------- ------------- ---------------- 
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST  | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
 --------------------------- -------------------------------------- -------------- ------------- -------------- ------------- ---------------- 
| group_replication_applier | 18142f04-95fe-11ec-aff9-246e96be2d00 | 10.xx.xx.184 |        5562 | ONLINE       | PRIMARY     | 8.0.20         |
| group_replication_applier | 32a10c3f-ce69-11eb-bd13-fa163e311dfa | 10.xxx.1.236 |        5562 | ONLINE       | PRIMARY     | 8.0.20         |
| group_replication_applier | 63a8b9e3-95f9-11ec-bfc3-fa163e06c313 | 10.xxx.1.129 |        5562 | ONLINE       | PRIMARY     | 8.0.20         |
| group_replication_applier | ca12c7a7-95fd-11ec-9979-6c92bf982df2 | 10.xx.xx.21  |        5562 | RECOVERING   | PRIMARY     | 8.0.20         |
 --------------------------- -------------------------------------- -------------- ------------- -------------- ------------- ---------------- 
4 rows in set (0.00 sec)

select * from performance_schema.replication_group_members;
 --------------------------- -------------------------------------- -------------- ------------- -------------- ------------- ---------------- 
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST  | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
 --------------------------- -------------------------------------- -------------- ------------- -------------- ------------- ---------------- 
| group_replication_applier | 18142f04-95fe-11ec-aff9-246e96be2d00 | 10.xx.xx.184 |        5562 | ONLINE       | PRIMARY     | 8.0.20         |
| group_replication_applier | 32a10c3f-ce69-11eb-bd13-fa163e311dfa | 10.xxx.1.236 |        5562 | ONLINE       | PRIMARY     | 8.0.20         |
| group_replication_applier | 63a8b9e3-95f9-11ec-bfc3-fa163e06c313 | 10.xxx.1.129 |        5562 | ONLINE       | PRIMARY     | 8.0.20         |
| group_replication_applier | ca12c7a7-95fd-11ec-9979-6c92bf982df2 | 10.xx.xx.21  |        5562 | ONLINE       | PRIMARY     | 8.0.20         |
 --------------------------- -------------------------------------- -------------- ------------- -------------- ------------- ---------------- 
4 rows in set (0.00 sec)

重启之后,可以看到,ERROR的状态变成了Recovering,说明节点正在执行恢复操作。不一会儿,就都变成Online状态了。

03

总结

MGR本身的自愈功能还是比较强的,多数派的协议保证了事务的强一致。

从运维经验上讲,一般情况下,MGR中比较棘手的问题就是认证冲突,报错内容形如:

ERROR 3101 (HY000): Plugin instructed the server to rollback the current transaction.

目前遇到认证冲突这种问题,可以通过重试的办法来解决。

如果是MGR深度用户,提几个建议:

1、使用MySQL8.0 社区版本的MGR,最好8.0.22以后

2、最好使用MGR单主模式,冲突会少些。如果想要体验多主模式,又遇到了不可解决的问题,可以退而求其次,多主模式下,集中在一个节点进行写入,利用它的故障自愈。

3、建议使用GreatSQL 替代社区版本MGR,修复了官方MGR的很多Bug。项目链接:https://gitee.com/GreatSQL/GreatSQL-Doc

今天内容就到这里吧。

0 人点赞