如何解决Redis复制风暴?

2023-11-24 14:07:00 浏览数 (1)

作为一个DBA,已经遇到过很多次Redis复制异常了。下面让我来介绍一下Redis复制风暴原因及其处理方式。

Redis复制风暴:Redis主库键值对写过高、主从实例之间的网络闪断或从库延迟过高等,导致复制缓存区或复制积压缓冲区(环形,新的键值对覆盖了旧的键值对数据)溢出,就会出现从库不断发起全量复制。

案例描述:生产环境网络短时间闪断,导致Redis从库失联和主库短时间阻塞。

案例警示:

1.系统稳定性建设很重要,避免单点

2.DBA要保证Redis数据尽量少丢失,服务尽量少中断 

3.Redis实例的内存使用不建议过大(key需要设置过期时间) 

4.不要使用bigkey和高复杂度的命令行操作 (禁用keys *)

5.避免数据集中过期 

6.关注内存碎片,避免缓存污染,不要使用低版本的Redis 

7.避免高频短连接 

8.rdb和aof有风险

故障原因:

1.Redis主从之间的网络出现了闪断

2.Redis主库写入数据过大,Redis实例内存太大

3.Redis从库命令处理较慢

4.复制缓冲区和复制积压缓冲区设置过小,导致缓冲区溢出;(主库把复制积压缓冲区写满后,覆盖了缓冲区中旧的数据,而且从库还没有同步这些旧的数据,导致从节点不断发起全量的复制)

5.主库因频繁的bgsave,出现了阻塞和响应慢的情况

技术回放:

代码语言:javascript复制
Redis基本信息:

redis_version:4.0.14
used_memory_human:9.35G (较大,建议Redis实例不要超过6GB,故障的时候可以减少RDB文件生成、传输和重新加载的开销)
Redis RDB dump文件:4.7G(每次全量复制都会从内存拷贝数据,写入磁盘)
maxmemory_human:20.00G
clients:300 
qps:30k 
key数量:1400万 
架构:2节点,Redis哨兵模式


报错和恢复日志信息:

99694:M 22 Nov 17:13:42.119 # Client id=11062111 addr=110.110.110.110:30640 fd=103 name= age=45 idle=45 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16342 oll=17797 omem=268450173 events=r cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.
99694:M 22 Nov 17:13:42.171 # Connection with slave 110.110.110.110:6381 lost.  
51199:C 22 Nov 17:14:02.617 * DB saved on disk
51199:C 22 Nov 17:14:02.937 * RDB: 614 MB of memory used by copy-on-write
99694:M 22 Nov 17:14:03.419 * Background saving terminated with success
99694:M 22 Nov 17:14:03.684 * Slave 110.110.110.110:6381 asks for synchronization
99694:M 22 Nov 17:14:03.685 * Full resync requested by slave 110.110.110.110:6381
99694:M 22 Nov 17:14:03.685 * Starting BGSAVE for SYNC with target: disk   全量复制,表示主库会把所有的数据都发送给从库,消耗性能,消耗时间 (如果发起增量复制成功,就不会发起全量复制了)
99694:M 22 Nov 17:14:04.050 * Background saving started by pid 51439
99694:M 22 Nov 17:14:49.287 # Client id=11062306 addr=110.110.110.110:20947 fd=102 name= age=46 idle=46 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=15286 oll=17791 omem=268441515 events=r cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.   缓冲区溢出
99694:M 22 Nov 17:14:49.353 # Connection with slave 110.110.110.110:6381 lost.  主从复制恢复异常
51439:C 22 Nov 17:15:12.284 * DB saved on disk  
51439:C 22 Nov 17:15:12.518 * RDB: 662 MB of memory used by copy-on-write
99694:M 22 Nov 17:15:12.954 * Background saving terminated with success
99694:M 22 Nov 17:15:13.871 * Slave 110.110.110.110:6381 asks for synchronization
99694:M 22 Nov 17:15:13.871 * Full resync requested by slave 110.110.110.110:6381 
99694:M 22 Nov 17:15:13.871 * Starting BGSAVE for SYNC with target: disk  重新发起主从全量复制
99694:M 22 Nov 17:15:14.247 * Background saving started by pid 51708
99694:M 22 Nov 17:15:50.081 # Client id=11062489 addr=110.110.110.110:16756 fd=76 name= age=37 idle=37 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=12999 oll=20735 omem=268441366 events=r cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.  缓冲区溢出
99694:M 22 Nov 17:15:50.150 # Connection with slave 110.110.110.110:6381 lost.
51708:C 22 Nov 17:16:19.700 * DB saved on disk
51708:C 22 Nov 17:16:20.007 * RDB: 678 MB of memory used by copy-on-write
99694:M 22 Nov 17:16:20.479 * Slave 110.110.110.110:6381 asks for synchronization
99694:M 22 Nov 17:16:20.480 * Full resync requested by slave 110.110.110.110:6381
99694:M 22 Nov 17:16:20.480 * Can't attach the slave to the current BGSAVE. Waiting for next BGSAVE for SYNC
99694:M 22 Nov 17:16:20.521 * Background saving terminated with success
99694:M 22 Nov 17:16:20.521 * Starting BGSAVE for SYNC with target: disk
99694:M 22 Nov 17:16:20.884 * Background saving started by pid 52130
99694:M 22 Nov 17:17:07.143 # Client id=11062836 addr=110.110.110.110:16747 fd=67 name= age=47 idle=47 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=8908 oll=17770 omem=268444499 events=r cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.
99694:M 22 Nov 17:17:07.187 # Connection with slave 110.110.110.110:6381 lost.
52130:C 22 Nov 17:17:26.378 * DB saved on disk
52130:C 22 Nov 17:17:26.616 * RDB: 594 MB of memory used by copy-on-write
99694:M 22 Nov 17:17:27.011 * Background saving terminated with success
99694:M 22 Nov 17:17:27.997 * Slave 110.110.110.110:6381 asks for synchronization
99694:M 22 Nov 17:17:27.997 * Full resync requested by slave 110.110.110.110:6381
99694:M 22 Nov 17:17:27.997 * Starting BGSAVE for SYNC with target: disk
99694:M 22 Nov 17:17:28.360 * Background saving started by pid 52270
52270:C 22 Nov 17:18:32.609 * DB saved on disk
52270:C 22 Nov 17:18:32.844 * RDB: 1255 MB of memory used by copy-on-write
99694:M 22 Nov 17:18:33.354 * Background saving terminated with success
99694:M 22 Nov 17:19:55.394 * Synchronization with slave 110.110.110.110:6381 succeeded              故障恢复

扩展:

RDB:Redis持久化数据的一种方式,内存快照(以二进制格式保存在磁盘文件中,比较消耗性能),就是指内存中的数据在某一个时刻的状态记录。(bgsave,生成dump文件,传输到Redis从实例。Redis从实例执行清空全量数据、加载dump数据、复制增量数据,含传输和应用期间的数据)AOF:生产环境一般关闭,Redis持久化数据的另外一种方式(写操作以追加的方式记录到磁盘中,比较消耗性能)。Redis先执行命令,把数据写入内存,然后才记录AOF日志。

copy-on-write:写时复制技术,在执行内存快照的同时,避免阻塞正常处理写请求。

DBA分析和处理过程:

1.DBA收到从库失联的告警后,查看Redis日志(上述),确认出现了复制异常,且不断地发起全量复制。

2.检查复制状态执行info replication,从库master_link_status:down  (正常的复制是:up),主库state不是online。

3.DBA修改client-output-buffer-limit后恢复

代码语言:javascript复制
config set client-output-buffer-limit  "normal 0 0 0 slave 0 0 0 pubsub 33554432 8388608 60"  核心关键操作,短时间调整,过大,故障恢复后一定要合理设置。
#Redis client-output-buffer-limit参数调优设置的依据,就是主节点的数据量大小、主节点的写负载压力和主节点本身的内存大小。
config set repl-backlog-size  8388608   短时间调整,故障恢复后合理设置。
config set  maxmemory 0  短时间调整,故障恢复后一定要合理设置,表示内存不受限制,防止bgsave导致内存超过设置20GB(故障恢复期间Redis内存,最大达到了19.6GB)。

再次强调:上面的参数是恢复故障,短时间在Redis主库调整的,故障恢复后,需根据业务Redis运行情况,合理设置。

4.检查复制状态

代码语言:javascript复制
110.110.110.111:6381> info replication  主库执行
# Replication
role:master
connected_slaves:1
slave0:ip=110.110.110.110,port=6381,state=online,offset=26647484467128,lag=1
master_replid:f721f0eef5cb609c586ac13d9f0b5cf69b56ba78
master_replid2:365a1b21960a7d0ea6bacd6f7346666c460fd4b8
master_repl_offset:26647489840742
second_repl_offset:535549748882

110.110.110.110:6381> info replication 从库执行
# Replication
role:slave
master_host:110.110.110.111
master_port:6381
master_link_status:up   (复制恢复)
master_last_io_seconds_ago:0
master_sync_in_progress:0

0 人点赞