MongoDB 4.4 读写分离、副本集相关BUG

2022-09-22 11:45:02 浏览数 (1)

【背景】

MongoDB 4.4.4集群稳定运行将近半年,由于操作系统安全漏洞,需要升级系统版本来修复,需要将MongoDB实例关闭,然后进行系统升级后重启服务器。关闭MongoDB实例,如是实例是主库,那么执行主备切换即可(使用rs.stepDown()或者修复优先级别),原本很简单的事情(4.4之前版本操作N次),结果遇到2个BUG。第一个是分片集群下读写分离 第二主备切换出现实例全部宕机(这个出乎意料,并不是每次都触发),修复这2个BUG,MongoDB至少采用4.4.7版本.如果没有使用读写分离,建议采用4.4.6版本(4.4.5不建议使用)

读写分离BUG--升级到4.4.8版本验证没有问题

【触发场景】

  • MongoDB 4.4.0-4.4.6 分片集群
  • URI使用

"maxStalenessSeconds=xxx"and "readPreference=secondary/secondaryPreferred/nearest"

  • 应用查询到分片X(不管是广播还是单个分片)
  • 分片X中出现从节点宕机

如果读写分离满足以上时,

MongoError: Encountered non-retryable error during query :: caused by :: Incompatible wire version

【修复版本】

https://jira.mongodb.org/browse/SERVER-57136

Fix Version/s:5.1.0, 4.4.7, 5.0.0-rc1

【应用连接】

mongodb://admin:***@mongoprd1.com:31051,mongoprd2.com:31051,mongoprd3.com:31051/xiaoxu?&readPreference=secondaryPreferred&maxStalenessSeconds=120

【应用报错】

Incompatible wire version' on server mongoprd1.com:31051. The full response is {"ok": 0.0, "errmsg": "Incompatible wire version", "code": 188, "codeName": "IncompatibleServerVersion", "operationTime": {"timestamp": {"t": 1628055478, "i": 7}}, "signature": {"hash": {"timestamp": {"t": 1628055478, "i": 7}}, "timestamp": {"t": 1628055478, "i": 7}}, "signature": {"hash": {"

【数据库mongos日志】--shard分片中没有发现类似错误

{"t":{"$date":"2021-08-04T14:14:31.992 08:00"},"s":"I", "c":"QUERY", "id":4625501, "ctx":"conn535","msg":"Unable to establish remote cursors","attr":{"error":{"code":188,"codeName":"IncompatibleServerVersion","errmsg":"Incompatible wire version"},"nRemotes":2}}

【使用Python程序来模拟这个错误】

【python程序】

mongos url需使用readPreference=secondaryPreferred&maxStalenessSeconds这个2个参数,否则不会出现这个错误,另外如果是非分片集群下,无异常。

代码语言:javascript复制
from pymongo import MongoClient
import pprint
 client = MongoClient('mongodb://admin:admin@mongodbtest.com:41051/?
readPreference=secondaryPreferred&maxStalenessSeconds=90')
db = client.xiaoxu
coll = db.xiaoxu
i = 0
while i < 100000:
  doc = { 'no': 100   i }
  pprint.pprint(coll.insert_one(doc))
  pprint.pprint(coll.find_one(doc))  
  i  = 1

【验证db:xiaoxu所有在主节点信息】

备注:从以下可以看出,xiaoxu数据库所在主节点是shard2,主要为了模拟对应分片下从实例宕机的影响.此时shard1宕机无影响,如果是分片集合,广播下发查询时,任何分片下出现实例宕机都有影响。

代码语言:javascript复制
mongos> sh.status();                                                                                                                                                          
--- Sharding Status ---                                                                                                                                                       
  sharding version: {                                                                                                                                                         
        "_id" : 1,                                                                                                                                                            
        "minCompatibleVersion" : 5,                                                                                                                                           
        "currentVersion" : 6,                                                                                                                                                 
        "clusterId" : ObjectId("5fc608cefcbcbec36d4f785d")                                                                                                                    
  }                                                                                                                                                                           
  shards:                                                                                                                                                                     
        {  "_id" : "shard1",  "host" : "shard1/mongodbtest1.com:27017,mongodbtest2.com:27017,mongodbtest3.com:27017",  "state" : 1 }                                                                        
        {  "_id" : "shard2",  "host" : "shard2/mongodbtest1.com:27018,mongodbtest2.com:27018,mongodbtest3.com:27018",  "state" : 1 }                                                    
  active mongoses:                                                                                                                                                            
        "4.4.4" : 3                                                                                                                                                           
  autosplit:                                                                                                                                                                  
        Currently enabled: yes                                                                                                                                                
  balancer:                                                                                                                                                                   
        Currently enabled:  yes                                                                                                                                               
        Currently running:  no                                                                                                                                                                                                                                                   
        Migration Results for the last 24 hours:                                                                                                                              
                No recent migrations                                                                                                                                          
  databases:              
        ......                                                                                                                                                    
        {  "_id" : "xiaoxu",  "primary" : "shard2",  "partitioned" : false,  "version" : {  "uuid" : UUID("6593e368-e82a-4a8c-a184-ccf57b1773e9"),  
        "lastMod" : 1 } }         
mongos>

【模拟shard2下任一从节点宕机--异常与正常都可以】

备注:如果此时是shard1下从节点出现宕机,对查询无影响

代码语言:javascript复制
mongod -f  /data/mongodb/mongodb44/mongod27018/conf/shard2.conf 
--shutdown

【登录剩下任一节点验证】

代码语言:javascript复制
mongo 127.0.0.1:27018/admin -uadmin --eval "rs.status()" |egrep "name|stateStr"
Enter password:
"name" : "mongodbtest1.com:27018",
"stateStr" : "(not reachable/healthy)",
"name" : "mongodbtest2.com:27018",
"stateStr" : "SECONDARY",
"name" : "mongodbtest3.com:27018",
"stateStr" : "PRIMARY",

【python程序抛出异常】

pymongo.errors.OperationFailure: Encountered non-retryable error during query :: caused by :: Incompatible wire version,

full error: {'ok': 0.0, 'errmsg': 'Encountered non-retryable error during query :: caused by ::

Incompatible wire version', 'code': 188, 'codeName': 'IncompatibleServerVersion', 'operationTime': Timestamp(1628219161, 4),

'$clusterTime': {'clusterTime': Timestamp(1628219162, 8), 'signature': {'hash': b'-Dxe2~xe3x14;xe6bqa>x14xadxf30<xf7Uxd0', 'keyId': 6937532038958284801}}}

【验证mongos、shard中错误--mongos有错误,shard中没有错误】

{"t":{"$date":"2021-08-06T11:06:02.048 08:00"},"s":"I", "c":"QUERY", "id":4625501, "ctx":"conn564","msg":"Unable to establish remote cursors",

"attr":{"error":{"code":188,"codeName":"IncompatibleServerVersion","errmsg":"Incompatible wire version"},"nRemotes":0}}

【集群版本升级到4.4.8】

升级后从节点宕机对前端查询无影响。新版本中Skip maxStaleness wire version check when server is down来修复这个BUG,如果无法升级,可以取消读写分离来规避这个问题。

主备切换出现实例全部宕机BUG

【触发场景】

在主节点执行rs.stepDown()后,新主节点已选出来且接受写入后副本集中所有成员全部宕机(没有模拟出来),查看jira中资料发现副本集状态发生变化有可能触发这个BUG,例如增加成员、升级4.4版本设置兼容性,主实例降级、网络分区错误等会产生Invariant failure错误。

【原主节点】

--replSetStepDown command completed从这个日志来看,已完成主节点降级操作然后宕机

{"t":{"

{"t":{"$date":"2021-07-28T15:39:49.070 08:00"},"s":"F", "c":"-", "id":23079, "ctx":"TopologyVersionObserver","msg":"Invariant failure","attr":{"expr":"opCtx != nullptr && _opCtx == nullptr","file":"src/mongo/db/client.cpp","line":126}}

{"t":{"$date":"2021-07-28T15:39:49.070 08:00"},"s":"I", "c":"REPL", "id":2903000, "ctx":"conn208969","msg":"Restarting heartbeats after learning of a new primary","attr":{"myPrimaryId":"none","senderAndPrimaryId":4,"senderTerm":3}}

{"t":{"$date":"2021-07-28T15:39:49.070 08:00"},"s":"F", "c":"-", "id":23080, "ctx":"TopologyVersionObserver","msg":"nn***aborting after invariant() failurenn"}

{"t":{"$date":"2021-07-28T15:39:49.071 08:00"},"s":"F", "c":"CONTROL", "id":4757800, "ctx":"TopologyVersionObserver","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).n"}}

【新主节点】

--Transition to primary complete; database writes are now permitted 新主节点已选出且接受写入操作然后宕机

{"t":{"$date":"2021-07-28T15:39:49.069 08:00"},"s":"I", "c":"STORAGE", "id":20657, "ctx":"OplogApplier-0","msg":"IndexBuildsCoordinator::onStepUp - this node is stepping up to primary"}

{"t":{"$date":"2021-07-28T15:39:49.069 08:00"},"s":"I", "c":"REPL", "id":21331, "ctx":"OplogApplier-0","msg":"Transition to primary complete; database writes are now permitted"}

{"t":{"$date":"2021-07-28T15:39:49.070 08:00"},"s":"F", "c":"-", "id":23079, "ctx":"waitForMajority","msg":"Invariant failure","attr":{"expr":"opCtx != nullptr && _opCtx == nullptr","file":"src/mongo/db/client.cpp","line":126}}

{"t":{"$date":"2021-07-28T15:39:49.070 08:00"},"s":"F", "c":"-", "id":23080, "ctx":"waitForMajority","msg":"nn***aborting after invariant() failurenn"}

{"t":{"$date":"2021-07-28T15:39:49.070 08:00"},"s":"F", "c":"CONTROL", "id":4757800, "ctx":"waitForMajority","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).n"}}

{"t":{"$date":"2021-07-28T15:39:49.071 08:00"},"s":"F", "c":"-", "id":23079, "ctx":"TopologyVersionObserver","msg":"Invariant failure","attr":{"expr":"opCtx != nullptr && _opCtx == nullptr","file":"src/mongo/db/client.cpp","line":126}}

{"t":{"$date":"2021-07-28T15:39:49.072 08:00"},"s":"F", "c":"-", "id":23080, "ctx":"TopologyVersionObserver","msg":"nn***aborting after invariant() failurenn"}

【第三个节点】

{"t":{"$date":"2021-07-28T15:40:19.466 08:00"},"s":"F", "c":"-", "id":23079, "ctx":"TopologyVersionObserver","msg":"Invariant failure","attr":{"expr":"opCtx != nullptr && _opCtx == nullptr","file":"src/mongo/db/client.cpp","line":126}}

{"t":{"$date":"2021-07-28T15:40:19.466 08:00"},"s":"F", "c":"-", "id":23080, "ctx":"TopologyVersionObserver","msg":"nn***aborting after invariant() failurenn"}

{"t":{"$date":"2021-07-28T15:40:19.466 08:00"},"s":"F", "c":"CONTROL", "id":4757800, "ctx":"TopologyVersionObserver","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).n"}}

【对应BUG以及影响版本--升级到4.4.5,4.4.5不建议使用

https://jira.mongodb.org/browse/SERVER-53566

MongoDB version 4.4.5 is not recommended for production use due to a critical issue, WT-7426. The issue is fixed in version 4.4.6.

0 人点赞