【背景】
MongoDB 4.4.4集群稳定运行将近半年,由于操作系统安全漏洞,需要升级系统版本来修复,需要将MongoDB实例关闭,然后进行系统升级后重启服务器。关闭MongoDB实例,如是实例是主库,那么执行主备切换即可(使用rs.stepDown()或者修复优先级别),原本很简单的事情(4.4之前版本操作N次),结果遇到2个BUG。第一个是分片集群下读写分离 第二主备切换出现实例全部宕机(这个出乎意料,并不是每次都触发),修复这2个BUG,MongoDB至少采用4.4.7版本.如果没有使用读写分离,建议采用4.4.6版本(4.4.5不建议使用)
【读写分离BUG--升级到4.4.8版本验证没有问题】
【触发场景】
- MongoDB 4.4.0-4.4.6 分片集群
- URI使用
"maxStalenessSeconds=xxx"and "readPreference=secondary/secondaryPreferred/nearest"
- 应用查询到分片X(不管是广播还是单个分片)
- 分片X中出现从节点宕机
如果读写分离满足以上时,
MongoError: Encountered non-retryable error during query :: caused by :: Incompatible wire version |
---|
【修复版本】
https://jira.mongodb.org/browse/SERVER-57136
Fix Version/s:5.1.0, 4.4.7, 5.0.0-rc1
【应用连接】
mongodb://admin:***@mongoprd1.com:31051,mongoprd2.com:31051,mongoprd3.com:31051/xiaoxu?&readPreference=secondaryPreferred&maxStalenessSeconds=120
【应用报错】
Incompatible wire version' on server mongoprd1.com:31051. The full response is {"ok": 0.0, "errmsg": "Incompatible wire version", "code": 188, "codeName": "IncompatibleServerVersion", "operationTime": {"timestamp": {"t": 1628055478, "i": 7}}, "signature": {"hash": {"timestamp": {"t": 1628055478, "i": 7}}, "timestamp": {"t": 1628055478, "i": 7}}, "signature": {"hash": {"
【数据库mongos日志】--shard分片中没有发现类似错误
{"t":{"$date":"2021-08-04T14:14:31.992 08:00"},"s":"I", "c":"QUERY", "id":4625501, "ctx":"conn535","msg":"Unable to establish remote cursors","attr":{"error":{"code":188,"codeName":"IncompatibleServerVersion","errmsg":"Incompatible wire version"},"nRemotes":2}}
【使用Python程序来模拟这个错误】
【python程序】
mongos url需使用readPreference=secondaryPreferred&maxStalenessSeconds这个2个参数,否则不会出现这个错误,另外如果是非分片集群下,无异常。
代码语言:javascript复制from pymongo import MongoClient
import pprint
client = MongoClient('mongodb://admin:admin@mongodbtest.com:41051/?
readPreference=secondaryPreferred&maxStalenessSeconds=90')
db = client.xiaoxu
coll = db.xiaoxu
i = 0
while i < 100000:
doc = { 'no': 100 i }
pprint.pprint(coll.insert_one(doc))
pprint.pprint(coll.find_one(doc))
i = 1
【验证db:xiaoxu所有在主节点信息】
备注:从以下可以看出,xiaoxu数据库所在主节点是shard2,主要为了模拟对应分片下从实例宕机的影响.此时shard1宕机无影响,如果是分片集合,广播下发查询时,任何分片下出现实例宕机都有影响。
代码语言:javascript复制mongos> sh.status();
--- Sharding Status ---
sharding version: {
"_id" : 1,
"minCompatibleVersion" : 5,
"currentVersion" : 6,
"clusterId" : ObjectId("5fc608cefcbcbec36d4f785d")
}
shards:
{ "_id" : "shard1", "host" : "shard1/mongodbtest1.com:27017,mongodbtest2.com:27017,mongodbtest3.com:27017", "state" : 1 }
{ "_id" : "shard2", "host" : "shard2/mongodbtest1.com:27018,mongodbtest2.com:27018,mongodbtest3.com:27018", "state" : 1 }
active mongoses:
"4.4.4" : 3
autosplit:
Currently enabled: yes
balancer:
Currently enabled: yes
Currently running: no
Migration Results for the last 24 hours:
No recent migrations
databases:
......
{ "_id" : "xiaoxu", "primary" : "shard2", "partitioned" : false, "version" : { "uuid" : UUID("6593e368-e82a-4a8c-a184-ccf57b1773e9"),
"lastMod" : 1 } }
mongos>
【模拟shard2下任一从节点宕机--异常与正常都可以】
备注:如果此时是shard1下从节点出现宕机,对查询无影响
代码语言:javascript复制mongod -f /data/mongodb/mongodb44/mongod27018/conf/shard2.conf
--shutdown
【登录剩下任一节点验证】
代码语言:javascript复制mongo 127.0.0.1:27018/admin -uadmin --eval "rs.status()" |egrep "name|stateStr"
Enter password:
"name" : "mongodbtest1.com:27018",
"stateStr" : "(not reachable/healthy)",
"name" : "mongodbtest2.com:27018",
"stateStr" : "SECONDARY",
"name" : "mongodbtest3.com:27018",
"stateStr" : "PRIMARY",
【python程序抛出异常】
pymongo.errors.OperationFailure: Encountered non-retryable error during query :: caused by :: Incompatible wire version,
full error: {'ok': 0.0, 'errmsg': 'Encountered non-retryable error during query :: caused by ::
Incompatible wire version', 'code': 188, 'codeName': 'IncompatibleServerVersion', 'operationTime': Timestamp(1628219161, 4),
'$clusterTime': {'clusterTime': Timestamp(1628219162, 8), 'signature': {'hash': b'-Dxe2~xe3x14;xe6bqa>x14xadxf30<xf7Uxd0', 'keyId': 6937532038958284801}}}
【验证mongos、shard中错误--mongos有错误,shard中没有错误】
{"t":{"$date":"2021-08-06T11:06:02.048 08:00"},"s":"I", "c":"QUERY", "id":4625501, "ctx":"conn564","msg":"Unable to establish remote cursors",
"attr":{"error":{"code":188,"codeName":"IncompatibleServerVersion","errmsg":"Incompatible wire version"},"nRemotes":0}}
【集群版本升级到4.4.8】
升级后从节点宕机对前端查询无影响。新版本中Skip maxStaleness wire version check when server is down来修复这个BUG,如果无法升级,可以取消读写分离来规避这个问题。
【主备切换出现实例全部宕机BUG】
【触发场景】
在主节点执行rs.stepDown()后,新主节点已选出来且接受写入后副本集中所有成员全部宕机(没有模拟出来),查看jira中资料发现副本集状态发生变化有可能触发这个BUG,例如增加成员、升级4.4版本设置兼容性,主实例降级、网络分区错误等会产生Invariant failure错误。
【原主节点】
--replSetStepDown command completed从这个日志来看,已完成主节点降级操作然后宕机
{"t":{"
{"t":{"$date":"2021-07-28T15:39:49.070 08:00"},"s":"F", "c":"-", "id":23079, "ctx":"TopologyVersionObserver","msg":"Invariant failure","attr":{"expr":"opCtx != nullptr && _opCtx == nullptr","file":"src/mongo/db/client.cpp","line":126}}
{"t":{"$date":"2021-07-28T15:39:49.070 08:00"},"s":"I", "c":"REPL", "id":2903000, "ctx":"conn208969","msg":"Restarting heartbeats after learning of a new primary","attr":{"myPrimaryId":"none","senderAndPrimaryId":4,"senderTerm":3}}
{"t":{"$date":"2021-07-28T15:39:49.070 08:00"},"s":"F", "c":"-", "id":23080, "ctx":"TopologyVersionObserver","msg":"nn***aborting after invariant() failurenn"}
{"t":{"$date":"2021-07-28T15:39:49.071 08:00"},"s":"F", "c":"CONTROL", "id":4757800, "ctx":"TopologyVersionObserver","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).n"}}
【新主节点】
--Transition to primary complete; database writes are now permitted 新主节点已选出且接受写入操作然后宕机
{"t":{"$date":"2021-07-28T15:39:49.069 08:00"},"s":"I", "c":"STORAGE", "id":20657, "ctx":"OplogApplier-0","msg":"IndexBuildsCoordinator::onStepUp - this node is stepping up to primary"}
{"t":{"$date":"2021-07-28T15:39:49.069 08:00"},"s":"I", "c":"REPL", "id":21331, "ctx":"OplogApplier-0","msg":"Transition to primary complete; database writes are now permitted"}
{"t":{"$date":"2021-07-28T15:39:49.070 08:00"},"s":"F", "c":"-", "id":23079, "ctx":"waitForMajority","msg":"Invariant failure","attr":{"expr":"opCtx != nullptr && _opCtx == nullptr","file":"src/mongo/db/client.cpp","line":126}}
{"t":{"$date":"2021-07-28T15:39:49.070 08:00"},"s":"F", "c":"-", "id":23080, "ctx":"waitForMajority","msg":"nn***aborting after invariant() failurenn"}
{"t":{"$date":"2021-07-28T15:39:49.070 08:00"},"s":"F", "c":"CONTROL", "id":4757800, "ctx":"waitForMajority","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).n"}}
{"t":{"$date":"2021-07-28T15:39:49.071 08:00"},"s":"F", "c":"-", "id":23079, "ctx":"TopologyVersionObserver","msg":"Invariant failure","attr":{"expr":"opCtx != nullptr && _opCtx == nullptr","file":"src/mongo/db/client.cpp","line":126}}
{"t":{"$date":"2021-07-28T15:39:49.072 08:00"},"s":"F", "c":"-", "id":23080, "ctx":"TopologyVersionObserver","msg":"nn***aborting after invariant() failurenn"}
【第三个节点】
{"t":{"$date":"2021-07-28T15:40:19.466 08:00"},"s":"F", "c":"-", "id":23079, "ctx":"TopologyVersionObserver","msg":"Invariant failure","attr":{"expr":"opCtx != nullptr && _opCtx == nullptr","file":"src/mongo/db/client.cpp","line":126}}
{"t":{"$date":"2021-07-28T15:40:19.466 08:00"},"s":"F", "c":"-", "id":23080, "ctx":"TopologyVersionObserver","msg":"nn***aborting after invariant() failurenn"}
{"t":{"$date":"2021-07-28T15:40:19.466 08:00"},"s":"F", "c":"CONTROL", "id":4757800, "ctx":"TopologyVersionObserver","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).n"}}
【对应BUG以及影响版本--升级到4.4.5,4.4.5不建议使用】
https://jira.mongodb.org/browse/SERVER-53566
MongoDB version 4.4.5 is not recommended for production use due to a critical issue, WT-7426. The issue is fixed in version 4.4.6.