分片恢复达到最大重试次数

异常现象

通过执行 GET /_cluster/allocation/explain 查看当前索引分配详情

获取分片锁失败（failed to obtain in-memory shard lock）

代码语言：javascript复制

		"deciders": [{
			"decider": "max_retry",
			"decision": "NO",
			"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2023-02-27T06:48:04.340Z], failed_attempts[5], failed_nodes[[iOKq3oMXReCl1EcdcM3OEQ]], delayed=false, details[failed shard on node [iOKq3oMXReCl1EcdcM3OEQ]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[myIndex][5]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], allocation_status[no_attempt]]]"
		}]

熔断（Data too large）

代码语言：javascript复制

		"deciders": [{
			"decider": "max_retry",
			"decision": "NO",
			"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2021-12-27T06:04:04.013Z], failed_attempts[5], delayed=false, details[failed shard on node [aHN6ZO4dSDOJPJCfdnGyAQ]: failed recovery, failure RecoveryFailedException[[.triggered_watches][0]: Recovery failed from {1618925137004890932}{ZsOg8qa1Qn6_NpWV3m-FVA}{q4nJNlZYQgujXDLmD6ol0g}{x.x.x.x}{x.x.x.x:9300}{ml.machine_memory=3929833472, rack=cvm_8_800003, xpack.installed=true, set=800003, ip=x.x.x.x, temperature=hot, ml.max_open_jobs=20, ml.enabled=true, region=8} into {1591263882001002432}{aHN6ZO4dSDOJPJCfdnGyAQ}{UoLH-_YfQcWuzWO8eVOe9w}{x.x.x.x}{x.x.x.x:28905}{ml.machine_memory=3929833472, rack=cvm_8_800003, xpack.installed=true, set=800003, ip=x.x.x.x, temperature=hot, ml.max_open_jobs=20, ml.enabled=true, region=8}]; nested: RemoteTransportException[[1618925137004890932][x.x.x.x:9300][internal:index/shard/recovery/start_recovery]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [1435813322/1.3gb], which is larger than the limit of [1433862144/1.3gb], real usage: [1435810872/1.3gb], new bytes reserved: [2450/2.3kb]]; ], allocation_status[no_attempt]]]"
		}]

磁盘打满（No space left on device）

代码语言：javascript复制

"deciders": [{
			"decider": "max_retry",
			"decision": "NO",
			"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2023-06-01T11:16:21.428Z], failed_attempts[5], delayed=false, details[failed recovery, failure RecoveryFailedException[[im_session_log][1]: Recovery failed from {1637914038007031232}{ITpemyeASn-4AKdB_hmmGA}{RLe4brDyR5CszH6WhUla2w}{x.x.x.x}{x.x.x.x:9300}{temperature=hot, rack=cvm_8_800007, set=800007, region=8, ip=x.x.x.x} into {1637914038007031032}{xANHj5XtQreIJPuqbSSydg}{2nSl8dV6Tam0T61u2-CMsw}{x.x.x.x}{x.x.x.x:9300}{temperature=hot, rack=cvm_8_800007, set=800007, region=8, ip=x.x.x.x}]; nested: RemoteTransportException[[1637914038007031232][x.x.x.x:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [134] files with total size of [10gb]]; nested: RemoteTransportException[[1637914038007031032][x.x.x.x:9300][internal:index/shard/recovery/file_chunk]]; nested: IOException[No space left on device]; ], allocation_status[no_attempt]]]"
		}]

如果 decider 中返回 "max_retry" 时，可以通过上面3种常见关键字过滤 explanation。

获取分片锁失败 和熔断通常是由于节点刚加入集群或集群当前负载比较高，导致分配失败，此时可以手动触发分片重试分配，或等集群低负载时手动触发分片重试分配。

磁盘打满 需要先清理历史数据或扩容磁盘容量，保证磁盘利用率低于磁盘低水位后，可以手动触发分片重试分配。

解决方案

手动触发分片重试分配

代码语言：javascript复制

POST _cluster/reroute?retry_failed=true

ElasticsearchServiceES max_retry

0 人点赞