ES报错赏析_ 字节宝

--------日志----------

CheckSum异常，不允许分片上线

日志报错：

代码语言：javascript复制

org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1wawfr3 actual=ux64oi (resource=name [_dem.fdt], length [2731835920], checksum [1wawfr3], writtenBy [8.7.0]) (resource=VerifyingIndexOutput(_dem.fdt))

解析：

一般是因为磁盘或系统问题导致的分片文件损坏，es checksum异常

解决：

参考官网进行保守修复：https://www.elastic.co/guide/en/elasticsearch/reference/current/shard-tool.html。若系统一直为reblance，需要把cluster.routing.allocation.allow_rebalance的值，改成 indices_primaries_active

2. 停机修复：https://www.elastic.co/guide/en/elasticsearch/reference/current/shard-tool.html

3. 不停机修复：用lucene里面提供的工具试了下，主要参考如下文章：

代码语言：javascript复制

https://mincong.io/cn/elasticsearch-corrupted-index/
然后按照下面的步骤处理了下，感觉有点绕远，理论上直接remove掉分配过期的分片就行
1. 把有问题的分片数据copy出来
2. 指定日志里面报错的segment检测下，有问题的segment名字是 _dem，然后确实有报错
/data/c_log/repository/jdk/kona11.0.9.1.b1/bin/java -cp lib/lucene-core-8.7.0.jar:lib/ohc-core-0.7.0.jar  -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /data1/containers/data_bak/index -segment _dem
3. 然后修复 
/data/c_log/repository/jdk/kona11.0.9.1.b1/bin/java -cp lib/lucene-core-8.7.0.jar:lib/ohc-core-0.7.0.jar  -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /data1/containers/data_bak/index -exorcise
4. close索引，替换分片的索引文件目录
5. reopen

---------Explain API-----------

磁盘满

日志报错：

代码语言：javascript复制

the node is above the high watermark cluster setting [cluster.routing.allocation.disk.watermark.high=95%], using more disk space than the maximum allowed [95.0%], actual free: [4.05%]

解决方法：

扩容磁盘或者删除数据

分配文档数超过最大值限制

日志报错：

代码语言：javascript复制

failure IllegalArgumentException[number of documents in the index cannot exceed 2147483519

解决方法：

向新索引中写入数据，并合理设置分片大小

ElasticsearchService es

0 人点赞