一个节点死机了,无法自动重启。通过logtash导数据,由于当天入的数据是0备份,节点丢失后,某些shard丢失,导致集群一直处于red状态。节点丢失后,该索引的导入速度直线下降。经测试发现是logtash的原因,logtash的input阶段是一个线程,filter和output用一个线程。中间通过一个同步队列缓存数据。如果在output的过程中出现问题,那么失败的数据会无限制地放回同步队列,然后队列中的数据被再次分配shard导入,分配到丢失shard的数据会再次失败,再次放入同步队列。因此数据一直在同步队列和es的bulk中循环,导致整个索引的导入速度变慢。
用测试机测试出的结果如下: 1、正常导数据:
代码语言:javascript复制xxx-20170925 1 p STARTED 24713 24.7mb xxx.7.67 node-xxx.7.67-performance_test
xxx-20170925 5 p STARTED 24256 33.7mb xxx.7.67 node-xxx.7.67-performance_test
xxx-20170925 2 p STARTED 24702 24.2mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925 3 p STARTED 24626 24.2mb xxx.7.81 node-xxx.7.81-performance_test
xxx-20170925 7 p STARTED 24916 34.2mb xxx.7.81 node-xxx.7.81-performance_test
xxx-20170925 4 p STARTED 23970 38.2mb xxx.6.105 node-xxx.6.105-performance_test
xxx-20170925 6 p STARTED 24786 24mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925 0 p STARTED 24824 34.4mb xxx.6.105 node-xxx.6.105-performance_test
2 关闭一个节点
代码语言:javascript复制xxx-20170925 6 p STARTED 128179 110.8mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925 1 p UNASSIGNED
xxx-20170925 4 p STARTED 128263 108.1mb xxx.6.105 node-xxx.6.105-performance_test
xxx-20170925 7 p STARTED 128593 109.3mb xxx.7.81 node-xxx.7.81-performance_test
xxx-20170925 2 p STARTED 128613 112.8mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925 5 p UNASSIGNED
xxx-20170925 3 p STARTED 127969 115.6mb xxx.7.81 node-xxx.7.81-performance_test
xxx-20170925 0 p STARTED 128322 110.3mb xxx.6.105 node-xxx.6.105-performance_test
3 经过一段时间后查看shard,发现其他shard增长的速度特别慢
代码语言:javascript复制xxx-20170925 6 p STARTED 128436 111.1mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925 5 p UNASSIGNED
xxx-20170925 3 p STARTED 128231 110.9mb xxx.7.81 node-xxx.7.81-performance_test
xxx-20170925 7 p STARTED 128814 109.6mb xxx.7.81 node-xxx.7.81-performance_test
xxx-20170925 1 p UNASSIGNED
xxx-20170925 2 p STARTED 128871 182.6mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925 4 p STARTED 128502 108.5mb xxx.6.105 node-xxx.6.105-performance_test
xxx-20170925 0 p STARTED 128568 109.1mb xxx.6.105 node-xxx.6.105-performance_test
logtash的日志如下:
代码语言:javascript复制[2017-11-21T11:04:26,780][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[xxx-20170925][5] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [xxx-20170925] containing [19] requests]"})
[2017-11-21T11:04:26,780][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[xxx-20170925][5] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [xxx-20170925] containing [19] requests]"})
[2017-11-21T11:04:26,780][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[xxx-20170925][1] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [xxx-20170925] containing [15] requests]"})
[2017-11-21T11:04:26,780][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[xxx-20170925][5] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [xxx-20170925] containing [19] requests]"})
[2017-11-21T11:04:26,784][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[xxx-20170925][5] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [xxx-20170925] containing [19] requests]"})
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Retrying individual actions
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Action
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Action
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Action
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Action
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Action
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Action
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Action
4 数据恢复后
代码语言:javascript复制xxx-20170925 4 p STARTED 154764 125.3mb xxx.6.105 node-xxx.6.105-performance_test
xxx-20170925 5 p STARTED 157936 126.4mb xxx.7.67 node-xxx.7.67-performance_test
xxx-20170925 2 p STARTED 154945 138.9mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925 7 p STARTED 155224 156.8mb xxx.7.81 node-xxx.7.81-performance_test
xxx-20170925 1 p STARTED 158080 124.8mb xxx.7.67 node-xxx.7.67-performance_test
xxx-20170925 3 p STARTED 154243 153.8mb xxx.7.81 node-xxx.7.81-performance_test
xxx-20170925 6 p STARTED 154909 146.9mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925 0 p STARTED 154681 127mb xxx.6.105 node-xxx.6.105-performance_test