腾讯云Elasticsearch集群运维常用命令详解二(节点篇)

2021-12-19 10:37:22 浏览数 (1)

在腾讯云Elasticsearch集群运维常用命令详解一(集群篇)中,我们详细介绍了集群层面的几个常用的运维命令。在本篇中,我们将从节点维度出发,向大家详细介绍节点相关的常用命令。

节点相关命令

1、查看节点基本信息

代码语言:javascript复制
GET /_nodes/<node_id>

该命令能够一次性返回节点全部的信息,如process、jvm、plugins以及节点角色roles、属性attributes和settings等相关信息。

如果我们想要返回特定的信息,如process,则可以使用如下API:

代码语言:javascript复制
GET /_nodes/<node_id>/process

返回Response:

代码语言:javascript复制
{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "es-xxxx",
  "nodes" : {
    "fdaOV16OQPq6-AUVihmq4A" : {
      "name" : "162680s31430001456s32",
      "transport_address" : "xx.0.96.22:9300",
      "host" : "xx.0.96.22",
      "ip" : "xx.0.96.22",
      "version" : "7.10.1",
      "build_flavor" : "default",
      "build_type" : "tar",
      "build_hash" : "119c2106d7bd8d206aa3b65dc43c87b8aa590b2b",
      "roles" : [
        "master",
        "ml",
        "remote_cluster_client"
      ],
      "attributes" : {
        "ml.machine_memory" : "16478932992",
        "rack" : "cvm_4_200003",
        "xpack.installed" : "true",
        "set" : "200003",
        "transform.node" : "false",
        "ip" : "xx.20.58.221",
        "temperature" : "hot",
        "ml.max_open_jobs" : "20",
        "region" : "4"
      },
      "process" : {
        "refresh_interval_in_millis" : 1000,
        "id" : 24918,
        "mlockall" : false
      }
    }
  }
}

2、查看节点统计信息

代码语言:javascript复制
GET /_nodes/stats
GET /_nodes/<node_id>/stats
GET/_nodes/stats/<metric>

该API能够获取节点各项指标的统计信息,如jvm、cpu、内存、磁盘等使用率情况、该节点上有多少个索引,get请求、search请求、merge操作、refresh和flush的相关统计信息,以及缓存相关的如fielddata、query_cache、request_cache的统计信息,还包括了segment和translog的相关统计信息。总之,通过该API,我们能够全方位获取到节点维度相关的各种指标信息。对于我们排查集群问题非常有帮助,我们还了解到腾讯云ES的部分大客户,通过定期去请求该API,将返回信息输出到对应的监控系统,来自己做更加细粒度的集群监控。

该API默认是返回节点所有的统计指标信息,如果我们需要查看部分指标或者特定指标统计信息,也可以在API中进行指定,如我们想查看特定节点的jvm使用情况:

代码语言:javascript复制
GET /_nodes/1626803143000145632/stats/jvm

返回Response如下:

代码语言:javascript复制
{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "es-xxx",
  "nodes" : {
    "fdaOV16OQPq6-AUVihmq4A" : {
      "timestamp" : 1639878427713,
      "name" : "1626803143000145632",
      "transport_address" : "xx.0.96.22:9300",
      "host" : "xx.0.96.22",
      "ip" : "xx.0.96.22:9300",
      "roles" : [
        "master",
        "ml",
        "remote_cluster_client"
      ],
      "attributes" : {
        "ml.machine_memory" : "16478932992",
        "rack" : "cvm_4_200003",
        "xpack.installed" : "true",
        "set" : "200003",
        "transform.node" : "false",
        "ip" : "xx.20.58.221",
        "temperature" : "hot",
        "ml.max_open_jobs" : "20",
        "region" : "4"
      },
      "jvm" : {
        "timestamp" : 1639878427713,
        "uptime_in_millis" : 13037976328,
        "mem" : {
          "heap_used_in_bytes" : 214161760,
          "heap_used_percent" : 2,
          "heap_committed_in_bytes" : 8555069440,
          "heap_max_in_bytes" : 8555069440,
          "non_heap_used_in_bytes" : 178981224,
          "non_heap_committed_in_bytes" : 195989504,
          "pools" : {
            "young" : {
              "used_in_bytes" : 104047328,
              "max_in_bytes" : 279183360,
              "peak_used_in_bytes" : 279183360,
              "peak_max_in_bytes" : 279183360
            },
            "survivor" : {
              "used_in_bytes" : 545616,
              "max_in_bytes" : 34865152,
              "peak_used_in_bytes" : 34865144,
              "peak_max_in_bytes" : 34865152
            },
            "old" : {
              "used_in_bytes" : 109568816,
              "max_in_bytes" : 8241020928,
              "peak_used_in_bytes" : 127845408,
              "peak_max_in_bytes" : 8241020928
            }
          }
        },
        "threads" : {
          "count" : 41,
          "peak_count" : 43
        },
        "gc" : {
          "collectors" : {
            "young" : {
              "collection_count" : 4718,
              "collection_time_in_millis" : 162825
            },
            "old" : {
              "collection_count" : 3,
              "collection_time_in_millis" : 359
            }
          }
        },
        "buffer_pools" : {
          "direct" : {
            "count" : 16,
            "used_in_bytes" : 4300808,
            "total_capacity_in_bytes" : 4300807
          },
          "mapped" : {
            "count" : 0,
            "used_in_bytes" : 0,
            "total_capacity_in_bytes" : 0
          }
        },
        "classes" : {
          "current_loaded_count" : 21255,
          "total_loaded_count" : 21361,
          "total_unloaded_count" : 106
        }
      }
    }
  }
}

或者我们想看该节点上索引的merge统计信息:

代码语言:javascript复制
GET /_nodes/1626803143000145632/stats/indices/merge

返回Response如下:

代码语言:javascript复制
{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "es-xxx",
  "nodes" : {
    "fdaOV16OQPq6-AUVihmq4A" : {
      "timestamp" : 1639878633590,
      "name" : "1626803143000145632",
      "transport_address" : "xx.0.96.22:9300",
      "host" : "xx.0.96.22",
      "ip" : "xx.0.96.22:9300",
      "roles" : [
        "master",
        "ml",
        "remote_cluster_client"
      ],
      "attributes" : {
        "ml.machine_memory" : "16478932992",
        "rack" : "cvm_4_200003",
        "xpack.installed" : "true",
        "set" : "200003",
        "transform.node" : "false",
        "ip" : "xx.20.58.221",
        "temperature" : "hot",
        "ml.max_open_jobs" : "20",
        "region" : "4"
      },
      "indices" : {
        "merges" : {
          "current" : 0,
          "current_docs" : 0,
          "current_size_in_bytes" : 0,
          "total" : 0,
          "total_time_in_millis" : 0,
          "total_docs" : 0,
          "total_size_in_bytes" : 0,
          "total_stopped_time_in_millis" : 0,
          "total_throttled_time_in_millis" : 0,
          "total_auto_throttle_in_bytes" : 0
        }
      }
    }
  }
}

以及查看节点索引segment和translog统计信息:

代码语言:javascript复制
GET /_nodes/1626803143000145632/stats/indices/segments,translog
代码语言:javascript复制
{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "es-xxxx",
  "nodes" : {
    "fdaOV16OQPq6-AUVihmq4A" : {
      "timestamp" : 1639878746911,
      "name" : "1626803143000145632",
      "transport_address" : "xx.0.96.22:9300",
      "host" : "xx.0.96.22",
      "ip" : "xx.0.96.22:9300",
      "roles" : [
        "master",
        "ml",
        "remote_cluster_client"
      ],
      "attributes" : {
        "ml.machine_memory" : "16478932992",
        "rack" : "cvm_4_200003",
        "xpack.installed" : "true",
        "set" : "200003",
        "transform.node" : "false",
        "ip" : "xx.20.58.221",
        "temperature" : "hot",
        "ml.max_open_jobs" : "20",
        "region" : "4"
      },
      "indices" : {
        "segments" : {
          "count" : 0,
          "memory_in_bytes" : 0,
          "terms_memory_in_bytes" : 0,
          "stored_fields_memory_in_bytes" : 0,
          "term_vectors_memory_in_bytes" : 0,
          "norms_memory_in_bytes" : 0,
          "points_memory_in_bytes" : 0,
          "doc_values_memory_in_bytes" : 0,
          "index_writer_memory_in_bytes" : 0,
          "version_map_memory_in_bytes" : 0,
          "fixed_bit_set_memory_in_bytes" : 0,
          "max_unsafe_auto_id_timestamp" : -9223372036854775808,
          "file_sizes" : { }
        },
        "translog" : {
          "operations" : 0,
          "size_in_bytes" : 0,
          "uncommitted_operations" : 0,
          "uncommitted_size_in_bytes" : 0,
          "earliest_last_modified_age" : 0
        }
      }
    }
  }
}

也可以通过该API来查看每个节点上所分配的索引存储信息:

代码语言:javascript复制
GET /_nodes/stats/indices/store

更多灵活用法请参考官方文档。另外,除了可以通过GET _nodes/stats API来查看节点的统计信息,ES官方文档中还提供了另外一个API,也可以获取到基本的统计信息:

代码语言:javascript复制
GET /_cat/nodes

返回Response:

代码语言:javascript复制
ip         heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
xx.0.96.9            30          70   0    0.01    0.05     0.05 cdhilrstw -      1626803143000145432
xx.0.96.24            2          99   1    0.11    0.09     0.07 lmr       -      1626803143000145832
xx.0.96.49           14          99   1    0.87    0.24     0.12 lmr       *      1626803143000145732
xx.0.96.13           14          70   0    0.16    0.08     0.06 cdhilrstw -      1626803143000145532
xx.0.96.22            3          99   2    0.13    0.12     0.13 lmr       -      1626803143000145632
xx.0.96.20                                                       cdhilrstw -      1626803143000145332

通常我们在给我们客户做集群内存使用率高等性能分析和问题排查的时候,最常用的是下面这个API:

代码语言:javascript复制
GET _cat/nodes?h=name,segments.memory,segments.index_writer_memory,heap.percent,fielddata.memory_size,query_cache.memory_size,request_cache.memory_size&v

返回Response:

代码语言:javascript复制
name                segments.memory segments.index_writer_memory heap.percent fielddata.memory_size query_cache.memory_size
1626803143000145832              0b                           0b            3                    0b                      0b
1626803143000145532          15.9mb                         58mb           17                 3.7kb                   3.9kb
1626803143000145332                                                                                                        
1626803143000145432          15.7mb                       50.7mb           28                 3.4kb                  10.2kb
1626803143000145732              0b                           0b           13                    0b                      0b
1626803143000145632              0b                           0b            3                    0b                      0b

通过上面这个API,我们能够非常直观的看到每个节点内存详细占用情况。也就基本能够分析出jvm内存较高的根本原因了。

3、查看节点线程池占用情况

代码语言:javascript复制
GET /_cat/thread_pool

在日常的集群运维工作中,我们经常会收到客户反馈的读写拒绝的情况,如图1所示:

图1. 集群出现查询拒绝图1. 集群出现查询拒绝

当我们查看日志的时候,通过能够看到如图2所示的报错信息,即节点的查询队列被打满。

图2. 集群节点查询队列被打满图2. 集群节点查询队列被打满

根据读写拒绝问题,我们通过可以通过如下API进行查看线程池的使用情况:

代码语言:javascript复制
GET /_cat/thread_pool/search,write?v

返回Response:

代码语言:javascript复制
node_name           name   active queue rejected
1626803143000145832 search      0     0        0
1626803143000145832 write       0     0        0
1626803143000145532 search      0     0        0
1626803143000145532 write       0     0        0
1626803143000145332 search      0     0        0
1626803143000145332 write       0     0        0
1626803143000145432 search      0     0        0
1626803143000145432 write       0     0        0
1626803143000145732 search      0     0        0
1626803143000145732 write       0     0        0
1626803143000145632 search      0     0        0
1626803143000145632 write       0     0        0

如果能从如上的返回中看到queue值和rejected值比较高,就说明该节点的读写处理能力快到瓶颈了,此时应该结合cpu使用率来综合评估。以我们的经验来看。读写拒绝通常是由于cpu使用率高引起,cpu使用率高会导致节点读写请求处理不过来,从而导致查询或bulk队列被打满而出现拒绝。而读写熔断通常是由于jvm使用率高引起。因此这里面需要针对不同的指标来进行分析。

4、查看节点热线程

代码语言:javascript复制
GET /_nodes/hot_threads
GET /_nodes/<node_id>/hot_threads

返回Response:

代码语言:javascript复制
::: {1626803143000145832}{vT4YRHWdRweoouLn2fGu0g}{Pq2mklvJTvCMPlhN7OY_KQ}{xx.0.96.24}{xx.0.96.24:9300}{lmr}{ml.machine_memory=16478932992, rack=cvm_4_200003, xpack.installed=true, set=200003, transform.node=false, ip=9.20.59.20, temperature=hot, ml.max_open_jobs=20, region=4}
   Hot threads at 2021-12-19T02:21:12.334Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {1626803143000145532}{CaQnhaYpQw6vbabGwKaPTw}{X_yKVAz9RHCOUwnMEXKLvg}{xx.0.96.13}{xx.0.96.13:9300}{cdhilrstw}{ml.machine_memory=50299387904, rack=cvm_4_200003, xpack.installed=true, set=200003, transform.node=true, ip=9.20.57.70, temperature=hot, ml.max_open_jobs=20, region=4}
   Hot threads at 2021-12-19T02:21:12.335Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {1626803143000145332}{yvtpqeypTke6aFxxwYSjjA}{ZQkJVy7zQGOOY5_hAP3z_w}{xx.0.96.20}{xx.0.96.20:9300}{cdhilrstw}{ml.machine_memory=50299125760, rack=cvm_4_200003, xpack.installed=true, set=200003, transform.node=true, ip=9.20.53.190, temperature=hot, ml.max_open_jobs=20, region=4}
   Hot threads at 2021-12-19T02:21:12.337Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {1626803143000145432}{hz6BqoupSuOUuWykrX5c2g}{VoANc_CJQWylc2rxT6NJTg}{xx.0.96.9}{xx.0.96.9:9300}{cdhilrstw}{ml.machine_memory=50299387904, rack=cvm_4_200003, xpack.installed=true, set=200003, transform.node=true, ip=9.20.56.176, temperature=hot, ml.max_open_jobs=20, region=4}
   Hot threads at 2021-12-19T02:21:12.334Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {1626803143000145732}{AexOqq25T7SRf6tzMYzO1Q}{t21NHAcyQwiJKEtV85Okzg}{xx.0.96.49}{xx.0.96.49:9300}{lmr}{ml.machine_memory=16478932992, rack=cvm_4_200003, xpack.installed=true, set=200003, transform.node=false, ip=9.20.58.203, temperature=hot, ml.max_open_jobs=20, region=4}
   Hot threads at 2021-12-19T02:21:12.335Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {1626803143000145632}{fdaOV16OQPq6-AUVihmq4A}{pdyApXIKQbivHe57M5UCIA}{xx.0.96.22}{xx.0.96.22:9300}{lmr}{ml.machine_memory=16478932992, rack=cvm_4_200003, xpack.installed=true, set=200003, transform.node=false, ip=9.20.58.221, temperature=hot, ml.max_open_jobs=20, region=4}
   Hot threads at 2021-12-19T02:21:12.335Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

我们有时候经过会遇到cpu使用持续较高的情况,一方面我们可以通过抓取火焰图来详细分析当前cpu高的原因,另一方面我们也可以到节点上执行该API来获取该节点当前正在执行的任务,例如可以分析出节点当前正在频繁做merge操作,或者做长文本分词解析操作等、快照备份操作等等。

节点常用命令总结

本篇我们详细介绍了在我们日常集群维护工作中常用的几个节点维度的运维命令,下面以表格的形式总结如下:

命令

API命令说明

GET /_nodes GET /_nodes/<node_id> GET /_cat/nodes

查看集群节点的基本信息,如节点属性、roles、settings等信息

GET /_nodes/stats GET /_nodes/<node_id>/stats GET /_nodes/stats/<metric>

查看集群节点统计相关信息,如节点cpu、jvm、负载和磁盘等使用率,以及查看节点缓存、segments、merge、trasnlog、recovery和索引存储容量等相关统计信息

GET /_cat/thread_pool

查看节点各种线程池使用情况

GET /_nodes/hot_threads GET/_nodes/<node_id>/hot_threads

查看节点热线程,分析节点cpu等性能

0 人点赞