kafka exporter调研与改进

2022-02-28 20:16:19 浏览数 (2)

2020年7月的总结文章,2021年再回顾的话发现kafka exporter原作者目前花在本项目的时间过少,很多PR没有处理。好在代码比较清晰,自己拉个独立分支动手修改代码问题也不大。

一、kafka exporter概况

1、项目地址

  • Github: https://github.com/danielqsj/kafka_exporter

2、项目状态

项目watch、star、fork数量均领先竞品,issue、pull request也比较活跃。

截至2020-07-07:

image.pngimage.png
  • v0.2.0版本已被Prometheus项目官方推荐
  • 目前官方最新版本已经更新到v1.2.0

3、项目优势

kafka exporter 通过 Kafka Protocol Specification 收集 Brokers, Topics 以及 Consumer Groups的相关指标。

  • 使用简单
  • 配置简单
  • 部署方便支持docker、k8s
  • 运行高效

相比于以往通过kafka内置的脚本进行收集,由于没有了每次脚本启动JVM的开销,指标收集时间从分钟级别降到秒级别,便于大规模集群的监控。

  • 生态丰富
  • 无缝对接prometheus、grafana
  • grafana有大量开源的DashBoard配置

4、kafka官方项目

KIP-575: build a Kafka-Exporter by Java

kafka后面可能会推出官方的Kafka-Exporter

二、实现分析

1、方案架构

kafka exporter代码层借助大量开源库,所以功能强大但代码量极少,仅600 行,大致架构如下:

image.pngimage.png
  • Kingpin > go的一个命令行库,处理用户输入的参数
  • sarama(核心) > go实现的kafka客户端,连接broker获取相关的指标与元数据
  • kazoo > go实现的zk客户端,连接kafka的zk集群,主要用于zk消费组的lag计算
  • promhttp > 用于生成 Prometheus HTTP服务器,供prometheus pull指标
  • 其他组件 > 协助将 sarama 和kazoo获取的指标转换成Prometheus的数据格式

2、详细指标

github上指标说明有些滞后,这里加上新的一些指标说明

2.1 Brokers

Name

Exposed informations

kafka_brokers

Number of Brokers in the Kafka Cluster

代码语言:txt复制
# HELP kafka_brokers Number of Brokers in the Kafka Cluster.
# TYPE kafka_brokers gauge
kafka_brokers 3
2.2 Topics

Name

Exposed informations

kafka_topic_partitions

Number of partitions for this Topic

kafka_topic_partition_current_offset

Current Offset of a Broker at Topic/Partition

kafka_topic_partition_oldest_offset

Oldest Offset of a Broker at Topic/Partition

kafka_topic_partition_in_sync_replica

Number of In-Sync Replicas for this Topic/Partition

kafka_topic_partition_leader

Leader Broker ID of this Topic/Partition

kafka_topic_partition_leader_is_preferred

1 if Topic/Partition is using the Preferred Broker

kafka_topic_partition_replicas

Number of Replicas for this Topic/Partition

kafka_topic_partition_under_replicated_partition

1 if Topic/Partition is under Replicated

代码语言:txt复制
# HELP kafka_topic_partitions Number of partitions for this Topic
# TYPE kafka_topic_partitions gauge
kafka_topic_partitions{topic="__consumer_offsets"} 50

# HELP kafka_topic_partition_current_offset Current Offset of a Broker at Topic/Partition
# TYPE kafka_topic_partition_current_offset gauge
kafka_topic_partition_current_offset{partition="0",topic="__consumer_offsets"} 0

# HELP kafka_topic_partition_oldest_offset Oldest Offset of a Broker at Topic/Partition
# TYPE kafka_topic_partition_oldest_offset gauge
kafka_topic_partition_oldest_offset{partition="0",topic="__consumer_offsets"} 0

# HELP kafka_topic_partition_in_sync_replica Number of In-Sync Replicas for this Topic/Partition
# TYPE kafka_topic_partition_in_sync_replica gauge
kafka_topic_partition_in_sync_replica{partition="0",topic="__consumer_offsets"} 3

# HELP kafka_topic_partition_leader Leader Broker ID of this Topic/Partition
# TYPE kafka_topic_partition_leader gauge
kafka_topic_partition_leader{partition="0",topic="__consumer_offsets"} 0

# HELP kafka_topic_partition_leader_is_preferred 1 if Topic/Partition is using the Preferred Broker
# TYPE kafka_topic_partition_leader_is_preferred gauge
kafka_topic_partition_leader_is_preferred{partition="0",topic="__consumer_offsets"} 1

# HELP kafka_topic_partition_replicas Number of Replicas for this Topic/Partition
# TYPE kafka_topic_partition_replicas gauge
kafka_topic_partition_replicas{partition="0",topic="__consumer_offsets"} 3

# HELP kafka_topic_partition_under_replicated_partition 1 if Topic/Partition is under Replicated
# TYPE kafka_topic_partition_under_replicated_partition gauge
kafka_topic_partition_under_replicated_partition{partition="0",topic="__consumer_offsets"} 0
2.3 Consumer Groups

Name

Exposed informations

kafka_consumergroup_current_offset

Current Offset of a ConsumerGroup at Topic/Partition

kafka_consumergroup_lag

Current Approximate Lag of a ConsumerGroup at Topic/Partition (broker consumer)

kafka_consumergroupzookeeper_lag_zookeeper

Current Approximate Lag of a ConsumerGroup at Topic/Partition (zk consumer)

代码语言:txt复制
# HELP kafka_consumergroup_current_offset Current Offset of a ConsumerGroup at Topic/Partition
# TYPE kafka_consumergroup_current_offset gauge
kafka_consumergroup_current_offset{consumergroup="KMOffsetCache-kafka-manager-3806276532-ml44w",partition="0",topic="__consumer_offsets"} -1

# HELP kafka_consumergroup_lag Current Approximate Lag of a ConsumerGroup at Topic/Partition
# TYPE kafka_consumergroup_lag gauge
kafka_consumergroup_lag{consumergroup="KMOffsetCache-kafka-manager-3806276532-ml44w",partition="0",topic="__consumer_offsets"} 1

3、grafana报表

目前grafana主要是基于这2个模板的dashboard按需构建。链接:

  • https://grafana.com/grafana/dashboards/7589
  • https://grafana.com/grafana/dashboards/11285
image.pngimage.png

三、问题与改进

1、ZK连接串支持chroot

我们现网kafka的zk都是带有chroot,如host1:2181,host2:2181/kafka1,试用发现kafka exporter 并不支持这种用法。分析代码发现,kafka exporter使用zk库kazoo的姿势不太对,使用NewKazooFromConnectionString代替NewKazoo方法就能兼容我们的场景,目前这种改进方案已经提交pr给作者。

2、多kafka集群采集支持

2.1 一个程序实例采集多个集群?

一个kafka exporter实例目前只能采集一个集群,自然而然我们就想通过改造源程序支持采集多个kafka集群的指标。然而试用中发现我们的kafka集群topicpartitiongroup数量极多,几乎达到LinkedIn内部最繁忙集群的水平,所以prometheus pull单集群指标的时间会比较长(30s左右,开启zk lag采集会更长)且指标内容较大(3w行左右),如果多个集群的指标汇总在一个实例里,prometheus压力会很大,所以放弃了这种方案。

2.2 给kafka exporter实例打标签

经过取舍后,我们决定采用一个集群一个kafka exporter方案,利用kafka.labels的特性,加上clusterId信息来区分不同集群的指标。参考方法:

代码语言:txt复制
./kafka_exporter --kafka.server=broker-host:9092  --kafka.labels="clusterId=203" --use.consumelag.zookeeper --zookeeper.server="zk-host:2181/kafka1" --kafka.version ="2.2.0"

3、部署改进

打包成一个docker镜像部署在k8s上。

感兴趣的同学可以关注我的微信公众号~

0 人点赞