2020年7月的总结文章,2021年再回顾的话发现kafka exporter原作者目前花在本项目的时间过少,很多PR没有处理。好在代码比较清晰,自己拉个独立分支动手修改代码问题也不大。
一、kafka exporter概况
1、项目地址
- Github: https://github.com/danielqsj/kafka_exporter
2、项目状态
项目watch、star、fork数量均领先竞品,issue、pull request也比较活跃。
截至2020-07-07:
- v0.2.0版本已被Prometheus项目官方推荐
- 目前官方最新版本已经更新到v1.2.0
3、项目优势
kafka exporter 通过 Kafka Protocol Specification 收集 Brokers, Topics 以及 Consumer Groups的相关指标。
- 使用简单
- 配置简单
- 部署方便支持docker、k8s
- 运行高效
相比于以往通过kafka内置的脚本进行收集,由于没有了每次脚本启动JVM的开销,指标收集时间从分钟级别降到秒级别,便于大规模集群的监控。
- 生态丰富
- 无缝对接prometheus、grafana
- grafana有大量开源的DashBoard配置
4、kafka官方项目
KIP-575: build a Kafka-Exporter by Java
kafka后面可能会推出官方的Kafka-Exporter
二、实现分析
1、方案架构
kafka exporter代码层借助大量开源库,所以功能强大但代码量极少,仅600 行,大致架构如下:
- Kingpin > go的一个命令行库,处理用户输入的参数
- sarama(核心) > go实现的kafka客户端,连接broker获取相关的指标与元数据
- kazoo > go实现的zk客户端,连接kafka的zk集群,主要用于zk消费组的lag计算
- promhttp > 用于生成 Prometheus HTTP服务器,供prometheus pull指标
- 其他组件 > 协助将 sarama 和kazoo获取的指标转换成Prometheus的数据格式
2、详细指标
github上指标说明有些滞后,这里加上新的一些指标说明
2.1 Brokers
Name | Exposed informations |
---|---|
| Number of Brokers in the Kafka Cluster |
# HELP kafka_brokers Number of Brokers in the Kafka Cluster.
# TYPE kafka_brokers gauge
kafka_brokers 3
2.2 Topics
Name | Exposed informations |
---|---|
| Number of partitions for this Topic |
| Current Offset of a Broker at Topic/Partition |
| Oldest Offset of a Broker at Topic/Partition |
| Number of In-Sync Replicas for this Topic/Partition |
| Leader Broker ID of this Topic/Partition |
| 1 if Topic/Partition is using the Preferred Broker |
| Number of Replicas for this Topic/Partition |
| 1 if Topic/Partition is under Replicated |
# HELP kafka_topic_partitions Number of partitions for this Topic
# TYPE kafka_topic_partitions gauge
kafka_topic_partitions{topic="__consumer_offsets"} 50
# HELP kafka_topic_partition_current_offset Current Offset of a Broker at Topic/Partition
# TYPE kafka_topic_partition_current_offset gauge
kafka_topic_partition_current_offset{partition="0",topic="__consumer_offsets"} 0
# HELP kafka_topic_partition_oldest_offset Oldest Offset of a Broker at Topic/Partition
# TYPE kafka_topic_partition_oldest_offset gauge
kafka_topic_partition_oldest_offset{partition="0",topic="__consumer_offsets"} 0
# HELP kafka_topic_partition_in_sync_replica Number of In-Sync Replicas for this Topic/Partition
# TYPE kafka_topic_partition_in_sync_replica gauge
kafka_topic_partition_in_sync_replica{partition="0",topic="__consumer_offsets"} 3
# HELP kafka_topic_partition_leader Leader Broker ID of this Topic/Partition
# TYPE kafka_topic_partition_leader gauge
kafka_topic_partition_leader{partition="0",topic="__consumer_offsets"} 0
# HELP kafka_topic_partition_leader_is_preferred 1 if Topic/Partition is using the Preferred Broker
# TYPE kafka_topic_partition_leader_is_preferred gauge
kafka_topic_partition_leader_is_preferred{partition="0",topic="__consumer_offsets"} 1
# HELP kafka_topic_partition_replicas Number of Replicas for this Topic/Partition
# TYPE kafka_topic_partition_replicas gauge
kafka_topic_partition_replicas{partition="0",topic="__consumer_offsets"} 3
# HELP kafka_topic_partition_under_replicated_partition 1 if Topic/Partition is under Replicated
# TYPE kafka_topic_partition_under_replicated_partition gauge
kafka_topic_partition_under_replicated_partition{partition="0",topic="__consumer_offsets"} 0
2.3 Consumer Groups
Name | Exposed informations |
---|---|
| Current Offset of a ConsumerGroup at Topic/Partition |
| Current Approximate Lag of a ConsumerGroup at Topic/Partition (broker consumer) |
| Current Approximate Lag of a ConsumerGroup at Topic/Partition (zk consumer) |
# HELP kafka_consumergroup_current_offset Current Offset of a ConsumerGroup at Topic/Partition
# TYPE kafka_consumergroup_current_offset gauge
kafka_consumergroup_current_offset{consumergroup="KMOffsetCache-kafka-manager-3806276532-ml44w",partition="0",topic="__consumer_offsets"} -1
# HELP kafka_consumergroup_lag Current Approximate Lag of a ConsumerGroup at Topic/Partition
# TYPE kafka_consumergroup_lag gauge
kafka_consumergroup_lag{consumergroup="KMOffsetCache-kafka-manager-3806276532-ml44w",partition="0",topic="__consumer_offsets"} 1
3、grafana报表
目前grafana主要是基于这2个模板的dashboard按需构建。链接:
- https://grafana.com/grafana/dashboards/7589
- https://grafana.com/grafana/dashboards/11285
三、问题与改进
1、ZK连接串支持chroot
我们现网kafka的zk都是带有chroot,如host1:2181,host2:2181/kafka1
,试用发现kafka exporter
并不支持这种用法。分析代码发现,kafka exporter
使用zk库kazoo
的姿势不太对,使用NewKazooFromConnectionString
代替NewKazoo
方法就能兼容我们的场景,目前这种改进方案已经提交pr给作者。
2、多kafka集群采集支持
2.1 一个程序实例采集多个集群?
一个kafka exporter
实例目前只能采集一个集群,自然而然我们就想通过改造源程序支持采集多个kafka集群的指标。然而试用中发现我们的kafka集群topicpartitiongroup数量极多,几乎达到LinkedIn内部最繁忙集群的水平,所以prometheus pull单集群指标的时间会比较长(30s左右,开启zk lag采集会更长)且指标内容较大(3w行左右),如果多个集群的指标汇总在一个实例里,prometheus压力会很大,所以放弃了这种方案。
2.2 给kafka exporter
实例打标签
经过取舍后,我们决定采用一个集群一个kafka exporter
方案,利用kafka.labels
的特性,加上clusterId
信息来区分不同集群的指标。参考方法:
./kafka_exporter --kafka.server=broker-host:9092 --kafka.labels="clusterId=203" --use.consumelag.zookeeper --zookeeper.server="zk-host:2181/kafka1" --kafka.version ="2.2.0"
3、部署改进
打包成一个docker镜像部署在k8s上。
感兴趣的同学可以关注我的微信公众号~