1、SparkStreaming中使用Kafka的createDirectStream自己管理offset
在Spark Streaming中,目前官方推荐的方式是createDirectStream方式,但是这种方式就需要我们自己去管理offset。目前的资料大部分是通过scala来实现的,并且实现套路都是一样的,我自己根据scala的实现改成了Java的方式,后面又相应的实现。 Direct Approach 更符合Spark的思维。我们知道,RDD的概念是一个不变的,分区的数据集合。我们将kafka数据源包裹成了一个KafkaRDD,RDD里的partition 对应的数据源为kafka的partition。唯一的区别是数据在Kafka里而不是事先被放到Spark内存里。其实包括FileInputStream里也是把每个文件映射成一个RDD。
2、DirectKafkaInputDStream
Spark Streaming通过Direct Approach接收数据的入口自然是KafkaUtils.createDirectStream 了。在调用该方法时,会先创建
val kc = new KafkaCluster(kafkaParams)
KafkaCluster 这个类是真实负责和Kafka 交互的类,该类会获取Kafka的partition信息,接着会创建 DirectKafkaInputDStream,每个DirectKafkaInputDStream对应一个Topic。此时会获取每个Topic的每个Partition的offset。如果配置成smallest 则拿到最早的offset,否则拿最近的offset。
每个DirectKafkaInputDStream 也会持有一个KafkaCluster实例。
到了计算周期后,对应的DirectKafkaInputDStream .compute方法会被调用,此时做下面几个操作:
- 获取对应Kafka Partition的untilOffset。这样就确定过了需要获取数据的区间,同时也就知道了需要计算多少数据了
- 构建一个KafkaRDD实例。这里我们可以看到,每个计算周期里,DirectKafkaInputDStream 和 KafkaRDD 是一一对应的
- 将相关的offset信息报给InputInfoTracker
- 返回该RDD
3、KafkaRDD 的组成结构
KafkaRDD 包含 N(N=Kafka的partition数目)个 KafkaRDDPartition,每个KafkaRDDPartition 其实只是包含一些信息,譬如topic,offset等,真正如果想要拉数据, 是透过KafkaRDDIterator 来完成,一个KafkaRDDIterator对应一个 KafkaRDDPartition。
整个过程都是延时过程,也就是数据其实都在Kafka存着呢,直到有实际的Action被触发,才会有去kafka主动拉数据。
4、使用Java来管理offset
代码语言:javascript复制// 注意:一定要存在这个包下面
package org.apache.spark.streaming.kafka;
import kafka.common.TopicAndPartition;
import kafka.message.MessageAndMetadata;
import kafka.serializer.StringDecoder;
import org.apache.spark.SparkException;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import scala.Tuple2;
import scala.collection.JavaConversions;
import scala.collection.mutable.ArrayBuffer;
import scala.util.Either;
import java.io.Serializable;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
public class JavaKafkaManager implements Serializable{
private scala.collection.immutable.Map<String, String> kafkaParams;
private KafkaCluster kafkaCluster;
public JavaKafkaManager(Map<String, String> kafkaParams) {
//TODO
this.kafkaParams = toScalaImmutableMap(kafkaParams);
kafkaCluster = new KafkaCluster(this.kafkaParams);
}
public JavaInputDStream<String> createDirectStream(
JavaStreamingContext jssc,
Map<String, String> kafkaParams,
Set<String> topics) throws SparkException {
String groupId = kafkaParams.get("group.id");
// 在zookeeper上读取offsets前先根据实际情况更新offsets
setOrUpdateOffsets(topics, groupId);
//从zookeeper上读取offset开始消费message
//TODO
scala.collection.immutable.Set<String> immutableTopics = JavaConversions.asScalaSet(topics).toSet();
Either<ArrayBuffer<Throwable>, scala.collection.immutable.Set<TopicAndPartition>> partitionsE
= kafkaCluster.getPartitions(immutableTopics);
if (partitionsE.isLeft()){
throw new SparkException("get kafka partition failed: ${partitionsE.left.get}");
}
Either.RightProjection<ArrayBuffer<Throwable>, scala.collection.immutable.Set<TopicAndPartition>>
partitions = partitionsE.right();
Either<ArrayBuffer<Throwable>, scala.collection.immutable.Map<TopicAndPartition, Object>> consumerOffsetsE
= kafkaCluster.getConsumerOffsets(groupId, partitions.get());
if (consumerOffsetsE.isLeft()){
throw new SparkException("get kafka consumer offsets failed: ${consumerOffsetsE.left.get}");
}
scala.collection.immutable.Map<TopicAndPartition, Object>
consumerOffsetsTemp = consumerOffsetsE.right().get();
Map<TopicAndPartition, Object> consumerOffsets = JavaConversions.mapAsJavaMap(consumerOffsetsTemp);
Map<TopicAndPartition, Long> consumerOffsetsLong = new HashMap<TopicAndPartition, Long>();
for (TopicAndPartition key: consumerOffsets.keySet()){
consumerOffsetsLong.put(key, (Long)consumerOffsets.get(key));
}
JavaInputDStream<String> message = KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
String.class,
kafkaParams,
consumerOffsetsLong,
new Function<MessageAndMetadata<String, String>, String>() {
@Override
public String call(MessageAndMetadata<String, String> v) throws Exception {
return v.message();
}
});
return message;
}
/**
* 创建数据流前,根据实际消费情况更新消费offsets
* @param topics
* @param groupId
*/
private void setOrUpdateOffsets(Set<String> topics, String groupId) throws SparkException {
for (String topic: topics){
boolean hasConsumed = true;
HashSet<String> topicSet = new HashSet<>();
topicSet.add(topic);
scala.collection.immutable.Set<String> immutableTopic = JavaConversions.asScalaSet(topicSet).toSet();
Either<ArrayBuffer<Throwable>, scala.collection.immutable.Set<TopicAndPartition>>
partitionsE = kafkaCluster.getPartitions(immutableTopic);
if (partitionsE.isLeft()){
throw new SparkException("get kafka partition failed: ${partitionsE.left.get}");
}
scala.collection.immutable.Set<TopicAndPartition> partitions = partitionsE.right().get();
Either<ArrayBuffer<Throwable>, scala.collection.immutable.Map<TopicAndPartition, Object>>
consumerOffsetsE = kafkaCluster.getConsumerOffsets(groupId, partitions);
if (consumerOffsetsE.isLeft()){
hasConsumed = false;
}
if (hasConsumed){// 消费过
/**
* 如果streaming程序执行的时候出现kafka.common.OffsetOutOfRangeException,
* 说明zk上保存的offsets已经过时了,即kafka的定时清理策略已经将包含该offsets的文件删除。
* 针对这种情况,只要判断一下zk上的consumerOffsets和earliestLeaderOffsets的大小,
* 如果consumerOffsets比earliestLeaderOffsets还小的话,说明consumerOffsets已过时,
* 这时把consumerOffsets更新为earliestLeaderOffsets
*/
Either<ArrayBuffer<Throwable>, scala.collection.immutable.Map<TopicAndPartition, KafkaCluster.LeaderOffset>>
earliestLeaderOffsetsE = kafkaCluster.getEarliestLeaderOffsets(partitions);
if (earliestLeaderOffsetsE.isLeft()){
throw new SparkException("get earliest leader offsets failed: ${earliestLeaderOffsetsE.left.get}");
}
scala.collection.immutable.Map<TopicAndPartition, KafkaCluster.LeaderOffset>
earliestLeaderOffsets = earliestLeaderOffsetsE.right().get();
scala.collection.immutable.Map<TopicAndPartition, Object>
consumerOffsets = consumerOffsetsE.right().get();
// 可能只是存在部分分区consumerOffsets过时,所以只更新过时分区的consumerOffsets为earliestLeaderOffsets
HashMap<TopicAndPartition, Object> offsets = new HashMap<>();
Map<TopicAndPartition, Object>
topicAndPartitionObjectMap = JavaConversions.mapAsJavaMap(consumerOffsets);
for (TopicAndPartition key: topicAndPartitionObjectMap.keySet()){
Long n = (Long) topicAndPartitionObjectMap.get(key);
long earliestLeaderOffset = earliestLeaderOffsets.get(key).get().offset();
if (n < earliestLeaderOffset){
System.out.println("consumer group:"
groupId ",topic:"
key.topic() ",partition:" key.partition()
" offsets已经过时,更新为" earliestLeaderOffset);
offsets.put(key, earliestLeaderOffset);
}
}
if (!offsets.isEmpty()){
//TODO
scala.collection.immutable.Map<TopicAndPartition, Object>
topicAndPartitionLongMap = toScalaImmutableMap(offsets);
kafkaCluster.setConsumerOffsets(groupId, topicAndPartitionLongMap);
}
}else{// 没有消费过
String offsetReset = kafkaParams.get("auto.offset.reset").get().toLowerCase();
scala.collection.immutable.Map<TopicAndPartition, KafkaCluster.LeaderOffset> leaderOffsets = null;
if ("smallest".equals(offsetReset)){
Either<ArrayBuffer<Throwable>, scala.collection.immutable.Map<TopicAndPartition, KafkaCluster.LeaderOffset>>
leaderOffsetsE = kafkaCluster.getEarliestLeaderOffsets(partitions);
if (leaderOffsetsE.isLeft()) {
throw new SparkException("get earliest leader offsets failed: ${leaderOffsetsE.left.get}");
}
leaderOffsets = leaderOffsetsE.right().get();
}else {
Either<ArrayBuffer<Throwable>, scala.collection.immutable.Map<TopicAndPartition, KafkaCluster.LeaderOffset>>
latestLeaderOffsetsE = kafkaCluster.getLatestLeaderOffsets(partitions);
if (latestLeaderOffsetsE.isLeft()){
throw new SparkException("get latest leader offsets failed: ${leaderOffsetsE.left.get}");
}
leaderOffsets = latestLeaderOffsetsE.right().get();
}
Map<TopicAndPartition, KafkaCluster.LeaderOffset>
topicAndPartitionLeaderOffsetMap = JavaConversions.mapAsJavaMap(leaderOffsets);
Map<TopicAndPartition, Object> offsets = new HashMap<>();
for (TopicAndPartition key: topicAndPartitionLeaderOffsetMap.keySet()){
KafkaCluster.LeaderOffset offset = topicAndPartitionLeaderOffsetMap.get(key);
long offset1 = offset.offset();
offsets.put(key, offset1);
}
//TODO
scala.collection.immutable.Map<TopicAndPartition, Object>
immutableOffsets = toScalaImmutableMap(offsets);
kafkaCluster.setConsumerOffsets(groupId,immutableOffsets);
}
}
}
/**
* 更新zookeeper上的消费offsets
* @param rdd
*/
public void updateZKOffsets(JavaRDD<String> rdd){
String groupId = kafkaParams.get("group.id").get();
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
for (OffsetRange offset: offsetRanges){
TopicAndPartition topicAndPartition = new TopicAndPartition(offset.topic(), offset.partition());
Map<TopicAndPartition, Object> offsets = new HashMap<>();
offsets.put(topicAndPartition, offset.untilOffset());
Either<ArrayBuffer<Throwable>, scala.collection.immutable.Map<TopicAndPartition, Object>>
o = kafkaCluster.setConsumerOffsets(groupId, toScalaImmutableMap(offsets));
if (o.isLeft()){
System.out.println("Error updating the offset to Kafka cluster: ${o.left.get}");
}
}
}
/**
* java Map convert immutable.Map
* @param javaMap
* @param <K>
* @param <V>
* @return
*/
private static <K, V> scala.collection.immutable.Map<K, V> toScalaImmutableMap(java.util.Map<K, V> javaMap) {
final java.util.List<scala.Tuple2<K, V>> list = new java.util.ArrayList<>(javaMap.size());
for (final java.util.Map.Entry<K, V> entry : javaMap.entrySet()) {
list.add(scala.Tuple2.apply(entry.getKey(), entry.getValue()));
}
final scala.collection.Seq<Tuple2<K, V>> seq = scala.collection.JavaConverters.asScalaBufferConverter(list).asScala().toSeq();
return (scala.collection.immutable.Map<K, V>) scala.collection.immutable.Map$.MODULE$.apply(seq);
}
}
代码语言:javascript复制import org.apache.spark.SparkConf;
import org.apache.spark.SparkException;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.JavaKafkaManager;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
public class KafkaManagerDemo {
public static void main(String[] args) throws SparkException, InterruptedException {
SparkConf sparkConf = new SparkConf().setAppName(KafkaManagerDemo.class.getName());
sparkConf.setMaster("local[3]");
sparkConf.set("spark.streaming.kafka.maxRatePerPartition", "5");
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
JavaStreamingContext javaStreamingContext =
new JavaStreamingContext(javaSparkContext, Durations.seconds(5));
javaStreamingContext.sparkContext().setLogLevel("WARN");
String brokers = "localhost:9092";
String topics = "finance_test2";
String groupId = "test22";
HashSet<String> topcisSet = new HashSet<>();
topcisSet.add(topics);
Map<String,String> kafkaParams = new HashMap<>();
kafkaParams.put("metadata.broker.list", brokers);
kafkaParams.put("group.id", groupId);
kafkaParams.put("auto.offset.reset", "smallest");
JavaKafkaManager javaKafkaManager = new JavaKafkaManager(kafkaParams);
JavaInputDStream<String> message
= javaKafkaManager.createDirectStream(javaStreamingContext, kafkaParams, topcisSet);
message.transform(new Function<JavaRDD<String>, JavaRDD<String>>() {
@Override
public JavaRDD<String> call(JavaRDD<String> v1) throws Exception {
return v1;
}
}).foreachRDD(new VoidFunction<JavaRDD<String>>() {
@Override
public void call(JavaRDD<String> rdd) throws Exception {
System.out.println(rdd);
if (!rdd.isEmpty()){
rdd.foreach(new VoidFunction<String>() {
@Override
public void call(String r) throws Exception {
System.out.println(r);
}
});
javaKafkaManager.updateZKOffsets(rdd);
}
}
});
javaStreamingContext.start();
javaStreamingContext.awaitTermination();
}
}
5、使用Scala来管理offset
代码语言:javascript复制package org.apache.spark.streaming.kafka
import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import kafka.serializer.Decoder
import org.apache.spark.SparkException
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka.KafkaCluster.LeaderOffset
import scala.reflect.ClassTag
/**
* 自己管理offset
*/
class KafkaManager(val kafkaParams: Map[String, String]) extends Serializable {
private val kc = new KafkaCluster(kafkaParams)
/**
* 创建数据流
*/
def createDirectStream[K: ClassTag,
V: ClassTag,
KD <: Decoder[K]: ClassTag,
VD <: Decoder[V]: ClassTag](ssc: StreamingContext,
kafkaParams: Map[String, String],
topics: Set[String]): InputDStream[(K, V)] = {
val groupId = kafkaParams.get("group.id").get
// 在zookeeper上读取offsets前先根据实际情况更新offsets
setOrUpdateOffsets(topics, groupId)
//从zookeeper上读取offset开始消费message
val messages = {
val partitionsE = kc.getPartitions(topics)
if (partitionsE.isLeft)
throw new SparkException(s"get kafka partition failed: ${partitionsE.left.get}")
val partitions = partitionsE.right.get
val consumerOffsetsE = kc.getConsumerOffsets(groupId, partitions)
if (consumerOffsetsE.isLeft)
throw new SparkException(s"get kafka consumer offsets failed: ${consumerOffsetsE.left.get}")
val consumerOffsets = consumerOffsetsE.right.get
KafkaUtils.createDirectStream[K, V, KD, VD, (K, V)](
ssc, kafkaParams, consumerOffsets, (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message))
}
messages
}
/**
* 创建数据流前,根据实际消费情况更新消费offsets
* @param topics
* @param groupId
*/
private def setOrUpdateOffsets(topics: Set[String], groupId: String): Unit = {
topics.foreach(topic => {
var hasConsumed = true
val partitionsE = kc.getPartitions(Set(topic))
if (partitionsE.isLeft)
throw new SparkException(s"get kafka partition failed: ${partitionsE.left.get}")
val partitions = partitionsE.right.get
val consumerOffsetsE = kc.getConsumerOffsets(groupId, partitions)
if (consumerOffsetsE.isLeft) hasConsumed = false
if (hasConsumed) {// 消费过
/**
* 如果streaming程序执行的时候出现kafka.common.OffsetOutOfRangeException,
* 说明zk上保存的offsets已经过时了,即kafka的定时清理策略已经将包含该offsets的文件删除。
* 针对这种情况,只要判断一下zk上的consumerOffsets和earliestLeaderOffsets的大小,
* 如果consumerOffsets比earliestLeaderOffsets还小的话,说明consumerOffsets已过时,
* 这时把consumerOffsets更新为earliestLeaderOffsets
*/
val earliestLeaderOffsetsE = kc.getEarliestLeaderOffsets(partitions)
if (earliestLeaderOffsetsE.isLeft)
throw new SparkException(s"get earliest leader offsets failed: ${earliestLeaderOffsetsE.left.get}")
val earliestLeaderOffsets = earliestLeaderOffsetsE.right.get
val consumerOffsets = consumerOffsetsE.right.get
// 可能只是存在部分分区consumerOffsets过时,所以只更新过时分区的consumerOffsets为earliestLeaderOffsets
var offsets: Map[TopicAndPartition, Long] = Map()
consumerOffsets.foreach({ case(tp, n) =>
val earliestLeaderOffset = earliestLeaderOffsets(tp).offset
if (n < earliestLeaderOffset) {
println("consumer group:" groupId ",topic:" tp.topic ",partition:" tp.partition
" offsets已经过时,更新为" earliestLeaderOffset)
offsets = (tp -> earliestLeaderOffset)
}
})
if (!offsets.isEmpty) {
kc.setConsumerOffsets(groupId, offsets)
}
} else {// 没有消费过
val reset = kafkaParams.get("auto.offset.reset").map(_.toLowerCase)
var leaderOffsets: Map[TopicAndPartition, LeaderOffset] = null
if (reset == Some("smallest")) {
val leaderOffsetsE = kc.getEarliestLeaderOffsets(partitions)
if (leaderOffsetsE.isLeft)
throw new SparkException(s"get earliest leader offsets failed: ${leaderOffsetsE.left.get}")
leaderOffsets = leaderOffsetsE.right.get
} else {
val leaderOffsetsE = kc.getLatestLeaderOffsets(partitions)
if (leaderOffsetsE.isLeft)
throw new SparkException(s"get latest leader offsets failed: ${leaderOffsetsE.left.get}")
leaderOffsets = leaderOffsetsE.right.get
}
val offsets = leaderOffsets.map {
case (tp, offset) => (tp, offset.offset)
}
kc.setConsumerOffsets(groupId, offsets)
}
})
}
/**
* 更新zookeeper上的消费offsets
* @param rdd
*/
def updateZKOffsets(rdd: RDD[(String, String)]) : Unit = {
val groupId = kafkaParams.get("group.id").get
val offsetsList = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
for (offsets <- offsetsList) {
val topicAndPartition = TopicAndPartition(offsets.topic, offsets.partition)
val o = kc.setConsumerOffsets(groupId, Map((topicAndPartition, offsets.untilOffset)))
if (o.isLeft) {
println(s"Error updating the offset to Kafka cluster: ${o.left.get}")
}
}
}
}
代码语言:javascript复制import kafka.serializer.StringDecoder
import org.apache.spark.rdd.RDD
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaManager
import org.apache.spark.streaming.{Seconds, StreamingContext}
object SparkKafkaStreaming {
/* def dealLine(line: String): String = {
val list = line.split(',').toList
// val list = AnalysisUtil.dealString(line, ',', '"')// 把dealString函数当做split即可
list.get(0).substring(0, 10) "-" list.get(26)
}*/
def processRdd(rdd: RDD[(String, String)]): Unit = {
val lines = rdd.map(_._2).map(x => (1,1)).reduceByKey(_ _)
/*val words = lines.map(_.split(" "))
val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ _)*/
lines.foreach(println)
}
def main(args: Array[String]) {
if (args.length < 3) {
System.err.println(
s"""
|Usage: DirectKafkaWordCount <brokers> <topics> <groupid>
| <brokers> is a list of one or more Kafka brokers
| <topics> is a list of one or more kafka topics to consume from
| <groupid> is a consume group
|
""".stripMargin)
System.exit(1)
}
Logger.getLogger("org").setLevel(Level.WARN)
val Array(brokers, topics, groupId) = args
// Create context with 2 second batch interval
val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")
sparkConf.setMaster("local[3]")
sparkConf.set("spark.streaming.kafka.maxRatePerPartition", "5")
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val ssc = new StreamingContext(sparkConf, Seconds(5))
ssc.sparkContext.setLogLevel("WARN")
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String](
"metadata.broker.list" -> brokers,
"group.id" -> groupId,
"auto.offset.reset" -> "smallest"
)
val km = new KafkaManager(kafkaParams)
val messages = km.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
messages.foreachRDD(rdd => {
if (!rdd.isEmpty()) {
// 先处理消息
processRdd(rdd)
// 再更新offsets
km.updateZKOffsets(rdd)
}
})
ssc.start()
ssc.awaitTermination()
}
}