Elastic Data Processing with Apache Flink and Apache Pulsar -- Sijie Guo(Apache Pulsar)
More and more applications are using Flink for low-latency data processing. Flink unifies batch and stream processing using one computation engine. However in reality, in order to really unify batch and stream processing, it requires a data system offers one unified data representation for both batch and streaming data. Nowadays, streaming data is typically stored in a log storage or messaging system, while batch data is stored in distributed filesystem and object stores. That means that data scientists still need write two different computing jobs to access same data stored in different data systems.
越来越多的应用程序使用Flink进行低延迟数据处理。Flink使用一个计算引擎来统一批处理和流处理。然而在现实中,为了真正统一批处理和流处理,需要一个数据系统为批处理和流处理数据提供一个统一的数据表示。现在,流式数据通常存储在日志存储或消息传递系统中,而批处理数据存储在分布式文件系统和对象存储中。这意味着数据科学家仍然需要编写两个不同的计算作业来访问存储在不同数据系统中的相同数据。
Apache Pulsar is the next generation messaging and streaming data system. It was originally built at Yahoo, and has graduated from Apache Incubator and become a Top-Level-Project. Pulsar separates messaging serving and data storage into two layers. Such layered architecture provides high throughput and low-latency while ensuring high availability and scalability. Pulsar’s segment centric storage design along with layered architecture makes Pulsar a perfect unbounded streaming data system, which can well fit into Flink’s computation model.
Apache Pulsar是下一代消息和流数据系统。它最初是在雅虎(Yahoo)建立的,现在已经从Apache孵化器中毕业,成为一个顶级项目。Pulsar将消息服务和数据存储分为两层。这种分层体系结构提供了高吞吐量和低延迟,同时确保了高可用性和可扩展性。Pulsar的以段为中心的存储设计和分层结构使其成为一个完美的无边界流数据系统,可以很好地融入Flink的计算模型。
In this talk, Sijie Guo from Apache Pulsar PMC, will introduce Pulsar and its layered architecture and segment-centric storage, detailing how this architecture can well integrate with Flink to provide elastic unified batch and stream processing.
在本文中,来自Apache Pulsar PMC的Sijie Guo将介绍Pulsar及其分层体系结构和以段为中心的存储,详细说明该体系结构如何与Flink很好地集成,以提供弹性统一的批处理和流处理。