Build a Table-centric Apache Flink Ecosystem -- Shaoxuan Wang(Alibaba)
Flink Table API was initially created to address the relational query use case. It has been a good addition to DataStream and DataSet API for users to write declarative queries. Moreover, Table API provides a unified API for batch and stream processing. We have been exploring extending the capability of Flink Table API to go beyond the classical relational query. With these work, we are establishing an ecosystem on top of the Table API. This talk will introduce the following enhancements we have made on Table API to expand its horizon. Most of the work has been or will be contributed back to Apache Flink. We will also share our experience of building an ecosystem around Flink Table API, and our vision for Table API in the future.
Flink Table API 最初是为解决关系查询用例而创建的。它是对数据流和数据集API的一个很好的添加,用户可以编写声明性查询。此外,表API为批处理和流处理提供了统一的API。我们一直在探索扩展Flink Table API的功能,使其超越传统的关系查询。通过这些工作,我们将在 Table API之上建立一个生态系统。本文将介绍我们在 Table API上所做的以下增强,以扩展其范围。大部分的工作已经或将要贡献给阿帕奇弗林克。我们还将分享我们围绕Flink Table API构建生态系统的经验,以及我们未来对Table API的愿景。
Non-relational processing API 非关系处理API
Relational query is natively supported by Table API. It is also very powerful to express complicated computation logic. However, non-relational API become handy to perform a general purpose computation. We have introduced a set of non-relational methods, such as map() and flatMap(), to Table API in a systematic manner to improve the user experience in general.
Table API本机支持关系查询。表示复杂的计算逻辑也非常强大。然而,非关系API在执行通用计算时变得很方便。我们以系统的方式向 Table API引入了一组非关系方法,如map()和flatmap(),以提高一般用户体验。
Interactive programming 交互式编程
Ad-hoc queries is a pretty common use case for processing engines, especially for batch processing. In order to meet the requirements for such use cases, we introduced interactive programming to Table API, which allows users to cache the intermediate result. We envision the underlying service, which caches the intermediate Flink Table, will grow significantly to provide more sophisticated capabilities.
Ad-hoc查询是处理引擎很普遍的应用,特别是批处理引擎的一个非常常见的用例。为了满足这些用例的需求,我们在 Table API中引入了交互式编程,允许用户缓存中间结果。我们设想,缓存中间Flink Table 的底层服务将显著增长,以提供更复杂的功能。
Iterative processing 交互处理
Compared with DataSet and DataStream, one thing missing from Table is native iteration support. Instead of naively copying the native iteration API from DataSet / DataStream, we designed a new API to address the caveats that we have seen in the existing iteration support in DataStream and DataSet.
与数据集和数据流相比,表中缺少的一件事是本机迭代支持。我们没有天真地从数据集/数据流复制本机迭代API,而是设计了一个新的API来解决我们在数据流和数据集的现有迭代支持中看到的警告。
ML on Table API
One important part of the Flink ecosystem is ML. We have proposed to build a ML on top of Table API, so that the algorithm engineers can also benefit from the optimizations provided by Flink, in both batch and stream jobs.
Flink 生态系统的一个重要部分是ML。我们建议在 Table API的基础上构建一个ML,这样算法工程师也可以从Flink提供的批处理和流作业优化中受益。