The business world communicates, thrives and operates in the form of data. 商业世界以数据的形式进行通信、繁荣和运营。 The new life essence that connects tomorrow with today must be masterfully kept in motion. 连接明天和今天的新生命精华必须巧妙地保持运动。 This is where state-of-the-art workflow management provides a helping hand. 这就是最先进的工作流程管理提供帮助的地方。 Digital processes are executed, various systems are orchestrated and data processing is automated. 执行数字流程,协调各种系统,实现数据处理自动化。 In this article, we will show you how all this can be done comfortably with the open-source workflow management platform Apache Airflow. 在本文中,我们将向您展示如何使用开源工作流管理平台Apache Airflow轻松完成所有这些操作。 Here you will find important functionalities, components and the most important terms explained for a trouble-free start. 在这里,您将找到重要的功能、组件和最重要的术语,以实现无故障启动。
Why implement digital workflow management?
为什么要实施数字化工作流程管理? The manual execution of workflows and pure startup with cron jobs is no longer up to date. 使用 cron 作业手动执行工作流和启动不再是最新的。 Many companies are therefore looking for a cron alternative. 因此,许多公司正在寻找 cron 替代品。 As soon as digital tasks - or entire processes - are to be executed repetitively and reliably, an automated solution is needed. In addition to the pure execution of work steps, other aspects are important: 一旦数字任务(或整个流程)要重复可靠地执行,就需要自动化解决方案。除了纯粹的工作步骤执行外,其他方面也很重要:
**_Troubleshooting _**故障排除 The true greatness of a workflow management platform becomes apparent when unforeseen errors occur. 当发生不可预见的错误时,工作流管理平台的真正伟大之处就显现出来了。 In addition to notification and detailed localization of errors in the process, automatic documentation is also part of the process. 除了通知和详细定位流程中的错误外,自动文档也是流程的一部分。 Ideally, a retry should be initiated automatically after a given time window, so that short-term system reachability problems are resolved on their own. 理想情况下,应在给定时间窗口后自动启动重试,以便自行解决短期系统可访问性问题。 Task-specific system logs should be available to the user for quick troubleshooting. 用户应可以使用特定于任务的系统日志进行快速故障排除。
_**Flexibility in the design of the workflow **_工作流设计的灵活性 The modern challenges of workflow management go beyond hard-coded workflows. 工作流管理的现代挑战超越了硬编码工作流。 To allow workflows to adapt dynamically to the current execution interval, for example, the execution context should be callable via variables at the time of execution. 例如,要允许工作流动态适应当前执行间隔,执行上下文应在执行时可通过变量调用。 Concepts such as conditions are also enjoying increasing user benefit in the design of flexible workflows. 条件等概念在设计灵活的工作流程时也越来越有利于用户。
Monitoring execution times 监控执行时间 A workflow management system is a central point, which tracks not only the status but also the execution time of the workflows. 工作流管理系统是一个中心点,它不仅跟踪工作流的状态,还跟踪工作流的执行时间。 Execution times can be monitored automatically by means of service level agreements (SLA). 可以通过服务级别协议 (SLA) 自动监控执行时间。 Unexpectedly long execution times due to an unusually large amount of data are thus detected and can optionally trigger a notification. 因此,由于异常大量的数据而导致的意外长执行时间被检测到,并且可以选择触发通知。
Out of the challenges, Airflow was developed in 2014 as AirBnB's internal workflow management platform to successfully manage the complex, numerous workflows. 在挑战中,Airflow于2014年开发为AirBnB的内部工作流程管理平台,以成功管理复杂的众多工作流程。 Apache Airflow was open-source from the beginning and is now available to users free of charge under the Apache License. Apache Airflow从一开始就是开源的,现在在Apache许可证下免费提供给用户。
Apache Airflow Features
Apache airflow功能 Since Airflow became a top-level project of the Apache Software Foundation in 2019, the contributing community got a gigantic growth boost. 自从 Airflow 在 2019 年成为 Apache 软件基金会的顶级项目以来,贡献社区获得了巨大的增长推动力。 As a result, the feature set has grown a lot over time, with regular releases to meet the current heartfelt needs of users. 因此,随着时间的推移,该功能集增长了很多,定期发布以满足用户当前衷心的需求。
Rich web interface 丰富的网页界面
Compared to other workflow management platforms, the rich web interface is particularly impressive. 与其他工作流管理平台相比,丰富的Web界面尤其令人印象深刻。 The status of execution processes, resulting runtimes and, of course, log files are directly accessible via the elegantly designed web interface. 执行进程的状态、生成的运行时,当然还有日志文件都可以通过设计优雅的 Web 界面直接访问。 Important functions for managing workflows, such as starting, pausing and deleting a workflow, can be realized directly from the start menu without any detours. 管理工作流的重要功能,例如启动、暂停和删除工作流,可以直接从开始菜单实现,而无需任何弯路。 This ensures intuitive usability, even without any programming knowledge. Access is best via a desktop, but is also possible via mobile devices with comfort restrictions. 这确保了直观的可用性,即使没有任何编程知识。最好通过台式机访问,但也可以通过具有舒适限制的移动设备访问。
Command line interface and API 命令行界面和 API
Apache Airflow is not only available for clicking. Apache Airflow不仅可用于点击。 For technical users there is a command line interface (CLI) which also covers the main functions. 对于技术用户,有一个命令行界面(CLI),其中也涵盖了主要功能。 Through the redesigned REST API, even other systems access Airflow with secure authentication through the interface. 通过重新设计的 REST API,甚至其他系统也可以通过界面通过安全身份验证访问 Airflow。 This enables a number of new use cases and system integrations. 这支持了许多新的用例和系统集成。
Realization of complex workflows with internal and external dependencies 实现具有内部和外部依赖关系的复杂工作流程
In Apache Airflow, workflows are defined by Python code. 在Apache Airflow中,工作流由Python代码定义。 The order of tasks can be easily customized. 可以轻松自定义任务的顺序。 Predecessors, successors and parallel tasks can be defined. 可以定义前置任务、后继任务和并行任务。 In addition to these internal dependencies, external dependencies can also be implemented. 除了这些内部依赖之外,还可以实现外部依赖关系。
For example, it is possible to wait with the continuation of the workflow until a file appears on a cloud storage or an SQL statement provides a valid result. 例如,可以等待工作流的继续,直到文件出现在云存储上或 SQL 语句提供有效结果。 Advanced functions, such as the reuse of workflow parts (TaskGroups) and conditional branching, delight even demanding users. 高级功能,如工作流部件(任务组)的重用和条件分支,即使是要求苛刻的用户也感到高兴。
Scalability and containerization 可扩展性和容器化
As it is deployed, Apache Airflow can initially run on a single server and then scale horizontally as tasks grow. 在部署时,Apache Airflow 最初可以在单个服务器上运行,然后随着任务的增长水平扩展。 Deployment on distributed systems is mature and different architecture variants (Kubernetes, Celery, Dask) are supported. 分布式系统上的部署已经成熟,并且支持不同的架构变体(Kubernetes,Celery,Dask)。
Customizability with plug-ins and macros 插件和宏的可定制性
Many integrations to Apache Hive, Hadoop Distributed File System (HDFS), Amazon S3, etc. are provided in the default installation. 默认安装中提供了许多与Apache Hive,Hadoop分布式文件系统(HDFS),Amazon S3等的集成。 Others can be added through custom task classes. 其他任务可以通过自定义任务类添加。 Due to its open-source nature, even the core of the application is customizable and the community provides well-documented plug-ins for most requirements. 由于其开源性质,即使是应用程序的核心也是可定制的,社区为大多数需求提供了有据可查的插件。