![]() We'll dig deeper into DAGs, but first, let's install Airflow. At various points in the pipeline, information is consolidated or broken out. In the above example, the DAG begins with edges 1, 2, and 3 kicking things off. Here's an example: An example DAG structure Interestingly, a "child" edge can also have multiple parents (this is where our tree analogy fails us). That's it - there's no need for fancy language here.Įdges in a DAG can have numerous "child" edges. Every node has a "parent" node, which of course means that a child node cannot be its parents' parent. If this remains unclear, consider how nodes in a tree data structure relate to one another. The connection of edges is called a vertex. Each "step" in the workflow (an edge) is reached via the previous step in the workflow until we reach the beginning. In computer science, a directed acyclic graph simply means a workflow that only flows in a single direction. The OG Dag What is a DAG?Īirflow refers to what we've been calling "pipelines" as DAGs (directed acyclic graphs). To get started with Airflow, we should stop throwing the word "pipeline" around. Not only can we check the heartbeat of our pipelines, but we can also view graphical representations of the very code we write. Even more impressive is that the code we write is visually represented in Airflow's GUI. By creating our pipelines within Airflow, we gain immediate visibility across all our pipelines to quickly spot areas of failure. Wrangling multiple pipelines which are prone to failure might be the least glorious aspect of any data engineer's job. The more obvious benefits of Airflow are centered around its powerful GUI. By leveraging these tools, engineers begin to see their pipelines abiding by a well-understood format, making code readable to others. Airflow comes with numerous powerful integrations that serve almost any need when it comes to outputting data. In the same way, a web framework might help developers by abstracting common patterns, Airflow does the same by providing data engineers with tools to trivialize certain repetitive aspects of pipeline creation. It's not too crazy to group these benefits into two main categories: code quality and visibility.Īirflow provides us with a better way to build data pipelines by serving as a sort of 'framework' for creating pipelines. What's the Point of Airflow?Īirflow provides countless benefits to those in the pipeline business. It won't take much time using Airflow before you wonder how you managed to get along without it. If you happen to be a data engineer who isn't using Airflow (or equivalent) yet, you're in for a treat. The best part of Airflow, of course, is that it's one of the rare projects donated to the Apache foundation which is written in Python. It shouldn't take much time in Airflow's interface to figure out why: Airflow is the missing piece data engineers need to standardize the creation of ETL pipelines. It seems like almost every data-heavy Python shop is using Airflow in some way these days.
0 Comments
Leave a Reply. |