Apache Airflow in Nutshell
What is Airflow?
Airflow is a platform to Programmatically Author, Schedule and Monitor Workflows or Data Pipeline.
What is Workflow?
§ A Sequence of tasks.
§ Schedule or Triggered an email
§ Frequently used to handle big data processing pipeline.
E.g. of Workflow
Traditional ETL Approach:
Time Based Scheduler in UNIX like Computers for Daily Extraction is required if Data is generating every day.
Source-> CRON Script-> DB
These CRON Scripts has lot of problem and not scalable hence need of Airflow comes into the picture.
Some Problems with CRON jobs:
1) Failures-> if Failure happens. (How many time? How often?)
2) Monitoring -> Success or Failure States.
3) Dependencies -> Execution Dependencies.
4) Scalability -> New Scheduler is hard to add.
5) Deployment -> Deploy New Changes.
6) Process historic data -> Rerun Historic Data.
Many Problem one Solution -> Apache Airflow
Airflow Framework to define tasks and dependencies in python, Executing, Scheduling, Distributing tasks across worker Nodes. View of present and past runs, logging feature Extensible through plugins. Nice UI, Possibility to define REST interface, Interact well with databases.
Airflow DAG (Directed Acyclic Graph) with Multiple tasks which can be executed independently or interdependently.
1) Can handle upstream | downstream dependencies.
2) Jobs can pass parameters to other jobs.
3) Handle Errors and failures.
4) Data Sensors to trigger a DAG when data arrives.
5) Job testing through airflow itself.
6) Monitoring and email alerts.
7) Community supports
8) Accessibility of log files.
9) Implementing Trigger rules.
So as you get Familiar with Apache Airflow and its benefits. Let’s write our first Program in Airflow-Python.
Writing Your First Airflow Program consists of below steps:
Step1: Importing Modules.
Step2: Default Arguments.
Step3: Instantiate a DAG.
Step5: Setting up Dependencies.
Install Python and Airflow setup on your Machine.
Step 1: Importing Modules
Import Python dependencies
Airflow schedules the job on the principal of Directed Acyclic Graph i.e. DAG.
Hence we need to import DAG. If we wrote any Python Function or Shell Script according to that we have to import Operators.
We can also import date time or in built date function of airflow.
There are various modules and functionality airflow offers so do explore further as per of your requirements.
Step 2: Default Arguments
Here, we all define some default arguments. Which can be like Owner, Dag should depend on past jobs or not, whom to mail if failure occurs.
If failure occur how many times it should retries to execute the dag in how much interval of time.
Step 3: Instantiate a DAG.
Here we are Instantiate DAG just like objects in case of classes.
We need to pass all the arguments to constructor so that it didn’t give any errors.
Schedule Interval is used for how often the tasks should get executed.
Find Below chart for better understanding.
Step 4: Tasks.
Here we can define any number of python function (or Bash Functions, etc.). Which will act as Tasks. We can write any ETL jobs inside these functions and execute as per convenience.
Step 5: Setting up Dependencies.
In which order task should be executed in.
t1.set_downstream(t2) == t1>>t2 (This means that t2 will depend on t1)
t1.set_upstream(t2) == t1<<t2 (This means that t1 will depend on t2)
for multiple task dependencies.
t1>>[t2,t3] (This means that t1 will depend on t2 and t3)
[t2,t3]<<t1 (This means that t2,t3 will depend on t1)
Follow for more Technical stories!!