Running Apache Airflow on Virtual Machines(VM) and write your first Dag
🌟 Welcome, Cloud Enthusiasts! 🌟
As someone who lives and breathes the cloud, I’ve explored countless tools and platforms, but Apache Airflow remains one of my favorites when it comes to orchestrating complex workflows. It’s like the director of a blockbuster data pipeline movie, seamlessly coordinating every moving part.
In this blog, I’m stepping away from managed solutions like Cloud Composer to show you how to set up Airflow on a Google Cloud Virtual Machine (VM). Why a VM, you ask? Simple! It’s cost-effective, highly customizable, and perfect for learning or running your workflows at scale. Whether you’re new to Airflow or looking for a flexible deployment, this guide has got you covered. So, let’s roll up our sleeves and get started! 🚀
What is Apache Airflow?
Apache Airflow is an open-source platform designed to programmatically author, schedule, and monitor workflows as Directed Acyclic Graphs (DAGs). It is commonly used for orchestrating complex workflows and automating tasks across systems.
Apache Airflow is a cornerstone of modern data engineering due to its ability to automate, schedule, and manage complex workflows efficiently. Its rich variety of providers (e.g., AWS, GCP, Azure) and operators (e.g., BashOperator, PythonOperator, BigQueryOperator) make it highly versatile for orchestrating tasks across diverse systems.
Airflow’s features — like task dependency management, scalability, monitoring, and extensibility — allow data engineers to build robust and reliable data pipelines. It simplifies workflows, reduces manual intervention, and ensures reproducibility, making it indispensable for handling the increasing complexity of data processes in today’s data-driven world.
Different Ways to Use Apache Airflow
Apache Airflow offers flexible deployment options to suit various needs:
- Managed Services like Cloud Composer and AWS MWAA handle infrastructure and scaling, perfect for those focused on workflows rather than setup.
- Kubernetes provides scalability and high availability, ideal for large-scale, containerized environments.
- Virtual Machines (VMs) offer cost-effectiveness and complete control, making them great for small teams or those learning Airflow.
- Bare Metal deployments provide full hardware control but are less common today.
In this blog, we focus on Airflow on VMs, a practical and customizable option to start building reliable workflows without the complexities of managed or containerized environments.
Deploying Apache Airflow on a virtual machine is cost-effective and ideal for small teams or beginners. It allows you to:
- Customize the setup for specific needs.
- Learn Airflow’s components deeply without relying on managed services.
In this blog, we’ll walk you through the process of setting up Airflow on a VM to get you started!
Prerequisites
- Any Cloud Provider Account to Create VM , I am using Google Cloud here.
- VM Setup:
- We can start with Default, 2 vCPUs, 4GB RAM, and 50GB storage (or tailored for Airflow needs).
- Python Installation:
sudo apt install python3 python3-pip python3-venv
- Firewall Rules:
- Show how to allow port 8080 for the Airflow web UI.
3. Installing Apache Airflow
- Step-by-step commands for installation:
- Create a Python Virtual Environment (recommended):
python3 -m venv ~/airflow-env
python3 -m venv airflow-env source airflow-env/bin/activate
Install Airflow:
Mention which version you’re installing (latest or LTS) and include dependencies like postgres
, mysql
, or gcp
:
pip3 install apache-airflow
Starting Airflow
You can start Airflow services individually or start all services at once. Here are major Airflow commands
Start Airflow Scheduler
The scheduler orchestrates task execution. Start it with:
airflow scheduler
Start Airflow Webserver
The webserver provides the UI to manage and monitor DAGs. Start it with
airflow webserver -p 8080
Starting Both Components Together
Use the airflow standalone
command for quick local setups. This initializes the database, starts the scheduler, and launches the web server in a single command
airflow standalone
- Access the Airflow web UI (
http://<VM_IP>:8080
) and Use Credentials mentioned after executing airflow standalone as per above screenshot.
- Run Airflow in Background:
- Use
nohup
,screen
, ortmux
to keep Airflow running after SSH disconnection. - Example with
nohup
:
nohup airflow standalone > airflow.log
Customizing Airflow
- The
airflow.cfg
file is the primary configuration file for Apache Airflow. It is used to define and manage various settings and options for how your Airflow instance operates. These settings range from database connections to executor configurations, logging, and scheduling behavior. - Customize
dags_folder
,base_url
, orlog_file_path
.
Creating and Deploying a Simple DAG
- Now we will Write a simple DAG to create a Google Cloud Storage (GCS) bucket.
- Writing dag is simple, Write a Dag definition file and put the file in a dags/ folder.
#pip install apache-airflow-providers-google
from airflow import DAG
from airflow.providers.google.cloud.operators.gcs import GCSCreateBucketOperator
from datetime import datetime
# Default arguments
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'retries': 1,
}
# Define the DAG
with DAG(
dag_id='create_gcs_bucket',
default_args=default_args,
description='A simple DAG to create a GCS bucket',
schedule_interval=None, # Run on demand
start_date=datetime(2023, 1, 1),
catchup=False,
tags=['gcs', 'example'],
) as dag:
# Task to create a GCS bucket
create_bucket = GCSCreateBucketOperator(
task_id='create_bucket',
bucket_name='airflow-test-001', # Replace with your desired bucket name
location='US', # Specify the GCS bucket location
project_id='tt-dev-001', # Replace with your GCP project ID
)
create_bucket
- Install providers:
pip install apache-airflow-providers-google
- Place it in the
dags_folder
and refresh the UI.
- Test and Monitor the DAG:
- Run the DAG and show its execution status in the UI.
Check the Dag Status
Dag run is successful and Bucket created
🎉 Congratulations, you did it! 🎉
By now, you’ve set up Apache Airflow on a Google Cloud VM and are ready to orchestrate your workflows like a pro. Whether you’re processing data, triggering ETL jobs, or managing machine learning pipelines, your Airflow instance is all set to handle the action.