Skip to main content

Airflow UI Plugin for monitoring DAG failures and SLA misses

Project description

Airflow Watcher ๐Ÿ‘๏ธ

An Airflow UI plugin for monitoring DAG failures and SLA misses/delays.

Demo

Airflow Watcher Demo

Features

  • ๐Ÿšจ DAG Failure Monitoring: Real-time tracking of DAG and task failures
  • โฐ SLA Miss Detection: Alerts when DAGs miss their SLA deadlines
  • ๐Ÿ“Š Dashboard View: Custom Airflow UI view for monitoring status
  • ๐Ÿ”” Multi-channel Notifications: Slack, Email, and PagerDuty alerts
  • ๐Ÿ“ˆ Trend Analysis: Historical failure and SLA miss trends
  • ๐Ÿ“ก Metrics Export: StatsD/Datadog and Prometheus support
  • โš™๏ธ Flexible Alert Rules: Pre-defined templates or custom rules

Installation

๐Ÿ“– See INSTALL.md for detailed installation and configuration instructions.

Alerting & Monitoring

๐Ÿ“– See ALERTING.md for complete alerting configuration:

  • Slack - Rich notifications with blocks
  • Email - SMTP-based alerts
  • PagerDuty - Incident management with deduplication
  • StatsD/Datadog - Real-time metrics
  • Prometheus - /metrics endpoint for scraping

Quick Setup

# Slack alerts
export AIRFLOW_WATCHER_SLACK_WEBHOOK_URL="https://hooks.slack.com/..."

# PagerDuty (optional)
export AIRFLOW_WATCHER_PAGERDUTY_ROUTING_KEY="your-key"

# Choose alert template
export AIRFLOW_WATCHER_ALERT_TEMPLATE="production_balanced"

Usage

Once installed, the plugin will automatically:

  1. Register with Airflow's plugin system
  2. Add a "Watcher" menu item to the Airflow UI
  3. Start monitoring DAG failures and SLA misses

Watcher Menu

Navigate to Watcher in the Airflow UI navigation to access:

  • Airflow Dashboard - Overview metrics
  • Airflow Health - DAG health status (success/failed/delayed/stale)
  • DAG Scheduling - Queue and pool utilization
  • DAG Failures - Recent failures with details
  • SLA Tracker - SLA misses and delays
  • Task Health - Long-running and zombie tasks
  • Dependencies - Cross-DAG dependency tracking

Role-Based Access Control (RBAC)

Airflow Watcher integrates with Airflow's built-in FAB security manager to enforce DAG-level access control. No separate configuration is needed โ€” it reads directly from Airflow's role and permission system.

How It Works

  • Admin / Op roles see all DAGs across every Watcher page and API endpoint
  • Custom roles only see DAGs they have can_read permission on
  • Filtering is mandatory and applied server-side โ€” restricted users cannot bypass it
  • Aggregate stats (failure counts, SLA misses, health scores) are recomputed per-user so no global data leaks
  • A ๐Ÿ”’ badge appears in the filter bar for non-admin users

Setting Up DAG-Level Permissions

Add access_control to your DAG definitions to grant team-specific access:

from airflow import DAG

dag = DAG(
    dag_id="weather_data_pipeline",
    schedule_interval="@hourly",
    access_control={
        "team_weather": {"can_read", "can_edit"},
    },
)

Then create matching roles in Airflow (Admin โ†’ Security โ†’ List Roles) and assign users to them. The Watcher plugin will automatically pick up the permissions.

What Gets Filtered

Area Filtering
Dashboard stats Failure count, SLA misses, health score โ€” all scoped to user's DAGs
Failures page Only failures from accessible DAGs
SLA page Only SLA misses from accessible DAGs
Health page Health status, stale DAGs, scheduling lag โ€” filtered
Task health Long-running tasks, zombies, retries โ€” filtered
Scheduling Concurrent runs, delayed DAGs โ€” filtered
Dependencies Cross-DAG deps, correlations โ€” filtered
All API endpoints Same RBAC enforcement as UI pages

Demo Users

The demo environment includes pre-configured RBAC users:

User Role Visible DAGs
admin Admin All 8 DAGs
weather_user team_weather weather_data_pipeline, stock_market_collector
ecommerce_user team_ecommerce ecommerce_sales_etl, data_quality_checks

Passwords are configured in demo/docker-compose.yml. Change them before any shared deployment.

cd demo
docker-compose up -d
# Visit http://localhost:8080 and login as any user above

Architecture

+--------------------------------------------------------------+
|                   Airflow Webserver                          |
|                                                              |
|  +--------------------------------------------------------+  |
|  |              Airflow Watcher Plugin                    |  |
|  |                                                        |  |
|  |  +-------------+     +------------------------------+  |  |
|  |  | Flask Views  |    |        Monitors (6)          |  |  |
|  |  | (Dashboard)  |<---|  - DAG Failure Monitor       |  |  |
|  |  |              |    |  - SLA Monitor               |  |  |
|  |  | REST API     |    |  - Task Health Monitor       |  |  |
|  |  | /api/watcher |    |  - Scheduling Monitor        |  |  |
|  |  +-------------+     |  - Dependency Monitor        |  |  |
|  |         |            |  - DAG Health Monitor        |  |  |
|  |         |            +----------+-------------------+  |  |
|  |         |                      |                       |  |
|  |         |           +----------v-------------------+   |  |
|  |         |           |    Metrics Collector          |  |  |
|  |         |           |    (WatcherMetrics)           |  |  |
|  |         |           +----------+-------------------+   |  |
|  |         |                      |                       |  |
|  |         v                      v                       |  |
|  |  +-------------+     +------------------------------+  |  |
|  |  |  Notifiers   |    |        Emitters              |  |  |
|  |  |  - Slack     |    |  - StatsD / Datadog (UDP)    |  |  |
|  |  |  - Email     |    |  - Prometheus (/metrics)     |  |  |
|  |  |  - PagerDuty |    |                              |  |  |
|  |  +-------------+     +------------------------------+  |  |
|  +--------------------------------------------------------+  |
|                          |                                   |
|                          v                                   |
|              +-----------------------+                       |
|              |  Airflow Metadata DB  |                       |
|              |  (PostgreSQL/MySQL)   |                       |
|              +-----------------------+                       |
+--------------------------------------------------------------+

Everything runs inside the Airflow webserver process. No separate workers, no message queues, no external databases. The plugin reads from the same metadata DB that Airflow already maintains.

Project Structure

airflow-watcher/
โ”œโ”€โ”€ src/
โ”‚   โ””โ”€โ”€ airflow_watcher/
โ”‚       โ”œโ”€โ”€ __init__.py
โ”‚       โ”œโ”€โ”€ plugins/           # Airflow plugin definitions
โ”‚       โ”œโ”€โ”€ views/             # Flask Blueprint views
โ”‚       โ”œโ”€โ”€ monitors/          # DAG & SLA monitoring logic
โ”‚       โ”œโ”€โ”€ notifiers/         # Slack, email notifications
โ”‚       โ””โ”€โ”€ templates/         # Jinja2 templates
โ”œโ”€โ”€ demo/                      # Local demo Airflow environment
โ”‚   โ”œโ”€โ”€ dags/                  # Sample DAGs for testing
โ”‚   โ”œโ”€โ”€ plugins/               # Plugin copy for demo
โ”‚   โ””โ”€โ”€ docker-compose.yml     # Docker setup
โ”œโ”€โ”€ tests/
โ””โ”€โ”€ pyproject.toml

Demo Environment

To test the plugin locally with sample DAGs:

cd demo
docker-compose up -d

Then visit http://localhost:8080 and navigate to the Watcher menu.

See demo/README.md for more details.

MWAA Integration

Setup

  1. Add airflow-watcher to your MWAA requirements.txt:
airflow-watcher==0.1.2

For Prometheus metrics support:

airflow-watcher[all]==0.1.2
  1. Upload requirements.txt to your MWAA S3 bucket:
aws s3 cp requirements.txt s3://<your-mwaa-bucket>/requirements.txt
  1. Update your MWAA environment to pick up the new requirements (via AWS Console or CLI):
aws mwaa update-environment \
  --name <your-environment-name> \
  --requirements-s3-path requirements.txt \
  --requirements-s3-object-version <version-id>

Note: No plugins.zip is needed. Airflow auto-discovers airflow-watcher via the airflow.plugins entry point when installed via pip (Airflow 2.7+).

  1. Wait for the environment to finish updating (takes a few minutes).

  2. Verify at:

https://<your-mwaa-url>/api/watcher/health

Environment Variables (optional)

Configure via MWAA Airflow configuration overrides:

Variable Purpose
AIRFLOW_WATCHER__SLACK_WEBHOOK_URL Slack notifications
AIRFLOW_WATCHER__PAGERDUTY_API_KEY PagerDuty alerts
AIRFLOW_WATCHER__ENABLE_PROMETHEUS Prometheus metrics

Testing Locally with MWAA Local Runner

git clone https://github.com/aws/aws-mwaa-local-runner.git
cd aws-mwaa-local-runner
echo "airflow-watcher==0.1.2" >> requirements/requirements.txt
./mwaa-local-env build-image
./mwaa-local-env start

Visit http://localhost:8080/api/watcher/health to verify.

Note: If using Slack or PagerDuty notifications, ensure your MWAA VPC has a NAT gateway for outbound internet access.

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
ruff check src tests
black --check src tests

# Type checking
mypy src

License

Apache License 2.0 - See LICENSE for details.

Author

Ramanujam Solaimalai (@ram07eng)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

airflow_watcher-0.1.3.tar.gz (57.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

airflow_watcher-0.1.3-py3-none-any.whl (62.7 kB view details)

Uploaded Python 3

File details

Details for the file airflow_watcher-0.1.3.tar.gz.

File metadata

  • Download URL: airflow_watcher-0.1.3.tar.gz
  • Upload date:
  • Size: 57.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for airflow_watcher-0.1.3.tar.gz
Algorithm Hash digest
SHA256 e39283133cd149a4fcade31c26ba33bb889c7c06775e6bc51f3310c7924af603
MD5 b71cae0d2fec1c48548bb01409519462
BLAKE2b-256 798179f78ea724b08e93299026b43e9d6066140c51a265a6207978ad264bf3e4

See more details on using hashes here.

Provenance

The following attestation bundles were made for airflow_watcher-0.1.3.tar.gz:

Publisher: publish.yml on ram07eng/airflow-watcher

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file airflow_watcher-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for airflow_watcher-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 794ac2e1d98b2155471476151651d0d86f51eaaead5e792a2b0f73292dad44fa
MD5 c046a5130a739472b6d4d5a42d814be3
BLAKE2b-256 0ae49d85bef8eb6125423c6ceb5fdebf79f27cc8418a415dbb2c7f3096435d35

See more details on using hashes here.

Provenance

The following attestation bundles were made for airflow_watcher-0.1.3-py3-none-any.whl:

Publisher: publish.yml on ram07eng/airflow-watcher

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page