Skip to main content

Apache Airflow DAGs and ETL processes for ImpactU

Project description

Validate and Test DAGs

ImpactU Airflow ETL

Central repository for Apache Airflow DAGs and ETL (Extraction, Transformation, and Loading) processes for the ImpactU project. This package includes the full logic for data orchestration, source extraction, and data processing.

🚀 Description

This project orchestrates data collection from various scientific and academic sources, its processing using the Kahi tool, and its subsequent loading into query systems such as MongoDB and Elasticsearch.

� Installation

You can install the package directly from PyPI:

pip install impactu_airflow

Or for development:

git clone https://github.com/colav/impactu_airflow.git
cd impactu_airflow
pip install -e .

�📂 Project Structure

The repository is organized by data lifecycle stages and Airflow components:

  • dags/: Apache Airflow DAG definitions.
  • extract/: Extraction logic for sources like OpenAlex, ORCID, ROR, etc.
  • transform/: Transformation and normalization processes (Kahi).
  • load/: Loading scripts to final destinations (MongoDB, Elasticsearch).
  • impactu/: Core utilities and shared logic for the project.
  • deploys/: Deployment logic for external services (APIs, databases) via DAGs.
  • backups/: Database backup automation via DAGs.
  • tests/: Integration, unit, and data quality tests.

📋 Requirements and Architecture

For details on design principles (Checkpoints, Idempotency, Parallelism), see the System Requirements document.

🛠 DAG Naming Standard

To maintain consistency in the Airflow interface, we follow this convention:

Type Format Example
Extraction extract_{source} extract_openalex
Transformation transform_{entity} transform_sources
Loading load_{db}_{env} load_mongodb_production
Deployment deploy_{service}_{env} deploy_mongodb_production
Backup backup_{db}_{name} backup_mongodb_kahi
Tests tests_{service} tests_kahi

⚙️ Development and Deployment

This repository focuses exclusively on DAG logic and ETL processes. The base infrastructure is provided by the Chia repository.

For details on the CI/CD strategy, image building, and environment management, see the document: 👉 README_DEVOPS.md

Local Workflow

  1. Clone the repository.
  2. Install dependencies: pip install -r requirements.txt.
  3. Develop DAGs in the dags/ folder.
  4. Validate integrity: pytest tests/etl/test_dag_integrity.py.

Colav - ImpactU

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

impactu_airflow-0.1.0.tar.gz (11.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

impactu_airflow-0.1.0-py3-none-any.whl (10.2 kB view details)

Uploaded Python 3

File details

Details for the file impactu_airflow-0.1.0.tar.gz.

File metadata

  • Download URL: impactu_airflow-0.1.0.tar.gz
  • Upload date:
  • Size: 11.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for impactu_airflow-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e57c2bbc1945beef3eb85068cc961fd84cd2d6ccf4a3c6274bdf753999c357bd
MD5 076c3e3463b16bb1a6dd5c1f3322d9a8
BLAKE2b-256 36b4b35667867b3ddc5039cfec3e9541d5bc9724f4e1a77c56d6e1d568ff586c

See more details on using hashes here.

File details

Details for the file impactu_airflow-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for impactu_airflow-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 49a7d1337c3e94bc39c312711a947fe020e720c12187f256593a0d789b086a13
MD5 1348f85086f0a417988c82cf13ddbfbf
BLAKE2b-256 b89a82223177a3896f6ba0a9d754b5534c1be4e5d9540ee3b3cb3dd6e919799e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page