Apache Airflow DAGs and ETL processes for ImpactU
Project description
ImpactU Airflow ETL
Central repository for Apache Airflow DAGs and ETL (Extraction, Transformation, and Loading) processes for the ImpactU project. This package includes the full logic for data orchestration, source extraction, and data processing.
🚀 Description
This project orchestrates data collection from various scientific and academic sources, its processing using the Kahi tool, and its subsequent loading into query systems such as MongoDB and Elasticsearch.
� Installation
You can install the package directly from PyPI:
pip install impactu_airflow
Or for development:
git clone https://github.com/colav/impactu_airflow.git
cd impactu_airflow
pip install -e .
�📂 Project Structure
The repository is organized by data lifecycle stages and Airflow components:
dags/: Apache Airflow DAG definitions.extract/: Extraction logic for sources like OpenAlex, ORCID, ROR, etc.transform/: Transformation and normalization processes (Kahi).load/: Loading scripts to final destinations (MongoDB, Elasticsearch).impactu/: Core utilities and shared logic for the project.deploys/: Deployment logic for external services (APIs, databases) via DAGs.backups/: Database backup automation via DAGs.tests/: Integration, unit, and data quality tests.
📋 Requirements and Architecture
For details on design principles (Checkpoints, Idempotency, Parallelism), see the System Requirements document.
🛠 DAG Naming Standard
To maintain consistency in the Airflow interface, we follow this convention:
| Type | Format | Example |
|---|---|---|
| Extraction | extract_{source} |
extract_openalex |
| Transformation | transform_{entity} |
transform_sources |
| Loading | load_{db}_{env} |
load_mongodb_production |
| Deployment | deploy_{service}_{env} |
deploy_mongodb_production |
| Backup | backup_{db}_{name} |
backup_mongodb_kahi |
| Tests | tests_{service} |
tests_kahi |
⚙️ Development and Deployment
This repository focuses exclusively on DAG logic and ETL processes. The base infrastructure is provided by the Chia repository.
For details on the CI/CD strategy, image building, and environment management, see the document: 👉 README_DEVOPS.md
Local Workflow
- Clone the repository.
- Install dependencies:
pip install -r requirements.txt. - Develop DAGs in the
dags/folder. - Validate integrity:
pytest tests/etl/test_dag_integrity.py.
Colav - ImpactU
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file impactu_airflow-0.1.0.tar.gz.
File metadata
- Download URL: impactu_airflow-0.1.0.tar.gz
- Upload date:
- Size: 11.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e57c2bbc1945beef3eb85068cc961fd84cd2d6ccf4a3c6274bdf753999c357bd
|
|
| MD5 |
076c3e3463b16bb1a6dd5c1f3322d9a8
|
|
| BLAKE2b-256 |
36b4b35667867b3ddc5039cfec3e9541d5bc9724f4e1a77c56d6e1d568ff586c
|
File details
Details for the file impactu_airflow-0.1.0-py3-none-any.whl.
File metadata
- Download URL: impactu_airflow-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
49a7d1337c3e94bc39c312711a947fe020e720c12187f256593a0d789b086a13
|
|
| MD5 |
1348f85086f0a417988c82cf13ddbfbf
|
|
| BLAKE2b-256 |
b89a82223177a3896f6ba0a9d754b5534c1be4e5d9540ee3b3cb3dd6e919799e
|