Skip to main content

YAML-driven data ingestion framework. Define collectors, dcf handles the rest.

Project description

dcf

PyPI Python License

D.ata C.ollection F.ramework

uvx --from dcf-core dcf init

How it works

  1. Define a data collector in YAML — source, schema, cadence
  2. Run it with dcf run
  3. Query data from your data lake

Example

name: dcf_commits
namespace: github

source:
  type: http
  url: https://api.github.com/repos/zephschafer/dcf/commits
  method: GET
  params:
    - name: per_page
      type: integer
      value: 100
    - name: since
      type: string
    - name: until
      type: string
  schema:
    columns:
      - {name: sha,          path: sha,                type: string}
      - {name: author,       path: commit.author.name, type: string}
      - {name: message,      path: commit.message,     type: string}
      - {name: committed_at, path: commit.author.date, type: timestamp}

cadence:
  strategy: incremental
  primary_key: sha
  iterate:
    - type: date_range
      params: [since, until]
      start: "2024-01-01"
      end: today
      step: 30 days

deployment:
  schedule: "0 8 * * *"
uv run dcf run dcf_commits
uv run dcf query 'SELECT * FROM github.dcf_commits LIMIT 5'

Install

pip install dcf-core

The CLI command is dcf.


Quickstart

mkdir my-project && cd my-project
uvx --from dcf-core dcf init
uv sync
uv run dcf run dcf_commits
uv run dcf query 'SELECT * FROM github.dcf_commits'

dcf init creates pyproject.toml, project.yml, .gitignore, collectors/, and an example collector.


Contributing

git clone https://github.com/zephschafer/dcf && cd dcf && uv sync

Point a local project at your checkout:

[tool.uv.sources]
dcf-core = { path = "../dcf", editable = true }

To verify changes:

uv run dcf run dcf_commits
uv run dcf query 'SELECT * FROM github.dcf_commits'

Releasing: bump version in pyproject.toml and push to main — GitHub Actions publishes to PyPI automatically.


Package structure

dcf/
├── cli.py              Entry point (Typer)
├── config/
│   ├── models.py       Pydantic models for collector YAML
│   └── loader.py       YAML loading + env var resolution
├── engine/
│   ├── runner.py       Outer loop (iterate → fetch → project → write)
│   ├── fetcher.py      HTTP and Python source fetchers
│   ├── iterator.py     Date range and categorical iteration
│   ├── projector.py    Schema projection and path extraction
│   └── transforms.py   Column transforms
├── writer/
│   └── iceberg.py      Write strategies (incremental / append / full_refresh)
└── gcp/                GCP auth, provisioning, Terraform wrappers

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dcf_core-1.0.0.tar.gz (61.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dcf_core-1.0.0-py3-none-any.whl (64.9 kB view details)

Uploaded Python 3

File details

Details for the file dcf_core-1.0.0.tar.gz.

File metadata

  • Download URL: dcf_core-1.0.0.tar.gz
  • Upload date:
  • Size: 61.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for dcf_core-1.0.0.tar.gz
Algorithm Hash digest
SHA256 b9108676a5929e0a05e9cb29b45524b61152b873c8d07aa06a6d2674e6de680f
MD5 bf3bd02849c761ca8d0281388646cbae
BLAKE2b-256 c79f45b39ab6cf0f425f4ad2de38581277d96d9cf8b52ad26f814510d139b819

See more details on using hashes here.

File details

Details for the file dcf_core-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: dcf_core-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 64.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for dcf_core-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2f0f67ca9256b17b82107c29334dcfcd46a36744f3a900f7d1975032d3b22bf4
MD5 41fd750bbd19b1a0111e35a8cedee422
BLAKE2b-256 04577febdefd0991fa3f48fb8dddf91fdae9ba42fcb061552ad49ef453cd2aeb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page