Skip to main content

YAML-driven data ingestion framework. Define collectors, dcf handles the rest.

Project description

dcf

PyPI Python License

D.ata C.ollection F.ramework

uvx --from dcf-core dcf init

How it works

  1. Define a data collector in YAML — source, schema, cadence
  2. Run it with dcf run
  3. Query data from your data lake

Quickstart

mkdir my-project && cd my-project
uvx --from dcf-core dcf init
uv sync
uv run dcf run dcf_commits
uv run dcf query 'SELECT * FROM github.dcf_commits'

dcf init creates pyproject.toml, project.yml, .gitignore, collectors/, and an example collector.


Example

dcf collector

name: dcf_commits
namespace: github

source:
  type: http
  url: https://api.github.com/repos/zephschafer/dcf/commits
  method: GET
  params:
    - name: per_page
      type: integer
      value: 100
    - name: since
      type: string
    - name: until
      type: string
  schema:
    columns:
      - {name: sha,          path: sha,                type: string}
      - {name: author,       path: commit.author.name, type: string}
      - {name: message,      path: commit.message,     type: string}
      - {name: committed_at, path: commit.author.date, type: timestamp}

cadence:
  strategy: incremental
  primary_key: sha
  iterate:
    - type: date_range
      params: [since, until]
      start: "2024-01-01"
      end: today
      step: 30 days

deployment:
  schedule: "0 8 * * *"

dcf run

uv run dcf run dcf_commits

dcf query

uv run dcf query 'SELECT * FROM github.dcf_commits LIMIT 5'

Contributing

git clone https://github.com/zephschafer/dcf && cd dcf && uv sync

Point a local project at your checkout:

[tool.uv.sources]
dcf-core = { path = "../dcf", editable = true }

To verify changes:

uv run dcf run dcf_commits
uv run dcf query 'SELECT * FROM github.dcf_commits'

Releasing: bump version in pyproject.toml and push to main — GitHub Actions publishes to PyPI automatically.


Package structure

dcf/
├── cli.py              Entry point (Typer)
├── config/
│   ├── models.py       Pydantic models for collector YAML
│   └── loader.py       YAML loading + env var resolution
├── engine/
│   ├── runner.py       Outer loop (iterate → fetch → project → write)
│   ├── fetcher.py      HTTP and Python source fetchers
│   ├── iterator.py     Date range and categorical iteration
│   ├── projector.py    Schema projection and path extraction
│   └── transforms.py   Column transforms
├── writer/
│   └── iceberg.py      Write strategies (incremental / append / full_refresh)
└── gcp/                GCP auth, provisioning, Terraform wrappers

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dcf_core-1.0.1.tar.gz (61.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dcf_core-1.0.1-py3-none-any.whl (65.0 kB view details)

Uploaded Python 3

File details

Details for the file dcf_core-1.0.1.tar.gz.

File metadata

  • Download URL: dcf_core-1.0.1.tar.gz
  • Upload date:
  • Size: 61.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for dcf_core-1.0.1.tar.gz
Algorithm Hash digest
SHA256 a650e4b3c15b5c3c52830c3af01267ad059a5d9252302709e338fcaa0fcdd170
MD5 4653473de2ad2d2db791f01b2424129b
BLAKE2b-256 4e1af534b469ade8255f9a045ed420deffe241ef5eb656f442334be1ee945c2f

See more details on using hashes here.

File details

Details for the file dcf_core-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: dcf_core-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 65.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for dcf_core-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 03755790d39a3c90e748b8b6bc1224e424450b08c7108e3bebd07f101cbbd615
MD5 808c63cf80aa3f2ffd0d1f784f8e35e5
BLAKE2b-256 c9f3132d624f36a8acfbc97a0b00bf584e74adb775da5f58ec93dc86e563319e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page