Skip to main content

D.ata C.ollection F.ramework - Define a data source, run dcf, query the data

Project description

dcf

PyPI Python License

D.ata C.ollection F.ramework

uvx --from dcf-core dcf init

How it works

  1. Define a data collector in YAML — source, schema, cadence
  2. Run it with dcf run
  3. Query data from your data lake

▶️ Video Tutorial: Run a DCF Collector with GitHub API

Run a DCF Collector with GitHub API


Quickstart

Get real data. From an API. Into your Lakehouse. Query it with SQL. In 5 lines.

mkdir dcf-demo && cd dcf-demo
uvx --from dcf-core dcf init
uv sync
uv run dcf run so_questions
uv run dcf query 'SELECT * FROM stackoverflow.so_questions'

dcf init creates pyproject.toml, profiles.yml, .gitignore, collectors/, and an example collector.


Example

dcf collector

name: so_questions
namespace: stackoverflow

source:
  type: http
  url: https://api.stackexchange.com/2.3/questions
  response:
    records_path: items
  params:
    - name: site
      type: string
      value: stackoverflow
    - name: fromdate
      type: string
      format: "%s"
    - name: todate
      type: string
      format: "%s"
  schema:
    columns:
      - name: question_id
        path: question_id
        type: integer
      - name: title
        path: title
        type: string
      - name: creation_date
        path: creation_date
        type: integer

cadence:
  strategy: incremental
  primary_key: question_id
  iterate:
    - type: date_range
      params: [fromdate, todate]
      start: "2025-01-01"
      end: today
      step: 30 days

dcf run

uv run dcf run so_questions

dcf query

uv run dcf query 'SELECT * FROM stackoverflow.so_questions LIMIT 5'

More features


Contributing

git clone https://github.com/zephschafer/dcf && cd dcf && uv sync

Point a local project at your checkout:

[tool.uv.sources]
dcf-core = { path = "../dcf", editable = true }

To verify changes:

uv run dcf run so_questions
uv run dcf query 'SELECT * FROM stackoverflow.so_questions'

Releasing: bump version in pyproject.toml and push to main — GitHub Actions publishes to PyPI automatically.


Package structure

dcf/
├── cli.py              Entry point (Typer)
├── config/
│   ├── models.py       Pydantic models for collector YAML
│   └── loader.py       YAML loading + env var resolution
├── engine/
│   ├── runner.py       Outer loop (iterate → fetch → project → write)
│   ├── fetcher.py      HTTP and Python source fetchers
│   ├── iterator.py     Date range and categorical iteration
│   ├── projector.py    Schema projection and path extraction
│   └── transforms.py   Column transforms
├── writer/
│   └── iceberg.py      Write strategies (incremental / append / full_refresh)
└── gcp/                GCP auth, provisioning, Terraform wrappers

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dcf_core-1.1.1.tar.gz (67.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dcf_core-1.1.1-py3-none-any.whl (72.2 kB view details)

Uploaded Python 3

File details

Details for the file dcf_core-1.1.1.tar.gz.

File metadata

  • Download URL: dcf_core-1.1.1.tar.gz
  • Upload date:
  • Size: 67.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for dcf_core-1.1.1.tar.gz
Algorithm Hash digest
SHA256 3605490c4588a720685b0c571851b5203ec43160ea90188c4c002ad5a4485e2d
MD5 ec7e0fd574ff03283a4589b474956335
BLAKE2b-256 9122edf2c952efb29fa47604e9097afb2b72c1d9e4f2ce5b2f5ef48300015dc6

See more details on using hashes here.

File details

Details for the file dcf_core-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: dcf_core-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 72.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for dcf_core-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cb362dd080127f9268bd2319407ae71b264ff533fa96f6a3a0b07f7c2acfd816
MD5 b9fbb978623c53d0fdcb1f12fb17ee33
BLAKE2b-256 5666ffecf6d08cd8217f75cc43b013d5161f5daf34d7eac874f380af67b8886b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page