Skip to main content

No project description provided

Project description

dpetl — Data package ETL

Release

The dpetl is a command-line interface (CLI) tool designed to run the three ETL phases (Extract, Transform, Load)[^1]

[^1]: Although currently only the Extract phase is implemented.

It is designed to work alongside the Data Package standard specification.

Installation

It requires Python 3.10 or more. Install:

# using pip
pip install dpetl

# using poetry
poetry add dpetl

Usage

Active your virtual environment!

Use the --help flag to inspect the CLI documentation:

dpetl --help

Currently, only the extract command is available:

# Run extract using the default datapackage.yaml descriptor
dpetl extract

# Specify a descriptor explicitly
dpetl extract -d path/to/datapackage.yaml
# or
dpetl extract --descriptor path/to/datapackage.yaml

How It Works

The CLI loads Data Package descriptor(s) (via the frictionless-py Python package) and iterates over its resources.

A .toml file could also be provided as a descriptor (using the -d flag) to run the command(s) recursively. Please create a .toml file following the below pattern:

title = 'dados_orcamentarios'

[datapackages] # required

[datapackages.dados_siafi]
path = 'datapackages/dados_siafi/datapackage.yaml' # descriptor required via path property

[datapackages.dados_sisor]
path = 'datapackages/dados_sisor/datapackage.yaml' # descriptor required via path property

For each resource found, dpetl extract command reads its dpetl_extract custom property: The key mode determines which extractor will run. Currently, available modes are:

  • api.
  • email.

Example Data Package Configuration

# datapackage.yaml
resources:
  - name: invoices
    path: data/invoices.csv
    sources:
      - method: get
        path: https://api.example.com/invoices
    dpetl_extract:
      mode: api

  - name: payroll_from_email
    path: data/payroll.xlsx
    dpetl_extract:
      mode: email
      mailbox: INBOX  # optional (Defaults to INBOX)
      criteria:
        subject: "Payroll Report" # optional (Defaults to resource name. See also the flag --add-package-name)

Extractors

Email Extractor

  • Connects to an IMAP server using environment variables:

    • EMAIL_USER.
    • EMAIL_PWD.
    • EMAIL_IMAP.
    • HTTP_PROXY[^2].

[^2]: Just in case you're running the command behind a corporate network that demands proxy configuration. The HTTP_PROXY, HTTPS_PROXY, http_proxy and https_proxy environment variables are equally acceptable. See this Issue's comment to understand why maybe you'll have to add authentication (http://<user>:<pwd>@<host>:<port>) on PROXY address.

  • Reads configuration from:
dpetl_extract:
  mode: email
  mailbox: INBOX        # optional (Defaults to INBOX)
  criteria:             # optional
    subject: "Report"   # optional (Defaults to resource name. See also the flag --add-package-name)
    from_: "finance@example.com" # optional
    date_gte: 2024-01-01 #optional (See also the flag --today-email)

Behavior:

  • If dpetl_extract.mailbox is not provided, INBOX is used.
  • If dpetl_extract.criteria.subject is not provided, it defaults to the resource name.
  • If the flag --add-package-name is provided the e-mail subject pattern will be {package_name}_{resource_name} instead of just resource name.
  • If the flag --today-email is provided the date when the command runs will be used in the to search criteria.
  • The extractor searches for the most recent matching e-mail.
  • All e-mail attachments are saved to resource.path.

API Extractor

  • Reads resource.sources.
  • Searches for a source containing a method.
  • Downloads the file.
  • Saves it to resource.path.

Design Philosophy

The dpetl package follows a convention over configuration philosophy, treating the Data Package descriptor as the single source of truth for ETL process.

Each resource declares how it should be processed through structured metadata, enabling reproducible, declarative, and version-controlled data workflows.

The goal is to keep the CLI simple while allowing flexible strategies driven entirely by configuration rather than imperative scripting.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dpetl-0.8.1.tar.gz (6.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dpetl-0.8.1-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file dpetl-0.8.1.tar.gz.

File metadata

  • Download URL: dpetl-0.8.1.tar.gz
  • Upload date:
  • Size: 6.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.10.12 Linux/6.17.9-76061709-generic

File hashes

Hashes for dpetl-0.8.1.tar.gz
Algorithm Hash digest
SHA256 7cb2df16687461775d31a7ab811c4160f61105df9f8130bf24ca1f79b35af624
MD5 a5aacfd17f5adefce7d7a58a4bdc4b79
BLAKE2b-256 2639e1f5ca72da2629b84f5f1ca65d2fdef914f661507fd052abdf30673c560d

See more details on using hashes here.

File details

Details for the file dpetl-0.8.1-py3-none-any.whl.

File metadata

  • Download URL: dpetl-0.8.1-py3-none-any.whl
  • Upload date:
  • Size: 8.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.10.12 Linux/6.17.9-76061709-generic

File hashes

Hashes for dpetl-0.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d99679cd6c775eb643a150a0372eb598f6c432c7073299add07781cb66ee49fa
MD5 11678f7caf59d9f3ae1ec5ba81b55540
BLAKE2b-256 102a21c6107b1638b6c886fbc0cb5dbd53d99ef01e57b5b3bf1005729d03f165

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page