Skip to main content

A Swiss Army knife for simple ETL operations

Project description

ETLPlus

PyPI Release Python License CI Coverage Issues PRs GitHub contributors

ETLPlus is a veritable Swiss Army knife for enabling simple ETL operations, offering both a Python package and command-line interface for data extraction, validation, transformation, and loading.

Features

  • Check data pipeline definitions before running them:

    • Summarize jobs, sources, targets, and transforms
    • Confirm configuration changes by printing focused sections on demand
  • Render SQL DDL from shared table specs:

    • Generate CREATE TABLE or view statements
    • Swap templates or direct output to files for database migrations
  • Extract data from multiple sources:

    • Files (CSV, JSON, XML, YAML)
    • Databases (connection string support)
    • REST APIs (GET)
  • Validate data with flexible rules:

    • Type checking
    • Required fields
    • Value ranges (min/max)
    • String length constraints
    • Pattern matching
    • Enum validation
  • Transform data with powerful operations:

    • Filter records
    • Map/rename fields
    • Select specific fields
    • Sort data
    • Aggregate functions (avg, count, max, min, sum)
  • Load data to multiple targets:

    • Files (CSV, JSON, XML, YAML)
    • Databases (connection string support)
    • REST APIs (PATCH, POST, PUT)

Installation

pip install etlplus

For development:

pip install -e ".[dev]"

Quickstart

Get up and running in under a minute.

Command line interface:

# Inspect help and version
etlplus --help
etlplus --version

# One-liner: extract CSV, filter, select, and write JSON
etlplus extract file examples/data/sample.csv \
  | etlplus transform --operations '{"filter": {"field": "age", "op": "gt", "value": 25}, "select": ["name", "email"]}' \
  - temp/sample_output.json

Python API:

from etlplus import extract, transform, validate, load

data = extract("file", "input.csv")
ops = {"filter": {"field": "age", "op": "gt", "value": 25}, "select": ["name", "email"]}
filtered = transform(data, ops)
rules = {"name": {"type": "string", "required": True}, "email": {"type": "string", "required": True}}
assert validate(filtered, rules)["valid"]
load(filtered, "file", "temp/sample_output.json", file_format="json")

Usage

Command Line Interface

ETLPlus provides a powerful CLI for ETL operations:

# Show help
etlplus --help

# Show version
etlplus --version

The CLI is implemented with Typer (Click-based). There is no argparse compatibility layer, so rely on the documented commands/flags and run etlplus <command> --help for current options.

Example error messages:

  • If you omit a required argument: Error: Missing required argument 'SOURCE'.
  • If you place an option before its argument: Error: Option '--source-format' must follow the 'SOURCE' argument.

Argument Order and Required Options

For each command, positional arguments must precede options. Required options must follow their associated argument:

  • extract: etlplus extract SOURCE [--source-format ...] [--source-type ...]
    • SOURCE is required. --source-format and --source-type must follow SOURCE.
  • transform: etlplus transform [--operations ...] SOURCE [--source-format ...] [--source-type ...] TARGET [--target-format ...] [--target-type ...]
    • SOURCE and TARGET are required. Format/type options must follow their respective argument.
  • load: etlplus load TARGET [--target-format ...] [--target-type ...] [--source-format ...]
    • TARGET is required. --target-format and --target-type must follow TARGET.
  • validate: etlplus validate SOURCE [--rules ...] [--source-format ...] [--source-type ...]
    • SOURCE is required. --rules and format/type options must follow SOURCE.

If required arguments or options are missing, or if options are placed before their associated argument, the CLI will display a clear error message.

Check Pipelines

Use etlplus check to explore pipeline YAML definitions without running them. The command can print job names, summarize configured sources and targets, or drill into specific sections.

List jobs and show a pipeline summary:

etlplus check --config examples/configs/pipeline.yml --jobs
etlplus check --config examples/configs/pipeline.yml --summary

Show sources or transforms for troubleshooting:

etlplus check --config examples/configs/pipeline.yml --sources
etlplus check --config examples/configs/pipeline.yml --transforms

Render SQL DDL

Use etlplus render to turn table schema specs into ready-to-run SQL. Render from a pipeline config or from a standalone schema file, and choose the built-in ddl or view templates (or provide your own).

Render all tables defined in a pipeline:

etlplus render --config examples/configs/pipeline.yml --template ddl

Render a single table in that pipeline:

etlplus render --config examples/configs/pipeline.yml --table customers --template view

Render from a standalone table spec to a file:

etlplus render --spec schemas/customer.yml --template view -o temp/customer_view.sql

Extract Data

Note: For file sources, the format is normally inferred from the filename extension. Use --source-format to override inference when a file lacks an extension or when you want to force a specific parser.

Extract from JSON file:

etlplus extract file examples/data/sample.json

Extract from CSV file:

etlplus extract file examples/data/sample.csv

Extract from XML file:

etlplus extract file examples/data/sample.xml

Extract from REST API:

etlplus extract api https://api.example.com/data

Save extracted data to file:

etlplus extract file examples/data/sample.csv > temp/sample_output.json

Validate Data

Validate data from file or JSON string:

etlplus validate '{"name": "John", "age": 30}' --rules '{"name": {"type": "string", "required": true}, "age": {"type": "number", "min": 0, "max": 150}}'

Validate from file:

etlplus validate examples/data/sample.json --rules '{"email": {"type": "string", "pattern": "^[\\w.-]+@[\\w.-]+\\.\\w+$"}}'

Transform Data

When piping data through etlplus transform, use --source-format whenever the SOURCE argument is - or a literal payload, mirroring the etlplus extract semantics. Use --target-format to control the emitted format for STDOUT or other non-file outputs, just like etlplus load. File paths continue to infer formats from their extensions. Use --source-type to override the inferred source connector type and --target-type to override the inferred target connector type, matching the etlplus extract/etlplus load behavior.

Transform file inputs while overriding connector types:

etlplus transform \
  --operations '{"select": ["name", "email"]}' \
  examples/data/sample.json  --source-type file \
  temp/selected_output.json --target-type file

Filter and select fields:

etlplus transform \
  --operations '{"filter": {"field": "age", "op": "gt", "value": 26}, "select": ["name"]}' \
  '[{"name": "John", "age": 30}, {"name": "Jane", "age": 25}]'

Sort data:

etlplus transform \
  --operations '{"sort": {"field": "age", "reverse": true}}' \
  examples/data/sample.json

Aggregate data:

etlplus transform \
  --operations '{"aggregate": {"field": "age", "func": "sum"}}' \
  examples/data/sample.json

Map/rename fields:

etlplus transform \
  --operations '{"map": {"name": "new_name"}}' \
  examples/data/sample.json

Load Data

etlplus load consumes JSON from STDIN; provide only the target argument plus optional flags.

Load to JSON file:

etlplus extract file examples/data/sample.json \
  | etlplus load temp/sample_output.json --target-type file

Load to CSV file:

etlplus extract file examples/data/sample.csv \
  | etlplus load temp/sample_output.csv --target-type file

Load to REST API:

cat examples/data/sample.json \
  | etlplus load https://api.example.com/endpoint --target-type api

Python API

Use ETLPlus as a Python library:

from etlplus import extract, validate, transform, load

# Extract data
data = extract("file", "data.json")

# Validate data
validation_rules = {
    "name": {"type": "string", "required": True},
    "age": {"type": "number", "min": 0, "max": 150}
}
result = validate(data, validation_rules)
if result["valid"]:
    print("Data is valid!")

# Transform data
operations = {
    "filter": {"field": "age", "op": "gt", "value": 18},
    "select": ["name", "email"]
}
transformed = transform(data, operations)

# Load data
load(transformed, "file", "temp/sample_output.json", format="json")

For YAML-driven pipelines executed end-to-end (extract → validate → transform → load), see:

CLI quick reference for pipelines:

# List jobs or show a pipeline summary
etlplus check --config examples/configs/pipeline.yml --jobs
etlplus check --config examples/configs/pipeline.yml --summary

# Run a job
etlplus run --config examples/configs/pipeline.yml --job file_to_file_customers

Complete ETL Pipeline Example

# 1. Extract from CSV
etlplus extract file examples/data/sample.csv > temp/sample_extracted.json

# 2. Transform (filter and select fields)
etlplus transform \
  --operations '{"filter": {"field": "age", "op": "gt", "value": 25}, "select": ["name", "email"]}' \
  temp/sample_extracted.json \
  temp/sample_transformed.json

# 3. Validate transformed data
etlplus validate \
  --rules '{"name": {"type": "string", "required": true}, "email": {"type": "string", "required": true}}' \
  temo/sample_transformed.json

# 4. Load to CSV
cat temp/sample_transformed.json \
  | etlplus load temp/sample_output.csv

Format Overrides

--source-format and --target-format override whichever format would normally be inferred from a file extension. This is useful when an input lacks an extension (for example, records.txt that actually contains CSV) or when you intentionally want to treat a file as another format.

Examples (zsh):

# Force CSV parsing for an extension-less file
etlplus extract data.txt --source-type file --source-format csv

# Write CSV to a file without the .csv suffix
etlplus load output.bin --target-type file --target-format csv < data.json

# Leave the flags off when extensions already match the desired format
etlplus extract data.csv --source-type file
etlplus load data.json --target-type file  < data.json

Transformation Operations

Filter Operations

Supported operators:

  • eq: Equal
  • ne: Not equal
  • gt: Greater than
  • gte: Greater than or equal
  • lt: Less than
  • lte: Less than or equal
  • in: Value in list
  • contains: List/string contains value

Example:

{
  "filter": {
    "field": "status",
    "op": "in",
    "value": ["active", "pending"]
  }
}

Aggregation Functions

Supported functions:

  • sum: Sum of values
  • avg: Average of values
  • min: Minimum value
  • max: Maximum value
  • count: Count of values

Example:

{
  "aggregate": {
    "field": "revenue",
    "func": "sum"
  }
}

Validation Rules

Supported validation rules:

  • type: Data type (string, number, integer, boolean, array, object)
  • required: Field is required (true/false)
  • min: Minimum value for numbers
  • max: Maximum value for numbers
  • minLength: Minimum length for strings
  • maxLength: Maximum length for strings
  • pattern: Regex pattern for strings
  • enum: List of allowed values

Example:

{
  "email": {
    "type": "string",
    "required": true,
    "pattern": "^[\\w.-]+@[\\w.-]+\\.\\w+$"
  },
  "age": {
    "type": "number",
    "min": 0,
    "max": 150
  },
  "status": {
    "type": "string",
    "enum": ["active", "inactive", "pending"]
  }
}

Development

API Client Docs

Looking for the HTTP client and pagination helpers? See the dedicated docs in etlplus/api/README.md for:

  • Quickstart with EndpointClient
  • Authentication via EndpointCredentialsBearer
  • Pagination with PaginationConfig (page and cursor styles)
  • Tips on records_path and cursor_path

Runner Internals and Connectors

Curious how the pipeline runner composes API requests, pagination, and load calls?

  • Runner overview and helpers: docs/run-module.md
  • Unified "connector" vocabulary (API/File/DB): etlplus/config/connector.py
    • API/file targets reuse the same shapes as sources; API targets typically set a method.

Running Tests

pytest tests/ -v

Test Layers

We split tests into two layers:

  • Unit (tests/unit/): single function or class, no real I/O, fast, uses stubs/monkeypatch (e.g. etlplus.cli.create_parser, transform + validate helpers).
  • Integration (tests/integration/): end-to-end flows (CLI main(), pipeline run(), pagination + rate limit defaults, file/API connector interactions) may touch temp files and use fake clients.

If a test calls etlplus.cli.main() or etlplus.run.run() it’s integration by default. Full criteria: CONTRIBUTING.md#testing.

Code Coverage

pytest tests/ --cov=etlplus --cov-report=html

Linting

flake8 etlplus/
black etlplus/

Updating Demo Snippets

DEMO.md shows the real output of etlplus --version captured from a freshly built wheel. Regenerate the snippet (and the companion file docs/snippets/installation_version.md) after changing anything that affects the version string:

make demo-snippets

The helper script in tools/update_demo_snippets.py builds the wheel, installs it into a throwaway virtual environment, runs etlplus --version, and rewrites the snippet between the markers in DEMO.md.

Releasing to PyPI

setuptools-scm derives the package version from Git tags, so publishing is now entirely tag driven—no hand-editing pyproject.toml, setup.py, or etlplus/__version__.py.

  1. Ensure main is green and the changelog/docs are up to date.
  2. Create and push a SemVer tag matching the v*.*.* pattern:
git tag -a v1.4.0 -m "Release v1.4.0"
git push origin v1.4.0
  1. GitHub Actions fetches tags, builds the sdist/wheel, and publishes to PyPI via the publish job in .github/workflows/ci.yml.

If you want an extra smoke-test before tagging, run make dist && pip install dist/*.whl locally; this exercises the same build path the workflow uses.

Links

License

This project is licensed under the MIT License.

Contributing

Code and codeless contributions are welcome! If you’d like to add a new feature, fix a bug, or improve the documentation, please feel free to submit a pull request as follows:

  1. Fork this repository.
  2. Create a new feature branch for your changes (git checkout -b feature/feature-name).
  3. Commit your changes (git commit -m "Add feature").
  4. Push to your branch (git push origin feature-name).
  5. Submit a pull request with a detailed description.

If you choose to be a code contributor, please first refer these documents:

Acknowledgments

ETLPlus is inspired by common work patterns in data engineering and software engineering patterns in Python development, aiming to increase productivity and reduce boilerplate code. Feedback and contributions are always appreciated!

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

etlplus-0.9.1.tar.gz (246.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

etlplus-0.9.1-py3-none-any.whl (144.9 kB view details)

Uploaded Python 3

File details

Details for the file etlplus-0.9.1.tar.gz.

File metadata

  • Download URL: etlplus-0.9.1.tar.gz
  • Upload date:
  • Size: 246.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for etlplus-0.9.1.tar.gz
Algorithm Hash digest
SHA256 b05f36b0b5da3af6627580ebb33cb071d2b906995c3e24c26a4de6481f738705
MD5 ce0a83f562107b60b2c3cda9fe46410e
BLAKE2b-256 8870237a21d12a102120a33e3c3cf7af2b211e36c05edef6009cc97a79f2f480

See more details on using hashes here.

Provenance

The following attestation bundles were made for etlplus-0.9.1.tar.gz:

Publisher: ci.yml on Dagitali/ETLPlus

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file etlplus-0.9.1-py3-none-any.whl.

File metadata

  • Download URL: etlplus-0.9.1-py3-none-any.whl
  • Upload date:
  • Size: 144.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for etlplus-0.9.1-py3-none-any.whl
Algorithm Hash digest
SHA256 de55a525e97f6d6fd7bda7853c767b92fb961b2c536146e1a454fce4ea8197a4
MD5 c5658e72c61bf376849a551d5ba2eafe
BLAKE2b-256 379133f79ff1dcf761328b96036e71c3088d0fdc062e02ce7ea1398b5b571183

See more details on using hashes here.

Provenance

The following attestation bundles were made for etlplus-0.9.1-py3-none-any.whl:

Publisher: ci.yml on Dagitali/ETLPlus

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page