Skip to main content

A Swiss Army knife for simple ETL operations

Project description

ETLPlus

PyPI Release Python License CI Coverage Issues PRs GitHub contributors

ETLPlus is a veritable Swiss Army knife for enabling simple ETL operations, offering both a Python package and command-line interface for data extraction, validation, transformation, and loading.

Features

  • Extract data from multiple sources:

    • Files (CSV, JSON, XML, YAML)
    • Databases (connection string support)
    • REST APIs (GET)
  • Validate data with flexible rules:

    • Type checking
    • Required fields
    • Value ranges (min/max)
    • String length constraints
    • Pattern matching
    • Enum validation
  • Transform data with powerful operations:

    • Filter records
    • Map/rename fields
    • Select specific fields
    • Sort data
    • Aggregate functions (avg, count, max, min, sum)
  • Load data to multiple targets:

    • Files (CSV, JSON, XML, YAML)
    • Databases (connection string support)
    • REST APIs (PATCH, POST, PUT)

Installation

pip install etlplus

For development:

pip install -e ".[dev]"

Quickstart

Get up and running in under a minute.

Command line interface:

# Inspect help and version
etlplus --help
etlplus --version

# One-liner: extract CSV, filter, select, and write JSON
etlplus extract file examples/data/sample.csv \
  | etlplus transform - --operations '{"filter": {"field": "age", "op": "gt", "value": 25}, "select": ["name", "email"]}' \
  -o temp/sample_output.json

Python API:

from etlplus import extract, transform, validate, load

data = extract("file", "input.csv")
ops = {"filter": {"field": "age", "op": "gt", "value": 25}, "select": ["name", "email"]}
filtered = transform(data, ops)
rules = {"name": {"type": "string", "required": True}, "email": {"type": "string", "required": True}}
assert validate(filtered, rules)["valid"]
load(filtered, "file", "temp/sample_output.json", file_format="json")

Usage

Command Line Interface

ETLPlus provides a powerful CLI for ETL operations:

# Show help
etlplus --help

# Show version
etlplus --version

Extract Data

Note: For file sources, the format is inferred from the filename extension; the --format option is ignored. To treat passing --format as an error for file sources, either set ETLPLUS_FORMAT_BEHAVIOR=error or pass the CLI flag --strict-format.

Extract from JSON file:

etlplus extract file examples/data/sample.json

Extract from CSV file:

etlplus extract file examples/data/sample.csv

Extract from XML file:

etlplus extract file examples/data/sample.xml

Extract from REST API:

etlplus extract api https://api.example.com/data

Save extracted data to file:

etlplus extract file examples/data/sample.csv -o temp/sample_output.json

Validate Data

Validate data from file or JSON string:

etlplus validate '{"name": "John", "age": 30}' --rules '{"name": {"type": "string", "required": true}, "age": {"type": "number", "min": 0, "max": 150}}'

Validate from file:

etlplus validate examples/data/sample.json --rules '{"email": {"type": "string", "pattern": "^[\\w.-]+@[\\w.-]+\\.\\w+$"}}'

Transform Data

Filter and select fields:

etlplus transform '[{"name": "John", "age": 30}, {"name": "Jane", "age": 25}]' \
  --operations '{"filter": {"field": "age", "op": "gt", "value": 26}, "select": ["name"]}'

Sort data:

etlplus transform examples/data/sample.json --operations '{"sort": {"field": "age", "reverse": true}}'

Aggregate data:

etlplus transform examples/data/sample.json --operations '{"aggregate": {"field": "age", "func": "sum"}}'

Map/rename fields:

etlplus transform examples/data/sample.json --operations '{"map": {"name": "new_name"}}'

Load Data

Load to JSON file:

etlplus load '{"name": "John", "age": 30}' file temp/sample_output.json

Load to CSV file:

etlplus load '[{"name": "John", "age": 30}]' file temp/sample_output.csv

Load to REST API:

etlplus load examples/data/sample.json api https://api.example.com/endpoint

Python API

Use ETLPlus as a Python library:

from etlplus import extract, validate, transform, load

# Extract data
data = extract("file", "data.json")

# Validate data
validation_rules = {
    "name": {"type": "string", "required": True},
    "age": {"type": "number", "min": 0, "max": 150}
}
result = validate(data, validation_rules)
if result["valid"]:
    print("Data is valid!")

# Transform data
operations = {
    "filter": {"field": "age", "op": "gt", "value": 18},
    "select": ["name", "email"]
}
transformed = transform(data, operations)

# Load data
load(transformed, "file", "temp/sample_output.json", format="json")

For YAML-driven pipelines executed end-to-end (extract → validate → transform → load), see:

Complete ETL Pipeline Example

# 1. Extract from CSV
etlplus extract file examples/data/sample.csv -o temp/sample_extracted.json

# 2. Transform (filter and select fields)
etlplus transform temp/sample_extracted.json \
  --operations '{"filter": {"field": "age", "op": "gt", "value": 25}, "select": ["name", "email"]}' \
  -o temp/sample_transformed.json

# 3. Validate transformed data
etlplus validate temp/sample_transformed.json \
  --rules '{"name": {"type": "string", "required": true}, "email": {"type": "string", "required": true}}'

# 4. Load to CSV
etlplus load temp/sample_transformed.json file temp/sample_output.csv

Environment Variables

ETLPlus honors a small number of environment toggles to refine CLI behavior:

  • ETLPLUS_FORMAT_BEHAVIOR: controls what happens when --format is provided for file sources or targets (extract/load) where the format is inferred from the filename extension.
    • error|fail|strict: treat as error (non-zero exit)
    • warn (default): print a warning to stderr
    • ignore|silent: no message
  • Precedence: the CLI flag --strict-format overrides the environment.

Examples (zsh):

# Warn (default)
etlplus extract file data.csv --format csv
etlplus load data.json file out.csv --format csv

# Enforce error via environment
ETLPLUS_FORMAT_BEHAVIOR=error \
  etlplus extract file data.csv --format csv
ETLPLUS_FORMAT_BEHAVIOR=error \
  etlplus load data.json file out.csv --format csv

# Equivalent strict behavior via flag (overrides environment)
etlplus extract file data.csv --format csv --strict-format
etlplus load data.json file out.csv --format csv --strict-format

# Recommended: rely on extension, no --format needed for files
etlplus extract file data.csv
etlplus load data.json file out.csv

Transformation Operations

Filter Operations

Supported operators:

  • eq: Equal
  • ne: Not equal
  • gt: Greater than
  • gte: Greater than or equal
  • lt: Less than
  • lte: Less than or equal
  • in: Value in list
  • contains: List/string contains value

Example:

{
  "filter": {
    "field": "status",
    "op": "in",
    "value": ["active", "pending"]
  }
}

Aggregation Functions

Supported functions:

  • sum: Sum of values
  • avg: Average of values
  • min: Minimum value
  • max: Maximum value
  • count: Count of values

Example:

{
  "aggregate": {
    "field": "revenue",
    "func": "sum"
  }
}

Validation Rules

Supported validation rules:

  • type: Data type (string, number, integer, boolean, array, object)
  • required: Field is required (true/false)
  • min: Minimum value for numbers
  • max: Maximum value for numbers
  • minLength: Minimum length for strings
  • maxLength: Maximum length for strings
  • pattern: Regex pattern for strings
  • enum: List of allowed values

Example:

{
  "email": {
    "type": "string",
    "required": true,
    "pattern": "^[\\w.-]+@[\\w.-]+\\.\\w+$"
  },
  "age": {
    "type": "number",
    "min": 0,
    "max": 150
  },
  "status": {
    "type": "string",
    "enum": ["active", "inactive", "pending"]
  }
}

Development

API Client Docs

Looking for the HTTP client and pagination helpers? See the dedicated docs in etlplus/api/README.md for:

  • Quickstart with EndpointClient
  • Authentication via EndpointCredentialsBearer
  • Pagination with PaginationConfig (page and cursor styles)
  • Tips on records_path and cursor_path

Runner Internals and Connectors

Curious how the pipeline runner composes API requests, pagination, and load calls?

  • Runner overview and helpers: docs/run-module.md
  • Unified "connector" vocabulary (API/File/DB): etlplus/config/connector.py
    • API/file targets reuse the same shapes as sources; API targets typically set a method.

Running Tests

pytest tests/ -v

Test Layers

We split tests into two layers:

  • Unit (tests/unit/): single function or class, no real I/O, fast, uses stubs/monkeypatch (e.g. etlplus.cli.create_parser, transform + validate helpers).
  • Integration (tests/integration/): end-to-end flows (CLI main(), pipeline run(), pagination + rate limit defaults, file/API connector interactions) may touch temp files and use fake clients.

If a test calls etlplus.cli.main() or etlplus.run.run() it’s integration by default. Full criteria: CONTRIBUTING.md#testing.

Code Coverage

pytest tests/ --cov=etlplus --cov-report=html

Linting

flake8 etlplus/
black etlplus/

Releasing to PyPI

For maintainers, releases are built from the root using the modern pyproject.toml configuration:

make dist          # build sdist + wheel into ./dist and run twine check

Then upload the artifacts in dist/ with twine (installed by make dist):

export TWINE_USERNAME="__token__"
export TWINE_PASSWORD="pypi-..."  # your PyPI API token
python -m twine upload dist/*

Links

License

This project is licensed under the MIT License.

Contributing

Code and codeless contributions are welcome! If you’d like to add a new feature, fix a bug, or improve the documentation, please feel free to submit a pull request as follows:

  1. Fork this repository.
  2. Create a new feature branch for your changes (git checkout -b feature/feature-name).
  3. Commit your changes (git commit -m "Add feature").
  4. Push to your branch (git push origin feature-name).
  5. Submit a pull request with a detailed description.

If you choose to be a code contributor, please first refer these documents:

Acknowledgments

ETLPlus is inspired by common work patterns in data engineering and software engineering patterns in Python development, aiming to increase productivity and reduce boilerplate code. Feedback and contributions are always appreciated!

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

etlplus-0.3.3.tar.gz (96.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

etlplus-0.3.3-py3-none-any.whl (114.8 kB view details)

Uploaded Python 3

File details

Details for the file etlplus-0.3.3.tar.gz.

File metadata

  • Download URL: etlplus-0.3.3.tar.gz
  • Upload date:
  • Size: 96.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for etlplus-0.3.3.tar.gz
Algorithm Hash digest
SHA256 d32ce9a864759fbb287e0e85682a074697a4aa850473accb50a6959380af5169
MD5 a4d7824d4621927036c31eafa2f02f36
BLAKE2b-256 f548ddf93096c344b3b8f22412747d699b2c1b8fb9c2eec0c686526605edfcb1

See more details on using hashes here.

Provenance

The following attestation bundles were made for etlplus-0.3.3.tar.gz:

Publisher: ci.yml on Dagitali/ETLPlus

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file etlplus-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: etlplus-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 114.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for etlplus-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 92bf7036d19cb1a8ba0a0950df85ad9e4b3a55f51141d4951340c81be80195ba
MD5 d8a396c722220467faf50a2228c61114
BLAKE2b-256 e46858827604a366d330c21645f77422bed0a22f31054b8994aa61013fa9c242

See more details on using hashes here.

Provenance

The following attestation bundles were made for etlplus-0.3.3-py3-none-any.whl:

Publisher: ci.yml on Dagitali/ETLPlus

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page