A Python library for generating synthetic test data and validating ETL outputs.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

kkartas

These details have not been verified by PyPI

Project description

ETLForge

PyPI - Python Version

A Python library for generating synthetic test data and validating ETL (Extract, Transform, Load) outputs. ETL processes are fundamental data workflows that extract data from various sources, transform it according to business rules, and load it into target systems like data warehouses or databases. ETLForge provides both command-line tools and library functions to help you create realistic test datasets and validate data quality throughout your ETL pipelines.

Features

Test Data Generator

Generate synthetic data based on YAML/JSON schema definitions
Support for multiple data types: int, float, string, date, category
Advanced constraints: ranges, uniqueness, nullable fields, categorical values
Integration with Faker for realistic string generation
Export to CSV or Excel formats

Data Validator

Validate CSV/Excel files against schema definitions
Comprehensive validation checks:
- Column existence
- Data type matching
- Value constraints (ranges, categories)
- Uniqueness validation
- Null value validation
- Date format validation
Generate detailed reports of invalid rows

Dual Interface

Command-line interface for quick operations
Python library for integration into existing workflows

Installation

Prerequisites

Python 3.9 or higher
pip package manager

Install from PyPI (Recommended)

pip install etl-forge

Install from Source

For development or latest features:

git clone https://github.com/kkartas/etl-forge.git
cd etl-forge
pip install -e ".[dev]"

Dependencies

Core dependencies (6 total, automatically installed):

pandas>=1.3.0 - Data manipulation and analysis
pyyaml>=5.4.0 - YAML parsing for schema files
click>=8.0.0 - Command-line interface framework
openpyxl>=3.0.0 - Excel file support
numpy>=1.21.0 - Numerical computing
psutil>=5.9.0 - System monitoring for benchmarks

Optional dependencies for enhanced features:

# For realistic data generation using Faker templates
pip install etl-forge[faker]

# For development (testing, linting, documentation)
pip install etl-forge[dev]

Verify Installation

# CLI verification (may require adding Scripts directory to PATH on Windows)
etl-forge --version

# Alternative CLI access (works on all platforms)
python -m etl_forge.cli --version

# Library verification
python -c "from etl_forge import DataGenerator, DataValidator; print('Installation verified')"

CLI Access Note

On some systems (especially Windows), the etl-forge command may not be directly accessible. In such cases, use:

python -m etl_forge.cli [command] [options]

Complete Example

For a comprehensive demonstration of ETLForge's capabilities, see the included example.py file:

# Run the complete example
python example.py

This example demonstrates:

Schema-driven data generation with realistic data (using Faker)
Data validation with the same schema
Error detection and reporting
Complete ETL testing workflow

Key snippet from example.py:

from etl_forge import DataGenerator, DataValidator

# Single schema drives both generation and validation
schema = {
    "fields": [
        {"name": "customer_id", "type": "int", "unique": True, "range": {"min": 1, "max": 10000}},
        {"name": "name", "type": "string", "faker_template": "name"},
        {"name": "email", "type": "string", "unique": True, "faker_template": "email"},
        {"name": "purchase_amount", "type": "float", "range": {"min": 10.0, "max": 5000.0}, "nullable": True},
        {"name": "customer_tier", "type": "category", "values": ["Bronze", "Silver", "Gold", "Platinum"]}
    ]
}

# Generate test data
generator = DataGenerator(schema)
df = generator.generate_data(1000)
generator.save_data(df, 'customer_test_data.csv')

# Validate with the same schema
validator = DataValidator(schema)
result = validator.validate('customer_test_data.csv')
print(f"Validation passed: {result.is_valid}")

This demonstrates ETLForge's key advantage: single schema, dual purpose - the same schema definition drives both data generation and validation, ensuring perfect synchronization between test data and validation rules.

Quick Start

1. Create a Schema

Create a schema.yaml file defining your data structure:

fields:
  - name: id
    type: int
    unique: true
    nullable: false
    range:
      min: 1
      max: 10000

  - name: name
    type: string
    nullable: false
    faker_template: name

  - name: department
    type: category
    nullable: false
    values:
      - Engineering
      - Marketing
      - Sales

2. Generate Test Data

Command Line:

# Direct CLI command (if available)
etl-forge generate --schema schema.yaml --rows 500 --output sample.csv

# Alternative CLI access (works on all platforms)
python -m etl_forge.cli generate --schema schema.yaml --rows 500 --output sample.csv

Python Library:

from etl_forge import DataGenerator

generator = DataGenerator('schema.yaml')
df = generator.generate_data(500)
generator.save_data(df, 'sample.csv')

3. Validate Data

Command Line:

# Direct CLI command (if available)
etl-forge check --input sample.csv --schema schema.yaml --report invalid_rows.csv

# Alternative CLI access (works on all platforms)
python -m etl_forge.cli check --input sample.csv --schema schema.yaml --report invalid_rows.csv

Python Library:

from etl_forge import DataValidator

validator = DataValidator('schema.yaml')
result = validator.validate('sample.csv')
print(f"Validation passed: {result.is_valid}")

Schema Definition

Supported Field Types

Integer (`int`)

- name: age
  type: int
  nullable: false
  range:
    min: 18
    max: 65
  unique: false

Float (`float`)

- name: salary
  type: float
  nullable: true
  range:
    min: 30000.0
    max: 150000.0
  precision: 2
  null_rate: 0.1

String (`string`)

- name: email
  type: string
  nullable: false
  unique: true
  length:
    min: 10
    max: 50
  faker_template: email  # Optional: uses Faker library

Date (`date`)

- name: hire_date
  type: date
  nullable: false
  range:
    start: '2020-01-01'
    end: '2024-12-31'
  format: '%Y-%m-%d'

Category (`category`)

- name: status
  type: category
  nullable: false
  values:
    - Active
    - Inactive
    - Pending

Schema Constraints

nullable: Allow null values (default: false)
unique: Ensure all values are unique (default: false)
range: Define min/max values for numeric types or start/end dates
values: List of allowed values for categorical fields
length: Min/max length for string fields
precision: Decimal places for float fields
format: Date format string (default: '%Y-%m-%d')
faker_template: Faker method name for realistic string generation
null_rate: Probability of null values when nullable: true (default: 0.1)

Command Line Interface

Generate Data

# Direct CLI command (if available)
etl-forge generate [OPTIONS]

# Alternative CLI access (works on all platforms)
python -m etl_forge.cli generate [OPTIONS]

Options:
  -s, --schema PATH     Path to schema file (YAML or JSON) [required]
  -r, --rows INTEGER    Number of rows to generate (default: 100)
  -o, --output PATH     Output file path (CSV or Excel) [required]
  -f, --format [csv|excel]  Output format (auto-detected if not specified)

Validate Data

# Direct CLI command (if available)
etl-forge check [OPTIONS]

# Alternative CLI access (works on all platforms)
python -m etl_forge.cli check [OPTIONS]

Options:
  -i, --input PATH      Path to input data file [required]
  -s, --schema PATH     Path to schema file [required]
  -r, --report PATH     Path to save invalid rows report (optional)
  -v, --verbose         Show detailed validation errors

Create Example Schema

# Direct CLI command (if available)
etl-forge create-schema example_schema.yaml

# Alternative CLI access (works on all platforms)
python -m etl_forge.cli create-schema example_schema.yaml

Library Usage

Data Generation

from etl_forge import DataGenerator

# Initialize with schema
generator = DataGenerator('schema.yaml')

# Generate data
df = generator.generate_data(1000)

# Save to file
generator.save_data(df, 'output.csv')

# Or do both in one step
df = generator.generate_and_save(1000, 'output.xlsx', 'excel')

Data Validation

from etl_forge import DataValidator

# Initialize validator
validator = DataValidator('schema.yaml')

# Validate data
result = validator.validate('data.csv')

# Check results
if result.is_valid:
    print("Data is valid!")
else:
    print(f"Found {len(result.errors)} validation errors")
    print(f"Invalid rows: {len(result.invalid_rows)}")

# Generate report
result = validator.validate_and_report('data.csv', 'errors.csv')

# Print summary
validator.print_validation_summary(result)

Advanced Usage

# Use schema as dictionary
schema_dict = {
    'fields': [
        {'name': 'id', 'type': 'int', 'unique': True},
        {'name': 'name', 'type': 'string', 'faker_template': 'name'}
    ]
}

generator = DataGenerator(schema_dict)
validator = DataValidator(schema_dict)

# Validate DataFrame directly
import pandas as pd
df = pd.read_csv('data.csv')
result = validator.validate(df)

Faker Integration

When the faker library is installed, you can use realistic data generation:

- name: first_name
  type: string
  faker_template: first_name

- name: address
  type: string
  faker_template: address

- name: phone
  type: string
  faker_template: phone_number

Common Faker templates:

name, first_name, last_name
email, phone_number
address, city, country
company, job
date, time
And many more! See Faker documentation

Testing

Run the test suite:

pytest tests/

Run with coverage:

pytest tests/ --cov=etl_forge --cov-report=html

Performance

Performance benchmarks are available in BENCHMARKS.md. To reproduce them, run:

python benchmark.py

Then, to visualize the results:

python plot_benchmark.py

Troubleshooting

Running Examples from Cloned Repository

If you've cloned the repository and encounter ModuleNotFoundError: No module named 'yaml' when running python example.py, this is because Python is importing the local etl_forge module instead of the installed package.

Solution 1: Install in Development Mode (if you want to modify the source code)

git clone https://github.com/kkartas/ETLForge.git
cd ETLForge
pip install -e .  # Or pip install -e ".[faker]" for full features
python example.py

Solution 2: Use the PyPI Package (if you just want to run the example)

# Install from PyPI
pip install etl-forge[faker]

# Download and run the example from outside the repository
curl -O https://raw.githubusercontent.com/kkartas/ETLForge/main/example.py
python example.py

Common Issues

Issue: etl-forge command not found

Solution: Use python -m etl_forge.cli instead, or add Python's Scripts directory to PATH

Issue: Faker templates not working

Solution: Install with faker support: pip install etl-forge[faker]

Issue: Excel files not supported

Solution: The openpyxl dependency should be installed automatically. Try: pip install openpyxl

Citation

If you use ETLForge in your research or work, please cite it using the information in CITATION.cff.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

kkartas

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.1.1

Feb 27, 2026

1.1.0

Dec 12, 2025

This version

1.0.4

Nov 12, 2025

1.0.3

Jun 17, 2025

1.0.1

Jun 17, 2025

1.0.0

Jun 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

etl_forge-1.0.4.tar.gz (24.8 kB view details)

Uploaded Nov 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

etl_forge-1.0.4-py3-none-any.whl (17.9 kB view details)

Uploaded Nov 12, 2025 Python 3

File details

Details for the file etl_forge-1.0.4.tar.gz.

File metadata

Download URL: etl_forge-1.0.4.tar.gz
Upload date: Nov 12, 2025
Size: 24.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for etl_forge-1.0.4.tar.gz
Algorithm	Hash digest
SHA256	`febe271c5cb79c33cc0b0811efd2ddcfaef557119b7743bd6c9ef1a1be4dd8dc`
MD5	`fbb8c107ecb2fb60a04617962b1dd2dc`
BLAKE2b-256	`c204a31af46fd7ab8ed1c0d396bf15d0a9d7ce1a4202b835f4fff85b025e72d7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for etl_forge-1.0.4.tar.gz:

Publisher: publish.yml on kkartas/ETLForge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: etl_forge-1.0.4.tar.gz
- Subject digest: febe271c5cb79c33cc0b0811efd2ddcfaef557119b7743bd6c9ef1a1be4dd8dc
- Sigstore transparency entry: 697476109
- Sigstore integration time: Nov 12, 2025
Source repository:
- Permalink: kkartas/ETLForge@490163f4868d2e05538869e1f0070884fb4a5c7d
- Branch / Tag: refs/tags/V1.0.4
- Owner: https://github.com/kkartas
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@490163f4868d2e05538869e1f0070884fb4a5c7d
- Trigger Event: release

File details

Details for the file etl_forge-1.0.4-py3-none-any.whl.

File metadata

Download URL: etl_forge-1.0.4-py3-none-any.whl
Upload date: Nov 12, 2025
Size: 17.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for etl_forge-1.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`48ee92a7e49803b97fa2c5cbafb6837107f883a31dcc56f6f4f94d23579830ad`
MD5	`0c0ca2126c592a6a5ac0dd1e3c2a60a9`
BLAKE2b-256	`fd807eabe151d3238cd44e98197a1ad35a70d80b74f14708b85cfd6832f032bd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for etl_forge-1.0.4-py3-none-any.whl:

Publisher: publish.yml on kkartas/ETLForge

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: etl_forge-1.0.4-py3-none-any.whl
- Subject digest: 48ee92a7e49803b97fa2c5cbafb6837107f883a31dcc56f6f4f94d23579830ad
- Sigstore transparency entry: 697476118
- Sigstore integration time: Nov 12, 2025
Source repository:
- Permalink: kkartas/ETLForge@490163f4868d2e05538869e1f0070884fb4a5c7d
- Branch / Tag: refs/tags/V1.0.4
- Owner: https://github.com/kkartas
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@490163f4868d2e05538869e1f0070884fb4a5c7d
- Trigger Event: release

etl-forge 1.0.4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

ETLForge

Features

Test Data Generator

Data Validator

Dual Interface

Installation

Prerequisites

Install from PyPI (Recommended)

Install from Source

Dependencies

Verify Installation

CLI Access Note

Complete Example

Quick Start

1. Create a Schema

2. Generate Test Data

3. Validate Data

Schema Definition

Supported Field Types

Integer (int)

Float (float)

String (string)

Date (date)

Category (category)

Schema Constraints

Command Line Interface

Generate Data

Validate Data

Create Example Schema

Library Usage

Data Generation

Data Validation

Advanced Usage

Faker Integration

Testing

Performance

Troubleshooting

Running Examples from Cloned Repository

Common Issues

Citation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Integer (`int`)

Float (`float`)

String (`string`)

Date (`date`)

Category (`category`)