A Python library for generating synthetic test data and validating ETL outputs.
Project description
ETLForge
A Python library for generating synthetic test data and validating ETL outputs. ETLForge provides both command-line tools and library functions to help you create realistic test datasets and validate data quality.
Features
🎲 Test Data Generator
- Generate synthetic data based on YAML/JSON schema definitions
- Support for multiple data types:
int,float,string,date,category - Advanced constraints: ranges, uniqueness, nullable fields, categorical values
- Integration with Faker for realistic string generation
- Export to CSV or Excel formats
✅ Data Validator
- Validate CSV/Excel files against schema definitions
- Comprehensive validation checks:
- Column existence
- Data type matching
- Value constraints (ranges, categories)
- Uniqueness validation
- Null value validation
- Date format validation
- Generate detailed reports of invalid rows
🔧 Dual Interface
- Command-line interface for quick operations
- Python library for integration into existing workflows
Installation
Prerequisites
- Python 3.8 or higher
- pip package manager
Install from PyPI (Recommended)
pip install etl-forge
Install from Source
For development or latest features:
git clone https://github.com/kkartas/etl-forge.git
cd etl-forge
pip install -e ".[dev]"
Dependencies
Core dependencies (6 total, automatically installed):
pandas>=1.3.0- Data manipulation and analysispyyaml>=5.4.0- YAML parsing for schema filesclick>=8.0.0- Command-line interface frameworkopenpyxl>=3.0.0- Excel file supportnumpy>=1.21.0- Numerical computingpsutil>=5.9.0- System monitoring for benchmarks
Optional dependencies for enhanced features:
# For realistic data generation using Faker templates
pip install etl-forge[faker]
# For development (testing, linting, documentation)
pip install etl-forge[dev]
Verify Installation
# CLI verification (may require adding Scripts directory to PATH on Windows)
etl-forge --version
# Alternative CLI access (works on all platforms)
python -m etl_forge.cli --version
# Library verification
python -c "from etl_forge import DataGenerator, DataValidator; print('✅ Installation verified')"
CLI Access Note
On some systems (especially Windows), the etl-forge command may not be directly accessible. In such cases, use:
python -m etl_forge.cli [command] [options]
Quick Start
1. Create a Schema
Create a schema.yaml file defining your data structure:
fields:
- name: id
type: int
unique: true
nullable: false
range:
min: 1
max: 10000
- name: name
type: string
nullable: false
faker_template: name
- name: department
type: category
nullable: false
values:
- Engineering
- Marketing
- Sales
2. Generate Test Data
Command Line:
# Direct CLI command (if available)
etl-forge generate --schema schema.yaml --rows 500 --output sample.csv
# Alternative CLI access (works on all platforms)
python -m etl_forge.cli generate --schema schema.yaml --rows 500 --output sample.csv
Python Library:
from etl_forge import DataGenerator
generator = DataGenerator('schema.yaml')
df = generator.generate_data(500)
generator.save_data(df, 'sample.csv')
3. Validate Data
Command Line:
# Direct CLI command (if available)
etl-forge check --input sample.csv --schema schema.yaml --report invalid_rows.csv
# Alternative CLI access (works on all platforms)
python -m etl_forge.cli check --input sample.csv --schema schema.yaml --report invalid_rows.csv
Python Library:
from etl_forge import DataValidator
validator = DataValidator('schema.yaml')
result = validator.validate('sample.csv')
print(f"Validation passed: {result.is_valid}")
Schema Definition
Supported Field Types
Integer (int)
- name: age
type: int
nullable: false
range:
min: 18
max: 65
unique: false
Float (float)
- name: salary
type: float
nullable: true
range:
min: 30000.0
max: 150000.0
precision: 2
null_rate: 0.1
String (string)
- name: email
type: string
nullable: false
unique: true
length:
min: 10
max: 50
faker_template: email # Optional: uses Faker library
Date (date)
- name: hire_date
type: date
nullable: false
range:
start: '2020-01-01'
end: '2024-12-31'
format: '%Y-%m-%d'
Category (category)
- name: status
type: category
nullable: false
values:
- Active
- Inactive
- Pending
Schema Constraints
nullable: Allow null values (default:false)unique: Ensure all values are unique (default:false)range: Define min/max values for numeric types or start/end datesvalues: List of allowed values for categorical fieldslength: Min/max length for string fieldsprecision: Decimal places for float fieldsformat: Date format string (default:'%Y-%m-%d')faker_template: Faker method name for realistic string generationnull_rate: Probability of null values whennullable: true(default: 0.1)
Command Line Interface
Generate Data
# Direct CLI command (if available)
etl-forge generate [OPTIONS]
# Alternative CLI access (works on all platforms)
python -m etl_forge.cli generate [OPTIONS]
Options:
-s, --schema PATH Path to schema file (YAML or JSON) [required]
-r, --rows INTEGER Number of rows to generate (default: 100)
-o, --output PATH Output file path (CSV or Excel) [required]
-f, --format [csv|excel] Output format (auto-detected if not specified)
Validate Data
# Direct CLI command (if available)
etl-forge check [OPTIONS]
# Alternative CLI access (works on all platforms)
python -m etl_forge.cli check [OPTIONS]
Options:
-i, --input PATH Path to input data file [required]
-s, --schema PATH Path to schema file [required]
-r, --report PATH Path to save invalid rows report (optional)
-v, --verbose Show detailed validation errors
Create Example Schema
# Direct CLI command (if available)
etl-forge create-schema example_schema.yaml
# Alternative CLI access (works on all platforms)
python -m etl_forge.cli create-schema example_schema.yaml
Library Usage
Data Generation
from etl_forge import DataGenerator
# Initialize with schema
generator = DataGenerator('schema.yaml')
# Generate data
df = generator.generate_data(1000)
# Save to file
generator.save_data(df, 'output.csv')
# Or do both in one step
df = generator.generate_and_save(1000, 'output.xlsx', 'excel')
Data Validation
from etl_forge import DataValidator
# Initialize validator
validator = DataValidator('schema.yaml')
# Validate data
result = validator.validate('data.csv')
# Check results
if result.is_valid:
print("✅ Data is valid!")
else:
print(f"❌ Found {len(result.errors)} validation errors")
print(f"Invalid rows: {len(result.invalid_rows)}")
# Generate report
result = validator.validate_and_report('data.csv', 'errors.csv')
# Print summary
validator.print_validation_summary(result)
Advanced Usage
# Use schema as dictionary
schema_dict = {
'fields': [
{'name': 'id', 'type': 'int', 'unique': True},
{'name': 'name', 'type': 'string', 'faker_template': 'name'}
]
}
generator = DataGenerator(schema_dict)
validator = DataValidator(schema_dict)
# Validate DataFrame directly
import pandas as pd
df = pd.read_csv('data.csv')
result = validator.validate(df)
Faker Integration
When the faker library is installed, you can use realistic data generation:
- name: first_name
type: string
faker_template: first_name
- name: address
type: string
faker_template: address
- name: phone
type: string
faker_template: phone_number
Common Faker templates:
name,first_name,last_nameemail,phone_numberaddress,city,countrycompany,jobdate,time- And many more! See Faker documentation
Testing
Run the test suite:
pytest tests/
Run with coverage:
pytest tests/ --cov=etl_forge --cov-report=html
Performance
Performance benchmarks are available in BENCHMARKS.md. To reproduce them, run:
python benchmark.py
Then, to visualize the results:
python plot_benchmark.py
Citation
If you use ETLForge in your research or work, please cite it using the information in CITATION.cff.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file etl_forge-1.0.0.tar.gz.
File metadata
- Download URL: etl_forge-1.0.0.tar.gz
- Upload date:
- Size: 23.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
834061a5c08c888093235acf2263e183a3f0ece1179f01a3502bfcfab990f9c8
|
|
| MD5 |
69f26febcf63cd130eaf0bc9a4d67357
|
|
| BLAKE2b-256 |
0c0923a12ba88cc3889054fe4279410277abbb4df719ea7c33a9db2855bd5664
|
File details
Details for the file etl_forge-1.0.0-py3-none-any.whl.
File metadata
- Download URL: etl_forge-1.0.0-py3-none-any.whl
- Upload date:
- Size: 15.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f219a884684585230892c5c879e605ff5497d68f1cde90208cf902271346b92f
|
|
| MD5 |
40530ba49ca9ae1253c1291c9f81128f
|
|
| BLAKE2b-256 |
9ff830a7ae423fa2322a99e3287a752820f04e8c8ec23a998fd5d96b8c2ca939
|