A powerful data processing and schema generation tool with PyArrow streaming, validation, and S3 support
Project description
Forklift
A powerful data processing and schema generation tool with PyArrow streaming, validation, and S3 support.
Overview
Forklift is a comprehensive data processing tool that provides:
- High-performance data import with PyArrow streaming for CSV, Excel, FWF, and SQL sources
- Intelligent schema generation that analyzes your data and creates standardized schema definitions
- Robust validation with configurable error handling and constraint validation
- S3 streaming support for both input and output operations
- Multiple output formats including Parquet, with comprehensive metadata and manifests
Key Features
🚀 Data Import & Processing
- Stream large files efficiently with PyArrow
- Support for CSV, Excel, Fixed-Width Files (FWF), and SQL sources
- Configurable batch processing with memory optimization
- Comprehensive validation with detailed error reporting
- S3 integration for cloud-native workflows
🔍 Schema Generation
- Intelligent schema inference from data analysis
- Privacy-first approach - no sensitive sample data included by default
- Multiple file format support - CSV, Excel, Parquet
- Flexible output options - stdout, file, or clipboard
- Standards-compliant schemas following JSON Schema with Forklift extensions
🛡️ Validation & Quality
- JSON Schema validation with custom extensions
- Primary key inference and enforcement
- Constraint validation (unique, not-null, primary key)
- Data type validation and conversion
- Configurable error handling modes (fail-fast, fail-complete, bad-rows)
Installation
pip install forklift
Optional Dependencies
# For Excel support
pip install openpyxl
# For clipboard functionality
pip install pyperclip
Quick Start
Data Import
import forklift
# Import CSV to Parquet with validation
from forklift import import_csv
results = import_csv(
source="data.csv",
destination="./output/",
schema_path="schema.json"
)
print(f"Import completed successfully!")
Schema Generation
import forklift
# Generate schema from CSV (analyzes entire file by default)
schema = forklift.generate_schema_from_csv("data.csv")
# Generate with limited row analysis
schema = forklift.generate_schema_from_csv("data.csv", nrows=1000)
# Save schema to file
forklift.generate_and_save_schema(
input_path="data.csv",
output_path="schema.json",
file_type="csv"
)
# Generate with primary key inference
schema = forklift.generate_schema_from_csv(
"data.csv",
infer_primary_key_from_metadata=True
)
Reading Data for Analysis
import forklift
# Read CSV into DataFrame for analysis
df = forklift.read_csv("data.csv")
# Read Excel with specific sheet
df = forklift.read_excel("data.xlsx", sheet_name="Sheet1")
# Read Fixed-Width File with schema
df = forklift.read_fwf("data.txt", schema_path="fwf_schema.json")
CLI Usage
Data Import
# Import CSV with schema validation
forklift ingest data.csv --dest ./output/ --input-kind csv --schema schema.json
# Import from S3
forklift ingest s3://bucket/data.csv --dest s3://bucket/output/ --input-kind csv
# Import Excel file
forklift ingest data.xlsx --dest ./output/ --input-kind excel --sheet "Sheet1"
# Import Fixed-Width File
forklift ingest data.txt --dest ./output/ --input-kind fwf --fwf-spec schema.json
Schema Generation
# Generate schema from CSV (analyzes entire file by default)
forklift generate-schema data.csv --file-type csv
# Generate with limited row analysis
forklift generate-schema data.csv --file-type csv --nrows 1000
# Save to file
forklift generate-schema data.csv --file-type csv --output file --output-path schema.json
# Include sample data for development (explicit opt-in)
forklift generate-schema data.csv --file-type csv --include-sample
# Copy to clipboard
forklift generate-schema data.csv --file-type csv --output clipboard
# Excel files
forklift generate-schema data.xlsx --file-type excel --sheet "Sheet1"
# Parquet files
forklift generate-schema data.parquet --file-type parquet
# With primary key inference
forklift generate-schema data.csv --file-type csv --infer-primary-key
Core Components
- Import Engine: High-performance data processing with PyArrow
- Schema Generator: Intelligent schema inference and generation
- Validation System: Constraint validation and error handling
- Processors: Pluggable data transformation components
- I/O Operations: S3 and local file system support
Documentation
For detailed documentation, see the docs/ directory:
- Usage Guide - Comprehensive usage examples and workflows
- Schema Standards - JSON Schema format and extensions
- API Reference - Complete API documentation
- Constraint Validation - Validation features
- S3 Integration - S3 usage and testing
Examples
See the examples/ directory for comprehensive examples:
- getting_started.py - Start here! Complete introduction to CSV processing with schema validation, including basic usage, complete schema validation, and passthrough mode for processing subsets of columns
- calculated_columns_demo.py - Calculated columns functionality
- constraint_validation_demo.py - Constraint validation examples
- validation_demo.py - Data validation with bad rows handling
- datetime_features_example.py - Date/time processing examples
- And more...
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Run the test suite
- Submit a pull request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file forklift_etl-0.1.4.tar.gz.
File metadata
- Download URL: forklift_etl-0.1.4.tar.gz
- Upload date:
- Size: 301.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
528fd58277dbfff59a2388198390f7de56f3739a16835e7bd48bb511fa3bcd33
|
|
| MD5 |
ddcdbed4f6e30180946780b2d3599b70
|
|
| BLAKE2b-256 |
25f371fb75f8b128ee8bc0d47b9f1b4a4c99fc23926fa10a8184e5777f08a13e
|
Provenance
The following attestation bundles were made for forklift_etl-0.1.4.tar.gz:
Publisher:
publish.yaml on cornyhorse/forklift
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
forklift_etl-0.1.4.tar.gz -
Subject digest:
528fd58277dbfff59a2388198390f7de56f3739a16835e7bd48bb511fa3bcd33 - Sigstore transparency entry: 622098588
- Sigstore integration time:
-
Permalink:
cornyhorse/forklift@58c3bbe51cf7fe6c40036535e2c12e82ec49490b -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/cornyhorse
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@58c3bbe51cf7fe6c40036535e2c12e82ec49490b -
Trigger Event:
release
-
Statement type:
File details
Details for the file forklift_etl-0.1.4-py3-none-any.whl.
File metadata
- Download URL: forklift_etl-0.1.4-py3-none-any.whl
- Upload date:
- Size: 313.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
837e9440b773b80272856f4ac08bfcf7124af51c0ee3f6347291cf1b28432763
|
|
| MD5 |
cf11fd7376aa41408160549ab4f837e1
|
|
| BLAKE2b-256 |
fb660b90df3d9ceb6a7e1923e48dfcdb0eb059b2b316e6801ef786cd3fbeca69
|
Provenance
The following attestation bundles were made for forklift_etl-0.1.4-py3-none-any.whl:
Publisher:
publish.yaml on cornyhorse/forklift
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
forklift_etl-0.1.4-py3-none-any.whl -
Subject digest:
837e9440b773b80272856f4ac08bfcf7124af51c0ee3f6347291cf1b28432763 - Sigstore transparency entry: 622098590
- Sigstore integration time:
-
Permalink:
cornyhorse/forklift@58c3bbe51cf7fe6c40036535e2c12e82ec49490b -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/cornyhorse
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yaml@58c3bbe51cf7fe6c40036535e2c12e82ec49490b -
Trigger Event:
release
-
Statement type: