A powerful data processing and schema generation tool with PyArrow streaming, validation, and S3 support

These details have not been verified by PyPI

Project description

Forklift

A powerful data processing and schema generation tool with PyArrow streaming, validation, and S3 support.

Forklift Logo

Overview

Forklift is a comprehensive data processing tool that provides:

High-performance data import with PyArrow streaming for CSV, Excel, FWF, and SQL sources
Intelligent schema generation that analyzes your data and creates standardized schema definitions
Robust validation with configurable error handling and constraint validation
S3 streaming support for both input and output operations
Multiple output formats including Parquet, with comprehensive metadata and manifests

Key Features

🚀 Data Import & Processing

Stream large files efficiently with PyArrow
Support for CSV, Excel, Fixed-Width Files (FWF), and SQL sources
Configurable batch processing with memory optimization
Comprehensive validation with detailed error reporting
S3 integration for cloud-native workflows

🔍 Schema Generation

Intelligent schema inference from data analysis
Privacy-first approach - no sensitive sample data included by default
Multiple file format support - CSV, Excel, Parquet
Flexible output options - stdout, file, or clipboard
Standards-compliant schemas following JSON Schema with Forklift extensions

🛡️ Validation & Quality

JSON Schema validation with custom extensions
Primary key inference and enforcement
Constraint validation (unique, not-null, primary key)
Data type validation and conversion
Configurable error handling modes (fail-fast, fail-complete, bad-rows)

Installation

pip install forklift

Optional Dependencies

# For Excel support
pip install openpyxl

# For clipboard functionality
pip install pyperclip

Quick Start

Data Import

import forklift

# Import CSV to Parquet with validation
from forklift import import_csv

results = import_csv(
    source="data.csv",
    destination="./output/",
    schema_path="schema.json"
)

print(f"Import completed successfully!")

Schema Generation

import forklift

# Generate schema from CSV (analyzes entire file by default)
schema = forklift.generate_schema_from_csv("data.csv")

# Generate with limited row analysis
schema = forklift.generate_schema_from_csv("data.csv", nrows=1000)

# Save schema to file
forklift.generate_and_save_schema(
    input_path="data.csv",
    output_path="schema.json",
    file_type="csv"
)

# Generate with primary key inference
schema = forklift.generate_schema_from_csv(
    "data.csv", 
    infer_primary_key_from_metadata=True
)

Reading Data for Analysis

import forklift

# Read CSV into DataFrame for analysis
df = forklift.read_csv("data.csv")

# Read Excel with specific sheet
df = forklift.read_excel("data.xlsx", sheet_name="Sheet1")

# Read Fixed-Width File with schema
df = forklift.read_fwf("data.txt", schema_path="fwf_schema.json")

CLI Usage

Data Import

# Import CSV with schema validation
forklift ingest data.csv --dest ./output/ --input-kind csv --schema schema.json

# Import from S3
forklift ingest s3://bucket/data.csv --dest s3://bucket/output/ --input-kind csv

# Import Excel file
forklift ingest data.xlsx --dest ./output/ --input-kind excel --sheet "Sheet1"

# Import Fixed-Width File
forklift ingest data.txt --dest ./output/ --input-kind fwf --fwf-spec schema.json

Schema Generation

# Generate schema from CSV (analyzes entire file by default)
forklift generate-schema data.csv --file-type csv

# Generate with limited row analysis
forklift generate-schema data.csv --file-type csv --nrows 1000

# Save to file
forklift generate-schema data.csv --file-type csv --output file --output-path schema.json

# Include sample data for development (explicit opt-in)
forklift generate-schema data.csv --file-type csv --include-sample

# Copy to clipboard
forklift generate-schema data.csv --file-type csv --output clipboard

# Excel files
forklift generate-schema data.xlsx --file-type excel --sheet "Sheet1"

# Parquet files
forklift generate-schema data.parquet --file-type parquet

# With primary key inference
forklift generate-schema data.csv --file-type csv --infer-primary-key

Core Components

Import Engine: High-performance data processing with PyArrow
Schema Generator: Intelligent schema inference and generation
Validation System: Constraint validation and error handling
Processors: Pluggable data transformation components
I/O Operations: S3 and local file system support

Documentation

For detailed documentation, see the docs/ directory:

Usage Guide - Comprehensive usage examples and workflows
Schema Standards - JSON Schema format and extensions
API Reference - Complete API documentation
Constraint Validation - Validation features
S3 Integration - S3 usage and testing

Examples

See the examples/ directory for comprehensive examples:

getting_started.py - Start here! Complete introduction to CSV processing with schema validation, including basic usage, complete schema validation, and passthrough mode for processing subsets of columns
calculated_columns_demo.py - Calculated columns functionality
constraint_validation_demo.py - Constraint validation examples
validation_demo.py - Data validation with bad rows handling
datetime_features_example.py - Date/time processing examples
And more...

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Run the test suite
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.4

Oct 19, 2025

0.1.3

Sep 9, 2025

0.1.0

Sep 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

forklift_etl-0.1.4.tar.gz (301.0 kB view details)

Uploaded Oct 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

forklift_etl-0.1.4-py3-none-any.whl (313.8 kB view details)

Uploaded Oct 19, 2025 Python 3

File details

Details for the file forklift_etl-0.1.4.tar.gz.

File metadata

Download URL: forklift_etl-0.1.4.tar.gz
Upload date: Oct 19, 2025
Size: 301.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for forklift_etl-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`528fd58277dbfff59a2388198390f7de56f3739a16835e7bd48bb511fa3bcd33`
MD5	`ddcdbed4f6e30180946780b2d3599b70`
BLAKE2b-256	`25f371fb75f8b128ee8bc0d47b9f1b4a4c99fc23926fa10a8184e5777f08a13e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for forklift_etl-0.1.4.tar.gz:

Publisher: publish.yaml on cornyhorse/forklift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: forklift_etl-0.1.4.tar.gz
- Subject digest: 528fd58277dbfff59a2388198390f7de56f3739a16835e7bd48bb511fa3bcd33
- Sigstore transparency entry: 622098588
- Sigstore integration time: Oct 19, 2025
Source repository:
- Permalink: cornyhorse/forklift@58c3bbe51cf7fe6c40036535e2c12e82ec49490b
- Branch / Tag: refs/tags/v0.1.4
- Owner: https://github.com/cornyhorse
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@58c3bbe51cf7fe6c40036535e2c12e82ec49490b
- Trigger Event: release

File details

Details for the file forklift_etl-0.1.4-py3-none-any.whl.

File metadata

Download URL: forklift_etl-0.1.4-py3-none-any.whl
Upload date: Oct 19, 2025
Size: 313.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for forklift_etl-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`837e9440b773b80272856f4ac08bfcf7124af51c0ee3f6347291cf1b28432763`
MD5	`cf11fd7376aa41408160549ab4f837e1`
BLAKE2b-256	`fb660b90df3d9ceb6a7e1923e48dfcdb0eb059b2b316e6801ef786cd3fbeca69`

See more details on using hashes here.

Provenance

The following attestation bundles were made for forklift_etl-0.1.4-py3-none-any.whl:

Publisher: publish.yaml on cornyhorse/forklift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: forklift_etl-0.1.4-py3-none-any.whl
- Subject digest: 837e9440b773b80272856f4ac08bfcf7124af51c0ee3f6347291cf1b28432763
- Sigstore transparency entry: 622098590
- Sigstore integration time: Oct 19, 2025
Source repository:
- Permalink: cornyhorse/forklift@58c3bbe51cf7fe6c40036535e2c12e82ec49490b
- Branch / Tag: refs/tags/v0.1.4
- Owner: https://github.com/cornyhorse
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yaml@58c3bbe51cf7fe6c40036535e2c12e82ec49490b
- Trigger Event: release

forklift-etl 0.1.4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Forklift

Overview

Key Features

🚀 Data Import & Processing

🔍 Schema Generation

🛡️ Validation & Quality

Installation

Optional Dependencies

Quick Start

Data Import

Schema Generation

Reading Data for Analysis

CLI Usage

Data Import

Schema Generation

Core Components

Documentation

Examples

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance