Skip to main content

A powerful data processing and schema generation tool with PyArrow streaming, validation, and S3 support

Project description

Forklift

A powerful data processing and schema generation tool with PyArrow streaming, validation, and S3 support.

Forklift Logo

Overview

Forklift is a comprehensive data processing tool that provides:

  • High-performance data import with PyArrow streaming for CSV, Excel, FWF, and SQL sources
  • Intelligent schema generation that analyzes your data and creates standardized schema definitions
  • Robust validation with configurable error handling and constraint validation
  • S3 streaming support for both input and output operations
  • Multiple output formats including Parquet, with comprehensive metadata and manifests

Key Features

🚀 Data Import & Processing

  • Stream large files efficiently with PyArrow
  • Support for CSV, Excel, Fixed-Width Files (FWF), and SQL sources
  • Configurable batch processing with memory optimization
  • Comprehensive validation with detailed error reporting
  • S3 integration for cloud-native workflows

🔍 Schema Generation

  • Intelligent schema inference from data analysis
  • Privacy-first approach - no sensitive sample data included by default
  • Multiple file format support - CSV, Excel, Parquet
  • Flexible output options - stdout, file, or clipboard
  • Standards-compliant schemas following JSON Schema with Forklift extensions

🛡️ Validation & Quality

  • JSON Schema validation with custom extensions
  • Primary key inference and enforcement
  • Constraint validation (unique, not-null, primary key)
  • Data type validation and conversion
  • Configurable error handling modes (fail-fast, fail-complete, bad-rows)

Installation

pip install forklift

Optional Dependencies

# For Excel support
pip install openpyxl

# For clipboard functionality
pip install pyperclip

Quick Start

Data Import

import forklift

# Import CSV to Parquet with validation
from forklift import import_csv

results = import_csv(
    source="data.csv",
    destination="./output/",
    schema_path="schema.json"
)

print(f"Import completed successfully!")

Schema Generation

import forklift

# Generate schema from CSV (analyzes entire file by default)
schema = forklift.generate_schema_from_csv("data.csv")

# Generate with limited row analysis
schema = forklift.generate_schema_from_csv("data.csv", nrows=1000)

# Save schema to file
forklift.generate_and_save_schema(
    input_path="data.csv",
    output_path="schema.json",
    file_type="csv"
)

# Generate with primary key inference
schema = forklift.generate_schema_from_csv(
    "data.csv", 
    infer_primary_key_from_metadata=True
)

Reading Data for Analysis

import forklift

# Read CSV into DataFrame for analysis
df = forklift.read_csv("data.csv")

# Read Excel with specific sheet
df = forklift.read_excel("data.xlsx", sheet_name="Sheet1")

# Read Fixed-Width File with schema
df = forklift.read_fwf("data.txt", schema_path="fwf_schema.json")

CLI Usage

Data Import

# Import CSV with schema validation
forklift ingest data.csv --dest ./output/ --input-kind csv --schema schema.json

# Import from S3
forklift ingest s3://bucket/data.csv --dest s3://bucket/output/ --input-kind csv

# Import Excel file
forklift ingest data.xlsx --dest ./output/ --input-kind excel --sheet "Sheet1"

# Import Fixed-Width File
forklift ingest data.txt --dest ./output/ --input-kind fwf --fwf-spec schema.json

Schema Generation

# Generate schema from CSV (analyzes entire file by default)
forklift generate-schema data.csv --file-type csv

# Generate with limited row analysis
forklift generate-schema data.csv --file-type csv --nrows 1000

# Save to file
forklift generate-schema data.csv --file-type csv --output file --output-path schema.json

# Include sample data for development (explicit opt-in)
forklift generate-schema data.csv --file-type csv --include-sample

# Copy to clipboard
forklift generate-schema data.csv --file-type csv --output clipboard

# Excel files
forklift generate-schema data.xlsx --file-type excel --sheet "Sheet1"

# Parquet files
forklift generate-schema data.parquet --file-type parquet

# With primary key inference
forklift generate-schema data.csv --file-type csv --infer-primary-key

Core Components

  • Import Engine: High-performance data processing with PyArrow
  • Schema Generator: Intelligent schema inference and generation
  • Validation System: Constraint validation and error handling
  • Processors: Pluggable data transformation components
  • I/O Operations: S3 and local file system support

Documentation

For detailed documentation, see the docs/ directory:

Examples

See the examples/ directory for comprehensive examples:

  • getting_started.py - Start here! Complete introduction to CSV processing with schema validation, including basic usage, complete schema validation, and passthrough mode for processing subsets of columns
  • calculated_columns_demo.py - Calculated columns functionality
  • constraint_validation_demo.py - Constraint validation examples
  • validation_demo.py - Data validation with bad rows handling
  • datetime_features_example.py - Date/time processing examples
  • And more...

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite
  6. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

forklift_etl-0.1.4.tar.gz (301.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

forklift_etl-0.1.4-py3-none-any.whl (313.8 kB view details)

Uploaded Python 3

File details

Details for the file forklift_etl-0.1.4.tar.gz.

File metadata

  • Download URL: forklift_etl-0.1.4.tar.gz
  • Upload date:
  • Size: 301.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for forklift_etl-0.1.4.tar.gz
Algorithm Hash digest
SHA256 528fd58277dbfff59a2388198390f7de56f3739a16835e7bd48bb511fa3bcd33
MD5 ddcdbed4f6e30180946780b2d3599b70
BLAKE2b-256 25f371fb75f8b128ee8bc0d47b9f1b4a4c99fc23926fa10a8184e5777f08a13e

See more details on using hashes here.

Provenance

The following attestation bundles were made for forklift_etl-0.1.4.tar.gz:

Publisher: publish.yaml on cornyhorse/forklift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file forklift_etl-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: forklift_etl-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 313.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for forklift_etl-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 837e9440b773b80272856f4ac08bfcf7124af51c0ee3f6347291cf1b28432763
MD5 cf11fd7376aa41408160549ab4f837e1
BLAKE2b-256 fb660b90df3d9ceb6a7e1923e48dfcdb0eb059b2b316e6801ef786cd3fbeca69

See more details on using hashes here.

Provenance

The following attestation bundles were made for forklift_etl-0.1.4-py3-none-any.whl:

Publisher: publish.yaml on cornyhorse/forklift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page