Skip to main content

A high-performance preprocessing and ETL engine for sanitizing raw data streams, accelerated by Rust.

Project description

Phaeton

PyPI version Python Versions Rust License: MIT

⚠️ Project Status: Phaeton is currently in Stable Beta (v0.3.0). The core streaming engine is fully functional. However, please note that some auxiliary methods (marked in docs) are currently placeholders and will be implemented in future versions.

Phaeton is a specialized, Rust-powered preprocessing and ETL engine designed to sanitize raw data streams before they reach your analytical environment.

It acts as the strictly typed "Gatekeeper" of your data pipeline. Unlike traditional DataFrame libraries that attempt to load entire datasets into RAM, Phaeton employs a zero-copy streaming architecture. It processes data chunk-by-chunk filtering noise, fixing encodings, and standardizing formats ensuring O(1) memory complexity relative to file size.

This allows you to process massive datasets on standard hardware without memory spikes, delivering clean, high-quality data to downstream tools like Pandas, Polars, or ML models.

The Philosophy: Don't waste memory loading garbage. Clean the stream first, then analyze the gold.


Key Features

  • Streaming Architecture: Processes files chunk-by-chunk. Memory usage remains stable regardless of whether the file is 100MB or 100GB.
  • Parallel Execution: Utilizes all CPU cores via Rust Rayon to handle heavy lifting (Regex, Fuzzy Matching) without blocking Python.
  • Strict Quarantine: Bad data isn't just dropped silently; it's quarantined into a separate file with a generated _phaeton_reason column for auditing.
  • Smart Casting: Automatically handles messy formats (e.g., "Rp 5.250.000,00"5250000 int) without complex manual parsing.
  • Privacy & Security: Built-in email masking and SHA-256 hashing for PII compliance.
  • Configurable Engine: Full control over batch_size and worker threads to tune performance for low-memory devices or high-end servers.

Performance Benchmark

Phaeton is optimized for "Dirty Data" scenarios involving heavy string parsing, regex filtering, and fuzzy matching.

Test Scenario:

  • Dataset: 1 Million Rows of generated mixed dirty data.
  • Operations: Trim whitespace, Currency scrubbing ($ 50.000,00 -> 50000), Type casting, Fuzzy Alignment (Typo correction for City names), and Filtering.
  • Hardware: Entry-level Laptop (Intel Core i3-1220P, 16GB RAM).

Results:

OS Environment Speed (Rows/sec) Duration (1M Rows) Throughput
Windows 11 ~820,000 rows/s 1.21s ~70 MB/s
Linux (Arch) ~575,000 rows/s 1.73s ~49 MB/s

⚠️ Note on I/O Bottleneck: The performance difference above is due to hardware configuration during testing.

  • Windows: Ran on internal NVMe SSD (High I/O speed).
  • Linux: Ran on External SSD via USB 3.2 enclosure (I/O Bottleneck).

In an equal hardware environment, Phaeton performs identically on Linux and Windows. The engine is heavily I/O bound; faster disk = faster processing.


Usage Example

Based on the features available in the current version.

import phaeton

# 1. Initialize Engine
# 'strict=True' enables schema validation before execution starts.
eng = phaeton.Engine(workers=0, batch_size=25_000, strict=True)

# 2. Define Base Pipeline (Shared Logic)
base = (
    eng.ingest("dirty_data.csv")
        # Critical Data Filter: Drop row if 'email' OR 'username' is missing
        .prune(['email', 'username'])
        
        # Deduplication: Ensure email uniqueness across the dataset
        .dedupe('email')
        
        # Cleaning & Normalization
        .scrub('username', mode='trim')             # Remove whitespace
        .scrub('salary', mode='currency')           # Normalize format ("$ 5,000" -> "5000")
        
        # Type Enforcement: Validate data is integer, strip noise if needed
        .cast('salary', dtype='int', clean=True)
        
        # Imputation: Fill missing status with a default value
        .fill('status', value='UNKNOWN')
        
        # Correction: Fix typos using Jaro-Winkler distance
        .fuzzyalign('city',
            ref=['Jakarta', 'Minnesota'],
            threshold=0.85 
        ) # e.g., "Jkarta" -> "Jakarta"
)

# 3. Pipeline Branching using .fork()

# Pipeline 1: Secure & Clean Active Users
p1 = (
    base.fork('Active Users')
        .keep('status', match='ACTIVE', mode='exact')
        .hash('email', salt='s3cret')               # Anonymize PII (SHA-256)
        .dump('clean_active.csv')
)

# Pipeline 2: Audit Banned Users
p2 = (
    base.fork('Banned Analysis')
        .keep('status', match='BANNED', mode='exact')
        .quarantine('quarantine_banned.csv')        # Isolate bad rows for review
        .dump('clean_banned.csv')
)

# 4. Execute Pipelines in Parallel
# Returns a list of result statistics
stats = eng.exec([p1, p2])

print(f"Pipeline 1 (Active) | Processed: {stats[0].processed}, Saved: {stats[0].saved}")
print(f"Pipeline 2 (Banned) | Processed: {stats[1].processed}, Saved: {stats[1].saved}")

Installation

Phaeton provides Universal Wheels (ABI3). No Rust compiler needed.

pip install phaeton

Supported: Python 3.8+ on Windows, Linux, and macOS (Intel & Apple Silicon).


API Reference

1. Engine & Diagnostics

Method Description
phaeton.probe(path) Detects encoding and delimiter automatically.
eng.ingest(source) Creates a new pipeline builder.
eng.exec(pipelines) Executes pipelines in parallel threads.
eng.validate(pipelines) Runs a schema dry-run check without executing data processing.

2. Pipeline: Cleaning & Transformation

Methods to sanitize data content.

Method Description
.decode(encoding) Fixes file encoding (e.g., latin-1 or cp1252). Mandatory as the first step if encoding is broken.
.scrub(col, mode) Basic string cleaning.
Modes: 'trim', 'lower', 'upper', 'currency', 'html', numeric_only, email (masking) .
.fill(col, val, method) Methods: fixed (constant value) or ffill (forward fill).
.dedupe(col) Removes duplicates. col can be None (full row), str (single col), or list (composite key).
.fuzzyalign(col, ref, threshold) Fixes typos using Jaro-Winkler distance against a reference list.
.cast(col, dtype, clean) Smart Cast. Converts types (int/float/bool).
Set clean=True to strip non-numeric chars before casting.

3. Pipeline: Structure & Security

Methods to manage columns and privacy.

Method Description
.headers(style) Standardizes header casing.
Styles: 'snake', 'camel', 'pascal', 'kebab', 'constant.
.rename(mapping) Renames specific columns using a dictionary mapping ({'old': 'new'}).
.hash(col, salt) Applies hashing (SHA-256) to specific columns for PII anonymization.
.map(col, mapping) Maps values using a dictionary lookup (VLOOKUP style).

4. Pipeline: Output & Flow

Methods to save the final results or handle rejected data.

Method Description
.quarantine(path) Saves rejected rows (with reasons) to a separate CSV file.
.dump(path, format) Saves clean data to .csv.
.fork(tag) Creates a branch of the pipeline.
.peek(n, col) Runs a dry-run preview. n: rows limit. col: specific column(s) to inspect (optional).

⚠️ Placeholder Methods (Coming Soon)

These methods are present in the API for compatibility but do not perform operations yet in v0.3.0.

  • reformat(col, ...): Date parsing/reformatting.
  • split(col, ...): Splitting columns.
  • combine(cols, ...): Merging columns.

Roadmap

Phaeton is currently in Stable Beta (v0.3.0). Here is the status of our development:

Feature Status Implementation Notes
Parallel Streaming Engine ✅ Ready Powered by Rust Rayon (Multi-core)
Filter Logic & Regex ✅ Ready keep, discard, prune implemented
Text Scrubbing ✅ Ready HTML, Currency, Email Masking, etc.
Type Enforcement ✅ Ready Validates data types & scrubs noise for clean CSV output
Fuzzy Alignment ✅ Ready Jaro-Winkler for typo correction
Quarantine System ✅ Ready Full audit trail for rejected rows
Deduplication ✅ Ready Row-level & Column-level dedupe
Hashing & Anonymization ✅ Ready SHA-256 for PII data
Header Normalization ✅ Ready snake_case, camelCase conversions
Strict Schema Validation ✅ Ready Engine(strict=True)
Inspector Engine 📝 Planned Dedicated stream for data profiling (Read-Only)
Date Normalization 📝 Planned Auto-detect & reformat dates
Parquet/Arrow Support 📝 Planned Native output integration

Contributing

This project is built with Maturin (PyO3 + Rust). Interested in contributing?

  1. Clone this repository.
  2. Ensure Rust & Cargo are installed.
  3. Set up the environment and build:
# Setup virtual environment (optional)
python -m venv .venv
source .venv/bin/activate

# Build & Install package in development mode
maturin develop --release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phaeton-0.3.0.tar.gz (41.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

phaeton-0.3.0-cp38-abi3-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.8+Windows x86-64

phaeton-0.3.0-cp38-abi3-manylinux_2_34_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.34+ x86-64

phaeton-0.3.0-cp38-abi3-macosx_11_0_arm64.whl (997.9 kB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file phaeton-0.3.0.tar.gz.

File metadata

  • Download URL: phaeton-0.3.0.tar.gz
  • Upload date:
  • Size: 41.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.11.5

File hashes

Hashes for phaeton-0.3.0.tar.gz
Algorithm Hash digest
SHA256 bf69b9a7e1f17abeeb2f0788eb888850d9a821914357cf42542d99a9e0a001f4
MD5 d7061874c1c283e2c27a0bd93eb5c6e7
BLAKE2b-256 f7b62e242a97f68835f7b4f7c7771b2ba0d73395d92bac2c9d66f94ff1595e61

See more details on using hashes here.

File details

Details for the file phaeton-0.3.0-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: phaeton-0.3.0-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.11.5

File hashes

Hashes for phaeton-0.3.0-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 610390545f3324ca941d301bb00b100b4ef4becb570ceead84b8b64d7f12028f
MD5 ecab42b79ddb953dd8492f33b3daee33
BLAKE2b-256 8fa9a3bd1ab41b3a2133f7ddeae194956f776e81186c8ab2b8bd3722296c7150

See more details on using hashes here.

File details

Details for the file phaeton-0.3.0-cp38-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for phaeton-0.3.0-cp38-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 437b9a3f3e2e684e17d7979823de06de4af69161e40b62eaeea1567c9607fbd8
MD5 2f0a9644966eb621d5ef4089ed820dd8
BLAKE2b-256 07ab627243d87a72d9727265dc61000675514ebb5003921e89a3bf26b20d504e

See more details on using hashes here.

File details

Details for the file phaeton-0.3.0-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for phaeton-0.3.0-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7d192d338171cd7f960018b3f0f4a14800efcacb0468c4e51cefd6c5a16a0c28
MD5 3ed66c86835b5814f50a4a826ec93317
BLAKE2b-256 b928c4a42f74ff6acf14d759c1e018cfa343320f8da632ddb7ceef69a5f09c7f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page