Unit-Safe Data Pipeline Schema and Dimensional Algebra Framework

These details have not been verified by PyPI

Project description

Chisa — Unit-Safe Data Pipeline Schema

Python NumPy Pandas License

Normalize messy heterogeneous units and enforce physical integrity before your data hits ML or production systems.

Chisa is a declarative schema validation and semantic data transformation tool designed for Data Engineers. It rescues your data pipelines from the nightmare of mixed units, bizarre abbreviations, and impossible physical values.

While standard schema tools (like Pydantic or Pandera) only validate data types (e.g., ensuring a value is a float), Chisa validates physical reality. If you are ingesting IoT sensor streams, parsing messy logistics CSVs, or processing manufacturing Excel sheets, Chisa ensures your numbers obey the laws of physics before they enter your database.

🚀 The Nightmare vs. The Chisa Way

Real-world data is rarely clean. A single dataset might contain "1.5e3 lbs", " -5 kg ", missing values, and typos like "20 pallets". Standard pandas workflows force you to write fragile regex and manual if-else blocks.

Chisa solves this declaratively.

import pandas as pd
import chisa as cs
from chisa import u

class GlobalFreightSchema(cs.Schema):
    gross_weight: u.Kilogram = cs.Field(
        source="Weight_Log", 
        parse_string=True, 
        on_error='coerce', 
        round=2,
        min=0 # Axiom Bound: Cargo mass cannot be negative!
    )
    cargo_volume: u.CubicMeter = cs.Field(
        source="Volume_Log", 
        parse_string=True, 
        on_error='coerce'
    )

df_messy = pd.DataFrame({
    'Weight_Log': ["1.5e3 lbs", "  -5 kg  ", "20 pallets", "150", "kg"],
    'Volume_Log': ["100 m^3", "500 cu_ft", "1000", "", "NaN"]
})

# Execute the pipeline instantly via vectorized masking
clean_df = GlobalFreightSchema.normalize(df_messy)

The Output: Chisa cleanly parses "1.5e3 lbs" to 680.39 kg, accurately converts "cu_ft" to Cubic Meters, and safely nullifies physical anomalies (like -5 kg), bare numbers, and vague inputs ("20 pallets") to NaN—all automatically.

🧠 Smart Error Intelligence

Data pipelines shouldn't just crash; they should tell you how to fix them. If you enforce strict data rules (on_error='raise'), Chisa provides unparalleled Developer Experience (DX) for debugging massive DataFrames:

NormalizationError: Normalization failed for field 'gross_weight' at index [2].
   ► Issue              : Unrecognized unit 'pallets'
   ► Expected Dimension : mass
   ► Raw Value Sample   : '20 pallets'
   ► Suggestion         : Fix the raw data, register the unit, or set Field(on_error='coerce').

⚡ Performance: The Vectorization Advantage

Standard unit libraries (like Pint) struggle with heterogeneous strings (mixed units in the same column), forcing developers to use slow pandas.apply() loops to parse row-by-row. Chisa bypasses this entirely using native NumPy vectorization and Pandas Boolean masking.

When stress-tested against 100,000 rows of heterogeneous data (e.g., a mix of lbs and oz targeting kg):

Traditional (Pint + Pandas Apply): ~14.71 seconds
Chisa (Vectorized Schema): ~0.046 seconds (>316x Faster)

Transparency Note: You can reproduce this 99.6% reduction in latency using the benchmarks/benchmark_vs_pint.py script included in this repository.

🪝 Pipeline Hooks (Inversion of Control)

Need to filter offline sensors before parsing, or trigger an alarm if a physical threshold is breached? Inject your own domain logic directly into the validation lifecycle.

class ColdChainPipeline(cs.Schema):
    temp: u.Celsius = cs.Field(source="raw_temp", parse_string=True)

    @cs.pre_normalize
    def drop_calibration_pings(cls, raw_df):
        """Runs BEFORE Chisa parses the strings. Removes sensor test pings."""
        return raw_df[raw_df['status'] != 'CALIBRATION']

    @cs.post_normalize
    def enforce_spoilage_check(cls, clean_df):
        """Runs AFTER all temperatures (e.g., Fahrenheit) are standardized to Celsius."""
        if clean_df['temp'].max() > -20.0:
            raise ValueError("CRITICAL: Vaccine shipment spoiled! Temp exceeded -20°C.")
        return clean_df

🏎️ The Fluent API (Quick Inline Conversions)

For simple scripts, logging, or UI components where you don't need full declarative schemas, Chisa provides a highly readable, chainable Fluent API.

import chisa as cs

# Simple scalar conversion
speed = cs.convert(120, 'km/h').to('m/s').resolve()
print(speed) # 33.333333333

# Powerful cosmetic formatting for logs
text = cs.convert(1000, 'm').to('cm').use(format='verbose', delim=True).resolve()
print(text) # "1,000 m = 100,000 cm"

📚 Examples & Tutorials

To help you integrate Chisa into your existing workflows, we provide a comprehensive suite of examples in the examples/ directory.

Interactive Crash Course (Google Colab)

The fastest way to learn Chisa is through our interactive notebooks. No local installation required!

Tutorial	Description	Link
01. Fundamentals	Core concepts, Axiom Engine, and Type Safety.
02. Workflow Demo	Real-world engineering with Pandas & Matplotlib.

Python Scripts Reference

For detailed, standalone script implementations, explore our examples/ directory:

Phase 1: Declarative Data Pipelines (Data Ingestion)
- 01_wearable_health_data.py: Standardizing messy smartwatch exports (BPM, kcal vs cal, body temperature).
- 02_food_manufacturing_scale.py: Safely converting industrial recipe batches across cups, tablespoons, grams, and fluid ounces.
- 03_multi_region_tariffs.py: Parsing mixed currency and weight strings (lbs, oz, kg) in a single pass to calculate global shipping costs.
- 04_energy_grid_audits.py: Normalizing utility bill chaos (MMBtu, kWh, Joules) into a single unified Pandas cost report.
Phase 2: High-Performance Vectorization & Algebra
- 05_f1_telemetry_vectorization.py: Array math on RPM, Speed, and Tire Pressure operating on millions of rows in milliseconds.
- 06_structural_stress_testing.py: Cross-unit algebra combining Kips, Newtons, and Pound-force over Square Meters for civil engineering loads.
- 07_financial_billing_precision.py: Understanding when to use .mag (fast Python floats for Math/ML) vs .exact (high-precision Decimals for strict financial audits).
Phase 3: The Axiom Engine (Domain-Driven Engineering)
- 08_gas_pipeline_thermodynamics.py: Using Contextual Shifts to dynamically calculate industrial gas volume expansion based on real-time temperature and pressure (PV=nRT).
- 09_end_to_end_esg_pipeline.py: The Grand Unified Theory of Chisa. Synthesizing a custom dimension (Carbon Intensity), cleaning data into it via Schema, and guarding algorithms with @require and @prepare.
Phase 4: Real-World Ecosystem Integration
- 10_pandas_groupby_physics.py: Integrating Chisa arrays directly with Pandas GroupBy to aggregate daily IoT power production into monthly summaries.
- 11_scikit_learn_transformer.py: Building a custom ML BaseEstimator to autonomously normalize heterogeneous unit arrays before training a Random Forest.
- 12_handling_sensor_drift.py: Using NumPy array masks and vectorization to neutralize factory machine calibration errors without slow for loops.
- 13_dynamic_alert_thresholds.py: Simulating an IoT streaming pipeline where safety limits (@axiom.bound) change dynamically based on the machine's operating context.
- 14_cloud_compute_costs.py: Utilizing extreme Metaclass algebra (Currency / (RAM * Time)) to synthesize and calculate abstract Server Compute billing rates ($ / GB-Hour).

🔬 The Engine: Explicit Dimensional Algebra

While Chisa's Schema is built for Data Engineering pipelines, underneath it lies a highly strict, Metaclass-driven Object-Oriented physics engine. If you are a Data Scientist, you can extract your clean data into Chisa Arrays for cross-dimensional mathematics with zero memory leaks.

import numpy as np
import chisa as cs
from chisa import u

# Seamless cross-unit Metaclass Vectorized Synthesis (Mass * Acceleration = Force)
Mass = u.Kilogram(np.random.uniform(10, 100, 1_000_000))
Acceleration = (u.Meter / (u.Second ** 2))(np.random.uniform(0.5, 9.8, 1_000_000))

Force = Mass * Acceleration
Force_kN_array = Force.to(u.Newton * 1000).mag

📖 Deep Dive: For advanced features like Dynamic Contextual Scaling (Mach), Axiom Bound derivation, and Registry Introspection, please refer to our Advanced Physics Documentation.

📦 Installation

Install via pip:

pip install chisa

Requirements:

Python 3.8+
numpy >= 1.26.0
pandas >= 2.0.0

🛠 Roadmap & TODOs

String Expression Parser: Upgrading the registry to autonomously parse complex composite strings (e.g., "kg * m / s^2").
Global Context Manager: Introduce chisa.conf() to temporarily force data types or ignore boundary rules.
Polars Integration: Expanding Schema.normalize() to support Polars DataFrames for ultra-fast Rust-based data processing.

🤝 Contributing

Contributions are what make the open-source community an amazing place to learn, inspire, and create. Any contributions you make to Chisa are greatly appreciated.

License

Distributed under the MIT License. See the LICENSE file for more information.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.4.2

Apr 11, 2026

0.4.1

Apr 11, 2026

0.4.0

Apr 6, 2026

0.3.1

Mar 13, 2026

0.3.0

Mar 13, 2026

0.2.3

Feb 28, 2026

0.2.2

Feb 27, 2026

This version

0.2.1

Feb 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phaethon-0.2.1.tar.gz (52.1 kB view details)

Uploaded Feb 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

phaethon-0.2.1-py3-none-any.whl (56.5 kB view details)

Uploaded Feb 27, 2026 Python 3

File details

Details for the file phaethon-0.2.1.tar.gz.

File metadata

Download URL: phaethon-0.2.1.tar.gz
Upload date: Feb 27, 2026
Size: 52.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for phaethon-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`32d548e51b55d6ac494747078425a84e190f3fa2a8e1c5a715ed10fbb602d51a`
MD5	`742fcbb2e7a63f5fdcdb378d54c9b307`
BLAKE2b-256	`e2f693a8921e9b34e7a1ad2b5651224e6fbd5df1036d4a0e37af3de13a975fae`

See more details on using hashes here.

File details

Details for the file phaethon-0.2.1-py3-none-any.whl.

File metadata

Download URL: phaethon-0.2.1-py3-none-any.whl
Upload date: Feb 27, 2026
Size: 56.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for phaethon-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e0272f4ee503db1211b336092cc6e8d917772159e3b322fdae2cceb00cb83f16`
MD5	`ef3f693d5e2cf6aeff49b134d8b8681a`
BLAKE2b-256	`29476fbb9ee0ffdd43d175e7a05c6c3c070382b005d4f76d570888f0d113f354`

See more details on using hashes here.

phaethon 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Chisa — Unit-Safe Data Pipeline Schema

🚀 The Nightmare vs. The Chisa Way

🧠 Smart Error Intelligence

⚡ Performance: The Vectorization Advantage

🪝 Pipeline Hooks (Inversion of Control)

🏎️ The Fluent API (Quick Inline Conversions)

📚 Examples & Tutorials

Interactive Crash Course (Google Colab)

Python Scripts Reference

🔬 The Engine: Explicit Dimensional Algebra

📦 Installation

🛠 Roadmap & TODOs

🤝 Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes