Skip to main content

Data Signing & Quality - Offline data integrity with CRC32 file hashing and MD5 schema hashing

Project description

fosho - Data Validation for Half-Asleep Scientists

File-&-schema Offline Signing & Hash Observatory

Stop wondering if your downstream scripts are using the data you think they are. fosho gives you confidence that your data hasn't changed under your feet.

Zombie-Proof Steps (Copy-Paste Ready)

Scenario: You have downstream_script.py that keeps breaking because your data changes. You want it to fail fast with clear errors instead of producing wrong results.

Step 1: You already have data + a script that breaks

# Your current situation:
# - data/my_data.csv (keeps changing)  
# - downstream_script.py (breaks silently when data changes)

Step 2: Generate schema and manifest (once)

# Scan your data directory - generates schemas automatically
uv run fosho scan data/

# This creates:
# - schemas/my_data_schema.py (auto-generated schema)
# - manifest.json (tracks file hashes and signing status)

Step 3: Look at schema, edit if needed

cat schemas/my_data_schema.py

You'll see something like:

schema = pa.DataFrameSchema({
    "id": pa.Column(int),
    "name": pa.Column(str),
    "score": pa.Column(float, nullable=True),
})

Edit this file if you want stricter validation (ranges, required values, etc.).

Step 4: Sign your data (approve current state)

# Approve the current data state
uv run fosho sign

# Check status
uv run fosho status

Step 5: Replace your pandas.read_csv() calls

Before (dangerous):

import pandas as pd
df = pd.read_csv('data/my_data.csv')  # Silent failures

After (safe):

import fosho

# Load with validation
df = fosho.read_csv(
    file='data/my_data.csv',
    schema='schemas/my_data_schema.py',
    manifest_path='manifest.json'
)

# Must validate before use - crashes if data changed since signing
validated_df = df.validate()  # 🚨 CRASHES if data changed

Step 6: Run your script

  • If data matches schema: Script runs normally
  • 🚨 If data changed: Script crashes with clear error message
  • 🎯 No more silent failures: You immediately know when data structure changes

What This Solves

Before: "Wait, did my preprocessing script change this CSV? Is my downstream analysis using old data?"

After: Your script crashes with a clear error if the data changed. No more silent failures.

The Magic

  1. File hashing - Detects when CSVs change (even 1 byte)
  2. Schema validation - Ensures data structure matches expectations
  3. Signing workflow - Explicit approval step prevents accidents
  4. Fail-fast - Scripts error immediately if using stale/changed data

Concrete Example

Your messy situation:

# You have this data that keeps changing
cat data/sales.csv
# id,product,revenue
# 1,widget,100.50
# 2,gadget,75.25

# And this script that breaks when data structure changes
cat analyze_sales.py
# import pandas as pd
# df = pd.read_csv('data/sales.csv')
# print(df['revenue'].mean())  # Breaks if 'revenue' column disappears

The fosho solution:

# 1. Scan and generate schema
uv run fosho scan data/

# 2. Check what it generated
cat schemas/sales_schema.py
# schema = pa.DataFrameSchema({
#     "id": pa.Column(int),
#     "product": pa.Column(str), 
#     "revenue": pa.Column(float),
# })

# 3. Sign the data state
uv run fosho sign

# 4. Update your script (safer approach)
cat analyze_sales_safe.py
# import fosho
# 
# df = fosho.read_csv(
#     file='data/sales.csv',
#     schema='schemas/sales_schema.py',
#     manifest_path='manifest.json'
# )
# validated_df = df.validate()  # <- This line protects you
# print(validated_df['revenue'].mean())  # Now safe!

Result: Your script now crashes immediately with a clear error if someone changes the data structure, instead of producing wrong results.

When Things Change

If your data changes:

# Re-scan to update checksums
uv run fosho scan data/

# Review what changed
uv run fosho status

# Re-approve if changes are intentional
uv run fosho sign

Your Python scripts will refuse to run until you explicitly re-approve the changes.

Commands

  • fosho scan data/ - Find CSVs, generate schemas, update manifest
  • fosho sign - Approve all current data/schemas
  • fosho status - Show what's signed vs unsigned
  • fosho verify - Check if everything still matches

Installation

cd your-project
uv add fosho  # or pip install fosho

Philosophy

Data scientists need simple validation first, not complex rules. fosho generates minimal schemas (just column types + nullability) so you can:

  1. Get protection immediately
  2. Add more validation rules later as you learn about your data
  3. Never wonder "is my script using the right data?"

Perfect for preprocessing pipelines that keep changing and downstream analyses that need to stay in sync.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fosho-0.1.0.tar.gz (110.3 kB view details)

Uploaded Source

File details

Details for the file fosho-0.1.0.tar.gz.

File metadata

  • Download URL: fosho-0.1.0.tar.gz
  • Upload date:
  • Size: 110.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.2

File hashes

Hashes for fosho-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d4aa04e8aa1f519a935451a3c174272cb7cbb4717b09e0ea9407b0b6163df586
MD5 a9111f1b5dbd837570aeef2a6ec0ff2b
BLAKE2b-256 b06479987a34ae095ac3ce605252a0446371c97c96befaa35462912d01b78320

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page