Skip to main content

The Ultimate Data Cleaning Engine for Python

Project description

🌊 Tidely

The production-grade data cleaning engine for Python.

PyPI version Python Supported Build Status License


What is Tidely?

Tidely is a local-first, deterministic data cleaning library designed to replace hundreds of lines of fragile Pandas preprocessing code with a single, highly optimized command.

Tidely automatically profiles your dataset, infers semantic types (Dates, Emails, Currency, IDs), safely downcasts memory footprint by up to 85%, and structures unstructured text—all without silently mutating your business logic or randomly dropping values.

Why Tidely Exists

Data scientists and engineers spend 80% of their time writing repetitive data cleaning boilerplate: fixing M/D/YYYY dates, trimming whitespaces, downcasting 64-bit floats to save memory, parsing currency symbols, and dropping exact duplicate rows.

Tidely eliminates this boilerplate entirely. It is built on three core philosophies:

  1. Never silently delete data. Every transformation is tracked, explained, and non-destructive.
  2. Local-first and Secure. Tidely runs entirely on your CPU. No API keys, no LLMs, no cloud processing.
  3. Deterministic. The same dirty DataFrame yields the exact same clean DataFrame, every single time.

⚡ Quick Start

Installation

pip install tidely

The One-Minute Example

import pandas as pd
import tidely as td

# 1. Load your dirty data
df = pd.read_csv("dirty_data.csv")

# 2. Clean it automatically
result = td.clean(df)

# 3. Retrieve the clean, memory-optimized DataFrame
clean_df = result.df

# 4. View a detailed, explainable summary of what changed
print(result.summary())

🔍 Before vs After

Before Tidely:

df = pd.read_csv("data.csv")
df.drop_duplicates(inplace=True)
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['price'] = df['price'].str.replace('$', '').astype(float)
df['category'] = df['category'].astype('category')
df['is_active'] = df['is_active'].map({'yes': True, 'no': False})
# ... 50 more lines of boilerplate ...

After Tidely:

import tidely as td
df = td.clean(pd.read_csv("data.csv")).df

🚀 Core Features

  • Semantic Intelligence: Natively infers and standardizes Emails, URLs, Currencies, Boolean permutations (yes/y/true/1), IPv4, SSNs, and Dates (including US formats like MM/DD/YYYY).
  • Memory Optimization: Automatically downcasts over-provisioned 64-bit integers/floats to 16/32-bit types, and converts low-cardinality strings to Categorical pointers. Safely reduces Pandas memory footprints by 40-85%.
  • Zero-Corruption Duplicate Removal: Identifies and drops exact duplicate rows that skew statistical modeling.
  • Deep Explainability: Generates an exhaustive summary() explaining what was changed, why it was changed, and the impact of the change.
  • Business Logic Protection: Explicitly issues Warnings for missing financial or identifier data rather than blindly imputing zeros.

Supported DataFrames

Tidely currently supports:

  • pandas.DataFrame
  • polars.DataFrame
  • polars.LazyFrame
  • pyarrow.Table

🏎️ Performance Philosophy

Tidely is designed for enterprise scale. It operates heavily via vectorized operations backed by pandas and polars.

During internal benchmarking, Tidely processed 10,000,000 rows across mixed-types in under 26 seconds, safely shrinking the DataFrame from 591 MB down to 85 MB without corrupting type definitions. We rely purely on algorithmic inference—no slow machine learning heuristics or network latency.


🛡️ Validation Summary (Public Beta)

Tidely v1.0 has completed an extensive internal validation campaign covering more than twenty real-world datasets across healthcare, finance, retail, manufacturing, government, environmental science, e-commerce, and enterprise Excel workflows.

The library has also passed property-based testing (Hypothesis), fuzz testing, large-scale stress testing up to 10 million rows, API stability checks, and cross-version compatibility testing.

Based on these results, Tidely is now entering Public Beta, where broader community feedback will continue to strengthen its reliability.


📚 Documentation

Detailed documentation is available in the docs/ directory:


🛣️ Roadmap

  • Multi-threaded processing for CSV batch-cleaning.
  • Out-of-core chunked processing for data exceeding local RAM.
  • Geographic coordinate standardization (Lat/Lon).
  • Enhanced HTML extraction capabilities.

🤝 Contributing

Tidely is an open-source project and community contributions are highly welcome. Please review our CONTRIBUTING.md and CODE_OF_CONDUCT.md before submitting pull requests.

License

Tidely is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tidely-1.0.0b1.tar.gz (29.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tidely-1.0.0b1-py3-none-any.whl (39.1 kB view details)

Uploaded Python 3

File details

Details for the file tidely-1.0.0b1.tar.gz.

File metadata

  • Download URL: tidely-1.0.0b1.tar.gz
  • Upload date:
  • Size: 29.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for tidely-1.0.0b1.tar.gz
Algorithm Hash digest
SHA256 5265d7c96d11fe052146b30e8ecc26fbb3f522c23107925768a65be5a241b124
MD5 f48791b6bef08d19a50d28b72342cf10
BLAKE2b-256 14f4119f6142928f5ff5e43e9a58f8a4b8f1aed945984b9ae7a3dccd38d8892e

See more details on using hashes here.

File details

Details for the file tidely-1.0.0b1-py3-none-any.whl.

File metadata

  • Download URL: tidely-1.0.0b1-py3-none-any.whl
  • Upload date:
  • Size: 39.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for tidely-1.0.0b1-py3-none-any.whl
Algorithm Hash digest
SHA256 457cd4f8dc8b6ae9ac7bb573a279c19306164029f20cbaf293c442bf93791960
MD5 b3c2e2c780963fd75049df8b3fd889b3
BLAKE2b-256 6cb8d72f48ecde3a0113e9c07a9fb39c5e7b248c5ad14642268d82f417e0714f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page