The Ultimate Data Cleaning Engine for Python
Project description
What is Tidely?
Tidely is a local-first, deterministic data cleaning library designed to replace hundreds of lines of fragile Pandas preprocessing code with a single, highly optimized command.
Tidely automatically profiles your dataset, infers semantic types (Dates, Emails, Currency, IDs), safely downcasts memory footprint by up to 85%, and structures unstructured text—all without silently mutating your business logic or randomly dropping values.
Why Tidely Exists
Data scientists and engineers spend 80% of their time writing repetitive data cleaning boilerplate: fixing M/D/YYYY dates, trimming whitespaces, downcasting 64-bit floats to save memory, parsing currency symbols, and dropping exact duplicate rows.
Tidely eliminates this boilerplate entirely. It is built on three core philosophies:
- Never silently delete data. Every transformation is tracked, explained, and non-destructive.
- Local-first and Secure. Tidely runs entirely on your CPU. No API keys, no LLMs, no cloud processing.
- Deterministic. The same dirty DataFrame yields the exact same clean DataFrame, every single time.
⚡ Quick Start
Installation
pip install tidely
The One-Minute Example
import pandas as pd
import tidely as td
# 1. Load your dirty data
df = pd.read_csv("dirty_data.csv")
# 2. Clean it automatically
result = td.clean(df)
# 3. Retrieve the clean, memory-optimized DataFrame
clean_df = result.df
# 4. View a detailed, explainable summary of what changed
print(result.summary())
🔍 Before vs After
Before Tidely:
df = pd.read_csv("data.csv")
df.drop_duplicates(inplace=True)
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['price'] = df['price'].str.replace('$', '').astype(float)
df['category'] = df['category'].astype('category')
df['is_active'] = df['is_active'].map({'yes': True, 'no': False})
# ... 50 more lines of boilerplate ...
After Tidely:
import tidely as td
df = td.clean(pd.read_csv("data.csv")).df
🚀 Core Features
- Semantic Intelligence: Natively infers and standardizes Emails, URLs, Currencies, Boolean permutations (yes/y/true/1), IPv4, SSNs, and Dates (including US formats like
MM/DD/YYYY). - Memory Optimization: Automatically downcasts over-provisioned 64-bit integers/floats to 16/32-bit types, and converts low-cardinality strings to Categorical pointers. Safely reduces Pandas memory footprints by 40-85%.
- Zero-Corruption Duplicate Removal: Identifies and drops exact duplicate rows that skew statistical modeling.
- Deep Explainability: Generates an exhaustive
summary()explaining what was changed, why it was changed, and the impact of the change. - Business Logic Protection: Explicitly issues
Warningsfor missing financial or identifier data rather than blindly imputing zeros.
Supported DataFrames
Tidely currently supports:
pandas.DataFramepolars.DataFramepolars.LazyFramepyarrow.Table
🏎️ Performance Philosophy
Tidely is designed for enterprise scale. It operates heavily via vectorized operations backed by pandas and polars.
During internal benchmarking, Tidely processed 10,000,000 rows across mixed-types in under 26 seconds, safely shrinking the DataFrame from 591 MB down to 85 MB without corrupting type definitions. We rely purely on algorithmic inference—no slow machine learning heuristics or network latency.
🛡️ Validation Summary (Public Beta)
Tidely v1.0 has completed an extensive internal validation campaign covering more than twenty real-world datasets across healthcare, finance, retail, manufacturing, government, environmental science, e-commerce, and enterprise Excel workflows.
The library has also passed property-based testing (Hypothesis), fuzz testing, large-scale stress testing up to 10 million rows, API stability checks, and cross-version compatibility testing.
Based on these results, Tidely is now entering Public Beta, where broader community feedback will continue to strengthen its reliability.
📚 Documentation
Detailed documentation is available in the docs/ directory:
- Introduction & Philosophy
- Installation Guide
- Cleaning Guide
- Semantic Detection Engine
- Memory & Performance
- Validation Guide
- FAQ
🛣️ Roadmap
- Multi-threaded processing for CSV batch-cleaning.
- Out-of-core chunked processing for data exceeding local RAM.
- Geographic coordinate standardization (Lat/Lon).
- Enhanced HTML extraction capabilities.
🤝 Contributing
Tidely is an open-source project and community contributions are highly welcome. Please review our CONTRIBUTING.md and CODE_OF_CONDUCT.md before submitting pull requests.
License
Tidely is released under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tidely-1.0.0b1.tar.gz.
File metadata
- Download URL: tidely-1.0.0b1.tar.gz
- Upload date:
- Size: 29.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5265d7c96d11fe052146b30e8ecc26fbb3f522c23107925768a65be5a241b124
|
|
| MD5 |
f48791b6bef08d19a50d28b72342cf10
|
|
| BLAKE2b-256 |
14f4119f6142928f5ff5e43e9a58f8a4b8f1aed945984b9ae7a3dccd38d8892e
|
File details
Details for the file tidely-1.0.0b1-py3-none-any.whl.
File metadata
- Download URL: tidely-1.0.0b1-py3-none-any.whl
- Upload date:
- Size: 39.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
457cd4f8dc8b6ae9ac7bb573a279c19306164029f20cbaf293c442bf93791960
|
|
| MD5 |
b3c2e2c780963fd75049df8b3fd889b3
|
|
| BLAKE2b-256 |
6cb8d72f48ecde3a0113e9c07a9fb39c5e7b248c5ad14642268d82f417e0714f
|