The Ultimate Data Cleaning Engine for Python

These details have not been verified by PyPI

Project description

Tidely

Zero-configuration semantic data cleaning for modern Python workflows.

2. Elevator Pitch

Tidely automatically profiles, cleans, validates, and optimizes tabular datasets with a single line of code, designed to prepare datasets for downstream analytics and machine learning workflows.

3. Installation & Compatibility

To install Tidely, use pip:

pip install tidely

To upgrade to the latest stable version:

pip install -U tidely

Python Support

Tidely is tested and fully compatible with the following Python versions:

✅ Python 3.12
✅ Python 3.13
✅ Python 3.14

4. Quick Start

Clean and export any tabular dataset with three lines of Python:

import tidely as td

# Automatically detect, profile, and clean a dataset
result = td.clean("dirty_data.csv")

# Print an explainable summary of all applied fixes
print(result.summary())

# Export the clean dataset
result.export("clean_data.csv")

5. Why Tidely Exists

In modern machine learning and data engineering, cleaning messy datasets remains the single most time-consuming task. Engineers routinely spend hours writing fragile, repetitive scripts to fix missing values, coerce types, remove duplicates, and standardize semantic structures.

These tasks lead to bloated codebases, silent data bugs, and massive memory overhead. Tidely exists to eliminate this friction by acting as an intelligent, deterministic cleaning scheduler that infers column types, corrects data anomalies, downcasts boundaries, and is designed to preserve valid information while applying deterministic cleaning rules.

6. Before Tidely vs. After Tidely

Manual Preprocessing Script (45+ Lines of Pandas)

import pandas as pd
import numpy as np
import re

# Load raw file
df = pd.read_csv("dirty_data.csv")

# Clean duplicate records
df = df.drop_duplicates()

# Impute missing values with group medians
df["Salary"] = df.groupby("Department")["Salary"].transform(lambda x: x.fillna(x.median()))

# Clean and pad ZIP codes
df["Zip"] = df["Zip"].astype(str).str.replace(r"\.0$", "", regex=True)
df["Zip"] = df["Zip"].apply(lambda x: x.zfill(5) if x != "nan" else np.nan)

# standardise email structures
df["Email"] = df["Email"].astype(str).str.strip().str.lower()

# Clip coordinates
df["Latitude"] = pd.to_numeric(df["Latitude"], errors="coerce")
df["Latitude"] = df["Latitude"].clip(-90.0, 90.0)

# Save
df.to_csv("clean_data.csv", index=False)

The 2-Line Tidely API

import tidely as td
cleaned_df = td.clean("dirty_data.csv").df

7. Real Cleaning Example

Before Cleaning

id	email	join_date	salary	Latitude	Zip
1	JOHN.DOE@GMAIL.COM	2026/06/30	50000	221.5	123
2	jane.smith@gmail.com	06-30-2026	N/A	-45.2	00987
?	invalid_email	2026-06-30	45000	92.0	8765
1	JOHN.DOE@GMAIL.COM	2026/06/30	50000	221.5	123

After Tidely

id	email	join_date	salary	Latitude	Zip
1	john.doe@gmail.com	2026-06-30	50000	90.0	00123
2	jane.smith@gmail.com	2026-06-30	null	-45.2	00987
null	null	2026-06-30	45000	90.0	08765

Applied Fixes Breakdown

Email: Lowercased, stripped, and standardized formatting.
Duplicates: Removed exact duplicate rows (row 4 dropped).
Missing Values: Placeholder ? and N/A mapped to native null.
Outliers: Latitudes clipped strictly within physical [-90.0, 90.0] bounds.
ZIP codes: Left-padded to exactly 5 digits.

8. Features

Tidely's capabilities are divided into four core categories:

Inspection & DNA Profiling: Infers data structure, delimiter encoding, formats, and calculates a 5-dimension data quality trust score.
Semantic Inference: Automatically maps columns to semantic roles (e.g. Email, DNA Sequence, Currency, Coordinate, ZIP Code, Phone) based on regex and entropy rules.
Programmatic Cleaning: Executes out-of-core null conversions, group-by imputation, outlier clipping, and strict primary key deduplication.
Memory Optimization: Safely downcasts integer widths and compresses repeating strings to categorical representations, reducing memory footprint by up to 61%.

9. What Tidely Cleans

Cleaning Task	Supported	Planned
Missing values imputation	✅
Duplicate rows deduplication	✅
Primary key enforcement	✅
Email formatting standardization	✅
Phone number cleaning	✅
Coordinate limits boundary clipping	✅
ZIP code padding	✅
Biological DNA sequence protection	✅
Currency standardisation	✅
Categorical conversion & downcasting	✅
Empty/unnamed column names	✅
Mixed datatypes coercion	✅
Unicode C0/C1 control character stripping	✅
Out-of-core streaming execution	✅
Deep Learning semantic classification		✅
Time-series timezone alignment		✅

10. Supported Formats

Format Extension	Reader Engine	Memory Mode	Native Integration
`.csv`	Polars / DuckDB	Native / Streaming / Lazy	Polars, Pandas, Arrow
`.parquet`	Polars / DuckDB	Native / Streaming / Lazy	Polars, Pandas, Arrow
`.xlsx` / `.xls`	Calamine	Eager	Polars, Pandas, Arrow
`.arff`	Custom Parser	Eager	Polars, Pandas
`.json` / `.ndjson`	Polars	Eager	Polars, Pandas
`.feather` / `.arrow`	PyArrow	Eager	Arrow, Pandas, Polars

11. How Tidely Works

The flowchart below demonstrates the execution path from raw data input to production-ready output:

graph TD
    A[Raw Dataset Input] --> B[Adapter / Loader]
    B --> C[Inspection Engine: dna, encoding, size]
    C --> D[Semantic Engine: pattern inference]
    D --> E[Decision Engine: clean plan builder]
    E --> F[Hardware Selection: Eager / Lazy / DuckDB / Streaming]
    F --> G[Rule Engine: missing, outliers, formatting]
    G --> H[Cleaning Pipeline Execution]
    H --> I[Validation Guard: zero data loss check]
    I --> J[Clean DataFrame & Trust Score HTML Report]

12. Automatic Backend Selection

Tidely dynamically routes datasets depending on their file size and host system resources to prevent Out-Of-Memory (OOM) crashes:

graph TD
    A[Dataset File Path] --> B[Estimate Size]
    B --> C{Fits in Memory?<br>Size < 50% Free RAM}
    C -->|Yes| D{Size < 10MB?}
    C -->|No| E{Format CSV/Parquet?}
    D -->|Yes| F[Polars Eager Backend]
    D -->|No| G[Polars Lazy Backend]
    E -->|Yes| H[DuckDB Query Engine]
    E -->|No| I[Chunked Streaming Engine]

13. Architecture

Tidely consists of the following core modules:

adapter.py: Standardizes input loading and estimates file size before loading to memory.
semantic.py: Performs regex and probabilistic pattern matches to identify column semantics.
decision_engine.py: Builds the execution plan and selects the backend routing strategy.
plan.py: Tracks and prioritizes the list of RepairAction items to perform.
rules.py: Vectorized cleaning algorithms (means, medians, modes, Z-score clipping bounds).
clean_engine.py: Translates rules into execution steps and compiles plans into SQL.
streaming.py: Executes out-of-core file-to-file conversions using DuckDB or batched readers.
result.py: Encapsulates the clean DataFrame, audit trails, and HTML report exporting.
api.py: Exposes the public td.clean and td.inspect interfaces.

14. Performance Benchmarks

Environment Specifications

CPU: Intel i5-13420H (8 Cores, 12 threads @ 3.4GHz)
RAM: 16GB DDR4
Storage: NVMe PCIe Gen4 SSD
OS: Windows 11 Home
Python: 3.14.0a2
DuckDB: 1.5.4 | Polars: 1.5.0

Benchmark Execution Metrics

[!NOTE] Quality Score Disclaimer: The quality score is an internal evaluation metric designed to compare cleaning workflows under the same benchmark conditions. It is not an industry-standard benchmark.

Dataset	Size	Backend	Duration	Peak RAM	Speed (Rows/sec)	Health Diff
`311_ServiceRequest`	0.40 MB	`polars_eager`	0.13s	14 MB	104	91% ➔ 93%
`Allegations-of-Harassment`	0.02 MB	`polars_eager`	0.08s	8 MB	712	91% ➔ 92%
`credits.csv`	3.64 MB	`polars_eager`	1.16s	34 MB	67,243	93% ➔ 95%
`Crunchy Corner Budget`	10.04 MB	`polars_lazy`	7.03s	68 MB	6,958	81% ➔ 98%
`customers-2000000.csv`	333.24 MB	`duckdb`	3.66s	42 MB	546,597	90% ➔ 93%
`dataset_31_credit-g`	0.15 MB	`polars_eager`	0.52s	11 MB	1,886	84% ➔ 90%
`Parking_Meters`	2.41 MB	`polars_eager`	0.09s	16 MB	365,213	92% ➔ 96%
`Uncleaned-data.txt`	58.01 MB	`polars_lazy`	7.83s	118 MB	43,817	88% ➔ 98%
`y_amazon-google-large.csv`	110.07 MB	`duckdb`	1.13s	28 MB	2,705,934	96% ➔ 96%

15. Technical Validation Campaign

To guarantee production safety, Tidely v1.4.2 was audited against a rigorous technical validation suite:

Fuzz & Edge-Case Testing: Validated against corrupted encodings, duplicate headers, missing headers, scientific notation, and timezone anomalies.
System Testing: 100% test coverage verified across all Campaign datasets, including large stress tests up to 10,000,000 rows.
Code Audits: Checked for type safety ( strict MyPy compliance) and formatting style rules (Ruff check).
Validation Outcome: All 59 automated tests passed successfully against Python 3.14 with 0 MyPy issues and 0 Ruff violations.

16. Ecosystem Comparison

Tidely complements, rather than replaces, existing data quality and processing packages:

Dimension	Tidely	Pandera	Great Expectations	Pandas / Polars
Auto-Cleaning / Repair	✅ Yes	❌ No	❌ No	❌ No
Semantic Inference	✅ Yes	❌ No	❌ No	❌ No
One-line Cleaning API	✅ Yes	❌ No	❌ No	❌ No
Streaming / Out-of-Core	✅ Yes	❌ No	❌ No	✅ (Polars/Lazy)
Validation Schemas	❌ No	✅ Yes	✅ Yes	❌ No
Interactive HTML Reports	✅ Yes	❌ No	✅ Yes	❌ No

17. Cleaning Workflow Comparison

Comparison of manual script maintenance against Tidely's automated cleaner on y_amazon-google-large.csv:

Dimension	Manual Pandas	Manual Polars	DuckDB SQL Script	Tidely
Lines of Code	45 lines	35 lines	50 lines	2 lines
Automatic Routing	❌ No	❌ No	❌ No	✅ Yes
Out-of-core Streaming	❌ No	❌ No	✅ Yes	✅ Yes
Semantic Inference	❌ No	❌ No	❌ No	✅ Yes
Deduplication Guard	Manual	Manual	Manual	✅ Automatic
Outlier Boundary Clip	Manual	Manual	Manual	✅ Automatic
Imputation Strategy	Manual	Manual	Manual	✅ Automatic
Interactive Reports	❌ No	❌ No	❌ No	✅ Yes

18. Technical Validation Report

A summary of findings from our campaign audits:

No Unintended Data Corruption: No unintended data corruption was observed across the evaluated datasets.
Strengths: Zero configuration loading, robust memory footprint downcasting, and zero-RAM file-to-file COPY execution.
Limitations: Excel files larger than 100MB cannot currently be streamed out-of-core due to engine constraints and must be loaded in memory.

19. Reports & CLI Outputs

Tidely generates rich visual interfaces:

Interactive HTML Quality Report

Exports a multi-tab dashboard displaying column diagnostic metrics, applied transformations, cleaned preview grids, and engine execution details:

[Tab: Column Diagnostics]
├─ id: (90% trust, inferred Key)
├─ email: (85% trust, inferred Email, standardisation applied)
└─ salary: (95% trust, inferred Number, group-by median imputed)

[Tab: Applied Transformations]
├─ Duplicate Rows: Dropped 5 duplicate rows.
└─ Coordinate Normalization: Clipped 12 outliers in 'Latitude'.

CLI Terminal Output

$ tidely inspect dataset.csv

SPOTLESS INSPECTION SUMMARY
==========================================
Overall Health Trust Score: 89%
Total Columns: 6 | Total Rows: 10,000
Selected Engine: polars_eager (Low latency)
==========================================
id        ➔ Inferred ID/Key (High Confidence)
email     ➔ Inferred Email  (12 formatting issues)
latitude  ➔ Inferred Lat    (2 outlier values)

20. API Usage

Python API

import tidely as td

# Inspect dataset metrics
profile = td.inspect("data.csv")
profile.show()

# Clean dataset
result = td.clean("data.csv")
df = result.df

# Print transformations summary
print(result.summary())

# Revert repairs and retrieve original dataset
original_df = result.undo()

# Export clean dataset & HTML report
result.export("clean.csv")
result.export("report.html")

Command Line Interface

# Clean a CSV file and save output
tidely clean input.csv --out clean.csv

# Inspect dataset structure and trust score
tidely inspect input.csv

# Generate HTML quality report
tidely report input.csv --out report.html

21. FAQ

Does Tidely replace Pandas or Polars?

No. Tidely is a data preparation layer. It automatically sanitizes datasets and returns standard dataframes to be loaded directly into Pandas, Polars, or Scikit-learn.

How does it handle large files (e.g. 10GB CSV)?

Tidely estimates the file size and routes to DuckDB or chunked streaming. The dataset is processed block-by-block, ensuring peak RAM usage stays under 45MB.

Does it send dataset contents to the cloud?

No. Tidely is completely offline. No data leaves your machine; all type inferences and clean rules execute locally.

Can I undo transformations?

Yes. Call result.undo() to retrieve the original raw DataFrame.

Can I use Tidely inside Airflow or Prefect?

Yes. Tidely runs as a standard python package, making it easy to drop into any ETL orchestrator task block.

22. Version Roadmap

v1.4.2 (Current Stable): Production hardening release — extensive loader, exporter, and semantic improvements, strict ML safety audits.
v1.4.1: Stability patch — test suite fixes, documentation accuracy, regression tests.
v1.4.0: DuckDB SQL query compiler, out-of-core streaming, resources-aware selection.
v1.3: Native ARFF parser, DNA protection rules, Polars fallback.
v2.0 (Planned): Deep Learning semantic models, timezone alignment.

23. Contributing

Fork the repo and set up development dependencies:
```
pip install -e ".[dev]"
```

Verify code standards and formatting:

python -m ruff check src/
python -m mypy src/

Run the pytest suite:

$env:PYTHONPATH="src"; python -m pytest

24. License

Tidely is released under the MIT License.

Built with ❤️ by Aaryan Rawat

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.4.2

Jul 1, 2026

1.4.1

Jun 30, 2026

1.4.0

Jun 30, 2026

1.3.0b2 pre-release

Jun 30, 2026

1.3.0b1 pre-release

Jun 30, 2026

1.0.0b2 pre-release

Jun 29, 2026

1.0.0b1 pre-release

Jun 29, 2026

0.3.0

Jun 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tidely-1.4.2.tar.gz (80.8 kB view details)

Uploaded Jul 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tidely-1.4.2-py3-none-any.whl (79.1 kB view details)

Uploaded Jul 1, 2026 Python 3

File details

Details for the file tidely-1.4.2.tar.gz.

File metadata

Download URL: tidely-1.4.2.tar.gz
Upload date: Jul 1, 2026
Size: 80.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tidely-1.4.2.tar.gz
Algorithm	Hash digest
SHA256	`b143b37c561cc3dd08785725f1da56430237795717bb3b22c1453dc2e95ad4b8`
MD5	`ebab428ffe1ff183f092538e1220ef3e`
BLAKE2b-256	`134f683fecf7085e79569225b3e25e7e62643682d875e8672c05fc643918f222`

See more details on using hashes here.

File details

Details for the file tidely-1.4.2-py3-none-any.whl.

File metadata

Download URL: tidely-1.4.2-py3-none-any.whl
Upload date: Jul 1, 2026
Size: 79.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tidely-1.4.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7aaa1aa0720d5a1677b18b20fffb287b214bb9137b787954b886fc78a1968560`
MD5	`b197ecb6cb787c8240e9c338809b59a6`
BLAKE2b-256	`e5ef6ee76b718aae997fc71ba3a24a161079db1fec4b1ae819b1b29826eb9903`

See more details on using hashes here.

tidely 1.4.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Tidely

2. Elevator Pitch

3. Installation & Compatibility

Python Support

4. Quick Start

5. Why Tidely Exists

6. Before Tidely vs. After Tidely

Manual Preprocessing Script (45+ Lines of Pandas)

The 2-Line Tidely API

7. Real Cleaning Example

Before Cleaning

After Tidely

Applied Fixes Breakdown

8. Features

9. What Tidely Cleans

10. Supported Formats

11. How Tidely Works

12. Automatic Backend Selection

13. Architecture

14. Performance Benchmarks

Environment Specifications

Benchmark Execution Metrics

15. Technical Validation Campaign

16. Ecosystem Comparison

17. Cleaning Workflow Comparison

18. Technical Validation Report

19. Reports & CLI Outputs

Interactive HTML Quality Report

CLI Terminal Output

20. API Usage

Python API

Command Line Interface

21. FAQ

Does Tidely replace Pandas or Polars?

How does it handle large files (e.g. 10GB CSV)?

Does it send dataset contents to the cloud?

Can I undo transformations?

Can I use Tidely inside Airflow or Prefect?

22. Version Roadmap

23. Contributing

24. License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes