Skip to main content

The Ultimate Data Cleaning Engine for Python

Project description

Tidely Logo

Tidely

Zero-configuration semantic data cleaning for modern Python workflows.

Tidely Banner

PyPI Version Python Support License Downloads PyPI Downloads GitHub Stars GitHub Issues


2. Elevator Pitch

Tidely automatically profiles, cleans, validates, and optimizes tabular datasets with a single line of code, designed to prepare datasets for downstream analytics and machine learning workflows.


3. Installation & Compatibility

To install Tidely, use pip:

pip install tidely

To upgrade to the latest stable version:

pip install -U tidely

Python Support

Tidely is tested and fully compatible with the following Python versions:

  • Python 3.12
  • Python 3.13
  • Python 3.14

4. Quick Start

Clean and export any tabular dataset with three lines of Python:

import tidely as td

# Automatically detect, profile, and clean a dataset
result = td.clean("dirty_data.csv")

# Print an explainable summary of all applied fixes
print(result.summary())

# Export the clean dataset
result.export("clean_data.csv")

5. Why Tidely Exists

In modern machine learning and data engineering, cleaning messy datasets remains the single most time-consuming task. Engineers routinely spend hours writing fragile, repetitive scripts to fix missing values, coerce types, remove duplicates, and standardize semantic structures.

These tasks lead to bloated codebases, silent data bugs, and massive memory overhead. Tidely exists to eliminate this friction by acting as an intelligent, deterministic cleaning scheduler that infers column types, corrects data anomalies, downcasts boundaries, and is designed to preserve valid information while applying deterministic cleaning rules.


6. Before Tidely vs. After Tidely

Manual Preprocessing Script (45+ Lines of Pandas)

import pandas as pd
import numpy as np
import re

# Load raw file
df = pd.read_csv("dirty_data.csv")

# Clean duplicate records
df = df.drop_duplicates()

# Impute missing values with group medians
df["Salary"] = df.groupby("Department")["Salary"].transform(lambda x: x.fillna(x.median()))

# Clean and pad ZIP codes
df["Zip"] = df["Zip"].astype(str).str.replace(r"\.0$", "", regex=True)
df["Zip"] = df["Zip"].apply(lambda x: x.zfill(5) if x != "nan" else np.nan)

# standardise email structures
df["Email"] = df["Email"].astype(str).str.strip().str.lower()

# Clip coordinates
df["Latitude"] = pd.to_numeric(df["Latitude"], errors="coerce")
df["Latitude"] = df["Latitude"].clip(-90.0, 90.0)

# Save
df.to_csv("clean_data.csv", index=False)

The 2-Line Tidely API

import tidely as td
cleaned_df = td.clean("dirty_data.csv").df

7. Real Cleaning Example

Before Cleaning

id email join_date salary Latitude Zip
1 JOHN.DOE@GMAIL.COM 2026/06/30 50000 221.5 123
2 jane.smith@gmail.com 06-30-2026 N/A -45.2 00987
? invalid_email 2026-06-30 45000 92.0 8765
1 JOHN.DOE@GMAIL.COM 2026/06/30 50000 221.5 123

After Tidely

id email join_date salary Latitude Zip
1 john.doe@gmail.com 2026-06-30 50000 90.0 00123
2 jane.smith@gmail.com 2026-06-30 null -45.2 00987
null null 2026-06-30 45000 90.0 08765

Applied Fixes Breakdown

  • Email: Lowercased, stripped, and standardized formatting.
  • Duplicates: Removed exact duplicate rows (row 4 dropped).
  • Missing Values: Placeholder ? and N/A mapped to native null.
  • Outliers: Latitudes clipped strictly within physical [-90.0, 90.0] bounds.
  • ZIP codes: Left-padded to exactly 5 digits.

8. Features

Tidely's capabilities are divided into four core categories:

  • Inspection & DNA Profiling: Infers data structure, delimiter encoding, formats, and calculates a 5-dimension data quality trust score.
  • Semantic Inference: Automatically maps columns to semantic roles (e.g. Email, DNA Sequence, Currency, Coordinate, ZIP Code, Phone) based on regex and entropy rules.
  • Programmatic Cleaning: Executes out-of-core null conversions, group-by imputation, outlier clipping, and strict primary key deduplication.
  • Memory Optimization: Safely downcasts integer widths and compresses repeating strings to categorical representations, reducing memory footprint by up to 61%.

9. What Tidely Cleans

Cleaning Task Supported Partial Planned
Missing values imputation
Duplicate rows deduplication
Primary key enforcement
Email formatting standardization
Phone number cleaning
Coordinate limits boundary clipping
ZIP code padding
Biological DNA sequence protection
Currency standardisation
Categorical conversion & downcasting
Empty/unnamed column names
Mixed datatypes coercion
Unicode C0/C1 control character stripping
Out-of-core streaming execution
Deep Learning semantic classification
Time-series timezone alignment

10. Supported Formats

Format Extension Reader Engine Memory Mode Native Integration
.csv Polars / DuckDB Native / Streaming / Lazy Polars, Pandas, Arrow
.parquet Polars / DuckDB Native / Streaming / Lazy Polars, Pandas, Arrow
.xlsx / .xls Calamine Eager Polars, Pandas, Arrow
.arff Custom Parser Eager Polars, Pandas
.json / .ndjson Polars Eager Polars, Pandas
.feather / .arrow PyArrow Eager Arrow, Pandas, Polars

11. How Tidely Works

The flowchart below demonstrates the execution path from raw data input to production-ready output:

graph TD
    A[Raw Dataset Input] --> B[Adapter / Loader]
    B --> C[Inspection Engine: dna, encoding, size]
    C --> D[Semantic Engine: pattern inference]
    D --> E[Decision Engine: clean plan builder]
    E --> F[Hardware Selection: Eager / Lazy / DuckDB / Streaming]
    F --> G[Rule Engine: missing, outliers, formatting]
    G --> H[Cleaning Pipeline Execution]
    H --> I[Validation Guard: zero data loss check]
    I --> J[Clean DataFrame & Trust Score HTML Report]

12. Automatic Backend Selection

Tidely dynamically routes datasets depending on their file size and host system resources to prevent Out-Of-Memory (OOM) crashes:

graph TD
    A[Dataset File Path] --> B[Estimate Size]
    B --> C{Fits in Memory?<br>Size < 50% Free RAM}
    C -->|Yes| D{Size < 10MB?}
    C -->|No| E{Format CSV/Parquet?}
    D -->|Yes| F[Polars Eager Backend]
    D -->|No| G[Polars Lazy Backend]
    E -->|Yes| H[DuckDB Query Engine]
    E -->|No| I[Chunked Streaming Engine]

13. Architecture

Tidely consists of the following core modules:

  • adapter.py: Standardizes input loading and estimates file size before loading to memory.
  • semantic.py: Performs regex and probabilistic pattern matches to identify column semantics.
  • decision_engine.py: Builds the execution plan and selects the backend routing strategy.
  • plan.py: Tracks and prioritizes the list of RepairAction items to perform.
  • rules.py: Vectorized cleaning algorithms (means, medians, modes, Z-score clipping bounds).
  • clean_engine.py: Translates rules into execution steps and compiles plans into SQL.
  • streaming.py: Executes out-of-core file-to-file conversions using DuckDB or batched readers.
  • result.py: Encapsulates the clean DataFrame, audit trails, and HTML report exporting.
  • api.py: Exposes the public td.clean and td.inspect interfaces.

14. Performance Benchmarks

Environment Specifications

  • CPU: Intel i5-13420H (8 Cores, 12 threads @ 3.4GHz)
  • RAM: 16GB DDR4
  • Storage: NVMe PCIe Gen4 SSD
  • OS: Windows 11 Home
  • Python: 3.14.0a2
  • DuckDB: 1.5.4 | Polars: 1.5.0

Benchmark Execution Metrics

[!NOTE] Quality Score Disclaimer: The quality score is an internal evaluation metric designed to compare cleaning workflows under the same benchmark conditions. It is not an industry-standard benchmark.

Dataset Size Backend Duration Peak RAM Speed (Rows/sec) Health Diff
311_ServiceRequest 0.40 MB polars_eager 0.13s 14 MB 104 91% ➔ 93%
Allegations-of-Harassment 0.02 MB polars_eager 0.08s 8 MB 712 91% ➔ 92%
credits.csv 3.64 MB polars_eager 1.16s 34 MB 67,243 93% ➔ 95%
Crunchy Corner Budget 10.04 MB polars_lazy 7.03s 68 MB 6,958 81% ➔ 98%
customers-2000000.csv 333.24 MB duckdb 3.66s 42 MB 546,597 90% ➔ 93%
dataset_31_credit-g 0.15 MB polars_eager 0.52s 11 MB 1,886 84% ➔ 90%
Parking_Meters 2.41 MB polars_eager 0.09s 16 MB 365,213 92% ➔ 96%
Uncleaned-data.txt 58.01 MB polars_lazy 7.83s 118 MB 43,817 88% ➔ 98%
y_amazon-google-large.csv 110.07 MB duckdb 1.13s 28 MB 2,705,934 96% ➔ 96%

15. Technical Validation Campaign

To guarantee production safety, Tidely v1.4.2 was audited against a rigorous technical validation suite:

  • Fuzz & Edge-Case Testing: Validated against corrupted encodings, duplicate headers, missing headers, scientific notation, and timezone anomalies.
  • System Testing: 100% test coverage verified across all Campaign datasets, including large stress tests up to 10,000,000 rows.
  • Code Audits: Checked for type safety ( strict MyPy compliance) and formatting style rules (Ruff check).
  • Validation Outcome: All 59 automated tests passed successfully against Python 3.14 with 0 MyPy issues and 0 Ruff violations.

16. Ecosystem Comparison

Tidely complements, rather than replaces, existing data quality and processing packages:

Dimension Tidely Pandera Great Expectations Pandas / Polars
Auto-Cleaning / Repair ✅ Yes ❌ No ❌ No ❌ No
Semantic Inference ✅ Yes ❌ No ❌ No ❌ No
One-line Cleaning API ✅ Yes ❌ No ❌ No ❌ No
Streaming / Out-of-Core ✅ Yes ❌ No ❌ No ✅ (Polars/Lazy)
Validation Schemas ❌ No ✅ Yes ✅ Yes ❌ No
Interactive HTML Reports ✅ Yes ❌ No ✅ Yes ❌ No

17. Cleaning Workflow Comparison

Comparison of manual script maintenance against Tidely's automated cleaner on y_amazon-google-large.csv:

Dimension Manual Pandas Manual Polars DuckDB SQL Script Tidely
Lines of Code 45 lines 35 lines 50 lines 2 lines
Automatic Routing ❌ No ❌ No ❌ No ✅ Yes
Out-of-core Streaming ❌ No ❌ No ✅ Yes ✅ Yes
Semantic Inference ❌ No ❌ No ❌ No ✅ Yes
Deduplication Guard Manual Manual Manual ✅ Automatic
Outlier Boundary Clip Manual Manual Manual ✅ Automatic
Imputation Strategy Manual Manual Manual ✅ Automatic
Interactive Reports ❌ No ❌ No ❌ No ✅ Yes

18. Technical Validation Report

A summary of findings from our campaign audits:

  • No Unintended Data Corruption: No unintended data corruption was observed across the evaluated datasets.
  • Strengths: Zero configuration loading, robust memory footprint downcasting, and zero-RAM file-to-file COPY execution.
  • Limitations: Excel files larger than 100MB cannot currently be streamed out-of-core due to engine constraints and must be loaded in memory.

19. Reports & CLI Outputs

Tidely generates rich visual interfaces:

Interactive HTML Quality Report

Exports a multi-tab dashboard displaying column diagnostic metrics, applied transformations, cleaned preview grids, and engine execution details:

[Tab: Column Diagnostics]
├─ id: (90% trust, inferred Key)
├─ email: (85% trust, inferred Email, standardisation applied)
└─ salary: (95% trust, inferred Number, group-by median imputed)

[Tab: Applied Transformations]
├─ Duplicate Rows: Dropped 5 duplicate rows.
└─ Coordinate Normalization: Clipped 12 outliers in 'Latitude'.

CLI Terminal Output

$ tidely inspect dataset.csv

SPOTLESS INSPECTION SUMMARY
==========================================
Overall Health Trust Score: 89%
Total Columns: 6 | Total Rows: 10,000
Selected Engine: polars_eager (Low latency)
==========================================
id         Inferred ID/Key (High Confidence)
email      Inferred Email  (12 formatting issues)
latitude   Inferred Lat    (2 outlier values)

20. API Usage

Python API

import tidely as td

# Inspect dataset metrics
profile = td.inspect("data.csv")
profile.show()

# Clean dataset
result = td.clean("data.csv")
df = result.df

# Print transformations summary
print(result.summary())

# Revert repairs and retrieve original dataset
original_df = result.undo()

# Export clean dataset & HTML report
result.export("clean.csv")
result.export("report.html")

Command Line Interface

# Clean a CSV file and save output
tidely clean input.csv --out clean.csv

# Inspect dataset structure and trust score
tidely inspect input.csv

# Generate HTML quality report
tidely report input.csv --out report.html

21. FAQ

Does Tidely replace Pandas or Polars?

No. Tidely is a data preparation layer. It automatically sanitizes datasets and returns standard dataframes to be loaded directly into Pandas, Polars, or Scikit-learn.

How does it handle large files (e.g. 10GB CSV)?

Tidely estimates the file size and routes to DuckDB or chunked streaming. The dataset is processed block-by-block, ensuring peak RAM usage stays under 45MB.

Does it send dataset contents to the cloud?

No. Tidely is completely offline. No data leaves your machine; all type inferences and clean rules execute locally.

Can I undo transformations?

Yes. Call result.undo() to retrieve the original raw DataFrame.

Can I use Tidely inside Airflow or Prefect?

Yes. Tidely runs as a standard python package, making it easy to drop into any ETL orchestrator task block.


22. Version Roadmap

  • v1.4.2 (Current Stable): Production hardening release — extensive loader, exporter, and semantic improvements, strict ML safety audits.
  • v1.4.1: Stability patch — test suite fixes, documentation accuracy, regression tests.
  • v1.4.0: DuckDB SQL query compiler, out-of-core streaming, resources-aware selection.
  • v1.3: Native ARFF parser, DNA protection rules, Polars fallback.
  • v2.0 (Planned): Deep Learning semantic models, timezone alignment.

23. Contributing

  1. Fork the repo and set up development dependencies:
    pip install -e ".[dev]"
    
  2. Verify code standards and formatting:
    python -m ruff check src/
    python -m mypy src/
    
  3. Run the pytest suite:
    $env:PYTHONPATH="src"; python -m pytest
    

24. License

Tidely is released under the MIT License.


Built with ❤️ by Aaryan Rawat

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tidely-1.4.2.tar.gz (80.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tidely-1.4.2-py3-none-any.whl (79.1 kB view details)

Uploaded Python 3

File details

Details for the file tidely-1.4.2.tar.gz.

File metadata

  • Download URL: tidely-1.4.2.tar.gz
  • Upload date:
  • Size: 80.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tidely-1.4.2.tar.gz
Algorithm Hash digest
SHA256 b143b37c561cc3dd08785725f1da56430237795717bb3b22c1453dc2e95ad4b8
MD5 ebab428ffe1ff183f092538e1220ef3e
BLAKE2b-256 134f683fecf7085e79569225b3e25e7e62643682d875e8672c05fc643918f222

See more details on using hashes here.

File details

Details for the file tidely-1.4.2-py3-none-any.whl.

File metadata

  • Download URL: tidely-1.4.2-py3-none-any.whl
  • Upload date:
  • Size: 79.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tidely-1.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7aaa1aa0720d5a1677b18b20fffb287b214bb9137b787954b886fc78a1968560
MD5 b197ecb6cb787c8240e9c338809b59a6
BLAKE2b-256 e5ef6ee76b718aae997fc71ba3a24a161079db1fec4b1ae819b1b29826eb9903

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page