Skip to main content

The Ultimate Data Cleaning Engine for Python

Project description

Tidely: The Operating System for Data Quality

Tidely is a production-grade Python package that acts as "The Operating System for Data Quality." Instead of introducing custom data wrappers, Tidely integrates seamlessly into existing pipelines by accepting and returning standard Pandas DataFrames, Polars DataFrames/LazyFrames, and PyArrow Tables.

Tidely relies on two primitives to drastically improve data workflows:

  1. td.inspect(df): Generates a stunning Dataset Intelligence Report detailing Trust Scores, DNA signatures, and Semantics.
  2. td.clean(df): Generates an explainable, deterministic cleaning plan to sanitize missing data, duplicate rows, memory bloat, and semantically noisy strings (Dates, Emails, Phones).

🚀 Key Features

  1. Zero Friction API: Call td.inspect() or td.clean() on any Polars, Pandas, or PyArrow dataframe.
  2. Lighthouse Dataset Trust Scores: Computes multi-dimensional quality scores (Overall, Reliability, ML Readiness, Memory Efficiency, Schema Stability, and Semantic Quality).
  3. Deep Semantic Engine: Heuristic regexes and checksum algorithms (Luhn for Credit Cards, Verhoeff for Aadhaar) to validate PAN, GSTIN, IP addresses, emails, phone numbers, and currencies.
  4. Explainable Cleaning: Automatically converts types, normalizes PII formats, imputes missing values, and drops exact duplicates—explaining exactly what changed, why it changed, and how much it bumped the Trust Score. By default, Tidely avoids forward-filling missing values (to prevent hallucinating metadata in cross-sectional data) and uses constant/mode imputation instead.
  5. Streaming Native: Built on Polars, td.clean() natively supports collect(streaming=True) on massive out-of-core datasets.

📦 Installation

To install Tidely in your project, use pip or uv:

pip install tidely

or

uv add tidely

(Note: On Windows systems, Tidely automatically includes the tzdata package to support timezone-aware datetime validation).


⚡ Quick Start

1. Dataset Inspection

import tidely as sp
import polars as pl

# 1. Load your standard dataframe
df = pl.read_csv("messy_sales.csv")

# 2. Inspect the dataset
profile = td.inspect(df)

# Retrieve metrics programmatically
print(f"Overall Trust Score: {profile.trust_score.overall}/100")
print(f"ML Readiness: {profile.trust_score.ml_readiness}/100")

# 3. Display the stunning visual report in your terminal
profile.show()

2. Explainable Cleaning

import tidely as sp
import polars as pl

df = pl.read_csv("messy_sales.csv")

# Generate the plan, show it in the terminal, and execute it
clean_df = td.clean(df)

# Alternatively, step through it manually:
plan = td.plan(df)
plan.show()

# Dry run to see exactly what rows will be affected before mutating
plan.execute(dry_run=True)

# Execute
clean_df = plan.execute()

3. Command Line Interface (CLI)

Tidely exposes a Typer-based CLI for instant dataset diagnostics directly from your terminal:

# Get a stunning visual diagnostic report
tidely inspect --input messy_sales.csv

Tidely Demo


🛠️ Benchmarks

Tidely is brutally fast. Check out our benchmarking suite to see how we stack up against PyJanitor, Pandera, ydata-profiling, and Great Expectations.

100,000 Rows (19MB DataFrame)

Tool Time (s) Memory Peak (MB)
Tidely 1.02s 113.79
Pandera 1.18s 14.38
PyJanitor N/A* N/A*
Great Expectations N/A* N/A*
ydata-profiling N/A* N/A*

*Note: As of Pandas 2.x/3.x, pyjanitor and ydata-profiling have severe internal breaking changes that cause crashes. Great Expectations V1.0+ has completely removed its standard from_pandas API.

Despite producing a massively detailed heuristic semantic analysis AND executing data transformations, Tidely is still faster than pure schema-validation libraries like Pandera.

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details on how to set up your development environment, run tests, and submit pull requests.


📚 API Reference

tidely.inspect(df: Any) -> DatasetProfile

Generates a comprehensive diagnostic profile.

  • df: The input data (Pandas DataFrame, Polars DataFrame/LazyFrame, PyArrow Table).
  • Returns: A DatasetProfile object. Call .show() to render it in the terminal.

tidely.plan(df: Any) -> RepairPlan

Generates a deterministic cleaning plan without mutating the data.

  • df: The input data.
  • Returns: A RepairPlan object. Call .show() to view the plan, and .execute() to run the transformations.

tidely.clean(df: Any) -> pl.DataFrame

Automatically plans and executes all recommended data cleaning transformations.

  • df: The input data.
  • Returns: A pristine Polars DataFrame.

❓ FAQ

Q: Does Tidely overwrite my original data? No. Tidely always returns a new, sanitized DataFrame. It never mutates your data in place.

Q: Why does Tidely use Polars internally? Polars is written in Rust, utilizes lazy execution graphs, and is inherently multi-threaded. This allows Tidely to inspect and clean datasets magnitudes faster than native Pandas.

Q: Can I run this on huge datasets? Yes. You can pass a Polars LazyFrame to tidely.clean() and it will utilize streaming collect(streaming=True) if the queries fit out-of-core memory bounds.

Q: How does it know a column is a GSTIN or PAN? Tidely uses a deep semantic engine combining specialized regex heuristics and checksum algorithms (like Luhn and Verhoeff) to deterministically validate PII/Financial tokens.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tidely-0.3.0.tar.gz (29.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tidely-0.3.0-py3-none-any.whl (27.2 kB view details)

Uploaded Python 3

File details

Details for the file tidely-0.3.0.tar.gz.

File metadata

  • Download URL: tidely-0.3.0.tar.gz
  • Upload date:
  • Size: 29.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for tidely-0.3.0.tar.gz
Algorithm Hash digest
SHA256 4b8e8a1b916981599e74e14f97a8e917f97064ec5ca264f072847e1e88b68df8
MD5 1fa965ac6bf78a96626078442d437c06
BLAKE2b-256 e9d73db5fa5d2c3983a96ba3be07f8a7a3b049b15f91a5df3c7cdaeb6e5461f7

See more details on using hashes here.

File details

Details for the file tidely-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: tidely-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 27.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for tidely-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 36de060f3c42f2d2f5f7534e234014ec5f5046c364c7bba4495744ab633765f3
MD5 413f8c6bc644dc7dbeb9b2899afffce7
BLAKE2b-256 fe6327aa20ce796b88e549d762d9ed0a53873bb5f80e5d80fe3e0233d88831b7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page