Skip to main content

Cleanalytix is a modular Python library for profiling, scoring, and cleaning tabular datasets.

Project description

Cleanalytix

Cleanalytix is a Python library for profiling, scoring, cleaning, and monitoring the quality of tabular datasets with a single pipeline.

It is designed for pandas-first workflows and supports:

  • baseline dataset profiling and scoring
  • optional cleaning recommendations and automatic cleaning
  • optional production/new-dataset monitoring
  • optional business rules, thresholds, weights, and type inference for new data

Installation

From a source checkout:

git clone https://github.com/Probot-DATA/Cleanalytix_Repo
cd Cleanalytix_Repo
pip install -e ".[dev]"

Once this project is published to PyPI, the install command will be:

pip install cleanalytix

Runtime requirements:

  • Python 3.9+
  • pandas
  • numpy
  • scikit-learn
  • nltk

Quick Start

import pandas as pd
from cleanalytix import Run_DQ_Pipeline

df = pd.read_csv("my_data.csv")

result = Run_DQ_Pipeline(
    dataset_names=["my_dataset"],
    dataset_list=[df],
)

print(result["base_data"]["dirty_scores"])
print(result["base_data"]["meta_before_cleaning"])
print(result["base_data"]["recommendations"])

Production / Monitoring Example

import pandas as pd
from cleanalytix import Run_DQ_Pipeline

train_df = pd.read_csv("train.csv")
prod_df = pd.read_csv("production.csv")

rules = {
    "age": lambda value: pd.isna(value) or 0 <= float(value) <= 120,
}

result = Run_DQ_Pipeline(
    dataset_names=["customers"],
    dataset_list=[train_df],
    new_dataset_list=[prod_df],
    rules=rules,
    cleaning=True,
    interactive=False,
    score_mode="exponential",
)

print(result["base_data"]["dirty_scores"])
print(result["base_data"]["cleaned_scores"])
print(result["prod_data"]["dirty_scores"])
print(result["prod_data"]["change_log"])

Public API

The primary entrypoint is:

from cleanalytix import Run_DQ_Pipeline

Additional building blocks are also exported:

from cleanalytix import (
    Compute_DQ_Score,
    DEFAULT_THRESHOLDS,
    generate_meta,
    cleaning_recommendations,
    get_cleaned_data,
    get_table_for_DQ_computation,
    summarize_dataset_health,
    learn_reference_profile,
    adjust_prod_meta_with_reference,
    infer_and_fix_types,
)

Pipeline Output Structure

Run_DQ_Pipeline returns:

{
    "base_data": {...},
    "prod_data": {...},
}

Each block preserves the same keys:

  • dirty_scores
  • cleaned_scores
  • cleaned_datasets
  • meta_before_cleaning
  • meta_after_cleaning
  • recommendations
  • change_log
  • summarized_before
  • summarized_after
  • main_metrics_before
  • main_metrics_after

Examples

Runnable examples live in examples:

python examples/simple_usage.py
python examples/production_usage.py

These examples assume the package has already been installed in the active environment.

Validation

The validation folder contains a portable real-world validation workflow.

  • Large raw datasets are intentionally not committed to the repository.
  • Put the expected files under validation/datasets/ by following validation/datasets/README.md.
  • Run the validation script:
python validation/run_validation.py
  • The script saves non-empty outputs to validation/outputs/<dataset_name>/.
  • The notebook validation/main.ipynb uses the same relative-path workflow.

Repository Layout

  • cleanalytix/ - installable library package
  • examples/ - small runnable examples
  • tests/ - smoke tests and lightweight sample fixtures
  • validation/ - public-friendly validation workflow and output folder
  • archive/legacy/ - historical prototype notebook/code kept for reference, not for active use

Known Limitations

  • Validation datasets are not bundled with the repository.
  • The yellow taxi validation workflow samples the first 20,000 rows from each configured monthly file to match the original project workflow and to keep validation practical.
  • Interactive cleaning is intended for notebook/CLI use and will prompt for input when interactive=True.

Contributing

See CONTRIBUTING.md.

License

MIT (c) 2026 Probot-DATA contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleanalytix-0.1.0.tar.gz (23.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cleanalytix-0.1.0-py3-none-any.whl (23.5 kB view details)

Uploaded Python 3

File details

Details for the file cleanalytix-0.1.0.tar.gz.

File metadata

  • Download URL: cleanalytix-0.1.0.tar.gz
  • Upload date:
  • Size: 23.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for cleanalytix-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8a9af22d8343dd02562dbe51935c1536e5302f71869de8e4a71cc9518de39955
MD5 cea4911e2a5f4942015701be9661cac0
BLAKE2b-256 0839fe0584ee003bdad904020c49c1fd167fc0966dfd887047a487a4c02472ad

See more details on using hashes here.

File details

Details for the file cleanalytix-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cleanalytix-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 23.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for cleanalytix-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 de0929b9d0fb3d675cd639125460d3c3110976d0e84575d5da2e3f1078f2107b
MD5 d9b45acfae7dd78cd0870fa116284b85
BLAKE2b-256 e1f2a7f1cfb6d78f3bcd2ddca808068d4254e4d38cc87953c7e9d6006f68c909

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page