Cleanalytix is a modular Python library for profiling, scoring, and cleaning tabular datasets.

These details have not been verified by PyPI

Project links

Project description

Cleanalytix

Cleanalytix is a Python library for profiling, scoring, cleaning, and monitoring the quality of tabular datasets with a single pipeline.

It is designed for pandas-first workflows and supports:

baseline dataset profiling and scoring
optional cleaning recommendations and automatic cleaning
optional production/new-dataset monitoring
optional business rules, thresholds, weights, and type inference for new data

Installation

From a source checkout:

git clone https://github.com/Probot-DATA/Cleanalytix_Repo
cd Cleanalytix_Repo
pip install -e ".[dev]"

Once this project is published to PyPI, the install command will be:

pip install cleanalytix

Runtime requirements:

Python 3.9+
pandas
numpy
scikit-learn
nltk

Quick Start

import pandas as pd
from cleanalytix import Run_DQ_Pipeline

df = pd.read_csv("my_data.csv")

result = Run_DQ_Pipeline(
    dataset_names=["my_dataset"],
    dataset_list=[df],
)

print(result["base_data"]["dirty_scores"])
print(result["base_data"]["meta_before_cleaning"])
print(result["base_data"]["recommendations"])

Production / Monitoring Example

import pandas as pd
from cleanalytix import Run_DQ_Pipeline

train_df = pd.read_csv("train.csv")
prod_df = pd.read_csv("production.csv")

rules = {
    "age": lambda value: pd.isna(value) or 0 <= float(value) <= 120,
}

result = Run_DQ_Pipeline(
    dataset_names=["customers"],
    dataset_list=[train_df],
    new_dataset_list=[prod_df],
    rules=rules,
    cleaning=True,
    interactive=False,
    score_mode="exponential",
)

print(result["base_data"]["dirty_scores"])
print(result["base_data"]["cleaned_scores"])
print(result["prod_data"]["dirty_scores"])
print(result["prod_data"]["change_log"])

Public API

The primary entrypoint is:

from cleanalytix import Run_DQ_Pipeline

Additional building blocks are also exported:

from cleanalytix import (
    Compute_DQ_Score,
    DEFAULT_THRESHOLDS,
    generate_meta,
    cleaning_recommendations,
    get_cleaned_data,
    get_table_for_DQ_computation,
    summarize_dataset_health,
    learn_reference_profile,
    adjust_prod_meta_with_reference,
    infer_and_fix_types,
)

Pipeline Output Structure

Run_DQ_Pipeline returns:

{
    "base_data": {...},
    "prod_data": {...},
}

Each block preserves the same keys:

dirty_scores
cleaned_scores
cleaned_datasets
meta_before_cleaning
meta_after_cleaning
recommendations
change_log
summarized_before
summarized_after
main_metrics_before
main_metrics_after

Examples

Runnable examples live in examples:

python examples/simple_usage.py
python examples/production_usage.py

These examples assume the package has already been installed in the active environment.

Validation

The validation folder contains a portable real-world validation workflow.

Large raw datasets are intentionally not committed to the repository.
Put the expected files under validation/datasets/ by following validation/datasets/README.md.
Run the validation script:

python validation/run_validation.py

The script saves non-empty outputs to validation/outputs/<dataset_name>/.
The notebook validation/main.ipynb uses the same relative-path workflow.

Repository Layout

cleanalytix/ - installable library package
examples/ - small runnable examples
tests/ - smoke tests and lightweight sample fixtures
validation/ - public-friendly validation workflow and output folder
archive/legacy/ - historical prototype notebook/code kept for reference, not for active use

Known Limitations

Validation datasets are not bundled with the repository.
The yellow taxi validation workflow samples the first 20,000 rows from each configured monthly file to match the original project workflow and to keep validation practical.
Interactive cleaning is intended for notebook/CLI use and will prompt for input when interactive=True.

Contributing

See CONTRIBUTING.md.

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleanalytix-0.1.0.tar.gz (23.3 kB view details)

Uploaded May 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cleanalytix-0.1.0-py3-none-any.whl (23.5 kB view details)

Uploaded May 22, 2026 Python 3

File details

Details for the file cleanalytix-0.1.0.tar.gz.

File metadata

Download URL: cleanalytix-0.1.0.tar.gz
Upload date: May 22, 2026
Size: 23.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for cleanalytix-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8a9af22d8343dd02562dbe51935c1536e5302f71869de8e4a71cc9518de39955`
MD5	`cea4911e2a5f4942015701be9661cac0`
BLAKE2b-256	`0839fe0584ee003bdad904020c49c1fd167fc0966dfd887047a487a4c02472ad`

See more details on using hashes here.

File details

Details for the file cleanalytix-0.1.0-py3-none-any.whl.

File metadata

Download URL: cleanalytix-0.1.0-py3-none-any.whl
Upload date: May 22, 2026
Size: 23.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for cleanalytix-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`de0929b9d0fb3d675cd639125460d3c3110976d0e84575d5da2e3f1078f2107b`
MD5	`d9b45acfae7dd78cd0870fa116284b85`
BLAKE2b-256	`e1f2a7f1cfb6d78f3bcd2ddca808068d4254e4d38cc87953c7e9d6006f68c909`

See more details on using hashes here.

cleanalytix 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Cleanalytix

Installation

Quick Start

Production / Monitoring Example

Public API

Pipeline Output Structure

Examples

Validation

Repository Layout

Known Limitations

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes