Cleanalytix is a modular Python library for profiling, scoring, and cleaning tabular datasets.
Project description
Cleanalytix
Cleanalytix is a Python library for profiling, scoring, cleaning, and monitoring the quality of tabular datasets with a single pipeline.
It is designed for pandas-first workflows and supports:
- baseline dataset profiling and scoring
- optional cleaning recommendations and automatic cleaning
- optional production/new-dataset monitoring
- optional business rules, thresholds, weights, and type inference for new data
Installation
From a source checkout:
git clone https://github.com/Probot-DATA/Cleanalytix_Repo
cd Cleanalytix_Repo
pip install -e ".[dev]"
Once this project is published to PyPI, the install command will be:
pip install cleanalytix
Runtime requirements:
- Python 3.9+
- pandas
- numpy
- scikit-learn
- nltk
Quick Start
import pandas as pd
from cleanalytix import Run_DQ_Pipeline
df = pd.read_csv("my_data.csv")
result = Run_DQ_Pipeline(
dataset_names=["my_dataset"],
dataset_list=[df],
)
print(result["base_data"]["dirty_scores"])
print(result["base_data"]["meta_before_cleaning"])
print(result["base_data"]["recommendations"])
Production / Monitoring Example
import pandas as pd
from cleanalytix import Run_DQ_Pipeline
train_df = pd.read_csv("train.csv")
prod_df = pd.read_csv("production.csv")
rules = {
"age": lambda value: pd.isna(value) or 0 <= float(value) <= 120,
}
result = Run_DQ_Pipeline(
dataset_names=["customers"],
dataset_list=[train_df],
new_dataset_list=[prod_df],
rules=rules,
cleaning=True,
interactive=False,
score_mode="exponential",
)
print(result["base_data"]["dirty_scores"])
print(result["base_data"]["cleaned_scores"])
print(result["prod_data"]["dirty_scores"])
print(result["prod_data"]["change_log"])
Public API
The primary entrypoint is:
from cleanalytix import Run_DQ_Pipeline
Additional building blocks are also exported:
from cleanalytix import (
Compute_DQ_Score,
DEFAULT_THRESHOLDS,
generate_meta,
cleaning_recommendations,
get_cleaned_data,
get_table_for_DQ_computation,
summarize_dataset_health,
learn_reference_profile,
adjust_prod_meta_with_reference,
infer_and_fix_types,
)
Pipeline Output Structure
Run_DQ_Pipeline returns:
{
"base_data": {...},
"prod_data": {...},
}
Each block preserves the same keys:
dirty_scorescleaned_scorescleaned_datasetsmeta_before_cleaningmeta_after_cleaningrecommendationschange_logsummarized_beforesummarized_aftermain_metrics_beforemain_metrics_after
Examples
Runnable examples live in examples:
python examples/simple_usage.py
python examples/production_usage.py
These examples assume the package has already been installed in the active environment.
Validation
The validation folder contains a portable real-world validation workflow.
- Large raw datasets are intentionally not committed to the repository.
- Put the expected files under
validation/datasets/by following validation/datasets/README.md. - Run the validation script:
python validation/run_validation.py
- The script saves non-empty outputs to
validation/outputs/<dataset_name>/. - The notebook validation/main.ipynb uses the same relative-path workflow.
Repository Layout
cleanalytix/- installable library packageexamples/- small runnable examplestests/- smoke tests and lightweight sample fixturesvalidation/- public-friendly validation workflow and output folderarchive/legacy/- historical prototype notebook/code kept for reference, not for active use
Known Limitations
- Validation datasets are not bundled with the repository.
- The yellow taxi validation workflow samples the first
20,000rows from each configured monthly file to match the original project workflow and to keep validation practical. - Interactive cleaning is intended for notebook/CLI use and will prompt for input when
interactive=True.
Contributing
See CONTRIBUTING.md.
License
MIT (c) 2026 Probot-DATA contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cleanalytix-0.1.0.tar.gz.
File metadata
- Download URL: cleanalytix-0.1.0.tar.gz
- Upload date:
- Size: 23.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8a9af22d8343dd02562dbe51935c1536e5302f71869de8e4a71cc9518de39955
|
|
| MD5 |
cea4911e2a5f4942015701be9661cac0
|
|
| BLAKE2b-256 |
0839fe0584ee003bdad904020c49c1fd167fc0966dfd887047a487a4c02472ad
|
File details
Details for the file cleanalytix-0.1.0-py3-none-any.whl.
File metadata
- Download URL: cleanalytix-0.1.0-py3-none-any.whl
- Upload date:
- Size: 23.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de0929b9d0fb3d675cd639125460d3c3110976d0e84575d5da2e3f1078f2107b
|
|
| MD5 |
d9b45acfae7dd78cd0870fa116284b85
|
|
| BLAKE2b-256 |
e1f2a7f1cfb6d78f3bcd2ddca808068d4254e4d38cc87953c7e9d6006f68c909
|