a modular Python package for cleaning text, categorical, numerical, and datetime data. It offers configurable pipelines with support for preprocessing, typo correction, encoding, imputation, logging, parallel processing, and audit reporting—perfect for data scientists handling messy, real-world datasets in ML workflows.

Project description

CleansiPy

CleansiPy 🧼📊 Clean your data like a pro — Text, Categorical, Numerical, and DateTime — all in one package.

🚀 Overview CleansiPy is an all-in-one Python package designed to clean and preprocess messy datasets with ease and flexibility. It supports four major data types:

📝 Text – tokenization, stemming, lemmatization, stopword removal, n-gram generation, profanity filtering, emoji & HTML cleaning, and more.

🧮 Numerical – missing value handling, outlier detection, precision adjustment, type correction, and logging.

🧾 Categorical – typo correction, standardization, rare value grouping, encoding (OneHot, Label, Ordinal), and fuzzy matching.

🕒 DateTime – flexible parsing, timezone unification, feature extraction (day, month, weekday, etc.), imputation, and validation.

It’s built for data scientists, ML engineers, and analysts working on real-world data pipelines.

🔧 Installation bash Copy Edit pip install puripy 📦 Features ✅ Configurable, modular pipelines ✅ Works with pandas DataFrames ✅ Multi-core processing for speed ✅ NLTK/TextBlob integration for NLP ✅ sklearn support for encoding ✅ Detailed logs and cleaning reports ✅ Auto column detection ✅ Type-safe and test-friendly design

⚡ Quick Start

Set up your configuration:

After installing, run the following command to copy the default config.py to your project directory:
```
cleansipy-config
```
Then edit config.py to set your input/output file paths and other options before running the application.
Install requirements:
```
pip install -r requirements.txt
```
Or, if you want to use the package mode:
```
pip install .
```
Run the application:
```
python -m puripy.app
```
Or, if you installed as a package and set up entry points:
```
puripy
```

🖼️ Logo

The official Puripy logo is included in the package at CleansiPy/assets/logo.png.

To access or display the logo programmatically:

from CleansiPy import get_logo_path, show_logo
print(get_logo_path())
show_logo()

📦 Package Structure

CleansiPy/
    __init__.py
    __main__.py
    app.py
    mainnum.py
    maincat.py
    maintext.py
    maindt.py
    logo.py
    config.py
    assets/
        logo.png
        README.txt
setup.py
requirements.txt
README.md

All main code is inside the CleansiPy/ directory for packaging.
The logo is in CleansiPy/assets/logo.png and accessible via get_logo_path() and show_logo().
To run the app: set up config, install requirements, then run python -m CleansiPy.app.

in case you dont get the config.py then use this template :

  # ==================== CATEGORICAL DATA CONFIG ====================
config2 = {
   # File paths
   "INPUT_FILE": r"testdata\xx.csv",              # Raw data source
   "OUTPUT_FILE": r"testdata\cleaned.csv",        # Cleaned data destination
   "TARGET_COLUMN": None,                         # Target variable for ML tasks
   "FILE_PATH": r"testdata\cleaning_report.txt",  # Cleaning audit log
   
   # Column selection
   "COLUMNS_TO_CLEAN": ["category", "other_cat_col"],  # Specific columns to process
   "EXCLUDE_COLUMNS": [],                         # Columns to skip
   
   # Core cleaning features
   "FIX_TYPOS": True,                            # Auto-correct spelling variations
   "GROUP_RARE": True,                           # Consolidate infrequent categories
   "RARE_THRESHOLD": 0.05,                       # Minimum frequency to keep as separate category
   "SIMILARITY_THRESHOLD": 80,                   # Fuzzy matching sensitivity (0-100)
   
   # Performance
   "MEMORY_EFFICIENT": False,                    # Optimize for large datasets
   "PARALLEL_JOBS": -1                           # CPU cores to use (-1 = all)
}

# ==================== NUMERICAL DATA CONFIG ====================
DEFAULT_CONFIG = {
   # File handling
   'input_file': r'testdata\data.csv',
   'output_file': r'testdata\cleaned_output.csv',
   'report_file': r'testdata\textreport.txt',

   # Data type handling
   'type_conversion': {
       'numeric_cols': ['Sales_Before', 'Sales_After']  # Columns to force-convert to numeric
   },
   
   # Missing data
   'missing_values': {
       'strategy': 'mean',                      # Imputation method (mean/median/mode)
       'threshold': 0.5                         # Max allowed missingness per column
   },
   
   # Data validation
   'data_errors': {
       'constraints': {                         # Value range rules
           'Sales_Before': lambda x: (x >= 50) & (x <= 500)
       },
       'correction': 'median'                   # How to fix invalid values
   },
   
   # Outlier treatment
   'outliers': {
       'method': 'iqr',                         # Detection method (iqr/zscore)
       'action': 'cap',                         # Handling (cap/remove)
       'columns': ['Sales_Before', 'Sales_After']  # Columns to check
   },
   
   # Precision control
   'precision': {
       'Sales_Before': 2                        # Decimal places to round
   }
}

# ==================== DATE/TIME CONFIG ====================
config3 = {
   # File setup
   "INPUT_FILE": r"testdata\dates.csv",
   "OUTPUT_FILE": r"testdata\cleaned.csv",
   "REPORT_FILE": r"testdata\date_cleaning_report.txt",

   # Date processing
   "PARSE_DATES": True,                        # Convert string dates
   "IMPUTE_MISSING": True,                     # Fill missing timestamps
   "IMPUTATION_METHOD": "linear",              # Filling strategy
   "STANDARDIZE_TIMEZONE": False,              # Convert to target timezone
   
   # Feature generation
   "EXTRACT_FEATURES": True,                   # Create calendar features
   "CALENDAR_FEATURES": ["year", "month"],     # Features to extract
   "FISCAL_YEAR_START": 10                     # Fiscal year starting month
}

# ==================== TEXT CLEANING CONFIG ====================
config = {
   # Core text processing
   'lowercase': True,                          # Convert to lowercase
   'remove_punctuation': True,                 # Strip punctuation
   'remove_stopwords': True,                   # Filter common words
   'lemmatize': True,                          # Reduce to base form
   
   # Advanced cleaning
   'remove_urls': True,                        # Strip web addresses
   'remove_emojis': True,                      # Filter emoji characters
   'spelling_correction': False,               # Fix spelling errors
   
   # File handling
   'input_file': r"testdata\test.csv",
   'output_file': r"testdata\cleaned_text.csv",
   'text_column': None                         # Auto-detect text column
}

Author Developed by Sambhranta Ghosh Open to contributions, feedback, and improvements!

For more, see the in-code docstrings and comments visit this repo for more info and contributions : https://github.com/Rickyy-Sam07/CleansiPy.git

Project details

Release history Release notifications | RSS feed

0.1.13

Jul 14, 2025

0.1.12

Jul 13, 2025

This version

0.1.10

Jun 12, 2025

0.1.6

Jun 12, 2025

0.1.5

Jun 12, 2025

0.1.4

Jun 12, 2025

0.1.0

Jun 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleansipy-0.1.10.tar.gz (40.7 kB view details)

Uploaded Jun 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cleansipy-0.1.10-py3-none-any.whl (48.0 kB view details)

Uploaded Jun 12, 2025 Python 3

File details

Details for the file cleansipy-0.1.10.tar.gz.

File metadata

Download URL: cleansipy-0.1.10.tar.gz
Upload date: Jun 12, 2025
Size: 40.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for cleansipy-0.1.10.tar.gz
Algorithm	Hash digest
SHA256	`bdff38e7b453c08f6ccacab8bca216019f536e705afe5c3b80df2b08351e300b`
MD5	`a58946ba624549da166dcfe741e98abe`
BLAKE2b-256	`838fc5d9b0c6587acc62de5d1d3c997ceb8e01bde38b7ec51f5488b5715ff5a3`

See more details on using hashes here.

File details

Details for the file cleansipy-0.1.10-py3-none-any.whl.

File metadata

Download URL: cleansipy-0.1.10-py3-none-any.whl
Upload date: Jun 12, 2025
Size: 48.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for cleansipy-0.1.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2c75dca4961a37ad5b46bd42335dc63ba0b2044a0b51bbb4aedd9fa7b163c547`
MD5	`3b919e572c36c8bb3e649cb48293ba69`
BLAKE2b-256	`ac4b974baa88a6eceb924e9c0947ec5f77cc52cab92fa379438f5db03ccbc30b`

See more details on using hashes here.

CleansiPy 0.1.10

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

CleansiPy

⚡ Quick Start

🖼️ Logo

📦 Package Structure

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes