Automatic data cleaning, EDA, and missing data visualization for pandas DataFrames
Project description
Pristinizer
Pristinizer is a lightweight Python package for automatic data cleaning, exploratory data analysis (EDA), and missing data visualization for pandas DataFrames.
It helps data scientists and ML engineers quickly clean and understand datasets with minimal effort.
Features
-
Automatic data cleaning
- Removes duplicate rows
- Standardizes column names
- Handles missing values
- Removes empty rows and columns
-
Dataset summary (EDA)
- Column data types
- Missing value counts and percentages
- Unique value counts
-
Missing data visualization
- Missing value matrix
- Missing value heatmap
- Missing value bar chart
-
Simple and easy-to-use API
Installation
From PyPI (recommended)
pip install pristinizer
From Source
git clone https://github.com/harmanbajwa2954/Pristinizer-pyProject.git
cd pristinizer
pip install .
Quick Start
import pandas as pd
import pristinizer as ps
# Load dataset
df = pd.read_csv("data.csv")
# Clean dataset
clean_df = ps.clean(df)
# Generate summary
summary = ps.summarize(df)
print(summary)
# Visualize missing data
ps.missing_matrix(df)
ps.missing_heatmap(df)
ps.missing_bar(df)
Examples
Input DataSet
| Name | Age | Salary | City |
|---|---|---|---|
| A | 25 | 50000 | Delhi |
| B | NaN | 60000 | Mumbai |
| C | 30 | NaN | NaN |
Output
| column | datatype | missing_count | missing_% | unique_count |
|---|---|---|---|---|
| Age | float64 | 1 | 33.33 | 2 |
| Salary | float64 | 1 | 33.33 | 2 |
| City | object | 1 | 33.33 | 2 |
| Name | object | 0 | 0.00 | 3 |
Available Functions
Data Cleaning
ps.clean(df)
Returns cleaned DataFrame.
Dataset Summary
ps.summarize(df)
Returns summary DataFrame containing:
- column name
- datatype
- missing count
- missing percentage
- unique count
Missing Data Visualization
Matrix view:
ps.missing_matrix(df)
Heatmap view:
ps.missing_heatmap(df)
Bar chart view:
ps.missing_bar(df)
Project Structure
pristinizer/
│
├── pristinizer/
│...... ├── init.py
│ ......├── cleaner.py
│.......├── eda.py
│...... ├── visualizer.py
│
├── README.md
├── pyproject.toml
├── LICENSE
Requirements
- Python ≥ 3.8
- pandas
- matplotlib
- seaborn
Use Cases
- Machine learning preprocessing
- Exploratory data analysis
- Data science projects
- Data cleaning automation
- Educational purposes
Future Features
- Outlier detection
- Automatic datatype conversion
- Feature importance analysis
- Full EDA reports
- Integration with ML pipelines
Contributing
Contributions are welcome.
Steps:
- Fork the repository
- Create a new branch
- Make changes
- Submit a pull request
License
MIT License
Author
Created as a data science utility package to simplify preprocessing workflows.
By – Harmanpreet Singh
Support
If you find this useful, consider starring the repository on GitHub.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pristinizer-0.1.0.tar.gz.
File metadata
- Download URL: pristinizer-0.1.0.tar.gz
- Upload date:
- Size: 5.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6752cdd8f4416ad38e618d7628bb99bd94d8c2cce1c9e25a061c3d264fb32ea1
|
|
| MD5 |
b9f382d8221ccadcd7c964bf7fcdc870
|
|
| BLAKE2b-256 |
3974604d191a6181a983ee232b30c46ea4f3de722c51a665be29805ea1a4f03b
|
File details
Details for the file pristinizer-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pristinizer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
34cac316d821ed5bfa0ac20220c1f41f3085573eb02b64d095db5d4e47e534f3
|
|
| MD5 |
495aae6c599e6449d1a3538db6c05daf
|
|
| BLAKE2b-256 |
59baec4d41cfb5dd1875ef20648871d086e45a23ea07119bc5a723e26243aded
|