Skip to main content

Automatic data cleaning, EDA, and missing data visualization for pandas DataFrames

Project description

Pristinizer

Pristinizer is a lightweight Python package for automatic data cleaning, exploratory data analysis (EDA), and missing data visualization for pandas DataFrames.

It helps data scientists and ML engineers quickly clean and understand datasets with minimal effort.


Features

  • Automatic data cleaning

    • Removes duplicate rows
    • Standardizes column names
    • Handles missing values
    • Removes empty rows and columns
  • Dataset summary (EDA)

    • Column data types
    • Missing value counts and percentages
    • Unique value counts
  • Missing data visualization

    • Missing value matrix
    • Missing value heatmap
    • Missing value bar chart
  • Simple and easy-to-use API


Installation

From PyPI (recommended)

pip install pristinizer

From Source

git clone https://github.com/harmanbajwa2954/Pristinizer-pyProject.git
cd pristinizer
pip install .

Quick Start

import pandas as pd
import pristinizer as ps

# Load dataset
df = pd.read_csv("data.csv")

# Clean dataset
clean_df = ps.clean(df)

# Generate summary
summary = ps.summarize(df)
print(summary)

# Visualize missing data
ps.missing_matrix(df)
ps.missing_heatmap(df)
ps.missing_bar(df)

Examples

Input DataSet

Name Age Salary City
A 25 50000 Delhi
B NaN 60000 Mumbai
C 30 NaN NaN

Output

column datatype missing_count missing_% unique_count
Age float64 1 33.33 2
Salary float64 1 33.33 2
City object 1 33.33 2
Name object 0 0.00 3

Available Functions

Data Cleaning

ps.clean(df)

Returns cleaned DataFrame.

Dataset Summary

ps.summarize(df)

Returns summary DataFrame containing:

  • column name
  • datatype
  • missing count
  • missing percentage
  • unique count

Missing Data Visualization

Matrix view:

ps.missing_matrix(df)

Heatmap view:

ps.missing_heatmap(df)

Bar chart view:

ps.missing_bar(df)

Project Structure

pristinizer/

├── pristinizer/
│...... ├── init.py
│ ......├── cleaner.py
│.......├── eda.py
│...... ├── visualizer.py

├── README.md
├── pyproject.toml
├── LICENSE


Requirements

  • Python ≥ 3.8
  • pandas
  • matplotlib
  • seaborn

Use Cases

  • Machine learning preprocessing
  • Exploratory data analysis
  • Data science projects
  • Data cleaning automation
  • Educational purposes

Future Features

  • Outlier detection
  • Automatic datatype conversion
  • Feature importance analysis
  • Full EDA reports
  • Integration with ML pipelines

Contributing

Contributions are welcome.

Steps:

  1. Fork the repository
  2. Create a new branch
  3. Make changes
  4. Submit a pull request

License

MIT License


Author

Created as a data science utility package to simplify preprocessing workflows.
By – Harmanpreet Singh


Support

If you find this useful, consider starring the repository on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pristinizer-0.1.0.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pristinizer-0.1.0-py3-none-any.whl (6.3 kB view details)

Uploaded Python 3

File details

Details for the file pristinizer-0.1.0.tar.gz.

File metadata

  • Download URL: pristinizer-0.1.0.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for pristinizer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6752cdd8f4416ad38e618d7628bb99bd94d8c2cce1c9e25a061c3d264fb32ea1
MD5 b9f382d8221ccadcd7c964bf7fcdc870
BLAKE2b-256 3974604d191a6181a983ee232b30c46ea4f3de722c51a665be29805ea1a4f03b

See more details on using hashes here.

File details

Details for the file pristinizer-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pristinizer-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for pristinizer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 34cac316d821ed5bfa0ac20220c1f41f3085573eb02b64d095db5d4e47e534f3
MD5 495aae6c599e6449d1a3538db6c05daf
BLAKE2b-256 59baec4d41cfb5dd1875ef20648871d086e45a23ea07119bc5a723e26243aded

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page