Skip to main content

Automatic data cleaning, EDA, and missing data visualization for pandas DataFrames

Project description

Pristinizer

Pristinizer is a lightweight Python package for automatic data cleaning, exploratory data analysis (EDA), and missing data visualization for pandas DataFrames.

It helps data scientists and ML engineers quickly clean and understand datasets with minimal effort.


Features

  • Automatic data cleaning

    • Removes duplicate rows
    • Standardizes column names
    • Handles missing values
    • Removes empty rows and columns
  • Dataset summary (EDA)

    • Column data types
    • Missing value counts and percentages
    • Unique value counts
  • Missing data visualization

    • Missing value matrix
    • Missing value heatmap
    • Missing value bar chart
  • Simple and easy-to-use API


Installation

From PyPI (recommended)

pip install pristinizer

From Source

git clone https://github.com/harmanbajwa2954/Pristinizer-pyProject.git
cd pristinizer
pip install .

Quick Start

import pandas as pd
import pristinizer as ps

# Load dataset
df = pd.read_csv("data.csv")

# Clean dataset
clean_df = ps.clean(df)

# Generate summary
summary = ps.summarize(df)
print(summary)

# Visualize missing data
ps.missing_matrix(df)
ps.missing_heatmap(df)
ps.missing_bar(df)

Examples

Input DataSet

Name Age Salary City
A 25 50000 Delhi
B NaN 60000 Mumbai
C 30 NaN NaN

Output

column datatype missing_count missing_% unique_count
Age float64 1 33.33 2
Salary float64 1 33.33 2
City object 1 33.33 2
Name object 0 0.00 3

Available Functions

Data Cleaning

ps.clean(df)

Returns cleaned DataFrame.

Dataset Summary

ps.summarize(df)

Returns summary DataFrame containing:

  • column name
  • datatype
  • missing count
  • missing percentage
  • unique count

Missing Data Visualization

Matrix view:

ps.missing_matrix(df)

Heatmap view:

ps.missing_heatmap(df)

Bar chart view:

ps.missing_bar(df)

Project Structure

pristinizer/

├── pristinizer/
│...... ├── init.py
│ ......├── cleaner.py
│.......├── eda.py
│...... ├── visualizer.py

├── README.md
├── pyproject.toml
├── LICENSE


Requirements

  • Python ≥ 3.8
  • pandas
  • matplotlib
  • seaborn

Use Cases

  • Machine learning preprocessing
  • Exploratory data analysis
  • Data science projects
  • Data cleaning automation
  • Educational purposes

Future Features

  • Outlier detection
  • Automatic datatype conversion
  • Feature importance analysis
  • Full EDA reports
  • Integration with ML pipelines

Contributing

Contributions are welcome.

Steps:

  1. Fork the repository
  2. Create a new branch
  3. Make changes
  4. Submit a pull request

License

MIT License


Author

Created as a data science utility package to simplify preprocessing workflows.
By – Harmanpreet Singh


Support

If you find this useful, consider starring the repository on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pristinizer-1.0.0.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pristinizer-1.0.0-py3-none-any.whl (6.3 kB view details)

Uploaded Python 3

File details

Details for the file pristinizer-1.0.0.tar.gz.

File metadata

  • Download URL: pristinizer-1.0.0.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for pristinizer-1.0.0.tar.gz
Algorithm Hash digest
SHA256 309016a746d9af6b19d35fbe340d2025fb80a1d35e3591a1507e05295408d017
MD5 d754c73d6ec4c4a85624b9b840cdf68f
BLAKE2b-256 08702ba201400640ea1b72734b5491068b4e350bdd2ad29c4cf523b69dfe0918

See more details on using hashes here.

File details

Details for the file pristinizer-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: pristinizer-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 6.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for pristinizer-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3e4cbb1099dab45bf50e5f0b31dad15180df7f444030de7bc60a6a426bb09b3a
MD5 5ec645531a88de7e06d23b480b35dbe3
BLAKE2b-256 a5512f55584536cef8d1572c8037640ad5aa0361d52976cd57ec1664a7376c68

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page