A comprehensive data quality analysis library for pandas DataFrames

These details have not been verified by PyPI

Project links

Project description

PyDataQuality

The Enterprise-Grade Data Quality Engine for Python.

PyDataQuality automates the tedious 80% of data science: validating, profiling, and cleaning new datasets. It transforms raw pandas DataFrames into publication-ready quality reports with a single line of code. Designed for high-velocity data teams, it features batch processing for big data, custom rule validation, and AI-powered remediation suggestions.

Data Quality Analysis

Features

Core Capabilities

Data Quality Analysis: Detect missing values, outliers, inconsistencies, and data type issues
Data Extraction: Pinpoint and extract problematic rows (e.g., specific outliers) for remediation via get_problematic_rows()
Multi-Format Support: Native support for CSV, Excel (.xlsx, .xls), JSON, and Parquet files
CLI Interface: Auto-detects file formats from the terminal - no coding required
Interactive Notebook Display: Direct rendering in Jupyter/Colab with show_report()
Batch Sampling: Process large datasets (GBs) efficiently with chunk-based sampling
Custom YAML Rules: Enterprise-grade validation with configuration files
AI Remediation Prompts: Auto-generated Python scripts for fixing detected issues
Comprehensive Visualizations: Publication-quality plots for data quality assessment
Professional Reports: HTML, text, and JSON reports with actionable insights
Easy Integration: Works seamlessly with pandas DataFrames from any source

Supported Input Formats

CLI: Auto-detects CSV, Excel (.xlsx, .xls), JSON, Parquet (python -m pydataquality data.csv)
Python API: Accepts any pandas DataFrame from:

CSV files (pd.read_csv())
Excel files (pd.read_excel())
JSON files (pd.read_json())
Parquet files (pd.read_parquet())
SQL databases (pd.read_sql())
APIs (any source that produces a DataFrame)
Cloud platforms (Snowflake, BigQuery, etc.)

Summary: PyDataQuality works with CSV, Excel, JSON, Parquet files, and any pandas DataFrame from SQL databases, APIs, or cloud platforms. The library is format-agnostic - it only requires a pandas DataFrame as input.

Why PyDataQuality?

The Problem

You just received a new dataset. You need to know what's wrong with it and how to fix it - fast. Existing tools either:

Take too long (pandas-profiling: 10+ minutes, 200MB reports)
Require too much setup (Great Expectations: YAML configs, checkpoints, data contexts)
Don't give actionable insights (pandas .describe(): just basic stats)

The Solution

PyDataQuality fills the gap between "too simple" and "too complex":

"I just got a new dataset. What's wrong with it, and how do I fix it? I need answers in 30 seconds, not 30 minutes."

Comparison with Alternatives

Feature	pandas-profiling	Great Expectations	PyDataQuality
One-liner usage	❌	❌	✅
Extract bad rows	❌	❌	✅
CLI support	❌	❌	✅
Fast on large data	❌	✅	✅
No config needed	✅	❌	✅
AI integration	❌	❌	✅
Beginner-friendly	⚠️	❌	✅

Real-World Example

Scenario: Data scientist gets a CSV with 1M rows

With pandas-profiling:

# Takes 10+ minutes, generates 200MB HTML
profile = ProfileReport(df)
profile.to_file("report.html")
# Now what? How do I fix the issues?

With PyDataQuality:

# Takes 30 seconds
analyzer = pdq.analyze_dataframe(df)
bad_ages = analyzer.get_problematic_rows('age', 'outliers')
bad_ages.to_csv('fix_these.csv')  # Send to data team

Bottom line: PyDataQuality is simple by design. It just works, without making you read 50 pages of documentation.

Installation

# Clone the repository
git clone https://github.com/DominionAkinrotimi/pydataquality.git
cd pydataquality

# Install in development mode
pip install -e .

# Install requirements directly
pip install -r requirements.txt

# For interactive notebook support (Jupyter/Colab)
pip install ".[notebook]"

Quick Start

import pandas as pd
import pydataquality as pdq

# Load your data
df = pd.read_csv('your_data.csv')

# Quick quality check
summary = pdq.quick_quality_check(df, name="My Dataset")

# Comprehensive analysis
analyzer = pdq.analyze_dataframe(df, name="My Dataset")

# Analysis with custom thresholds (e.g., stricter missing data check)
# Analysis with custom thresholds and excluded values
config = {
    'missing_critical': 0.1, 
    'outlier_threshold': 2.0,
    'exclude_values': {
        'age': [150, -5],  # specific values to ignore
        'sales': [-100]
    }
}
analyzer = pdq.analyze_dataframe(df, name="My Dataset", config=config)

# Generate visualizations
visualizer = pdq.create_visual_report(analyzer)

# Generate HTML report
pdq.generate_report(analyzer, output_path="quality_report.html", format='html')

# Get AI assistance for fixes (includes EDA context by default)
prompt = pdq.generate_ai_prompt(analyzer)
print(prompt) 

# Output example:
# "I have a dataset with 1000 rows. Issues: age (outliers: outside [18, 65]).
#  Statistical Context: age (int64): Mean=42, Max=150...
#  Please write a script..."

CLI Usage

Run the analysis directly from your terminal:

# Basic usage
python -m pydataquality data.csv

# Generate professional HTML report
python -m pydataquality data.csv --report html --theme professional

# Create visualizations
python -m pydataquality data.csv --visualize

Documentation

Examples

Check the examples directory for comprehensive usage examples:

Basic Usage: Simple analysis and reporting
Advanced Features: Custom configurations and visualizations
Real-world Scenarios: Handling various data quality issues

Project Structure

PyDataQuality/
├── pydataquality/       # Main package directory
│   ├── __init__.py      # Exports
│   ├── __main__.py      # Package entry point
│   ├── analyzer.py      # Core analysis engine
│   ├── cli.py           # Package CLI logic
│   ├── comparator.py    # Data drift detection
│   ├── config.py        # Settings management
│   ├── reporter.py      # Report generation
│   ├── utils.py         # Helper functions
│   └── visualizer.py    # Visualization engine
├── docs/                # Documentation
├── examples/            # Usage examples
├── tests/               # Unit tests
├── cli.py               # Root CLI script (local usage)
├── setup.py             # Installation script
└── requirements.txt     # Dependencies

Key Components

1. DataQualityAnalyzer

Core analysis engine that examines data structure, detects issues, and computes statistics.

2. DataQualityVisualizer

Creates comprehensive visualizations including:

Missing value heatmaps
Outlier detection plots
Distribution analysis
Correlation heatmaps
Categorical value distributions

3. QualityReportGenerator

Generates professional reports in multiple formats with actionable recommendations.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built with pandas, matplotlib, and seaborn
Inspired by real-world data quality challenges
Designed for data scientists and analysts

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Dec 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydataquality-0.1.0.tar.gz (48.7 kB view details)

Uploaded Dec 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pydataquality-0.1.0-py3-none-any.whl (56.6 kB view details)

Uploaded Dec 18, 2025 Python 3

File details

Details for the file pydataquality-0.1.0.tar.gz.

File metadata

Download URL: pydataquality-0.1.0.tar.gz
Upload date: Dec 18, 2025
Size: 48.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pydataquality-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9b456486983a505678988d868595847b85cca932bb2bbb7be12477695edc21dc`
MD5	`e9722e343fc10e630d27c53ea7789eba`
BLAKE2b-256	`3bb32ceaa36a7f7dc4c320e06bf778b5b72778af4dd3ad32ae1e842aff68ff55`

See more details on using hashes here.

File details

Details for the file pydataquality-0.1.0-py3-none-any.whl.

File metadata

Download URL: pydataquality-0.1.0-py3-none-any.whl
Upload date: Dec 18, 2025
Size: 56.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pydataquality-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e6a2d8cce139636d15aed813dcb5248b61ea4d59e0d311b99eb52f057bae55e0`
MD5	`a2f89dc3d516390bc8ab789b1950b594`
BLAKE2b-256	`ef83f57bd4a6865eb87c66ca1ff133955df17034f91a40adf29842338bf4a04c`

See more details on using hashes here.

pydataquality 0.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

PyDataQuality

Features

Core Capabilities

Supported Input Formats

Why PyDataQuality?

The Problem

The Solution

Comparison with Alternatives

Real-World Example

Installation

Quick Start

CLI Usage

Documentation

Examples

Project Structure

Key Components

1. DataQualityAnalyzer

2. DataQualityVisualizer

3. QualityReportGenerator

Contributing

License

Acknowledgments

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes