Skip to main content

A comprehensive data quality analysis library for pandas DataFrames

Project description

PyDataQuality

The Enterprise-Grade Data Quality Engine for Python.

PyDataQuality automates the tedious 80% of data science: validating, profiling, and cleaning new datasets. It transforms raw pandas DataFrames into publication-ready quality reports with a single line of code. Designed for high-velocity data teams, it features batch processing for big data, custom rule validation, and AI-powered remediation suggestions.

Data Quality Analysis

Features

Core Capabilities

  • Data Quality Analysis: Detect missing values, outliers, inconsistencies, and data type issues
  • Data Extraction: Pinpoint and extract problematic rows (e.g., specific outliers) for remediation via get_problematic_rows()
  • Multi-Format Support: Native support for CSV, Excel (.xlsx, .xls), JSON, and Parquet files
  • CLI Interface: Auto-detects file formats from the terminal - no coding required
  • Interactive Notebook Display: Direct rendering in Jupyter/Colab with show_report()
  • Batch Sampling: Process large datasets (GBs) efficiently with chunk-based sampling
  • Custom YAML Rules: Enterprise-grade validation with configuration files
  • AI Remediation Prompts: Auto-generated Python scripts for fixing detected issues
  • Comprehensive Visualizations: Publication-quality plots for data quality assessment
  • Professional Reports: HTML, text, and JSON reports with actionable insights
  • Easy Integration: Works seamlessly with pandas DataFrames from any source

Supported Input Formats

CLI: Auto-detects CSV, Excel (.xlsx, .xls), JSON, Parquet (python -m pydataquality data.csv)
Python API: Accepts any pandas DataFrame from:

  • CSV files (pd.read_csv())
  • Excel files (pd.read_excel())
  • JSON files (pd.read_json())
  • Parquet files (pd.read_parquet())
  • SQL databases (pd.read_sql())
  • APIs (any source that produces a DataFrame)
  • Cloud platforms (Snowflake, BigQuery, etc.)

Summary: PyDataQuality works with CSV, Excel, JSON, Parquet files, and any pandas DataFrame from SQL databases, APIs, or cloud platforms. The library is format-agnostic - it only requires a pandas DataFrame as input.

Why PyDataQuality?

The Problem

You just received a new dataset. You need to know what's wrong with it and how to fix it - fast. Existing tools either:

  • Take too long (pandas-profiling: 10+ minutes, 200MB reports)
  • Require too much setup (Great Expectations: YAML configs, checkpoints, data contexts)
  • Don't give actionable insights (pandas .describe(): just basic stats)

The Solution

PyDataQuality fills the gap between "too simple" and "too complex":

"I just got a new dataset. What's wrong with it, and how do I fix it? I need answers in 30 seconds, not 30 minutes."

Comparison with Alternatives

Feature pandas-profiling Great Expectations PyDataQuality
One-liner usage
Extract bad rows
CLI support
Fast on large data
No config needed
AI integration
Beginner-friendly ⚠️

Real-World Example

Scenario: Data scientist gets a CSV with 1M rows

With pandas-profiling:

# Takes 10+ minutes, generates 200MB HTML
profile = ProfileReport(df)
profile.to_file("report.html")
# Now what? How do I fix the issues?

With PyDataQuality:

# Takes 30 seconds
analyzer = pdq.analyze_dataframe(df)
bad_ages = analyzer.get_problematic_rows('age', 'outliers')
bad_ages.to_csv('fix_these.csv')  # Send to data team

Bottom line: PyDataQuality is simple by design. It just works, without making you read 50 pages of documentation.

Installation

# Clone the repository
git clone https://github.com/DominionAkinrotimi/pydataquality.git
cd pydataquality

# Install in development mode
pip install -e .

# Install requirements directly
pip install -r requirements.txt

# For interactive notebook support (Jupyter/Colab)
pip install ".[notebook]"

Quick Start

import pandas as pd
import pydataquality as pdq

# Load your data
df = pd.read_csv('your_data.csv')

# Quick quality check
summary = pdq.quick_quality_check(df, name="My Dataset")

# Comprehensive analysis
analyzer = pdq.analyze_dataframe(df, name="My Dataset")

# Analysis with custom thresholds (e.g., stricter missing data check)
# Analysis with custom thresholds and excluded values
config = {
    'missing_critical': 0.1, 
    'outlier_threshold': 2.0,
    'exclude_values': {
        'age': [150, -5],  # specific values to ignore
        'sales': [-100]
    }
}
analyzer = pdq.analyze_dataframe(df, name="My Dataset", config=config)

# Generate visualizations
visualizer = pdq.create_visual_report(analyzer)

# Generate HTML report
pdq.generate_report(analyzer, output_path="quality_report.html", format='html')

# Get AI assistance for fixes (includes EDA context by default)
prompt = pdq.generate_ai_prompt(analyzer)
print(prompt) 

# Output example:
# "I have a dataset with 1000 rows. Issues: age (outliers: outside [18, 65]).
#  Statistical Context: age (int64): Mean=42, Max=150...
#  Please write a script..."

CLI Usage

Run the analysis directly from your terminal:

# Basic usage
python -m pydataquality data.csv

# Generate professional HTML report
python -m pydataquality data.csv --report html --theme professional

# Create visualizations
python -m pydataquality data.csv --visualize

Documentation

Examples

Check the examples directory for comprehensive usage examples:

  1. Basic Usage: Simple analysis and reporting
  2. Advanced Features: Custom configurations and visualizations
  3. Real-world Scenarios: Handling various data quality issues

Project Structure

PyDataQuality/
├── pydataquality/       # Main package directory
│   ├── __init__.py      # Exports
│   ├── __main__.py      # Package entry point
│   ├── analyzer.py      # Core analysis engine
│   ├── cli.py           # Package CLI logic
│   ├── comparator.py    # Data drift detection
│   ├── config.py        # Settings management
│   ├── reporter.py      # Report generation
│   ├── utils.py         # Helper functions
│   └── visualizer.py    # Visualization engine
├── docs/                # Documentation
├── examples/            # Usage examples
├── tests/               # Unit tests
├── cli.py               # Root CLI script (local usage)
├── setup.py             # Installation script
└── requirements.txt     # Dependencies

Key Components

1. DataQualityAnalyzer

Core analysis engine that examines data structure, detects issues, and computes statistics.

2. DataQualityVisualizer

Creates comprehensive visualizations including:

  • Missing value heatmaps
  • Outlier detection plots
  • Distribution analysis
  • Correlation heatmaps
  • Categorical value distributions

3. QualityReportGenerator

Generates professional reports in multiple formats with actionable recommendations.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Built with pandas, matplotlib, and seaborn

  • Inspired by real-world data quality challenges

  • Designed for data scientists and analysts

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pydataquality-0.1.0.tar.gz (48.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydataquality-0.1.0-py3-none-any.whl (56.6 kB view details)

Uploaded Python 3

File details

Details for the file pydataquality-0.1.0.tar.gz.

File metadata

  • Download URL: pydataquality-0.1.0.tar.gz
  • Upload date:
  • Size: 48.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pydataquality-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9b456486983a505678988d868595847b85cca932bb2bbb7be12477695edc21dc
MD5 e9722e343fc10e630d27c53ea7789eba
BLAKE2b-256 3bb32ceaa36a7f7dc4c320e06bf778b5b72778af4dd3ad32ae1e842aff68ff55

See more details on using hashes here.

File details

Details for the file pydataquality-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pydataquality-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 56.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pydataquality-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e6a2d8cce139636d15aed813dcb5248b61ea4d59e0d311b99eb52f057bae55e0
MD5 a2f89dc3d516390bc8ab789b1950b594
BLAKE2b-256 ef83f57bd4a6865eb87c66ca1ff133955df17034f91a40adf29842338bf4a04c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page