A Python library for comprehensive data quality analysis and reporting on tabular datasets.

These details have not been verified by PyPI

Project links

Project description

Data Guardian 🛡️

Data Guardian is a Python library designed to meticulously analyze the quality of your tabular datasets (CSV and Excel). It provides a comprehensive audit, a clear scoring system, and generates detailed reports in text, HTML, and PDF formats, empowering you to trust and effectively utilize your data.

Whether you're a data scientist, analyst, researcher, or civic tech worker, Data Guardian helps you quickly identify and understand issues like missing values, inconsistencies, outliers, and duplicates before you dive into deeper analysis, visualization, or machine learning.

✨ Key Features

Comprehensive Data Profiling: Identifies a wide range of common data quality issues:
- Missing Values (NaNs)
- Suspicious Null-like Strings (e.g., "NA", "Null", "", "--")
- Constant Columns (columns with only one unique value)
- Duplicated Rows
- Leading/Trailing Whitespace in string values
- Mixed Case Values (e.g., "Apple" vs "apple")
- Potential Numeric Types (object columns that appear fully numeric)
- Mixed Data Types within object columns
- Numerical Outliers (using IQR or Z-score methods)
- Categorical Outliers (rare categories)
Intuitive Quality Scoring: Generates scores 0/100 for:
- Completeness
- Uniqueness
- Consistency
- Validity
- An Overall Quality Score
Detailed Reporting: Produces human-readable reports in multiple formats:
- Console Text Summary
- HTML Report (with basic styling, suitable for sharing)
- PDF Report (for archival and formal documentation)
Easy-to-Use API: Simple Python interface to integrate into your data workflows.
Command-Line Interface (CLI): Quickly analyze datasets directly from your terminal.
File Support: Natively handles CSV and Excel (.xls, .xlsx) files.
Configurable Analysis: (Future - ability to tune thresholds and checks).

🚀 Installation

You can install Data Guardian using pip. Python 3.8 or higher is required.

pip install data-guardian

⚡ Quickstart

from data_guardian import DatasetProfile, PDFReporter # Or HTMLReporter

# 1. Specify the path to your data file
file_path = "path/to/your/dataset.csv" # Or "path/to/your/dataset.xlsx"
# For example, if you downloaded the comprehensive_test_data.csv from the project:
# file_path = "data/comprehensive_test_data.csv"


# 2. Create a DatasetProfile instance
# The name is optional; if not provided, it uses the filename.
# The file_type ('csv' or 'excel') is also optional and will be inferred from the extension.
profile = DatasetProfile(source_path=file_path, name="My Sample Analysis")

# 3. Load the data
if profile.load_data():
    print(f"Successfully loaded: {profile.name}")

    # 4. Run all available data quality analyses
    # You can pass a configuration dictionary if needed for specific analyses,
    # e.g., custom_null_strings or outlier parameters.
    # analysis_config = {
    #     'custom_null_strings': ["N/A", "-", "Not Available"],
    #     'outliers_numerical': {'method': 'zscore', 'threshold': 3.0}
    # }
    # profile.run_analysis(config=analysis_config) # Pass config if using custom settings
    profile.run_analysis() # Uses default settings if config is not passed
    print(f"Analysis complete. Issues found: {len(profile.issues_found)}")

    # 5. Calculate quality scores based on the analysis
    profile.calculate_quality_scores()
    if profile.quality_score:
        print(f"Overall Quality Score: {profile.quality_score.overall_score:.2f}/100")

    # 6. Get a text summary report (printed to console)
    print("\n--- Text Summary Report ---")
    print(profile.get_summary_report())

    # 7. Generate a PDF report
    print("\n--- Generating PDF Report ---")
    pdf_reporter = PDFReporter(profile)
    pdf_output_path = "data_guardian_report.pdf"
    if pdf_reporter.generate_pdf_report(output_path=pdf_output_path):
        print(f"PDF report saved to: {pdf_output_path}")
    else:
        print("Failed to generate PDF report.")

    # # Alternatively, generate an HTML report
    # from data_guardian import HTMLReporter
    # print("\n--- Generating HTML Report ---")
    # html_reporter = HTMLReporter(profile)
    # html_output_path = "data_guardian_report.html"
    # if html_reporter.save_html_report(output_path=html_output_path):
    #     print(f"HTML report saved to: {html_output_path}")
    # else:
    #     print("Failed to generate HTML report.")

else:
    print(f"Failed to load data from: {file_path}")

Command-Line Interface (CLI)

data-guardian-cli path/to/your/dataset.csv -o quality_report.pdf

CLI examples:

# Analyze a CSV and generate a PDF report (default output name)
data-guardian-cli my_data.csv

# Analyze an Excel file and generate an HTML report with a custom name
data-guardian-cli financial_data.xlsx -o financial_audit.html --name "Financial Audit Q1"

# Analyze a CSV, specifying its type, and output to a custom PDF name
data-guardian-cli sales_records -t csv -o sales_quality.pdf --name "Sales Records"

🤝 Contributing:

Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. If you have suggestions for adding or removing projects, feel free to open an issue to discuss it, or directly create a pull request after you've first forked the repo and created a branch from main.

1-Fork the Project (Click the "Fork" button on the GitHub repository page: https://github.com/SAAD2003D/data-guardian)
2-Create your Feature Branch (git checkout -b feature/AmazingFeature)
3-Commit your Changes (git commit -m 'Add some AmazingFeature')
4-Push to the Branch (git push origin feature/AmazingFeature)
5-Open a Pull Request

📧 Contact

saad fikri – fsaad1929@gmail.com

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_guardian-0.1.0.tar.gz (25.8 kB view details)

Uploaded May 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

data_guardian-0.1.0-py3-none-any.whl (28.8 kB view details)

Uploaded May 8, 2025 Python 3

File details

Details for the file data_guardian-0.1.0.tar.gz.

File metadata

Download URL: data_guardian-0.1.0.tar.gz
Upload date: May 8, 2025
Size: 25.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.9

File hashes

Hashes for data_guardian-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`9c8be5bef7a82a6676f1e7cdfc416b4a70ab31a1a421f64213adddc23e7b15da`
MD5	`d1f4c49b811521fe901cf2a9e041820e`
BLAKE2b-256	`4810fbaf3fe79b7767e0764873bbef3eb9ecc932e5a93a64ea397e6d8b82d617`

See more details on using hashes here.

File details

Details for the file data_guardian-0.1.0-py3-none-any.whl.

File metadata

Download URL: data_guardian-0.1.0-py3-none-any.whl
Upload date: May 8, 2025
Size: 28.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.9

File hashes

Hashes for data_guardian-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7e0ca5d4424cdeb20198a6a498e623b9b6ad9431aece78a0d34672963ec9ff3a`
MD5	`35fbd8d6f2e957ad1b9e15ef93bc84d7`
BLAKE2b-256	`66a35ddb27bfec66f3deb14a49dff83ca0422115e2af5260401282fbb68a259e`

See more details on using hashes here.

data-guardian 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Data Guardian 🛡️

✨ Key Features

🚀 Installation

⚡ Quickstart

Command-Line Interface (CLI)

CLI examples:

🤝 Contributing:

📧 Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes