Skip to main content

A Python library for comprehensive data quality analysis and reporting on tabular datasets.

Project description

Data Guardian 🛡️

PyPI version PyPI - Python Version License: MIT

Data Guardian is a Python library designed to meticulously analyze the quality of your tabular datasets (CSV and Excel). It provides a comprehensive audit, a clear scoring system, and generates detailed reports in text, HTML, and PDF formats, empowering you to trust and effectively utilize your data.

Whether you're a data scientist, analyst, researcher, or civic tech worker, Data Guardian helps you quickly identify and understand issues like missing values, inconsistencies, outliers, and duplicates before you dive into deeper analysis, visualization, or machine learning.

✨ Key Features

  • Comprehensive Data Profiling: Identifies a wide range of common data quality issues:
    • Missing Values (NaNs)
    • Suspicious Null-like Strings (e.g., "NA", "Null", "", "--")
    • Constant Columns (columns with only one unique value)
    • Duplicated Rows
    • Leading/Trailing Whitespace in string values
    • Mixed Case Values (e.g., "Apple" vs "apple")
    • Potential Numeric Types (object columns that appear fully numeric)
    • Mixed Data Types within object columns
    • Numerical Outliers (using IQR or Z-score methods)
    • Categorical Outliers (rare categories)
  • Intuitive Quality Scoring: Generates scores 0/100 for:
    • Completeness
    • Uniqueness
    • Consistency
    • Validity
    • An Overall Quality Score
  • Detailed Reporting: Produces human-readable reports in multiple formats:
    • Console Text Summary
    • HTML Report (with basic styling, suitable for sharing)
    • PDF Report (for archival and formal documentation)
  • Easy-to-Use API: Simple Python interface to integrate into your data workflows.
  • Command-Line Interface (CLI): Quickly analyze datasets directly from your terminal.
  • File Support: Natively handles CSV and Excel (.xls, .xlsx) files.
  • Configurable Analysis: (Future - ability to tune thresholds and checks).

🚀 Installation

You can install Data Guardian using pip. Python 3.8 or higher is required.

pip install data-guardian

⚡ Quickstart

from data_guardian import DatasetProfile, PDFReporter # Or HTMLReporter

# 1. Specify the path to your data file
file_path = "path/to/your/dataset.csv" # Or "path/to/your/dataset.xlsx"
# For example, if you downloaded the comprehensive_test_data.csv from the project:
# file_path = "data/comprehensive_test_data.csv"


# 2. Create a DatasetProfile instance
# The name is optional; if not provided, it uses the filename.
# The file_type ('csv' or 'excel') is also optional and will be inferred from the extension.
profile = DatasetProfile(source_path=file_path, name="My Sample Analysis")

# 3. Load the data
if profile.load_data():
    print(f"Successfully loaded: {profile.name}")

    # 4. Run all available data quality analyses
    # You can pass a configuration dictionary if needed for specific analyses,
    # e.g., custom_null_strings or outlier parameters.
    # analysis_config = {
    #     'custom_null_strings': ["N/A", "-", "Not Available"],
    #     'outliers_numerical': {'method': 'zscore', 'threshold': 3.0}
    # }
    # profile.run_analysis(config=analysis_config) # Pass config if using custom settings
    profile.run_analysis() # Uses default settings if config is not passed
    print(f"Analysis complete. Issues found: {len(profile.issues_found)}")

    # 5. Calculate quality scores based on the analysis
    profile.calculate_quality_scores()
    if profile.quality_score:
        print(f"Overall Quality Score: {profile.quality_score.overall_score:.2f}/100")

    # 6. Get a text summary report (printed to console)
    print("\n--- Text Summary Report ---")
    print(profile.get_summary_report())

    # 7. Generate a PDF report
    print("\n--- Generating PDF Report ---")
    pdf_reporter = PDFReporter(profile)
    pdf_output_path = "data_guardian_report.pdf"
    if pdf_reporter.generate_pdf_report(output_path=pdf_output_path):
        print(f"PDF report saved to: {pdf_output_path}")
    else:
        print("Failed to generate PDF report.")

    # # Alternatively, generate an HTML report
    # from data_guardian import HTMLReporter
    # print("\n--- Generating HTML Report ---")
    # html_reporter = HTMLReporter(profile)
    # html_output_path = "data_guardian_report.html"
    # if html_reporter.save_html_report(output_path=html_output_path):
    #     print(f"HTML report saved to: {html_output_path}")
    # else:
    #     print("Failed to generate HTML report.")

else:
    print(f"Failed to load data from: {file_path}")

Command-Line Interface (CLI)

data-guardian-cli path/to/your/dataset.csv -o quality_report.pdf

CLI examples:

# Analyze a CSV and generate a PDF report (default output name)
data-guardian-cli my_data.csv

# Analyze an Excel file and generate an HTML report with a custom name
data-guardian-cli financial_data.xlsx -o financial_audit.html --name "Financial Audit Q1"

# Analyze a CSV, specifying its type, and output to a custom PDF name
data-guardian-cli sales_records -t csv -o sales_quality.pdf --name "Sales Records"

🤝 Contributing:

Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. If you have suggestions for adding or removing projects, feel free to open an issue to discuss it, or directly create a pull request after you've first forked the repo and created a branch from main.

  • 1-Fork the Project (Click the "Fork" button on the GitHub repository page: https://github.com/SAAD2003D/data-guardian)
  • 2-Create your Feature Branch (git checkout -b feature/AmazingFeature)
  • 3-Commit your Changes (git commit -m 'Add some AmazingFeature')
  • 4-Push to the Branch (git push origin feature/AmazingFeature)
  • 5-Open a Pull Request

📧 Contact

saad fikri – fsaad1929@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_guardian-0.1.0.tar.gz (25.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

data_guardian-0.1.0-py3-none-any.whl (28.8 kB view details)

Uploaded Python 3

File details

Details for the file data_guardian-0.1.0.tar.gz.

File metadata

  • Download URL: data_guardian-0.1.0.tar.gz
  • Upload date:
  • Size: 25.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.9

File hashes

Hashes for data_guardian-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9c8be5bef7a82a6676f1e7cdfc416b4a70ab31a1a421f64213adddc23e7b15da
MD5 d1f4c49b811521fe901cf2a9e041820e
BLAKE2b-256 4810fbaf3fe79b7767e0764873bbef3eb9ecc932e5a93a64ea397e6d8b82d617

See more details on using hashes here.

File details

Details for the file data_guardian-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: data_guardian-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 28.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.9

File hashes

Hashes for data_guardian-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7e0ca5d4424cdeb20198a6a498e623b9b6ad9431aece78a0d34672963ec9ff3a
MD5 35fbd8d6f2e957ad1b9e15ef93bc84d7
BLAKE2b-256 66a35ddb27bfec66f3deb14a49dff83ca0422115e2af5260401282fbb68a259e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page