A Python library for comprehensive data quality analysis and reporting on tabular datasets.
Project description
Data Guardian 🛡️
Data Guardian is a Python library designed to meticulously analyze the quality of your tabular datasets (CSV and Excel). It provides a comprehensive audit, a clear scoring system, and generates detailed reports in text, HTML, and PDF formats, empowering you to trust and effectively utilize your data.
Whether you're a data scientist, analyst, researcher, or civic tech worker, Data Guardian helps you quickly identify and understand issues like missing values, inconsistencies, outliers, and duplicates before you dive into deeper analysis, visualization, or machine learning.
✨ Key Features
- Comprehensive Data Profiling: Identifies a wide range of common data quality issues:
- Missing Values (NaNs)
- Suspicious Null-like Strings (e.g., "NA", "Null", "", "--")
- Constant Columns (columns with only one unique value)
- Duplicated Rows
- Leading/Trailing Whitespace in string values
- Mixed Case Values (e.g., "Apple" vs "apple")
- Potential Numeric Types (object columns that appear fully numeric)
- Mixed Data Types within object columns
- Numerical Outliers (using IQR or Z-score methods)
- Categorical Outliers (rare categories)
- Intuitive Quality Scoring: Generates scores 0/100 for:
- Completeness
- Uniqueness
- Consistency
- Validity
- An Overall Quality Score
- Detailed Reporting: Produces human-readable reports in multiple formats:
- Console Text Summary
- HTML Report (with basic styling, suitable for sharing)
- PDF Report (for archival and formal documentation)
- Easy-to-Use API: Simple Python interface to integrate into your data workflows.
- Command-Line Interface (CLI): Quickly analyze datasets directly from your terminal.
- File Support: Natively handles CSV and Excel (
.xls,.xlsx) files. - Configurable Analysis: (Future - ability to tune thresholds and checks).
🚀 Installation
You can install Data Guardian using pip. Python 3.8 or higher is required.
pip install data-guardian
⚡ Quickstart
from data_guardian import DatasetProfile, PDFReporter # Or HTMLReporter
# 1. Specify the path to your data file
file_path = "path/to/your/dataset.csv" # Or "path/to/your/dataset.xlsx"
# For example, if you downloaded the comprehensive_test_data.csv from the project:
# file_path = "data/comprehensive_test_data.csv"
# 2. Create a DatasetProfile instance
# The name is optional; if not provided, it uses the filename.
# The file_type ('csv' or 'excel') is also optional and will be inferred from the extension.
profile = DatasetProfile(source_path=file_path, name="My Sample Analysis")
# 3. Load the data
if profile.load_data():
print(f"Successfully loaded: {profile.name}")
# 4. Run all available data quality analyses
# You can pass a configuration dictionary if needed for specific analyses,
# e.g., custom_null_strings or outlier parameters.
# analysis_config = {
# 'custom_null_strings': ["N/A", "-", "Not Available"],
# 'outliers_numerical': {'method': 'zscore', 'threshold': 3.0}
# }
# profile.run_analysis(config=analysis_config) # Pass config if using custom settings
profile.run_analysis() # Uses default settings if config is not passed
print(f"Analysis complete. Issues found: {len(profile.issues_found)}")
# 5. Calculate quality scores based on the analysis
profile.calculate_quality_scores()
if profile.quality_score:
print(f"Overall Quality Score: {profile.quality_score.overall_score:.2f}/100")
# 6. Get a text summary report (printed to console)
print("\n--- Text Summary Report ---")
print(profile.get_summary_report())
# 7. Generate a PDF report
print("\n--- Generating PDF Report ---")
pdf_reporter = PDFReporter(profile)
pdf_output_path = "data_guardian_report.pdf"
if pdf_reporter.generate_pdf_report(output_path=pdf_output_path):
print(f"PDF report saved to: {pdf_output_path}")
else:
print("Failed to generate PDF report.")
# # Alternatively, generate an HTML report
# from data_guardian import HTMLReporter
# print("\n--- Generating HTML Report ---")
# html_reporter = HTMLReporter(profile)
# html_output_path = "data_guardian_report.html"
# if html_reporter.save_html_report(output_path=html_output_path):
# print(f"HTML report saved to: {html_output_path}")
# else:
# print("Failed to generate HTML report.")
else:
print(f"Failed to load data from: {file_path}")
Command-Line Interface (CLI)
data-guardian-cli path/to/your/dataset.csv -o quality_report.pdf
CLI examples:
# Analyze a CSV and generate a PDF report (default output name)
data-guardian-cli my_data.csv
# Analyze an Excel file and generate an HTML report with a custom name
data-guardian-cli financial_data.xlsx -o financial_audit.html --name "Financial Audit Q1"
# Analyze a CSV, specifying its type, and output to a custom PDF name
data-guardian-cli sales_records -t csv -o sales_quality.pdf --name "Sales Records"
🤝 Contributing:
Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. If you have suggestions for adding or removing projects, feel free to open an issue to discuss it, or directly create a pull request after you've first forked the repo and created a branch from main.
- 1-Fork the Project (Click the "Fork" button on the GitHub repository page: https://github.com/SAAD2003D/data-guardian)
- 2-Create your Feature Branch (git checkout -b feature/AmazingFeature)
- 3-Commit your Changes (git commit -m 'Add some AmazingFeature')
- 4-Push to the Branch (git push origin feature/AmazingFeature)
- 5-Open a Pull Request
📧 Contact
saad fikri – fsaad1929@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file data_guardian-0.1.0.tar.gz.
File metadata
- Download URL: data_guardian-0.1.0.tar.gz
- Upload date:
- Size: 25.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9c8be5bef7a82a6676f1e7cdfc416b4a70ab31a1a421f64213adddc23e7b15da
|
|
| MD5 |
d1f4c49b811521fe901cf2a9e041820e
|
|
| BLAKE2b-256 |
4810fbaf3fe79b7767e0764873bbef3eb9ecc932e5a93a64ea397e6d8b82d617
|
File details
Details for the file data_guardian-0.1.0-py3-none-any.whl.
File metadata
- Download URL: data_guardian-0.1.0-py3-none-any.whl
- Upload date:
- Size: 28.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e0ca5d4424cdeb20198a6a498e623b9b6ad9431aece78a0d34672963ec9ff3a
|
|
| MD5 |
35fbd8d6f2e957ad1b9e15ef93bc84d7
|
|
| BLAKE2b-256 |
66a35ddb27bfec66f3deb14a49dff83ca0422115e2af5260401282fbb68a259e
|