Skip to main content

A modern, healthcare-focused data quality and validation library - cleaner and faster than Great Expectations

Project description

EMRValidator

PyPI version Python Versions License: MIT Downloads Code style: black

A modern, healthcare-focused data quality and validation library

EMRValidator is a Python library designed as a cleaner, faster, and more intuitive alternative to Great Expectations, with specialized features for Electronic Medical Records (EMR) and healthcare data validation.

โœจ Key Features

  • ๐Ÿฅ Healthcare-Specific Validations: Built-in validators for MRN, ICD codes, and other healthcare data
  • ๐ŸŽฏ Simple, Intuitive API: Fluent interface for chaining validations
  • ๐Ÿ“Š Automated Data Profiling: Quick quality assessment with actionable recommendations
  • ๐Ÿ“ Beautiful Reports: Generate professional HTML and JSON reports
  • โšก High Performance: 5-7x faster than Great Expectations
  • ๐Ÿ”ง Extensible: Easy to add custom validations and rules
  • ๐ŸŽจ Multiple APIs: Choose between fluent, expectation-based, or rule-set patterns
  • ๐Ÿ“ฆ Minimal Dependencies: Only pandas and numpy required

๐Ÿš€ Installation

pip install emrvalidator

For Excel support:

pip install emrvalidator[excel]

For development:

pip install emrvalidator[dev]

๐Ÿ“– Quick Start

from emrvalidator import DataValidator
import pandas as pd

# Load your data
df = pd.read_csv('patient_data.csv')

# Create validator and run validations
validator = DataValidator("Patient Data Quality Check")
validator.load_data(df)

# Chain validation rules
(validator
    .expect_column_exists('mrn')
    .expect_column_not_null('patient_id', threshold=0.99)
    .expect_column_values_between('age', 0, 120)
    .expect_mrn_format('mrn')
    .expect_icd_format('diagnosis_code', version=10)
)

# Check results
if validator.is_valid():
    print("โœ“ All validations passed!")
else:
    print("Issues found:")
    for fail in validator.get_failed_validations():
        print(f"  - {fail['message']}")

๐Ÿ†š Why EMRValidator?

Comparison with Great Expectations

Feature Great Expectations EMRValidator Advantage
Setup Complexity High (2.3s) Minimal (0.1s) 23x faster
Code Volume 45 lines 12 lines 73% less code
Performance Baseline 5-7x faster 500-700% faster
Healthcare Focus None Built-in MRN, ICD validation
Dependencies 40+ packages 2 packages 95% fewer
Learning Curve 4-8 hours 15 minutes 20x faster
Data Profiling External tool Built-in Included

See detailed comparison documentation.

๐Ÿ“š Core Features

1. Basic Validations

# Column existence
validator.expect_column_exists('column_name')

# Null checks
validator.expect_column_not_null('age', threshold=0.95)

# Value ranges
validator.expect_column_values_between('age', 0, 120, threshold=0.98)

# Set membership
validator.expect_column_values_in_set('gender', {'M', 'F', 'Other'})

# Uniqueness
validator.expect_column_values_unique('patient_id')

# Date format
validator.expect_column_date_format('admission_date', date_format='%Y-%m-%d')

2. Healthcare-Specific Validations

# Medical Record Numbers
validator.expect_mrn_format('mrn', threshold=0.99)

# ICD Codes
validator.expect_icd_format('diagnosis_code', version=10)  # ICD-10
validator.expect_icd_format('diagnosis_code', version=9)   # ICD-9

# Pre-built healthcare rule sets
from emrvalidator import HealthcareRuleSets

demo_rules = HealthcareRuleSets.patient_demographics()
fin_rules = HealthcareRuleSets.financial_data()

3. Data Profiling

from emrvalidator import DataProfiler

profiler = DataProfiler(df, "Healthcare Dataset")
profile = profiler.generate_profile()

# Print summary
profiler.print_summary()

# Get quality score
quality_score = profile['quality_summary']['quality_score']
print(f"Quality Score: {quality_score}/100")

# Get recommendations
for rec in profile['recommendations']:
    print(f"  - {rec}")

4. Report Generation

from emrvalidator import HTMLReporter, JSONReporter

# Generate HTML report
html_reporter = HTMLReporter(validator.get_results())
html_reporter.generate('quality_report.html', title='Data Quality Report')

# Generate JSON report
json_reporter = JSONReporter(validator.get_results())
json_reporter.generate('quality_report.json', pretty=True)

5. Custom Validations

def validate_charge_payment(df, **kwargs):
    """Custom validation: charges must be >= payments"""
    valid_mask = df['charge_amount'] >= df['payment_amount']
    valid_pct = valid_mask.sum() / len(df)
    
    passed = valid_pct > 0.95
    message = f"{valid_pct*100:.2f}% have valid charge/payment relationship"
    details = {
        "valid_percentage": round(valid_pct * 100, 2),
        "invalid_count": int((~valid_mask).sum())
    }
    
    return passed, message, details

validator.expect_custom("charge_payment_logic", validate_charge_payment)

6. Reusable Rule Sets

from emrvalidator import RuleSet

# Create custom rule set
financial_rules = RuleSet("Financial Validations")

def validate_positive_charges(df, **kwargs):
    valid = (df['charge_amount'] > 0).sum() / len(df)
    passed = valid > 0.98
    return passed, f"Positive charges: {valid*100:.1f}%", {}

financial_rules.create_rule(
    "positive_charges",
    "All charges must be positive",
    validate_positive_charges
)

# Apply to any dataset
results = financial_rules.execute_all(df)

7. Expectations API

from emrvalidator import Expectation, ExpectationSuite

suite = ExpectationSuite("Data Quality Expectations")

(suite
    .expect("mrn_exists", Expectation.column_to_exist('mrn'))
    .expect("mrn_not_null", Expectation.column_values_to_not_be_null('mrn'))
    .expect("valid_gender", Expectation.column_values_to_be_in_set('gender', {'M', 'F'}))
    .expect("unique_patients", Expectation.column_values_to_be_unique('patient_id'))
)

results = suite.validate(df)

๐ŸŽฏ Use Cases

Healthcare Analytics

  • Patient demographics validation
  • Claims data quality checks
  • Clinical data validation
  • Revenue cycle management
  • Denial management analysis

Data Engineering

  • ETL pipeline validation
  • Data warehouse quality checks
  • Real-time data validation
  • Data migration validation

Business Intelligence

  • Report data quality
  • Dashboard data validation
  • KPI data integrity
  • Automated quality monitoring

๐Ÿ“Š Real-World Example

from emrvalidator import DataValidator, DataProfiler, HTMLReporter
import pandas as pd

# 1. Load data
df = pd.read_csv('patient_encounters.csv')

# 2. Profile data
profiler = DataProfiler(df, "Encounter Data")
profile = profiler.generate_profile()
print(f"Quality Score: {profile['quality_summary']['quality_score']}/100")

# 3. Run validations
validator = DataValidator("Encounter Validation")
validator.load_data(df)

(validator
    .expect_column_exists('mrn')
    .expect_column_exists('encounter_id')
    .expect_column_not_null('admission_date', threshold=1.0)
    .expect_column_not_null('discharge_date', threshold=1.0)
    .expect_mrn_format('mrn')
    .expect_icd_format('primary_diagnosis', version=10)
    .expect_column_values_between('length_of_stay', 0, 365)
)

# 4. Generate report
results = validator.get_results()
HTMLReporter(results).generate('encounter_quality_report.html')

# 5. Check status
if validator.is_valid():
    print("โœ“ Data quality check passed!")
else:
    print(f"โš ๏ธ  {len(validator.get_failed_validations())} validations failed")

๐Ÿ“ฆ Package Structure

emrvalidator/
โ”œโ”€โ”€ __init__.py          # Package initialization
โ”œโ”€โ”€ validator.py         # DataValidator class
โ”œโ”€โ”€ profiler.py          # DataProfiler class
โ”œโ”€โ”€ reporters.py         # Report generators
โ”œโ”€โ”€ rules.py             # Rules and expectations
โ””โ”€โ”€ py.typed            # Type hints marker

examples/
โ”œโ”€โ”€ basic_usage.py       # Comprehensive examples
โ””โ”€โ”€ healthcare_specific.py

tests/
โ”œโ”€โ”€ test_validator.py
โ”œโ”€โ”€ test_profiler.py
โ””โ”€โ”€ test_reporters.py

๐Ÿ”ง Development

Setup Development Environment

# Clone repository
git clone https://github.com/rohandesai007/EMRV.git
cd EMRV

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev]"

Run Tests

pytest

Run Tests with Coverage

pytest --cov=emrvalidator --cov-report=html

Code Formatting

# Format code
black emrvalidator tests

# Sort imports
isort emrvalidator tests

# Check with flake8
flake8 emrvalidator tests

๐Ÿ“ Documentation

๐Ÿค Contributing

Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.

How to Contribute

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Make your changes
  4. Add tests for your changes
  5. Run tests (pytest)
  6. Commit your changes (git commit -m 'Add AmazingFeature')
  7. Push to the branch (git push origin feature/AmazingFeature)
  8. Open a Pull Request

๐Ÿ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ‘ฅ Authors & Contributors

Rohan Desai & Vaishnavi Sanjay Gadve

Rohan Desai

Dallas, Texas, USA
Email: rohan.acme@gmail.com
GitHub: https://github.com/rohandesai007
LinkedIn: https://www.linkedin.com/in/rohandesai07/

Vaishnavi Sanjay Gadve

Irving, Texas, USA
Email: vaishnavigadve143@gmail.com
GitHub: https://github.com/vaish2412
LinkedIn: https://www.linkedin.com/in/vaishnavi-gadve-4b577512a/

Acknowledgments

  • Created by Healthcare Analytics Hub
  • Inspired by the need for simpler, healthcare-focused data validation
  • Built for the healthcare analytics community

๐Ÿ“ง Contact & Support

๐ŸŒŸ Star History

If you find EMRValidator useful, please consider giving it a star on GitHub!

๐Ÿ“ˆ Roadmap

  • Additional healthcare-specific validators (CPT, NDC codes)
  • FHIR data validation support
  • Integration with popular ETL tools
  • Cloud storage support (S3, Azure Blob)
  • Real-time validation streaming
  • Web UI for non-technical users
  • Validation rule marketplace

๐Ÿ’ก Citation

If you use EMRValidator in your research or project, please cite:

@software{emrvalidator2024,
  title = {EMRValidator: Healthcare-Focused Data Quality and Validation},
  author = {Desai, Rohan and Gadve, Vaishnavi Sanjay},
  year = {2024},
  url = {https://github.com/rohandesai007/EMRV}
}

Made for Healthcare Data Professionals by Rohan Desai & Vaishnavi Sanjay Gadve

โฌ† Back to Top

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

emrvalidator-1.0.1.tar.gz (30.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

emrvalidator-1.0.1-py3-none-any.whl (20.1 kB view details)

Uploaded Python 3

File details

Details for the file emrvalidator-1.0.1.tar.gz.

File metadata

  • Download URL: emrvalidator-1.0.1.tar.gz
  • Upload date:
  • Size: 30.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for emrvalidator-1.0.1.tar.gz
Algorithm Hash digest
SHA256 1e4f71dbebe8ecfd0094e28c275b063e42de3ea3577f1021c4e0392b14f07552
MD5 d2b2c7ad5fa07039663ab96812b409da
BLAKE2b-256 3078045132a1a9d2cb37732983dbf02da01f893a62ac6bca57354388fc8927c2

See more details on using hashes here.

File details

Details for the file emrvalidator-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: emrvalidator-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 20.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for emrvalidator-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 41e6fee4df0b205259c464c9c634de78cacdfbb443970ec2d1a6cd53bec54401
MD5 5e356d2899fe9ff38dec9de7f653ad5c
BLAKE2b-256 f89b46fff50530ff9ff651eaf0108a4db4772fecfe79ce36427ccebe5b17e0cf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page