Skip to main content

A Data Validation Tool for Healthcare Data

Project description

EMRValidator

PyPI version Python Versions License: MIT Code style: black

A modern, healthcare-focused data quality and validation library

EMRValidator is a Python library designed as a cleaner, faster, and more intuitive alternative to Great Expectations, with specialized features for Electronic Medical Records (EMR) and healthcare data validation.

โœจ Key Features

  • ๐Ÿฅ Healthcare-Specific Validations: Built-in validators for MRN, ICD codes, and other healthcare data
  • ๐ŸŽฏ Simple, Intuitive API: Fluent interface for chaining validations
  • ๐Ÿ“Š Automated Data Profiling: Quick quality assessment with actionable recommendations
  • ๐Ÿ“ Beautiful Reports: Generate professional HTML and JSON reports
  • โšก High Performance: 5-7x faster than Great Expectations
  • ๐Ÿ”ง Extensible: Easy to add custom validations and rules
  • ๐ŸŽจ Multiple APIs: Choose between fluent, expectation-based, or rule-set patterns
  • ๐Ÿ“ฆ Minimal Dependencies: Only pandas and numpy required

EMRValidator If You Need:

  • Data quality rules for EMR, claims, or clinical datasets
  • Fast validation for millions of rows
  • Healthcare-specific formats (ICD, MRN, CPT, NDC)
  • Validation in ETL, Airflow, dbt, or LLM pipelines

Donโ€™t Use It If:

  • You need schema evolution tracking across multiple batches

๐Ÿš€ Installation

pip install emrvalidator

For Excel support:

pip install emrvalidator[excel]

For development:

pip install emrvalidator[dev]

๐Ÿ“– Quick Start

from emrvalidator import DataValidator
import pandas as pd

# Load your data
df = pd.read_csv('patient_data.csv')

# Create validator and run validations
validator = DataValidator("Patient Data Quality Check")
validator.load_data(df)

# Chain validation rules
(validator
    .expect_column_exists('mrn')
    .expect_column_not_null('patient_id', threshold=0.99)
    .expect_column_values_between('age', 0, 120)
    .expect_mrn_format('mrn')
    .expect_icd_format('diagnosis_code', version=10)
)

# Check results
if validator.is_valid():
    print("โœ“ All validations passed!")
else:
    print("Issues found:")
    for fail in validator.get_failed_validations():
        print(f"  - {fail['message']}")

๐Ÿ†š Why EMRValidator?

Comparison with Great Expectations

Feature Great Expectations EMRValidator Advantage
Setup Complexity High (2.3s) Minimal (0.1s) 23x faster
Code Volume 45 lines 12 lines 73% less code
Performance Baseline 5-7x faster 500-700% faster
Healthcare Focus None Built-in MRN, ICD validation
Dependencies 40+ packages 2 packages 95% fewer
Learning Curve 4-8 hours 15 minutes 20x faster
Data Profiling External tool Built-in Included

See detailed comparison documentation.

๐Ÿ“š Core Features

1. Basic Validations

# Column existence
validator.expect_column_exists('column_name')

# Null checks
validator.expect_column_not_null('age', threshold=0.95)

# Value ranges
validator.expect_column_values_between('age', 0, 120, threshold=0.98)

# Set membership
validator.expect_column_values_in_set('gender', {'M', 'F', 'Other'})

# Uniqueness
validator.expect_column_values_unique('patient_id')

# Date format
validator.expect_column_date_format('admission_date', date_format='%Y-%m-%d')

2. Healthcare-Specific Validations

# Medical Record Numbers
validator.expect_mrn_format('mrn', threshold=0.99)

# ICD Codes
validator.expect_icd_format('diagnosis_code', version=10)  # ICD-10
validator.expect_icd_format('diagnosis_code', version=9)   # ICD-9

# Pre-built healthcare rule sets
from emrvalidator import HealthcareRuleSets

demo_rules = HealthcareRuleSets.patient_demographics()
fin_rules = HealthcareRuleSets.financial_data()

3. Data Profiling

from emrvalidator import DataProfiler

profiler = DataProfiler(df, "Healthcare Dataset")
profile = profiler.generate_profile()

# Print summary
profiler.print_summary()

# Get quality score
quality_score = profile['quality_summary']['quality_score']
print(f"Quality Score: {quality_score}/100")

# Get recommendations
for rec in profile['recommendations']:
    print(f"  - {rec}")

4. Report Generation

from emrvalidator import HTMLReporter, JSONReporter

# Generate HTML report
html_reporter = HTMLReporter(validator.get_results())
html_reporter.generate('quality_report.html', title='Data Quality Report')

# Generate JSON report
json_reporter = JSONReporter(validator.get_results())
json_reporter.generate('quality_report.json', pretty=True)

5. Custom Validations

def validate_charge_payment(df, **kwargs):
    """Custom validation: charges must be >= payments"""
    valid_mask = df['charge_amount'] >= df['payment_amount']
    valid_pct = valid_mask.sum() / len(df)
    
    passed = valid_pct > 0.95
    message = f"{valid_pct*100:.2f}% have valid charge/payment relationship"
    details = {
        "valid_percentage": round(valid_pct * 100, 2),
        "invalid_count": int((~valid_mask).sum())
    }
    
    return passed, message, details

validator.expect_custom("charge_payment_logic", validate_charge_payment)

6. Reusable Rule Sets

from emrvalidator import RuleSet

# Create custom rule set
financial_rules = RuleSet("Financial Validations")

def validate_positive_charges(df, **kwargs):
    valid = (df['charge_amount'] > 0).sum() / len(df)
    passed = valid > 0.98
    return passed, f"Positive charges: {valid*100:.1f}%", {}

financial_rules.create_rule(
    "positive_charges",
    "All charges must be positive",
    validate_positive_charges
)

# Apply to any dataset
results = financial_rules.execute_all(df)

7. Expectations API

from emrvalidator import Expectation, ExpectationSuite

suite = ExpectationSuite("Data Quality Expectations")

(suite
    .expect("mrn_exists", Expectation.column_to_exist('mrn'))
    .expect("mrn_not_null", Expectation.column_values_to_not_be_null('mrn'))
    .expect("valid_gender", Expectation.column_values_to_be_in_set('gender', {'M', 'F'}))
    .expect("unique_patients", Expectation.column_values_to_be_unique('patient_id'))
)

results = suite.validate(df)

๐ŸŽฏ Use Cases

Healthcare Analytics

  • Patient demographics validation
  • Claims data quality checks
  • Clinical data validation
  • Revenue cycle management
  • Denial management analysis

Data Engineering

  • ETL pipeline validation
  • Data warehouse quality checks
  • Real-time data validation
  • Data migration validation

Business Intelligence

  • Report data quality
  • Dashboard data validation
  • KPI data integrity
  • Automated quality monitoring

๐Ÿ“Š Real-World Example

from emrvalidator import DataValidator, DataProfiler, HTMLReporter
import pandas as pd

# 1. Load data
df = pd.read_csv('patient_encounters.csv')

# 2. Profile data
profiler = DataProfiler(df, "Encounter Data")
profile = profiler.generate_profile()
print(f"Quality Score: {profile['quality_summary']['quality_score']}/100")

# 3. Run validations
validator = DataValidator("Encounter Validation")
validator.load_data(df)

(validator
    .expect_column_exists('mrn')
    .expect_column_exists('encounter_id')
    .expect_column_not_null('admission_date', threshold=1.0)
    .expect_column_not_null('discharge_date', threshold=1.0)
    .expect_mrn_format('mrn')
    .expect_icd_format('primary_diagnosis', version=10)
    .expect_column_values_between('length_of_stay', 0, 365)
)

# 4. Generate report
results = validator.get_results()
HTMLReporter(results).generate('encounter_quality_report.html')

# 5. Check status
if validator.is_valid():
    print("โœ“ Data quality check passed!")
else:
    print(f"โš ๏ธ  {len(validator.get_failed_validations())} validations failed")

๐Ÿ“ฆ Package Structure

emrvalidator/
โ”œโ”€โ”€ __init__.py          # Package initialization
โ”œโ”€โ”€ validator.py         # DataValidator class
โ”œโ”€โ”€ profiler.py          # DataProfiler class
โ”œโ”€โ”€ reporters.py         # Report generators
โ”œโ”€โ”€ rules.py             # Rules and expectations
โ””โ”€โ”€ py.typed            # Type hints marker

examples/
โ”œโ”€โ”€ basic_usage.py       # Comprehensive examples
โ””โ”€โ”€ healthcare_specific.py

tests/
โ”œโ”€โ”€ test_validator.py
โ”œโ”€โ”€ test_profiler.py
โ””โ”€โ”€ test_reporters.py

๐Ÿ”ง Development

Setup Development Environment

# Clone repository
git clone https://github.com/rohandesai007/EMRV.git
cd EMRV

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev]"

Run Tests

pytest

Run Tests with Coverage

pytest --cov=emrvalidator --cov-report=html

Code Formatting

# Format code
black emrvalidator tests

# Sort imports
isort emrvalidator tests

# Check with flake8
flake8 emrvalidator tests

๐Ÿ“ Documentation

๐Ÿค Contributing

Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.

How to Contribute

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Make your changes
  4. Add tests for your changes
  5. Run tests (pytest)
  6. Commit your changes (git commit -m 'Add AmazingFeature')
  7. Push to the branch (git push origin feature/AmazingFeature)
  8. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ‘ฅ Authors & Contributors

Rohan Desai
Dallas, Texas, USA
Email: rohan.acme@gmail.com
GitHub: https://github.com/rohan-desai
LinkedIn: https://www.linkedin.com/in/rohandesai07/

Vaishnavi Gadve
Irving, Texas, USA
Email: vaishnavigadve143@gmail.com
GitHub: https://github.com/vaish2412
LinkedIn: https://www.linkedin.com/in/vaishnavi-gadve-4b577512a/

Acknowledgments

  • Created by Healthcare Analytics Hub
  • Inspired by the need for simpler, healthcare-focused data validation
  • Built for the healthcare analytics community

๐Ÿ“ง Contact & Support

Star History

If you find EMRValidator useful, please consider giving it a star on GitHub!

๐Ÿ“ˆ Roadmap

  • Additional healthcare-specific validators (CPT, NDC codes)
  • FHIR data validation support
  • Integration with popular ETL tools
  • Cloud storage support (S3, Azure Blob)
  • Real-time validation streaming
  • Web UI for non-technical users
  • Validation rule marketplace

๐Ÿ’ก Citation

If you use EMRValidator in your research or project, please cite:

@software{emrvalidator2025,
  title = {EMRValidator: Healthcare-Focused Data Quality and Validation},
  author = {Desai, Rohan and Gadve, Vaishnavi},
  year = {2025},
  url = {https://github.com/rohandesai007/EMRV}
}

โฌ† Back to Top

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

emrvalidator-1.0.2.tar.gz (23.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

emrvalidator-1.0.2-py3-none-any.whl (22.2 kB view details)

Uploaded Python 3

File details

Details for the file emrvalidator-1.0.2.tar.gz.

File metadata

  • Download URL: emrvalidator-1.0.2.tar.gz
  • Upload date:
  • Size: 23.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for emrvalidator-1.0.2.tar.gz
Algorithm Hash digest
SHA256 f4bbe7eb6592a11a5229ba4ef30194797341e36af1d48f83718fba4168e12c8e
MD5 96e00d0a63ed2c383a65efdbb20b778d
BLAKE2b-256 ada0f68b157933055d03d129ecf30279b2580951e5b59d39ea580b1f5719ce45

See more details on using hashes here.

File details

Details for the file emrvalidator-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: emrvalidator-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 22.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for emrvalidator-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8d415769e83567ba1b98beabdafddf07e2a95d5c8fa2dcd4d5d49b0c61a393c0
MD5 ab5edd42eb15f2447acdb39fb6dc2130
BLAKE2b-256 70f386aa454dc679d48555445c4fe49ee5041663af86ec6c0a2c99f061d999fa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page