A Data Validation Tool for Healthcare Data

Project description

EMRValidator

A modern, healthcare-focused data quality and validation library

EMRValidator is a Python library designed as a cleaner, faster, and more intuitive alternative to Great Expectations, with specialized features for Electronic Medical Records (EMR) and healthcare data validation.

✨ Key Features

🏥 Healthcare-Specific Validations: Built-in validators for MRN, ICD codes, and other healthcare data
🎯 Simple, Intuitive API: Fluent interface for chaining validations
📊 Automated Data Profiling: Quick quality assessment with actionable recommendations
📝 Beautiful Reports: Generate professional HTML and JSON reports
⚡ High Performance: 5-7x faster than Great Expectations
🔧 Extensible: Easy to add custom validations and rules
🎨 Multiple APIs: Choose between fluent, expectation-based, or rule-set patterns
📦 Minimal Dependencies: Only pandas and numpy required

EMRValidator If You Need:

Data quality rules for EMR, claims, or clinical datasets
Fast validation for millions of rows
Healthcare-specific formats (ICD, MRN, CPT, NDC)
Validation in ETL, Airflow, dbt, or LLM pipelines

Don’t Use It If:

You need schema evolution tracking across multiple batches

🚀 Installation

pip install emrvalidator

For Excel support:

pip install emrvalidator[excel]

For development:

pip install emrvalidator[dev]

📖 Quick Start

from emrvalidator import DataValidator
import pandas as pd

# Load your data
df = pd.read_csv('patient_data.csv')

# Create validator and run validations
validator = DataValidator("Patient Data Quality Check")
validator.load_data(df)

# Chain validation rules
(validator
    .expect_column_exists('mrn')
    .expect_column_not_null('patient_id', threshold=0.99)
    .expect_column_values_between('age', 0, 120)
    .expect_mrn_format('mrn')
    .expect_icd_format('diagnosis_code', version=10)
)

# Check results
if validator.is_valid():
    print("✓ All validations passed!")
else:
    print("Issues found:")
    for fail in validator.get_failed_validations():
        print(f"  - {fail['message']}")

🆚 Why EMRValidator?

Comparison with Great Expectations

Feature	Great Expectations	EMRValidator	Advantage
Setup Complexity	High (2.3s)	Minimal (0.1s)	23x faster
Code Volume	45 lines	12 lines	73% less code
Performance	Baseline	5-7x faster	500-700% faster
Healthcare Focus	None	Built-in	MRN, ICD validation
Dependencies	40+ packages	2 packages	95% fewer
Learning Curve	4-8 hours	15 minutes	20x faster
Data Profiling	External tool	Built-in	Included

See detailed comparison documentation.

📚 Core Features

1. Basic Validations

# Column existence
validator.expect_column_exists('column_name')

# Null checks
validator.expect_column_not_null('age', threshold=0.95)

# Value ranges
validator.expect_column_values_between('age', 0, 120, threshold=0.98)

# Set membership
validator.expect_column_values_in_set('gender', {'M', 'F', 'Other'})

# Uniqueness
validator.expect_column_values_unique('patient_id')

# Date format
validator.expect_column_date_format('admission_date', date_format='%Y-%m-%d')

2. Healthcare-Specific Validations

# Medical Record Numbers
validator.expect_mrn_format('mrn', threshold=0.99)

# ICD Codes
validator.expect_icd_format('diagnosis_code', version=10)  # ICD-10
validator.expect_icd_format('diagnosis_code', version=9)   # ICD-9

# Pre-built healthcare rule sets
from emrvalidator import HealthcareRuleSets

demo_rules = HealthcareRuleSets.patient_demographics()
fin_rules = HealthcareRuleSets.financial_data()

3. Data Profiling

from emrvalidator import DataProfiler

profiler = DataProfiler(df, "Healthcare Dataset")
profile = profiler.generate_profile()

# Print summary
profiler.print_summary()

# Get quality score
quality_score = profile['quality_summary']['quality_score']
print(f"Quality Score: {quality_score}/100")

# Get recommendations
for rec in profile['recommendations']:
    print(f"  - {rec}")

4. Report Generation

from emrvalidator import HTMLReporter, JSONReporter

# Generate HTML report
html_reporter = HTMLReporter(validator.get_results())
html_reporter.generate('quality_report.html', title='Data Quality Report')

# Generate JSON report
json_reporter = JSONReporter(validator.get_results())
json_reporter.generate('quality_report.json', pretty=True)

5. Custom Validations

def validate_charge_payment(df, **kwargs):
    """Custom validation: charges must be >= payments"""
    valid_mask = df['charge_amount'] >= df['payment_amount']
    valid_pct = valid_mask.sum() / len(df)
    
    passed = valid_pct > 0.95
    message = f"{valid_pct*100:.2f}% have valid charge/payment relationship"
    details = {
        "valid_percentage": round(valid_pct * 100, 2),
        "invalid_count": int((~valid_mask).sum())
    }
    
    return passed, message, details

validator.expect_custom("charge_payment_logic", validate_charge_payment)

6. Reusable Rule Sets

from emrvalidator import RuleSet

# Create custom rule set
financial_rules = RuleSet("Financial Validations")

def validate_positive_charges(df, **kwargs):
    valid = (df['charge_amount'] > 0).sum() / len(df)
    passed = valid > 0.98
    return passed, f"Positive charges: {valid*100:.1f}%", {}

financial_rules.create_rule(
    "positive_charges",
    "All charges must be positive",
    validate_positive_charges
)

# Apply to any dataset
results = financial_rules.execute_all(df)

7. Expectations API

from emrvalidator import Expectation, ExpectationSuite

suite = ExpectationSuite("Data Quality Expectations")

(suite
    .expect("mrn_exists", Expectation.column_to_exist('mrn'))
    .expect("mrn_not_null", Expectation.column_values_to_not_be_null('mrn'))
    .expect("valid_gender", Expectation.column_values_to_be_in_set('gender', {'M', 'F'}))
    .expect("unique_patients", Expectation.column_values_to_be_unique('patient_id'))
)

results = suite.validate(df)

🎯 Use Cases

Healthcare Analytics

Patient demographics validation
Claims data quality checks
Clinical data validation
Revenue cycle management
Denial management analysis

Data Engineering

ETL pipeline validation
Data warehouse quality checks
Real-time data validation
Data migration validation

Business Intelligence

Report data quality
Dashboard data validation
KPI data integrity
Automated quality monitoring

📊 Real-World Example

from emrvalidator import DataValidator, DataProfiler, HTMLReporter
import pandas as pd

# 1. Load data
df = pd.read_csv('patient_encounters.csv')

# 2. Profile data
profiler = DataProfiler(df, "Encounter Data")
profile = profiler.generate_profile()
print(f"Quality Score: {profile['quality_summary']['quality_score']}/100")

# 3. Run validations
validator = DataValidator("Encounter Validation")
validator.load_data(df)

(validator
    .expect_column_exists('mrn')
    .expect_column_exists('encounter_id')
    .expect_column_not_null('admission_date', threshold=1.0)
    .expect_column_not_null('discharge_date', threshold=1.0)
    .expect_mrn_format('mrn')
    .expect_icd_format('primary_diagnosis', version=10)
    .expect_column_values_between('length_of_stay', 0, 365)
)

# 4. Generate report
results = validator.get_results()
HTMLReporter(results).generate('encounter_quality_report.html')

# 5. Check status
if validator.is_valid():
    print("✓ Data quality check passed!")
else:
    print(f"⚠️  {len(validator.get_failed_validations())} validations failed")

📦 Package Structure

emrvalidator/
├── __init__.py          # Package initialization
├── validator.py         # DataValidator class
├── profiler.py          # DataProfiler class
├── reporters.py         # Report generators
├── rules.py             # Rules and expectations
└── py.typed            # Type hints marker

examples/
├── basic_usage.py       # Comprehensive examples
└── healthcare_specific.py

tests/
├── test_validator.py
├── test_profiler.py
└── test_reporters.py

🔧 Development

Setup Development Environment

# Clone repository
git clone https://github.com/rohandesai007/EMRV.git
cd EMRV

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev]"

Run Tests

pytest

Run Tests with Coverage

pytest --cov=emrvalidator --cov-report=html

Code Formatting

# Format code
black emrvalidator tests

# Sort imports
isort emrvalidator tests

# Check with flake8
flake8 emrvalidator tests

📝 Documentation

🤝 Contributing

Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.

How to Contribute

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Make your changes
Add tests for your changes
Run tests (pytest)
Commit your changes (git commit -m 'Add AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

👥 Authors & Contributors

Rohan Desai
Dallas, Texas, USA
Email: rohan.acme@gmail.com
GitHub: https://github.com/rohan-desai
LinkedIn: https://www.linkedin.com/in/rohandesai07/

Vaishnavi Gadve
Irving, Texas, USA
Email: vaishnavigadve143@gmail.com
GitHub: https://github.com/vaish2412
LinkedIn: https://www.linkedin.com/in/vaishnavi-gadve-4b577512a/

Acknowledgments

Created by Healthcare Analytics Hub
Inspired by the need for simpler, healthcare-focused data validation
Built for the healthcare analytics community

📧 Contact & Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: rohan.acme@gmail.com

Star History

If you find EMRValidator useful, please consider giving it a star on GitHub!

📈 Roadmap

Additional healthcare-specific validators (CPT, NDC codes)
FHIR data validation support
Integration with popular ETL tools
Cloud storage support (S3, Azure Blob)
Real-time validation streaming
Web UI for non-technical users
Validation rule marketplace

💡 Citation

If you use EMRValidator in your research or project, please cite:

@software{emrvalidator2025,
  title = {EMRValidator: Healthcare-Focused Data Quality and Validation},
  author = {Desai, Rohan and Gadve, Vaishnavi},
  year = {2025},
  url = {https://github.com/rohandesai007/EMRV}
}

⬆ Back to Top

Project details

Release history Release notifications | RSS feed

This version

1.0.2

Jan 26, 2026

1.0.1

Nov 18, 2025

1.0.0

Nov 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

emrvalidator-1.0.2.tar.gz (23.4 kB view details)

Uploaded Jan 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

emrvalidator-1.0.2-py3-none-any.whl (22.2 kB view details)

Uploaded Jan 26, 2026 Python 3

File details

Details for the file emrvalidator-1.0.2.tar.gz.

File metadata

Download URL: emrvalidator-1.0.2.tar.gz
Upload date: Jan 26, 2026
Size: 23.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for emrvalidator-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`f4bbe7eb6592a11a5229ba4ef30194797341e36af1d48f83718fba4168e12c8e`
MD5	`96e00d0a63ed2c383a65efdbb20b778d`
BLAKE2b-256	`ada0f68b157933055d03d129ecf30279b2580951e5b59d39ea580b1f5719ce45`

See more details on using hashes here.

File details

Details for the file emrvalidator-1.0.2-py3-none-any.whl.

File metadata

Download URL: emrvalidator-1.0.2-py3-none-any.whl
Upload date: Jan 26, 2026
Size: 22.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for emrvalidator-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8d415769e83567ba1b98beabdafddf07e2a95d5c8fa2dcd4d5d49b0c61a393c0`
MD5	`ab5edd42eb15f2447acdb39fb6dc2130`
BLAKE2b-256	`70f386aa454dc679d48555445c4fe49ee5041663af86ec6c0a2c99f061d999fa`

See more details on using hashes here.

emrvalidator 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

EMRValidator

✨ Key Features

EMRValidator If You Need:

Don’t Use It If:

🚀 Installation

📖 Quick Start

🆚 Why EMRValidator?

Comparison with Great Expectations

📚 Core Features

1. Basic Validations

2. Healthcare-Specific Validations

3. Data Profiling

4. Report Generation

5. Custom Validations

6. Reusable Rule Sets

7. Expectations API

🎯 Use Cases

Healthcare Analytics

Data Engineering

Business Intelligence

📊 Real-World Example

📦 Package Structure

🔧 Development

Setup Development Environment

Run Tests

Run Tests with Coverage

Code Formatting

📝 Documentation

🤝 Contributing

How to Contribute

License

👥 Authors & Contributors

Acknowledgments

📧 Contact & Support

Star History

📈 Roadmap

💡 Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes