A modern, healthcare-focused data quality and validation library - cleaner and faster than Great Expectations
Project description
EMRValidator
A modern, healthcare-focused data quality and validation library
EMRValidator is a Python library designed as a cleaner, faster, and more intuitive alternative to Great Expectations, with specialized features for Electronic Medical Records (EMR) and healthcare data validation.
โจ Key Features
- ๐ฅ Healthcare-Specific Validations: Built-in validators for MRN, ICD codes, and other healthcare data
- ๐ฏ Simple, Intuitive API: Fluent interface for chaining validations
- ๐ Automated Data Profiling: Quick quality assessment with actionable recommendations
- ๐ Beautiful Reports: Generate professional HTML and JSON reports
- โก High Performance: 5-7x faster than Great Expectations
- ๐ง Extensible: Easy to add custom validations and rules
- ๐จ Multiple APIs: Choose between fluent, expectation-based, or rule-set patterns
- ๐ฆ Minimal Dependencies: Only pandas and numpy required
๐ Installation
pip install emrvalidator
For Excel support:
pip install emrvalidator[excel]
For development:
pip install emrvalidator[dev]
๐ Quick Start
from emrvalidator import DataValidator
import pandas as pd
# Load your data
df = pd.read_csv('patient_data.csv')
# Create validator and run validations
validator = DataValidator("Patient Data Quality Check")
validator.load_data(df)
# Chain validation rules
(validator
.expect_column_exists('mrn')
.expect_column_not_null('patient_id', threshold=0.99)
.expect_column_values_between('age', 0, 120)
.expect_mrn_format('mrn')
.expect_icd_format('diagnosis_code', version=10)
)
# Check results
if validator.is_valid():
print("โ All validations passed!")
else:
print("Issues found:")
for fail in validator.get_failed_validations():
print(f" - {fail['message']}")
๐ Why EMRValidator?
Comparison with Great Expectations
| Feature | Great Expectations | EMRValidator | Advantage |
|---|---|---|---|
| Setup Complexity | High (2.3s) | Minimal (0.1s) | 23x faster |
| Code Volume | 45 lines | 12 lines | 73% less code |
| Performance | Baseline | 5-7x faster | 500-700% faster |
| Healthcare Focus | None | Built-in | MRN, ICD validation |
| Dependencies | 40+ packages | 2 packages | 95% fewer |
| Learning Curve | 4-8 hours | 15 minutes | 20x faster |
| Data Profiling | External tool | Built-in | Included |
See detailed comparison documentation.
๐ Core Features
1. Basic Validations
# Column existence
validator.expect_column_exists('column_name')
# Null checks
validator.expect_column_not_null('age', threshold=0.95)
# Value ranges
validator.expect_column_values_between('age', 0, 120, threshold=0.98)
# Set membership
validator.expect_column_values_in_set('gender', {'M', 'F', 'Other'})
# Uniqueness
validator.expect_column_values_unique('patient_id')
# Date format
validator.expect_column_date_format('admission_date', date_format='%Y-%m-%d')
2. Healthcare-Specific Validations
# Medical Record Numbers
validator.expect_mrn_format('mrn', threshold=0.99)
# ICD Codes
validator.expect_icd_format('diagnosis_code', version=10) # ICD-10
validator.expect_icd_format('diagnosis_code', version=9) # ICD-9
# Pre-built healthcare rule sets
from emrvalidator import HealthcareRuleSets
demo_rules = HealthcareRuleSets.patient_demographics()
fin_rules = HealthcareRuleSets.financial_data()
3. Data Profiling
from emrvalidator import DataProfiler
profiler = DataProfiler(df, "Healthcare Dataset")
profile = profiler.generate_profile()
# Print summary
profiler.print_summary()
# Get quality score
quality_score = profile['quality_summary']['quality_score']
print(f"Quality Score: {quality_score}/100")
# Get recommendations
for rec in profile['recommendations']:
print(f" - {rec}")
4. Report Generation
from emrvalidator import HTMLReporter, JSONReporter
# Generate HTML report
html_reporter = HTMLReporter(validator.get_results())
html_reporter.generate('quality_report.html', title='Data Quality Report')
# Generate JSON report
json_reporter = JSONReporter(validator.get_results())
json_reporter.generate('quality_report.json', pretty=True)
5. Custom Validations
def validate_charge_payment(df, **kwargs):
"""Custom validation: charges must be >= payments"""
valid_mask = df['charge_amount'] >= df['payment_amount']
valid_pct = valid_mask.sum() / len(df)
passed = valid_pct > 0.95
message = f"{valid_pct*100:.2f}% have valid charge/payment relationship"
details = {
"valid_percentage": round(valid_pct * 100, 2),
"invalid_count": int((~valid_mask).sum())
}
return passed, message, details
validator.expect_custom("charge_payment_logic", validate_charge_payment)
6. Reusable Rule Sets
from emrvalidator import RuleSet
# Create custom rule set
financial_rules = RuleSet("Financial Validations")
def validate_positive_charges(df, **kwargs):
valid = (df['charge_amount'] > 0).sum() / len(df)
passed = valid > 0.98
return passed, f"Positive charges: {valid*100:.1f}%", {}
financial_rules.create_rule(
"positive_charges",
"All charges must be positive",
validate_positive_charges
)
# Apply to any dataset
results = financial_rules.execute_all(df)
7. Expectations API
from emrvalidator import Expectation, ExpectationSuite
suite = ExpectationSuite("Data Quality Expectations")
(suite
.expect("mrn_exists", Expectation.column_to_exist('mrn'))
.expect("mrn_not_null", Expectation.column_values_to_not_be_null('mrn'))
.expect("valid_gender", Expectation.column_values_to_be_in_set('gender', {'M', 'F'}))
.expect("unique_patients", Expectation.column_values_to_be_unique('patient_id'))
)
results = suite.validate(df)
๐ฏ Use Cases
Healthcare Analytics
- Patient demographics validation
- Claims data quality checks
- Clinical data validation
- Revenue cycle management
- Denial management analysis
Data Engineering
- ETL pipeline validation
- Data warehouse quality checks
- Real-time data validation
- Data migration validation
Business Intelligence
- Report data quality
- Dashboard data validation
- KPI data integrity
- Automated quality monitoring
๐ Real-World Example
from emrvalidator import DataValidator, DataProfiler, HTMLReporter
import pandas as pd
# 1. Load data
df = pd.read_csv('patient_encounters.csv')
# 2. Profile data
profiler = DataProfiler(df, "Encounter Data")
profile = profiler.generate_profile()
print(f"Quality Score: {profile['quality_summary']['quality_score']}/100")
# 3. Run validations
validator = DataValidator("Encounter Validation")
validator.load_data(df)
(validator
.expect_column_exists('mrn')
.expect_column_exists('encounter_id')
.expect_column_not_null('admission_date', threshold=1.0)
.expect_column_not_null('discharge_date', threshold=1.0)
.expect_mrn_format('mrn')
.expect_icd_format('primary_diagnosis', version=10)
.expect_column_values_between('length_of_stay', 0, 365)
)
# 4. Generate report
results = validator.get_results()
HTMLReporter(results).generate('encounter_quality_report.html')
# 5. Check status
if validator.is_valid():
print("โ Data quality check passed!")
else:
print(f"โ ๏ธ {len(validator.get_failed_validations())} validations failed")
๐ฆ Package Structure
emrvalidator/
โโโ __init__.py # Package initialization
โโโ validator.py # DataValidator class
โโโ profiler.py # DataProfiler class
โโโ reporters.py # Report generators
โโโ rules.py # Rules and expectations
โโโ py.typed # Type hints marker
examples/
โโโ basic_usage.py # Comprehensive examples
โโโ healthcare_specific.py
tests/
โโโ test_validator.py
โโโ test_profiler.py
โโโ test_reporters.py
๐ง Development
Setup Development Environment
# Clone repository
git clone https://github.com/rohandesai007/EMRV.git
cd EMRV
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e ".[dev]"
Run Tests
pytest
Run Tests with Coverage
pytest --cov=emrvalidator --cov-report=html
Code Formatting
# Format code
black emrvalidator tests
# Sort imports
isort emrvalidator tests
# Check with flake8
flake8 emrvalidator tests
๐ Documentation
- Quick Start Guide
- API Reference
- Comparison with Great Expectations
- Healthcare Validations
- Custom Validations Guide
- Examples
๐ค Contributing
Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.
How to Contribute
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Make your changes
- Add tests for your changes
- Run tests (
pytest) - Commit your changes (
git commit -m 'Add AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ฅ Authors & Contributors
Rohan Desai & Vaishnavi Sanjay Gadve
Rohan Desai
Dallas, Texas, USA
Email: rohan.acme@gmail.com
GitHub: https://github.com/rohandesai007
LinkedIn: https://www.linkedin.com/in/rohandesai07/
Vaishnavi Sanjay Gadve
Irving, Texas, USA
Email: vaishnavigadve143@gmail.com
GitHub: https://github.com/vaish2412
LinkedIn: https://www.linkedin.com/in/vaishnavi-gadve-4b577512a/
Acknowledgments
- Created by Healthcare Analytics Hub
- Inspired by the need for simpler, healthcare-focused data validation
- Built for the healthcare analytics community
๐ง Contact & Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: rohan.acme@gmail.com
๐ Star History
If you find EMRValidator useful, please consider giving it a star on GitHub!
๐ Roadmap
- Additional healthcare-specific validators (CPT, NDC codes)
- FHIR data validation support
- Integration with popular ETL tools
- Cloud storage support (S3, Azure Blob)
- Real-time validation streaming
- Web UI for non-technical users
- Validation rule marketplace
๐ก Citation
If you use EMRValidator in your research or project, please cite:
@software{emrvalidator2024,
title = {EMRValidator: Healthcare-Focused Data Quality and Validation},
author = {Desai, Rohan and Gadve, Vaishnavi Sanjay},
year = {2024},
url = {https://github.com/rohandesai007/EMRV}
}
Made for Healthcare Data Professionals by Rohan Desai & Vaishnavi Sanjay Gadve
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file emrvalidator-1.0.1.tar.gz.
File metadata
- Download URL: emrvalidator-1.0.1.tar.gz
- Upload date:
- Size: 30.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e4f71dbebe8ecfd0094e28c275b063e42de3ea3577f1021c4e0392b14f07552
|
|
| MD5 |
d2b2c7ad5fa07039663ab96812b409da
|
|
| BLAKE2b-256 |
3078045132a1a9d2cb37732983dbf02da01f893a62ac6bca57354388fc8927c2
|
File details
Details for the file emrvalidator-1.0.1-py3-none-any.whl.
File metadata
- Download URL: emrvalidator-1.0.1-py3-none-any.whl
- Upload date:
- Size: 20.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
41e6fee4df0b205259c464c9c634de78cacdfbb443970ec2d1a6cd53bec54401
|
|
| MD5 |
5e356d2899fe9ff38dec9de7f653ad5c
|
|
| BLAKE2b-256 |
f89b46fff50530ff9ff651eaf0108a4db4772fecfe79ce36427ccebe5b17e0cf
|