Command-line tool for deduplicating healthcare provider data using probabilistic record linkage
Project description
Provider Dedupe
A command-line tool for deduplicating healthcare provider data using probabilistic record linkage with the Splink library.
๐ฆ Installation
From PyPI (when published)
pip install provider-dedupe
Development Installation
# Clone the repository
git clone https://github.com/taylor-hickman/provider_dedupe.git
cd provider_dedupe
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e ".[dev,viz,excel,parquet]"
# Install pre-commit hooks (optional)
pre-commit install
๐โโ๏ธ Quick Start
Basic Usage Examples
# Example 1: Simple deduplication with default settings
provider-dedupe dedupe providers.csv deduplicated_providers.csv
# Example 2: Deduplication with custom threshold and HTML report
provider-dedupe dedupe providers.csv results.csv --threshold 0.95 --generate-report
# Example 3: Using a configuration file for advanced settings
provider-dedupe dedupe providers.csv output.xlsx --config config.json --generate-report
# Example 4: Analyze data quality before deduplication
provider-dedupe analyze providers.csv --output-dir quality_reports/
# Example 5: Generate visualizations from results
provider-dedupe visualize deduplicated_providers.csv --output-dir visualizations/
Sample Input Data Format
Your CSV file should look like this:
npi,firstname,lastname,address1,city,state,zipcode
1234567890,JOHN,SMITH,123 MAIN ST,NEW YORK,NY,10001
1234567890,JOHN,SMITH,123 MAIN STREET,NEW YORK,NY,10001
9876543210,JANE,DOE,456 ELM AVE,BOSTON,MA,02101
Command Line Options
provider-dedupe dedupe --help
Options:
--threshold FLOAT Match threshold (0.0-1.0) [default: 0.95]
--config PATH Path to configuration file
--output-format TEXT Output format: csv, excel, json, parquet [default: csv]
--generate-report Generate HTML report with statistics
--blocking-rules TEXT Custom blocking rules (JSON format)
--batch-size INTEGER Batch size for processing [default: 50000]
--help Show this message and exit
Python API Example
from provider_dedupe import ProviderDeduplicator
from provider_dedupe.core.config import DeduplicationConfig
# Example 1: Basic usage
deduplicator = ProviderDeduplicator()
results_df, stats = deduplicator.run_deduplication("providers.csv")
print(f"Found {stats['duplicates_found']} duplicate records")
print(f"Merged into {stats['unique_providers']} unique providers")
# Example 2: Custom configuration
config = DeduplicationConfig(
match_threshold=0.98,
blocking_rules=[
{"rule": "l.npi = r.npi", "description": "Exact NPI match"},
{"rule": "l.zipcode = r.zipcode AND l.lastname = r.lastname",
"description": "Same ZIP and last name"}
]
)
deduplicator = ProviderDeduplicator(config=config)
results_df, stats = deduplicator.run_deduplication("providers.csv")
# Example 3: Save results in multiple formats
deduplicator.save_results(results_df, "output.csv", format="csv")
deduplicator.save_results(results_df, "output.xlsx", format="excel")
deduplicator.save_results(results_df, "output.json", format="json")
๐ Input Data Format
The system expects CSV files with the following columns:
| Column | Description | Required |
|---|---|---|
npi |
National Provider Identifier | โ |
firstname |
Provider first name | โ |
lastname |
Provider last name | โ |
address1 |
Street address | โ |
city |
City name | โ |
state |
State code (2 letters) | โ |
zipcode |
ZIP/postal code | โ |
gnpi |
Group NPI | โ |
group_name |
Organization name | โ |
primary_spec_desc |
Specialty | โ |
phone |
Phone number | โ |
address_status |
Address quality | โ |
phone_status |
Phone quality | โ |
โ๏ธ Configuration
Configuration File Structure
{
"match_threshold": 0.95,
"max_iterations": 20,
"em_convergence": 0.001,
"blocking_rules": [
{
"rule": "l.npi = r.npi",
"description": "Exact NPI match"
}
],
"comparisons": [
{
"column_name": "npi",
"comparison_type": "exact",
"term_frequency_adjustments": false
}
]
}
Environment Variables
# Set via .env file or environment
PROVIDER_DEDUPE_LOG_LEVEL=INFO
PROVIDER_DEDUPE_DATA_DIR=/path/to/data
PROVIDER_DEDUPE_OUTPUT_DIR=/path/to/output
PROVIDER_DEDUPE_MAX_WORKERS=4
๐๏ธ Architecture
src/provider_dedupe/
โโโ core/ # Core business logic
โ โโโ config.py # Configuration management
โ โโโ deduplicator.py # Main deduplication engine
โ โโโ exceptions.py # Custom exceptions
โโโ models/ # Data models
โ โโโ provider.py # Provider and record models
โโโ services/ # Service layer
โ โโโ data_loader.py # Multi-format data loading
โ โโโ data_quality.py # Quality analysis
โ โโโ output_generator.py # Result output
โโโ utils/ # Utilities
โ โโโ logging.py # Structured logging
โ โโโ normalization.py # Text normalization
โโโ cli/ # Command-line interface
โโโ main.py # CLI commands
๐งช Testing
# Run all tests
pytest
# Run with coverage
pytest --cov=provider_dedupe --cov-report=html
# Run specific test categories
pytest tests/unit/
pytest tests/integration/
# Run with verbose output
pytest -v
# Run performance tests
pytest tests/performance/ -m performance
๐ Performance
Optimization Tips
- Use appropriate blocking rules for your data
- Adjust
max_pairs_for_trainingbased on available memory - Enable parallel processing for large datasets
- Consider data preprocessing to improve quality
๐ง Development
Code Quality Tools
# Format code
black src/ tests/
# Sort imports
isort src/ tests/
# Type checking
mypy src/
# Linting
flake8 src/ tests/
# Run all quality checks
pre-commit run --all-files
Project Structure
- src/: Source code using src layout
- tests/: Comprehensive test suite
- scripts/: Utility scripts
- .github/: CI/CD workflows
๐ API Reference
Core Classes
ProviderDeduplicator
Main deduplication engine.
class ProviderDeduplicator:
def __init__(
self,
config: Optional[DeduplicationConfig] = None,
data_loader: Optional[DataLoader] = None,
quality_analyzer: Optional[DataQualityAnalyzer] = None,
) -> None: ...
def load_data(self, input_path: Union[str, Path]) -> pd.DataFrame: ...
def prepare_data(self) -> pd.DataFrame: ...
def train_model(self) -> None: ...
def deduplicate(self, threshold: Optional[float] = None) -> Tuple[pd.DataFrame, Dict]: ...
Provider
Data model for provider information.
class Provider(BaseModel):
npi: str
first_name: str
last_name: str
address_line_1: str
city: str
state: str
postal_code: str
# ... additional fields
CLI Commands
dedupe
Main deduplication command.
provider-dedupe dedupe INPUT_FILE OUTPUT_FILE [OPTIONS]
analyze
Data quality analysis.
provider-dedupe analyze INPUT_FILE [OPTIONS]
visualize
Generate visualizations.
provider-dedupe visualize RESULTS_FILE [OPTIONS]
๐ค Contributing
We welcome contributions! Please see our Contributing Guide for details.
Development Setup
- Fork the repository
- Create a feature branch
- Install development dependencies
- Make your changes
- Run tests and quality checks
- Submit a pull request
Code Standards
- Follow PEP 8 style guidelines
- Add type hints to all functions
- Write docstrings for all public APIs
- Include unit tests for new features
- Update documentation as needed
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Support
- Documentation: Full documentation
- Issues: GitHub Issues
- Discussions: GitHub Discussions
๐ Acknowledgments
- Built with Splink by the UK Ministry of Justice
- Thank you to all contributors and users
๐ Metrics
Built with โค๏ธ for the open source healthcare data community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file provider_dedupe-1.0.0.tar.gz.
File metadata
- Download URL: provider_dedupe-1.0.0.tar.gz
- Upload date:
- Size: 33.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2557ec34b8f641b28fca57feb02474aed94e962f5b6bb23c7f9676ab94509dbe
|
|
| MD5 |
84694658b7220f556ad157e9ab6426fe
|
|
| BLAKE2b-256 |
4724384ea3464347439a830fa54e4852836be07a7729830dd711a429504ddd48
|
Provenance
The following attestation bundles were made for provider_dedupe-1.0.0.tar.gz:
Publisher:
ci.yml on taylor-hickman/provider_dedupe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
provider_dedupe-1.0.0.tar.gz -
Subject digest:
2557ec34b8f641b28fca57feb02474aed94e962f5b6bb23c7f9676ab94509dbe - Sigstore transparency entry: 243293454
- Sigstore integration time:
-
Permalink:
taylor-hickman/provider_dedupe@8c19b3ac17b7effd31702bbead64df6344d8daf3 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/taylor-hickman
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@8c19b3ac17b7effd31702bbead64df6344d8daf3 -
Trigger Event:
release
-
Statement type:
File details
Details for the file provider_dedupe-1.0.0-py3-none-any.whl.
File metadata
- Download URL: provider_dedupe-1.0.0-py3-none-any.whl
- Upload date:
- Size: 33.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9a7ef6fbd09037ca119cb33e30dad22764198c28731f22efac77b2f1f443e70
|
|
| MD5 |
e159a1ab1d487c3288c19fde84428441
|
|
| BLAKE2b-256 |
90c5443ccfcbc2e004e6efb0924472c275537c0cfa37cbe49a8ef79f6bef7888
|
Provenance
The following attestation bundles were made for provider_dedupe-1.0.0-py3-none-any.whl:
Publisher:
ci.yml on taylor-hickman/provider_dedupe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
provider_dedupe-1.0.0-py3-none-any.whl -
Subject digest:
e9a7ef6fbd09037ca119cb33e30dad22764198c28731f22efac77b2f1f443e70 - Sigstore transparency entry: 243293456
- Sigstore integration time:
-
Permalink:
taylor-hickman/provider_dedupe@8c19b3ac17b7effd31702bbead64df6344d8daf3 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/taylor-hickman
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@8c19b3ac17b7effd31702bbead64df6344d8daf3 -
Trigger Event:
release
-
Statement type: