Skip to main content

Command-line tool for deduplicating healthcare provider data using probabilistic record linkage

Project description

Provider Dedupe

Python License: MIT Code style: black Type checking: mypy

A command-line tool for deduplicating healthcare provider data using probabilistic record linkage with the Splink library.

๐Ÿ“ฆ Installation

From PyPI (when published)

pip install provider-dedupe

Development Installation

# Clone the repository
git clone https://github.com/taylor-hickman/provider_dedupe.git
cd provider_dedupe

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev,viz,excel,parquet]"

# Install pre-commit hooks (optional)
pre-commit install

๐Ÿƒโ€โ™‚๏ธ Quick Start

Basic Usage Examples

# Example 1: Simple deduplication with default settings
provider-dedupe dedupe providers.csv deduplicated_providers.csv

# Example 2: Deduplication with custom threshold and HTML report
provider-dedupe dedupe providers.csv results.csv --threshold 0.95 --generate-report

# Example 3: Using a configuration file for advanced settings
provider-dedupe dedupe providers.csv output.xlsx --config config.json --generate-report

# Example 4: Analyze data quality before deduplication
provider-dedupe analyze providers.csv --output-dir quality_reports/

# Example 5: Generate visualizations from results
provider-dedupe visualize deduplicated_providers.csv --output-dir visualizations/

Sample Input Data Format

Your CSV file should look like this:

npi,firstname,lastname,address1,city,state,zipcode
1234567890,JOHN,SMITH,123 MAIN ST,NEW YORK,NY,10001
1234567890,JOHN,SMITH,123 MAIN STREET,NEW YORK,NY,10001
9876543210,JANE,DOE,456 ELM AVE,BOSTON,MA,02101

Command Line Options

provider-dedupe dedupe --help

Options:
  --threshold FLOAT         Match threshold (0.0-1.0) [default: 0.95]
  --config PATH            Path to configuration file
  --output-format TEXT     Output format: csv, excel, json, parquet [default: csv]
  --generate-report        Generate HTML report with statistics
  --blocking-rules TEXT    Custom blocking rules (JSON format)
  --batch-size INTEGER     Batch size for processing [default: 50000]
  --help                  Show this message and exit

Python API Example

from provider_dedupe import ProviderDeduplicator
from provider_dedupe.core.config import DeduplicationConfig

# Example 1: Basic usage
deduplicator = ProviderDeduplicator()
results_df, stats = deduplicator.run_deduplication("providers.csv")
print(f"Found {stats['duplicates_found']} duplicate records")
print(f"Merged into {stats['unique_providers']} unique providers")

# Example 2: Custom configuration
config = DeduplicationConfig(
    match_threshold=0.98,
    blocking_rules=[
        {"rule": "l.npi = r.npi", "description": "Exact NPI match"},
        {"rule": "l.zipcode = r.zipcode AND l.lastname = r.lastname", 
         "description": "Same ZIP and last name"}
    ]
)
deduplicator = ProviderDeduplicator(config=config)
results_df, stats = deduplicator.run_deduplication("providers.csv")

# Example 3: Save results in multiple formats
deduplicator.save_results(results_df, "output.csv", format="csv")
deduplicator.save_results(results_df, "output.xlsx", format="excel")
deduplicator.save_results(results_df, "output.json", format="json")

๐Ÿ“Š Input Data Format

The system expects CSV files with the following columns:

Column Description Required
npi National Provider Identifier โœ…
firstname Provider first name โœ…
lastname Provider last name โœ…
address1 Street address โœ…
city City name โœ…
state State code (2 letters) โœ…
zipcode ZIP/postal code โœ…
gnpi Group NPI โŒ
group_name Organization name โŒ
primary_spec_desc Specialty โŒ
phone Phone number โŒ
address_status Address quality โŒ
phone_status Phone quality โŒ

โš™๏ธ Configuration

Configuration File Structure

{
  "match_threshold": 0.95,
  "max_iterations": 20,
  "em_convergence": 0.001,
  "blocking_rules": [
    {
      "rule": "l.npi = r.npi",
      "description": "Exact NPI match"
    }
  ],
  "comparisons": [
    {
      "column_name": "npi",
      "comparison_type": "exact",
      "term_frequency_adjustments": false
    }
  ]
}

Environment Variables

# Set via .env file or environment
PROVIDER_DEDUPE_LOG_LEVEL=INFO
PROVIDER_DEDUPE_DATA_DIR=/path/to/data
PROVIDER_DEDUPE_OUTPUT_DIR=/path/to/output
PROVIDER_DEDUPE_MAX_WORKERS=4

๐Ÿ—๏ธ Architecture

src/provider_dedupe/
โ”œโ”€โ”€ core/                   # Core business logic
โ”‚   โ”œโ”€โ”€ config.py          # Configuration management
โ”‚   โ”œโ”€โ”€ deduplicator.py    # Main deduplication engine
โ”‚   โ””โ”€โ”€ exceptions.py      # Custom exceptions
โ”œโ”€โ”€ models/                 # Data models
โ”‚   โ””โ”€โ”€ provider.py        # Provider and record models
โ”œโ”€โ”€ services/               # Service layer
โ”‚   โ”œโ”€โ”€ data_loader.py     # Multi-format data loading
โ”‚   โ”œโ”€โ”€ data_quality.py    # Quality analysis
โ”‚   โ””โ”€โ”€ output_generator.py # Result output
โ”œโ”€โ”€ utils/                  # Utilities
โ”‚   โ”œโ”€โ”€ logging.py         # Structured logging
โ”‚   โ””โ”€โ”€ normalization.py   # Text normalization
โ””โ”€โ”€ cli/                    # Command-line interface
    โ””โ”€โ”€ main.py            # CLI commands

๐Ÿงช Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=provider_dedupe --cov-report=html

# Run specific test categories
pytest tests/unit/
pytest tests/integration/

# Run with verbose output
pytest -v

# Run performance tests
pytest tests/performance/ -m performance

๐Ÿ“ˆ Performance

Optimization Tips

  • Use appropriate blocking rules for your data
  • Adjust max_pairs_for_training based on available memory
  • Enable parallel processing for large datasets
  • Consider data preprocessing to improve quality

๐Ÿ”ง Development

Code Quality Tools

# Format code
black src/ tests/

# Sort imports
isort src/ tests/

# Type checking
mypy src/

# Linting
flake8 src/ tests/

# Run all quality checks
pre-commit run --all-files

Project Structure

  • src/: Source code using src layout
  • tests/: Comprehensive test suite
  • scripts/: Utility scripts
  • .github/: CI/CD workflows

๐Ÿ“š API Reference

Core Classes

ProviderDeduplicator

Main deduplication engine.

class ProviderDeduplicator:
    def __init__(
        self,
        config: Optional[DeduplicationConfig] = None,
        data_loader: Optional[DataLoader] = None,
        quality_analyzer: Optional[DataQualityAnalyzer] = None,
    ) -> None: ...
    
    def load_data(self, input_path: Union[str, Path]) -> pd.DataFrame: ...
    def prepare_data(self) -> pd.DataFrame: ...
    def train_model(self) -> None: ...
    def deduplicate(self, threshold: Optional[float] = None) -> Tuple[pd.DataFrame, Dict]: ...

Provider

Data model for provider information.

class Provider(BaseModel):
    npi: str
    first_name: str
    last_name: str
    address_line_1: str
    city: str
    state: str
    postal_code: str
    # ... additional fields

CLI Commands

dedupe

Main deduplication command.

provider-dedupe dedupe INPUT_FILE OUTPUT_FILE [OPTIONS]

analyze

Data quality analysis.

provider-dedupe analyze INPUT_FILE [OPTIONS]

visualize

Generate visualizations.

provider-dedupe visualize RESULTS_FILE [OPTIONS]

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

  1. Fork the repository
  2. Create a feature branch
  3. Install development dependencies
  4. Make your changes
  5. Run tests and quality checks
  6. Submit a pull request

Code Standards

  • Follow PEP 8 style guidelines
  • Add type hints to all functions
  • Write docstrings for all public APIs
  • Include unit tests for new features
  • Update documentation as needed

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ†˜ Support

๐Ÿ™ Acknowledgments

  • Built with Splink by the UK Ministry of Justice
  • Thank you to all contributors and users

๐Ÿ“Š Metrics

  • Coverage
  • Tests

Built with โค๏ธ for the open source healthcare data community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

provider_dedupe-1.0.0.tar.gz (33.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

provider_dedupe-1.0.0-py3-none-any.whl (33.9 kB view details)

Uploaded Python 3

File details

Details for the file provider_dedupe-1.0.0.tar.gz.

File metadata

  • Download URL: provider_dedupe-1.0.0.tar.gz
  • Upload date:
  • Size: 33.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for provider_dedupe-1.0.0.tar.gz
Algorithm Hash digest
SHA256 2557ec34b8f641b28fca57feb02474aed94e962f5b6bb23c7f9676ab94509dbe
MD5 84694658b7220f556ad157e9ab6426fe
BLAKE2b-256 4724384ea3464347439a830fa54e4852836be07a7729830dd711a429504ddd48

See more details on using hashes here.

Provenance

The following attestation bundles were made for provider_dedupe-1.0.0.tar.gz:

Publisher: ci.yml on taylor-hickman/provider_dedupe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file provider_dedupe-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for provider_dedupe-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e9a7ef6fbd09037ca119cb33e30dad22764198c28731f22efac77b2f1f443e70
MD5 e159a1ab1d487c3288c19fde84428441
BLAKE2b-256 90c5443ccfcbc2e004e6efb0924472c275537c0cfa37cbe49a8ef79f6bef7888

See more details on using hashes here.

Provenance

The following attestation bundles were made for provider_dedupe-1.0.0-py3-none-any.whl:

Publisher: ci.yml on taylor-hickman/provider_dedupe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page