Skip to main content

The tool extracts PHI into a structured CSV file and saves the anonymized note as a text file, making it ideal for healthcare professionals, researchers, and developers handling sensitive medical data.

Project description

PHIdelity

License: MIT

Overview

PHIdelity is a Python tool that intelligently anonymizes Protected Health Information (PHI) in clinical notes while preserving their contextual meaning. Unlike basic redaction methods that obscure data, this tool uses a local Large Language Model (LLM) via Ollama to identify PHI (e.g., names, dates, medical record numbers) and replace it with meaningful, generalized descriptions (e.g., [Patient Name], [Date of Visit]). This contextualized anonymization ensures the anonymized notes remain useful for research, analysis, or sharing while complying with privacy regulations like HIPAA.

The tool extracts PHI into a structured CSV file and saves the anonymized note as a text file, making it ideal for healthcare professionals, researchers, and developers handling sensitive medical data.

Key Features

  • Contextualized Anonymization: Replaces PHI with descriptive placeholders that retain the note's meaning (e.g., "John Doe" becomes [Patient Name]), enhancing usability for downstream applications.
  • Advanced PHI Detection: Leverages a local LLM (default: qwen3:4B) to identify a wide range of PHI, including names, dates, medical record numbers, and more.
  • Structured Output: Saves PHI to a CSV file with unique IDs, types, values, and descriptions for easy tracking and auditing.
  • Anonymized Note Export: Generates a text file with the anonymized clinical note, ready for secure sharing or analysis.
  • Configurable and Local: Runs on a local Ollama server, ensuring data privacy and allowing customization of the LLM model and output paths.
  • Open Source: Licensed under the MIT License, inviting community contributions and adoption.

Why Contextualized Anonymization?

Traditional anonymization methods often replace PHI with generic markers (e.g., [REDACTED]) or random strings, which can obscure the note's meaning and reduce its value for research or clinical review. The Clinical Note Anonymizer addresses this by:

  • Preserving Semantics: Descriptive placeholders like [Attending Physician Name] or [Medical Record Number] maintain the note's context, making it interpretable for humans and machines.
  • Supporting Use Cases: Anonymized notes remain suitable for medical research, machine learning training, or educational purposes without compromising privacy.
  • Ensuring Compliance: By removing identifiable information while retaining structure, the tool helps meet strict privacy standards like HIPAA.

Prerequisites

  • Python: Version 3.8 or higher.
  • Ollama: A running Ollama server (default: http://localhost:11434/) with the qwen3:4B model installed. See Ollama's documentation for setup.
  • Dependencies: Python packages listed in requirements.txt.

Installation

  1. Clone the Repository:

    git clone https://github.com/your-username/clinical-note-anonymizer.git
    cd clinical-note-anonymizer
    
  2. Set Up a Virtual Environment (recommended):

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
  3. Install Dependencies:

    pip install -r requirements.txt
    
  4. Install and Configure Ollama:

    • Install Ollama from ollama.ai.
    • Start the Ollama server:
      ollama serve
      
    • Pull the required model:
      ollama pull qwen3:4B
      
  5. Verify Setup: Confirm the Ollama server is running at http://localhost:11434/:

    curl http://localhost:11434/api/generate -d '{"model": "qwen3:4B", "prompt": "Test"}'
    

Usage

  1. Prepare a Clinical Note: The script includes a sample clinical note in anonymizer.py. Modify the clinical_note variable or provide your own note as a string.

  2. Run the Script: Process the clinical note to detect PHI and generate outputs:

    python anonymizer.py
    
  3. Outputs:

    • PHI Data (phi_data.csv): A CSV file with columns: ID, type, value, description.
    • Anonymized Note (anonymized_note.txt): A text file with PHI replaced by contextual placeholders.
    • Console Output: Shows the LLM's JSON output, the anonymized note, and status messages.
  4. Example Output:

    • phi_data.csv:
      ID,type,value,description
      redacted_name_001,Name,John Doe,Patient Name
      redacted_date_001,Date,June 11, 2025,Date of Visit
      redacted_medical_record_number_001,Medical Record Number,123456,Medical Record Number
      redacted_name_002,Name,Dr. Jane Smith,Attending Physician Name
      ...
      
    • anonymized_note.txt:
      Radiation Oncology Clinical Note
      Date of Visit: [Date of Visit]
      Patient Information
      
      Name: [Patient Name]
      Age: 65 years old
      Medical Record Number: [Medical Record Number]
      ...
      Physician: [Attending Physician Name]
      
  5. Customize Configuration: Edit anonymizer.py to adjust:

    • OLLAMA_ENDPOINT: Ollama server URL (default: http://localhost:11434/).
    • OLLAMA_MODEL: LLM model (default: qwen3:4B).
    • Output file paths in generate_phi_csv and anonymize_clinical_note.

File Structure

clinical-note-anonymizer/
├── anonymizer.py          # Main script for PHI detection and anonymization
├── requirements.txt       # Python dependencies
├── README.md              # Project documentation (this file)
├── phi_data.csv           # Output CSV file (generated)
├── anonymized_note.txt    # Output anonymized note (generated)

Contributing

We welcome contributions to enhance the Clinical Note Anonymizer, especially improvements to contextualization, LLM integration, or output formats. To contribute:

  1. Fork the Repository: Create a fork on GitHub.
  2. Create a Branch: Use a descriptive name (e.g., feature/improve-phi-detection).
  3. Make Changes: Implement and test your changes.
  4. Submit a Pull Request: Include a clear description and reference related issues.

Review the Contributing Guidelines and Code of Conduct before submitting.

Issues and Support

Encounter a problem? Please:

  • Check the Issues page for similar reports.
  • Open a new issue with details, including error messages and reproduction steps.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

  • Powered by Ollama for secure, local LLM inference.
  • Motivated by the need for privacy-preserving tools in healthcare that balance compliance and data utility.

Contact

For inquiries or collaboration, use GitHub Issues or contact (add your email if desired).


Last updated: June 11, 2025

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phidelity-0.2.0.tar.gz (20.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phidelity-0.2.0-py3-none-any.whl (22.8 kB view details)

Uploaded Python 3

File details

Details for the file phidelity-0.2.0.tar.gz.

File metadata

  • Download URL: phidelity-0.2.0.tar.gz
  • Upload date:
  • Size: 20.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.4

File hashes

Hashes for phidelity-0.2.0.tar.gz
Algorithm Hash digest
SHA256 330fdafe5bca1cf9761695407cf3b29a8be9d162bb5cc8206522342ee8bb32e9
MD5 7ae3515d726db31388f751a7d20bf183
BLAKE2b-256 fea800ec5c4b961f4fd3dbb77e5c9425474e5cf2e9fbc9801a0ec4577d0b9ae3

See more details on using hashes here.

File details

Details for the file phidelity-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: phidelity-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 22.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.4

File hashes

Hashes for phidelity-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b4a21344cf62edf97fb9eb76fb7bce6409726fcaae80cd200237bd1c5cd53c93
MD5 089c1dd2d2cba3512b5b9a4ea2636f0f
BLAKE2b-256 3c83b49c57c3632dcbba23fc46a1568b682cf3609480cbac9e8b270c34d3dd10

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page