Skip to main content

The tool extracts PHI into a structured CSV file and saves the anonymized note as a text file, making it ideal for healthcare professionals, researchers, and developers handling sensitive medical data.

Project description

PHIdelity

License: MIT

Overview

The PHIdelity is a Python tool that intelligently anonymizes Protected Health Information (PHI) in clinical notes while preserving their contextual meaning. Unlike basic redaction methods that obscure data, this tool uses a local Large Language Model (LLM) via Ollama to identify PHI (e.g., names, dates, medical record numbers) and replace it with meaningful, generalized descriptions (e.g., [Patient Name], [Date of Visit]). This contextualized anonymization ensures the anonymized notes remain useful for research, analysis, or sharing while complying with privacy regulations like HIPAA.

The tool extracts PHI into a structured CSV file and saves the anonymized note as a text file, making it ideal for healthcare professionals, researchers, and developers handling sensitive medical data.

Key Features

  • Contextualized Anonymization: Replaces PHI with descriptive placeholders that retain the note's meaning (e.g., "John Doe" becomes [Patient Name]), enhancing usability for downstream applications.
  • Advanced PHI Detection: Leverages a local LLM (default: qwen3:4B) to identify a wide range of PHI, including names, dates, medical record numbers, and more.
  • Structured Output: Saves PHI to a CSV file with unique IDs, types, values, and descriptions for easy tracking and auditing.
  • Anonymized Note Export: Generates a text file with the anonymized clinical note, ready for secure sharing or analysis.
  • Configurable and Local: Runs on a local Ollama server, ensuring data privacy and allowing customization of the LLM model and output paths.
  • Open Source: Licensed under the MIT License, inviting community contributions and adoption.

Why Contextualized Anonymization?

Traditional anonymization methods often replace PHI with generic markers (e.g., [REDACTED]) or random strings, which can obscure the note's meaning and reduce its value for research or clinical review. The Clinical Note Anonymizer addresses this by:

  • Preserving Semantics: Descriptive placeholders like [Attending Physician Name] or [Medical Record Number] maintain the note's context, making it interpretable for humans and machines.
  • Supporting Use Cases: Anonymized notes remain suitable for medical research, machine learning training, or educational purposes without compromising privacy.
  • Ensuring Compliance: By removing identifiable information while retaining structure, the tool helps meet strict privacy standards like HIPAA.

Prerequisites

  • Python: Version 3.8 or higher.
  • Ollama: A running Ollama server (default: http://localhost:11434/) with the qwen3:4B model installed. See Ollama's documentation for setup.
  • Dependencies: Python packages listed in requirements.txt.

Installation

  1. Clone the Repository:

    git clone https://github.com/your-username/clinical-note-anonymizer.git
    cd clinical-note-anonymizer
    
  2. Set Up a Virtual Environment (recommended):

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
  3. Install Dependencies:

    pip install -r requirements.txt
    
  4. Install and Configure Ollama:

    • Install Ollama from ollama.ai.
    • Start the Ollama server:
      ollama serve
      
    • Pull the required model:
      ollama pull qwen3:4B
      
  5. Verify Setup: Confirm the Ollama server is running at http://localhost:11434/:

    curl http://localhost:11434/api/generate -d '{"model": "qwen3:4B", "prompt": "Test"}'
    

Usage

  1. Prepare a Clinical Note: The script includes a sample clinical note in anonymizer.py. Modify the clinical_note variable or provide your own note as a string.

  2. Run the Script: Process the clinical note to detect PHI and generate outputs:

    python anonymizer.py
    
  3. Outputs:

    • PHI Data (phi_data.csv): A CSV file with columns: ID, type, value, description.
    • Anonymized Note (anonymized_note.txt): A text file with PHI replaced by contextual placeholders.
    • Console Output: Shows the LLM's JSON output, the anonymized note, and status messages.
  4. Example Output:

    • phi_data.csv:
      ID,type,value,description
      redacted_name_001,Name,John Doe,Patient Name
      redacted_date_001,Date,June 11, 2025,Date of Visit
      redacted_medical_record_number_001,Medical Record Number,123456,Medical Record Number
      redacted_name_002,Name,Dr. Jane Smith,Attending Physician Name
      ...
      
    • anonymized_note.txt:
      Radiation Oncology Clinical Note
      Date of Visit: [Date of Visit]
      Patient Information
      
      Name: [Patient Name]
      Age: 65 years old
      Medical Record Number: [Medical Record Number]
      ...
      Physician: [Attending Physician Name]
      
  5. Customize Configuration: Edit anonymizer.py to adjust:

    • OLLAMA_ENDPOINT: Ollama server URL (default: http://localhost:11434/).
    • OLLAMA_MODEL: LLM model (default: qwen3:4B).
    • Output file paths in generate_phi_csv and anonymize_clinical_note.

File Structure

clinical-note-anonymizer/
├── anonymizer.py          # Main script for PHI detection and anonymization
├── requirements.txt       # Python dependencies
├── README.md              # Project documentation (this file)
├── phi_data.csv           # Output CSV file (generated)
├── anonymized_note.txt    # Output anonymized note (generated)

Contributing

We welcome contributions to enhance the Clinical Note Anonymizer, especially improvements to contextualization, LLM integration, or output formats. To contribute:

  1. Fork the Repository: Create a fork on GitHub.
  2. Create a Branch: Use a descriptive name (e.g., feature/improve-phi-detection).
  3. Make Changes: Implement and test your changes.
  4. Submit a Pull Request: Include a clear description and reference related issues.

Review the Contributing Guidelines and Code of Conduct before submitting.

Issues and Support

Encounter a problem? Please:

  • Check the Issues page for similar reports.
  • Open a new issue with details, including error messages and reproduction steps.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

  • Powered by Ollama for secure, local LLM inference.
  • Motivated by the need for privacy-preserving tools in healthcare that balance compliance and data utility.

Contact

For inquiries or collaboration, use GitHub Issues or contact (add your email if desired).


Last updated: June 11, 2025

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phidelity-0.1.1.tar.gz (19.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phidelity-0.1.1-py3-none-any.whl (19.8 kB view details)

Uploaded Python 3

File details

Details for the file phidelity-0.1.1.tar.gz.

File metadata

  • Download URL: phidelity-0.1.1.tar.gz
  • Upload date:
  • Size: 19.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.4

File hashes

Hashes for phidelity-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c6b259a4125b63f7dc4d8fa1153d4423af7f3469595cad2154cfcb2935baad5e
MD5 e5d68041cdd3f4c14eba6163cc4fe18f
BLAKE2b-256 5e6f50955b41cf7434d664da61d0e7a7a489b2812708c5d4af4477e880031a4e

See more details on using hashes here.

File details

Details for the file phidelity-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: phidelity-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 19.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.4

File hashes

Hashes for phidelity-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 15172af3d369efda2a35b3e61b91e75d603ec99a6210e5f346bd19ccd44fde6d
MD5 7bd635cd567ae0143b844b24c26e69cb
BLAKE2b-256 f3dc9fc053148af49ca63bdd3464b56257d61aecccb0e7190d6896a4b45cc8ca

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page