Stateful, exposure-aware de-identification for multimodal streaming healthcare data with adaptive RL policy control.

These details have not been verified by PyPI

Project links

Project description

Stateful Exposure-Aware De-Identification for Multimodal Streaming Data

Research implementation of a stateful, risk-aware de-identification architecture for streaming multimodal systems.

This project demonstrates an alternative to static, document-level anonymization. Instead of treating privacy protection as a one-time preprocessing step, the system models cumulative identity exposure over time and dynamically adjusts masking strength in response to quantified re-identification risk.

Overview

Most de-identification pipelines operate per document:

detect PHI -> remove PHI -> store result

This approach assumes that risk is isolated within individual records. In practice, re-identification risk accumulates across events, modalities, and time.

A name fragment, identifier token, or cross-modal linkage that appears harmless in isolation may become identifying when repeated or combined with other signals.

This repository implements a stateful exposure-aware controller that:

Maintains subject-level exposure state
Computes rolling re-identification risk
Incorporates recency and cross-modal linkage signals
Dynamically selects masking strength
Supports pseudonym versioning upon risk escalation
Produces structured, reproducible audit logs

De-identification becomes a longitudinal control problem rather than a static transformation.

Architectural Characteristics

The system differs from conventional masking pipelines in several concrete ways:

Longitudinal Exposure Tracking: Identity exposure is accumulated and tracked over time at the subject level.

Risk-Governed Policy Selection: Masking strength is selected dynamically based on quantified risk thresholds.

Cross-Modal Linkage Modeling: Signals from text, ASR transcripts, image proxies, waveform headers, and audio metadata are aggregated to evaluate identity-level exposure.

Localized Retokenization When risk increases, pseudonym tokens can be versioned forward, containing linkage continuity without global reprocessing.

Auditability: All masking decisions are logged with structured metadata and can be reproduced deterministically from exposure state.

Demonstration

The repository includes a fully synthetic streaming simulation.

Five policies are evaluated:

raw
weak
pseudo
redact
adaptive

The adaptive controller escalates masking strength only when cumulative exposure justifies it.

Outputs include:

policy_metrics.csv
latency_summary.csv
audit_log.jsonl
EXPERIMENT_REPORT.md
privacy_utility_curve.png
sample_dag.png

All experiments are reproducible from source using synthetic data generated within the repository.

Run:

python -m amphi_rl_dpgraph.run_demo

Results are written to the results/ directory.

Testing

Run the test suite with verbose output:

pytest -vv

For explicit installation (recommended for notebooks/Colab):

pip install -e .
pytest -vv

To generate a machine-readable report plus a markdown summary:

pytest -vv --junitxml .artifacts/pytest.xml
python scripts/generate_test_report.py .artifacts/pytest.xml TEST_RESULTS.md

The latest checked-in summary is in TEST_RESULTS.md.

Data Description

This repository does not contain real clinical data, personal information, or protected health information.

All experiments operate on synthetically generated streams designed to simulate longitudinal healthcare data workflows. The synthetic data includes structured representations of:

Clinical note text
Speech transcription output
Image proxy signals
Waveform and monitoring features

The streams are constructed to model realistic structural properties relevant to privacy evaluation, including:

Repeated subject mentions over time
Identifier recurrence
Variable disclosure frequency
Cross-modal co-occurrence patterns

These properties allow controlled evaluation of cumulative identity exposure and adaptive masking behavior without exposing real individuals.

Synthetic data is used to ensure reproducibility, transparency, and safe public distribution of the research implementation.

Privacy–Utility Evaluation

The demo evaluates:

Residual PHI leakage
Utility proxy metrics
Latency distribution
Adaptive escalation behavior

The objective is not to eliminate utility through maximal redaction, but to demonstrate controlled escalation based on exposure accumulation.

Intended Use

This repository is intended for:

Research in privacy-preserving machine learning
Streaming system design
Exposure-aware masking strategies
Longitudinal risk modeling
Reproducible evaluation of privacy–utility tradeoffs

It is not a production-ready compliance system.

Security and Data Safety Policy

Data Restrictions

This repository must not contain real patient data, protected health information (PHI), or identifiable personal data.

All demonstrations and experiments run exclusively on synthetic data generated within the repository or on publicly permitted datasets.

The following must never be uploaded:

Clinical notes derived from real individuals
Hospital records or EHR exports
Medical images associated with identifiable persons
Audio recordings of patients
Any dataset containing direct or indirect identifiers

If sensitive data is discovered, do not open a public issue. Contact the maintainer directly for immediate removal.

Ethical Scope and Research Boundaries

This project studies adaptive privacy control mechanisms for streaming and multimodal systems.

It does not collect, process, or distribute real clinical data.

The methods demonstrated here are intended to strengthen privacy protection. They are not designed to weaken safeguards or enable re-identification.

When adapting this code to real-world systems, implementers must ensure:

Institutional and regulatory compliance
Independent security controls
Data governance review
Validation under applicable legal frameworks

Privacy protection in regulated domains requires layered safeguards. This repository addresses one technical layer: exposure-aware masking.

It should not be treated as a substitute for comprehensive compliance infrastructure.

Citation

If you use this software in academic or technical work, please cite it via the included CITATION.cff file.

Title:

Stateful Exposure-Aware De-Identification for Multimodal Streaming Data

Patent notice

This repository is associated with a U.S. provisional patent application filed on 2025-07-05.

Public release (GitHub): 2026-03-02.

License

MIT License. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phi_exposure_guard-0.1.0.tar.gz (45.9 kB view details)

Uploaded Mar 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

phi_exposure_guard-0.1.0-py3-none-any.whl (47.9 kB view details)

Uploaded Mar 4, 2026 Python 3

File details

Details for the file phi_exposure_guard-0.1.0.tar.gz.

File metadata

Download URL: phi_exposure_guard-0.1.0.tar.gz
Upload date: Mar 4, 2026
Size: 45.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for phi_exposure_guard-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e7e877f5a5a8607ea33671bec291daa127ee5157328eee9467cee4c6acba4241`
MD5	`0ae0cac32adf3e9b8b5833cd5de25367`
BLAKE2b-256	`baed23a0aa2bcba39365f334b7d1f752c4af613706125b9d19fdc5061d1676c4`

See more details on using hashes here.

File details

Details for the file phi_exposure_guard-0.1.0-py3-none-any.whl.

File metadata

Download URL: phi_exposure_guard-0.1.0-py3-none-any.whl
Upload date: Mar 4, 2026
Size: 47.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for phi_exposure_guard-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aae5b805294ca8dc3318f4c6aa7cb862bc046184853165da09e5471d042af418`
MD5	`338f1b41454678a11356e67935f835a3`
BLAKE2b-256	`3e969bb33e5375a866f95530ed1d22dc0797d75437280b480013a88f8cb022e7`

See more details on using hashes here.

phi-exposure-guard 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Stateful Exposure-Aware De-Identification for Multimodal Streaming Data

Overview

Architectural Characteristics

Demonstration

Testing

Data Description

Privacy–Utility Evaluation

Intended Use

Security and Data Safety Policy

Data Restrictions

Ethical Scope and Research Boundaries

Citation

Patent notice

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes