Skip to main content

Composable PII anonymization pipeline for LLM agents, with a LangChain middleware and pluggable detectors (NER, regex, LLM, custom).

Project description

PIIGhost

CI PyPI version PyPI downloads Python versions License: MIT Tested with pytest Code style: Ruff Security: bandit

README EN - README FR

Documentation EN - Documentation FR

piighost is a composable PII anonymization pipeline for LLM agents. It sits as a layer on top of any regex, NER, or LLM you plug in, so you can use a hosted LLM (GPT, Claude, Gemini) without ever sending it the raw data of your users. piighost spots PII like names, emails, addresses, anything the model does not need to see, swaps them for placeholders (for example <<PERSON:1>>, <<EMAIL:2>>, <<LOCATION:1>>) the LLM can still reason about, and restores the real values for your tools and your end users. The same PII keeps the same placeholder across an entire conversation, even when it spans multiple messages or tool calls, and your agent code does not change.

On top of the core pipeline, piighost ships extra layers to harden each step, like composable detectors with confidence arbitration for detection, a tolerant linker for correction of typos and case variants, and output guardrails (regex or LLM-based) for safety when the LLM accidentally generates fresh PII in its response.

sequenceDiagram
    autonumber
    participant U as User
    participant M as piighost
    participant L as LLM
    participant T as Tool

    U->>M: "Email Patrick at patrick@acme.com"
    M->>L: "Email <<PERSON:1>> at <<EMAIL:1>>"
    L->>M: tool_call(send_email, to=<<EMAIL:1>>)
    M->>T: send_email(to="patrick@acme.com")
    T-->>M: "Sent."
    M-->>L: "Sent."
    L-->>M: "Done, your email to <<PERSON:1>> went out."
    M-->>U: "Done, your email to Patrick went out."

The LLM only ever sees <<PERSON:1>> and <<EMAIL:1>>. Your send_email tool still receives the real address. The end user receives a deanonymized response. Zero changes to your agent code.

Table of contents

Why piighost?

When you ship an LLM feature, you usually pick one of three families of providers, and each one forces a trade-off:

  • Hosted non-EU clouds (OpenAI, Anthropic, Google): the best models, but every byte of context, including raw user PII, leaves your jurisdiction.
  • EU-sovereign clouds (Mistral AI, OVHcloud, Scaleway): legal guarantees on residency, but you give up some of the state of the art.
  • Self-hosted open weights: total control, but you carry the infra cost and you stay further from the SOTA.

The only clean way to decouple the LLM from the sensitivity of the content is to anonymize upstream. Once PII never reaches the model, the choice of provider stops being a privacy decision and goes back to being a quality / cost / latency one. That's the slot piighost fills.

piighost LangChain Microsoft Presidio Regex
Pluggable detectors (NER, regex, LLM, …) ✅ via AnyDetector protocol ⚠️ regex / Presidio only ⚠️ tied to spaCy/recognizers
Compose multiple detectors CompositeDetector + span resolver ❌ one strategy per instance ⚠️ partial
Cross-message entity linking ThreadAnonymizationPipeline + memory
Tolerates case / typo variants ExactEntityLinker + FuzzyEntityResolver
Reversible anonymization (deanonymize) ✅ cache-backed ❌ block / mask only ⚠️ separate API
LangChain / LangGraph middleware PIIAnonymizationMiddleware PIIMiddleware
Per-tool deanonymize / re-anonymize awrap_tool_call
Async-first API ⚠️ ⚠️
Bring-your-own placeholder format AnyPlaceholderFactory ⚠️ template-only ⚠️ template-only depends

LangChain's built-in PIIMiddleware is the closest neighbour: it wires anonymization into the agent loop, but it is one-shot (block / redact / mask / hash) and can't deanonymize for end users or pass real values to tools. piighost keeps the same hook point and adds the round trip, the cross-message memory, and the swappable detection stack, so the LLM sees placeholders while the rest of the system keeps working with real data.

What makes this hard

On paper, "replace names with tokens before the LLM call" is a one-liner. In practice, four problems show up immediately, and they are what shape the pipeline:

  • Placeholder consistency. If Patrick is <<PERSON:1>> at line 1, every later mention of Patrick must stay <<PERSON:1>>. Restart the counter and the LLM thinks it's three different people.
  • Detector misses. A NER model finds the full mention Patrick Dupont but skips a standalone Patrick two paragraphs later. The bare regex you'd add to compensate misses the variants. You need a layer that re-scans for known entities, not just raw NER output.
  • Span overlaps. Run two detectors and they will sometimes claim the same offset with different labels (PERSON at confidence 0.95, ORG at 0.60). Without arbitration, span replacement walks over itself and you get garbled text.
  • Multi-message state. Anonymize each message in isolation and you hit a silent collision: Patrick becomes <<PERSON:1>> in message 1, Bob becomes <<PERSON:1>> in message 2, both decode to the same person. Conversation memory is non-optional.

Each pipeline stage in piighost exists to fix one of those four. The 5-step architecture isn't theoretical, it's the smallest set that makes anonymization work end-to-end on real conversations.

Quick start

Install the cache extra (used by the pipeline):

uv add 'piighost[cache]'

Anonymize and deanonymize without downloading any model. The ExactMatchDetector matches a fixed dictionary by word-boundary regex, ideal to try piighost in under a minute.

import asyncio
from pprint import pp

from piighost import Anonymizer, ExactMatchDetector
from piighost.pipeline import AnonymizationPipeline

detector = ExactMatchDetector([("Patrick", "PERSON"), ("Paris", "LOCATION")])
pipeline = AnonymizationPipeline(detector=detector, anonymizer=Anonymizer())


async def main() -> None:
    anonymized, entities = await pipeline.anonymize("Patrick lives in Paris.")
    print(anonymized)
    # <<PERSON:1>> lives in <<LOCATION:1>>.

    pp(entities)
    # [
    #     Entity(detections=(
    #         Detection(
    #             text='Patrick',
    #             label='PERSON',
    #             position=Span(start_pos=0, end_pos=7),
    #             confidence=1.0,
    #         ),
    #     )),
    #     Entity(detections=(
    #         Detection(
    #             text='Paris',
    #             label='LOCATION',
    #             position=Span(start_pos=17, end_pos=22),
    #             confidence=1.0,
    #         ),
    #     )),
    # ]

    original, restored = await pipeline.deanonymize(anonymized)
    print(original)
    # Patrick lives in Paris.

    pp(restored)
    # [
    #     Entity(detections=(
    #         Detection(
    #             text='Patrick',
    #             label='PERSON',
    #             position=Span(start_pos=0, end_pos=7),
    #             confidence=1.0,
    #         ),
    #     )),
    #     Entity(detections=(
    #         Detection(
    #             text='Paris',
    #             label='LOCATION',
    #             position=Span(start_pos=17, end_pos=22),
    #             confidence=1.0,
    #         ),
    #     )),
    # ]


asyncio.run(main())

Both anonymize and deanonymize return (text, list[Entity]): the entities come straight from the cache, no re-detection. Note that an Entity does not carry its placeholder: the placeholder is regenerated on the fly by the PlaceholderFactory from the ordered list of entities. That's how counter-style factories such as LabelCounterPlaceholderFactory produce stable <<PERSON:1>>, <<PERSON:2>>, ... numbers across anonymize and deanonymize (the order is the source of truth). The text field of Detection is rendered verbatim by the standard dataclass repr, scrub it yourself if you forward these objects to logs (see security docs).

How does deanonymize know the original? It does not re-run the detector. The pipeline keeps an in-memory cache (aiocache.SimpleMemoryCache by default) that maps sha256(anonymized_text) → (original_text, entities). Calling deanonymize is just a lookup. For multi-instance deployments, swap in a Redis or Memcached backend, see deployment docs.

For real workloads, plug in a NER model or your own detector below.

Advanced configuration (real NER, custom resolvers, full pipeline)
import asyncio
from gliner2 import GLiNER2

from piighost.anonymizer import Anonymizer
from piighost.detector.gliner2 import Gliner2Detector
from piighost.pipeline import AnonymizationPipeline

model = GLiNER2.from_pretrained("fastino/gliner2-multi-v1")
detector = Gliner2Detector(model=model, labels=["PERSON", "LOCATION"])
pipeline = AnonymizationPipeline(detector=detector, anonymizer=Anonymizer())


async def main() -> None:
    text = "Patrick lives in Paris. Patrick loves Paris."
    anonymized, entities = await pipeline.anonymize(text)
    print(anonymized)
    # <<PERSON:1>> lives in <<LOCATION:1>>. <<PERSON:1>> loves <<LOCATION:1>>.

    for entity in entities:
        print(f"  {entity.label}: {entity.detections[0].text}")
    # PERSON: Patrick
    # LOCATION: Paris

    original, _ = await pipeline.deanonymize(anonymized)
    print(original)
    # Patrick lives in Paris. Patrick loves Paris.


asyncio.run(main())

Swap Gliner2Detector for any other implementation of AnyDetector (spaCy, regex, a remote API, your own, see Bring your own detector). Same for every other stage of the pipeline.

With LangChain agent middleware

A LangChain middleware is an extension point that runs before and after every LLM call and every tool call. piighost hooks into it to intercept and transform messages, so PII anonymization is applied without changing your agent code.

from langchain.agents import create_agent
from langchain_core.tools import tool

from piighost.anonymizer import Anonymizer
from piighost.detector.gliner2 import Gliner2Detector
from piighost.pipeline import ThreadAnonymizationPipeline
from piighost.middleware import PIIAnonymizationMiddleware

from gliner2 import GLiNER2


@tool
def send_email(to: str, subject: str, body: str) -> str:
    """Send an email to a given address."""
    return f"Email successfully sent to {to}."


model = GLiNER2.from_pretrained("fastino/gliner2-multi-v1")
detector = Gliner2Detector(model=model, labels=["PERSON", "LOCATION"])
pipeline = ThreadAnonymizationPipeline(detector=detector, anonymizer=Anonymizer())
middleware = PIIAnonymizationMiddleware(pipeline=pipeline)

graph = create_agent(
    model="openai:gpt-5.4",
    system_prompt="You are a helpful assistant.",
    tools=[send_email],
    middleware=[middleware],
)

The middleware intercepts every agent turn: the LLM only sees anonymized text, tools receive real values, and user-facing messages are deanonymized automatically.

Bring your own detector

The detection stage is just a Protocol. Anything async with a detect(text) -> list[Detection] method works. The pipeline does not care whether it's a model, a regex, or an HTTP call.

import httpx

from piighost.detector.base import AnyDetector  # protocol, structural typing
from piighost.models import Detection, Span


class RemoteNERDetector:
    """Calls a hosted NER service and maps its response to Detection objects."""

    def __init__(self, url: str, api_key: str) -> None:
        self._url, self._key = url, api_key

    async def detect(self, text: str) -> list[Detection]:
        async with httpx.AsyncClient() as client:
            r = await client.post(
                self._url,
                json={"text": text},
                headers={"Authorization": f"Bearer {self._key}"},
            )
        return [
            Detection(
                text=hit["text"],
                label=hit["label"],
                position=Span(start_pos=hit["start"], end_pos=hit["end"]),
                confidence=hit["score"],
            )
            for hit in r.json()["entities"]
        ]


# Satisfies AnyDetector by structural typing — drop it straight into the pipeline.
detector: AnyDetector = RemoteNERDetector(url="...", api_key="...")

Combine several detectors with CompositeDetector and let ConfidenceSpanConflictResolver pick a winner when their spans overlap. See extending docs for the full catalogue (spaCy, transformers, LLM-as-detector, regex with validators for IBAN / NIR / Luhn).

Use cases

piighost fits anywhere a third-party LLM should not see real names, identifiers, or free-text PII:

  • Customer support chatbot. A SaaS sends every ticket to GPT to generate a draft reply. With piighost, the LLM sees <<CUSTOMER:1>> reports an outage on order <<ORDER_ID:3>>, the response comes back deanonymized, and the customer email is never logged on the provider side.
  • Healthcare / clinical assistant. A nurse pastes patient notes into a triage assistant. piighost strips patient names, SSN, and addresses before the LLM call, while the medical content (symptoms, vitals, treatments) reaches the model intact, which keeps reasoning quality high while avoiding a HIPAA / GDPR incident.
  • HR agent on internal documents. A RAG agent answers questions over performance reviews and salary grids. Employee names and amounts are anonymized in retrieved chunks; the LLM never sees who got what; the final answer is reconstructed for the authorized HR user only.
  • Legal assistant. Contracts processed with client and counterparty names redacted before reaching the model.
  • Tool-enabled agents. Anonymize free-text inputs without breaking tool calls: the send_email / CRM / Jira tool still receives the real address, the LLM only ever saw <<PERSON:1>>.

How it works

Pipeline

AnonymizationPipeline runs five stages, each one a swappable protocol:

---
title: "AnonymizationPipeline.anonymize() flow"
---
flowchart LR
    classDef stage fill:#90CAF9,stroke:#1565C0,color:#000
    classDef protocol fill:#FFF9C4,stroke:#F9A825,color:#000
    classDef data fill:#A5D6A7,stroke:#2E7D32,color:#000

    INPUT(["`**Input text**
    _'Patrick lives in Paris.
    Patrick loves Paris.'_`"]):::data

    DETECT["`**1. Detect**
    _AnyDetector_`"]:::stage
    RESOLVE_SPANS["`**2. Resolve Spans**
    _AnySpanConflictResolver_`"]:::stage
    LINK["`**3. Link Entities**
    _AnyEntityLinker_`"]:::stage
    RESOLVE_ENTITIES["`**4. Resolve Entities**
    _AnyEntityConflictResolver_`"]:::stage
    ANONYMIZE["`**5. Anonymize**
    _AnyAnonymizer_`"]:::stage

    OUTPUT(["`**Output**
    _'<<PERSON:1>> lives in <<LOCATION:1>>.
    <<PERSON:1>> loves <<LOCATION:1>>.'_`"]):::data

    INPUT --> DETECT
    DETECT -- "list[Detection]" --> RESOLVE_SPANS
    RESOLVE_SPANS -- "deduplicated detections" --> LINK
    LINK -- "list[Entity]" --> RESOLVE_ENTITIES
    RESOLVE_ENTITIES -- "merged entities" --> ANONYMIZE
    ANONYMIZE --> OUTPUT

    P_DETECT["`GlinerDetector
    _(or RegexDetector, ExactMatchDetector, CompositeDetector…)_`"]:::protocol
    P_RESOLVE_SPANS["`ConfidenceSpanConflictResolver
    _(highest confidence wins)_`"]:::protocol
    P_LINK["`ExactEntityLinker
    _(word-boundary regex)_`"]:::protocol
    P_RESOLVE_ENTITIES["`MergeEntityConflictResolver
    _(union-find merge)_`"]:::protocol
    P_ANONYMIZE["`Anonymizer + LabelCounterPlaceholderFactory
    _(<<LABEL:N>> tags)_`"]:::protocol

    P_DETECT -. "implements" .-> DETECT
    P_RESOLVE_SPANS -. "implements" .-> RESOLVE_SPANS
    P_LINK -. "implements" .-> LINK
    P_RESOLVE_ENTITIES -. "implements" .-> RESOLVE_ENTITIES
    P_ANONYMIZE -. "implements" .-> ANONYMIZE

Three terms drive the pipeline: a span is a (start, end) offset, a detection is a span + label + confidence emitted by a detector, an entity is a group of detections referring to the same real-world PII (so they share the same placeholder). Full definitions in the glossary.

Middleware integration

The middleware hooks into LangChain's abefore_model, awrap_tool_call and aafter_model to anonymize, deanonymize for tools, and re-anonymize tool results. See architecture docs for the full sequence.

Installation

piighost ships as a regular wheel on PyPI. The core package has no required dependencies, install only the extras for the features you need.

Inside a uv project (recommended)

uv add piighost                 # core only (small, no model)
uv add 'piighost[cache]'        # AnonymizationPipeline (aiocache)
uv add 'piighost[gliner2]'      # Gliner2Detector
uv add 'piighost[middleware]'   # PIIAnonymizationMiddleware (langchain + aiocache)
uv add 'piighost[all]'          # everything

Standalone (pip or uv pip)

For an isolated venv, a notebook, or a script outside a uv project:

pip install piighost                          # or:  uv pip install piighost
pip install 'piighost[middleware]'

Compatibility

Python LangChain (extra middleware) aiocache (extra cache) GLiNER2 (extra gliner2)
>=3.10 >=1.2 >=0.12 >=1.2

piighost is tested on Python 3.10 through 3.14. Versions are declared in pyproject.toml.

From source (development)

git clone https://github.com/Athroniaeth/piighost.git
cd piighost
uv sync
make lint        # ruff format + check, pyrefly type-check, bandit
uv run pytest

Pipeline components

Only detector and anonymizer are required, the three middle stages (span resolver, entity linker, entity resolver) have sensible defaults you can override. Full table of defaults, roles, and what each stage protects against in architecture docs.

FAQ

Q: What languages are supported? That's entirely up to the detector you plug in. The pipeline itself is language-agnostic. With Gliner2Detector and a multilingual GLiNER2 model, you get ~100 languages out of the box. With SpacyDetector, anything spaCy supports. With RegexDetector, language doesn't matter.

Q: Which entities does it detect out of the box? None: piighost does not ship its own NER model, on purpose. You bring the detector. Use ExactMatchDetector for fixed dictionaries, RegexDetector with piighost.detector.patterns (FR_IBAN, FR_NIR, EU_VAT, ...), Gliner2Detector for open-set NER (PERSON, LOCATION, ORGANIZATION, EMAIL, ... whatever labels you query), or compose them.

Q: How much latency does it add? The pipeline itself is ~milliseconds (regex + dict lookups). Real cost is the detector you choose. CPU GLiNER2 on a 200-token message is typically 50-200 ms; an LLM-as-detector is hundreds of ms. The pipeline caches detection results per text hash via aiocache, so repeated content is free. Benchmarks for your own workload are recommended before sizing production traffic.

Q: Does it work fully offline? (GDPR / RGPD) Yes. With a local detector (Gliner2Detector, SpacyDetector, RegexDetector, ExactMatchDetector), no data leaves your process. The middleware only forwards already-anonymized text to the LLM. This is the main reason teams adopt piighost: keep using a hosted LLM under EU constraints without exfiltrating raw PII.

Q: What happens when the NER misses an entity? Two layers of defense:

  1. The entity linker scans the whole text (and the conversation, in ThreadAnonymizationPipeline) for word-level matches of every detected entity. So if Patrick is detected once, every other Patrick in the text gets the same placeholder, even if the NER missed them.
  2. For deterministic PII (emails, phone numbers, IBANs), combine the NER detector with a RegexDetector via CompositeDetector. NER false negatives become regex true positives.

For PII the LLM generates in its response (entities never seen in the input), use a DetectorGuardRail on the output, see extending docs.

Q: Can I use it without LangChain? Yes. AnonymizationPipeline and ThreadAnonymizationPipeline are independent of any agent framework. The LangChain middleware is one integration; the pipeline itself can be called from anywhere (FastAPI handler, batch script, custom agent loop).

Q: How is reversibility (deanonymize) implemented? A SHA-256 keyed cache stores anonymized_text → (original_text, entities). pipeline.deanonymize(anonymized_text) looks up the mapping and restores the original. The cache is in-memory by default (SimpleMemoryCache), pass any aiocache backend (Redis, Memcached) for multi-instance deployments.

Limitations

piighost is not a silver bullet. Trade-offs to keep in mind before deploying:

  • Entity linking amplifies NER mistakes. If Rose is mistakenly detected as a person, every rose (the flower) is anonymized too. Mitigation: narrower detector (ExactMatchDetector, RegexDetector), or fresh thread per message.
  • Fuzzy resolution can over-merge. Jaro-Winkler on short names (Marin vs Martin) can fuse distinct people. Mitigation: raise the threshold or stick to MergeEntityConflictResolver.
  • PII generated by the LLM in its responses (never seen in input) bypasses entity linking. Add a DetectorGuardRail on the output.
  • Cache is local by default. Multi-instance deployments need a shared backend (Redis, Memcached) configured explicitly.
  • Latency overhead is detector-bound. Benchmark on your own workload before sizing production traffic.

See architecture docs, extending docs, and limitations docs for mitigation strategies.

Development

uv sync                              # install dev dependencies
make lint                            # ruff format + check, pyrefly, bandit
uv run pytest                        # run all tests
uv run pytest tests/ -k "test_name"  # run a single test

Contributing

  • Commits: Conventional Commits via Commitizen (feat:, fix:, refactor:, ...)
  • Type checking: PyReFly (not mypy)
  • Formatting / linting: Ruff
  • Package manager: uv (not pip)
  • Python: 3.10+

See CONTRIBUTING.md and CODE_OF_CONDUCT.md.

Ecosystem

  • piighost-api: REST API server for PII anonymization inference. Loads a piighost pipeline once server-side and exposes anonymize / deanonymize over HTTP, so clients only need a lightweight HTTP client instead of embedding the NER model.
  • piighost-chat: Demo chat app showcasing privacy-preserving AI conversations. Uses PIIAnonymizationMiddleware with LangChain to anonymize messages before the LLM and deanonymize responses transparently. Built with SvelteKit, Litestar, and Docker Compose.

Additional notes

  • All data models are frozen dataclasses, safe to share across threads.
  • Tests use ExactMatchDetector to avoid loading any heavy NER model in CI.
  • For the threat model, what piighost protects against and what it does not, and cache storage considerations, see SECURITY.md.

Roadmap

A public roadmap (logo, latency / accuracy benchmarks on a reference corpus, GIF demo of piighost-chat, hosted live demo) lives in roadmap. Issues and discussions welcome.

Star us!

If piighost saves you a few hours, a ⭐ on GitHub helps others find it. Bug reports and PRs are even better, see CONTRIBUTING.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

piighost-0.10.0.tar.gz (895.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

piighost-0.10.0-py3-none-any.whl (90.7 kB view details)

Uploaded Python 3

File details

Details for the file piighost-0.10.0.tar.gz.

File metadata

  • Download URL: piighost-0.10.0.tar.gz
  • Upload date:
  • Size: 895.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for piighost-0.10.0.tar.gz
Algorithm Hash digest
SHA256 fda85d05d7294562a1923cae444130e8faf621f0d272f038ec2f67a70d321ac9
MD5 7e8e3ec32c4e76c0eab01e52e6c0f2a2
BLAKE2b-256 439c7ba5a36a222984d8d41ff4ffe4271b8dbc0d0989b3f6a847fce8b7733f9e

See more details on using hashes here.

File details

Details for the file piighost-0.10.0-py3-none-any.whl.

File metadata

  • Download URL: piighost-0.10.0-py3-none-any.whl
  • Upload date:
  • Size: 90.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for piighost-0.10.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0a8a200273b3672332edd6d6e81b109fa542a7a44418a6f6ccdc2d93b2754d99
MD5 bc73d503d5d7c88dece835e3b58a9a11
BLAKE2b-256 7fc10f1ea69805e1dc237b02c91807fa592a5f0d30cae13fe5d961b0602175ef

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page