MaskPipe is a modular, spaCy-native PII de-identification pipeline. Refine and orchestrate entity results from GLiNER, HuggingFace, or any NER source with context-aware boosting, rule-based validation, and flexible redaction.

Project description

MaskPipe

MaskPipe is a spaCy-native toolkit for detecting, refining, resolving, and redacting PII.

Use it when you want one of these workflows:

detect PII with built-in and custom rules, then redact it
take entities from another NER system, run overlap resolution, then redact them
combine both approaches in one spaCy pipeline

What MaskPipe Does

MaskPipe gives you four composable pipeline components:

recognizer: finds spans from token patterns, phrase patterns, and custom matchers
context_enhancer: boosts scores or relabels spans from nearby context
conflict_resolver: resolves overlap and filters low-confidence spans
anonymizer: writes masked output to doc._.masked

The original doc.text is never modified.

Installation

pip install maskpipe
python -m spacy download nl_core_news_sm

Requirements:

Python 3.11-3.14
spaCy 3.8+

Optional dependencies for examples and integrations:

pip install faker gliner transformers

Quick Start: Built-in Detection + Masking

This is the default workflow if you want MaskPipe to detect PII itself.

import spacy
from maskpipe import PipelineBuilder
from maskpipe import entities
from maskpipe.entities import nl

nlp = spacy.load("nl_core_news_sm", disable=["ner"])

builder = PipelineBuilder(nlp)
builder.add_entities([
    nl.BSN.replace(redactor="[BSN]"),
    entities.PHONE_NUMBER.replace(redactor="[PHONE_NUMBER]"),
    entities.EMAIL.replace(redactor="[EMAIL]"),
])

nlp = builder.build()

doc = nlp("Mijn BSN is 692015644, bel me op 0612345678 of mail naar info@example.com")

print(doc.text)
print(doc._.masked)
for ent in doc.ents:
    print(ent.text, ent.label_, ent._.score, ent._.replacement)
# Mijn BSN is 692015644, bel me op 0612345678 of mail naar info@example.com
# Mijn BSN is [BSN], bel me op [PHONE_NUMBER] of mail naar [EMAIL]
# 692015644 BSN 0.85 [BSN]
# 0612345678 PHONE_NUMBER 0.75 [PHONE_NUMBER]
# info@example.com EMAIL 1.0 [EMAIL]

Output model:

doc.text: original text
doc._.masked: masked text
doc.ents: resolved spans after conflict resolution
span._.replacement: replacement chosen by the anonymizer

If no redactor is registered for a label, MaskPipe uses [LABEL].

Quick Start: External NER + Masking

This is the right setup if another model already produced entity offsets.

import spacy
from transformers import pipeline
from maskpipe import PipelineBuilder, DocBuilder, HF_NER_MAPPER

# Load your NER model
ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")

# Set up MaskPipe to only resolve overlaps and mask (no local detection).
nlp = spacy.load("nl_core_news_sm", disable=["ner"])
builder = PipelineBuilder(nlp, disable=["recognizer", "context_enhancer"])
nlp = builder.build()

text = "Alice works at Google. Contact her at alice@example.com or 555-1234."
results = ner(text)

# results is a list of dicts like:
# [{"word": "Alice", "start": 0, "end": 5, "entity_group": "B-PER", "score": 0.98}, ...]

doc = DocBuilder(nlp, text).with_entities(results, entity_mapper=HF_NER_MAPPER).build()
doc = nlp(doc)
print(doc._.masked)
# [PERSON] works at [ORG]. Contact her at [EMAIL] or [PHONE_NUMBER].

Why this works:

HF_NER_MAPPER normalizes HuggingFace NER output to the canonical entity format.
with_entities() converts character offsets to spaCy spans with scores.
conflict_resolver deduplicates overlapping spans and writes clean results to doc.ents.
anonymizer reads doc.ents and generates doc._.masked.

Built-in Entities

Available in maskpipe.entities:

CREDIT_CARD
DATE
EMAIL
IBAN
IPV4
IPV6
MAC_ADDRESS
NUMBER
PHONE_NUMBER
URL

Available in maskpipe.entities.nl:

BSN

Entity objects are immutable configs. Use .replace(...) to override one field without rebuilding the whole entity:

from maskpipe import entities

masked_email = entities.EMAIL.replace(redactor="[EMAIL]")

Creating Custom Entities

from maskpipe.entities import Entity

EMPLOYEE_ID = Entity(
    label="EMPLOYEE_ID",
    patterns=[
        {"pattern": [{"TEXT": {"REGEX": r"EMP-\\d{5}"}}], "score": 0.9, "id": "employee-id"},
    ],
    context_patterns=[
        {"pattern": [{"LOWER": "employee"}]},
        {"context_label": "STAFF_ID", "pattern": [{"LOWER": "staff"}, {"LOWER": "id"}]},
    ],
    validator=lambda span: span.text.startswith("EMP-"),
    redactor=lambda text: "EMP-XXXXX",
)

Supported redactors:

fixed string: "[MASK]"
zero-argument callable: lambda: "generated-value"
one-argument callable: lambda text: text[:1] + "*" * (len(text) - 1)

DocBuilder

DocBuilder converts character offsets from external NER systems into spaCy spans with scores.

Basic Usage

from maskpipe import DocBuilder

# Create a doc and add entities
doc = DocBuilder(nlp, text).with_entities(
    entities=[
        {"start": 0, "end": 5, "label": "PERSON", "score": 0.95},
        {"start": 30, "end": 45, "label": "EMAIL", "score": 0.99},
    ]
).build()

doc = nlp(doc)
print(doc._.masked)

Entity Format

with_entities() expects a list of dicts with at least:

start: character offset (int)
end: character offset (int)
label: entity type (str)
score: confidence [0.0, 1.0] (float, optional)

entities = [
    {"start": 0, "end": 5, "label": "PERSON", "score": 0.95},
    {"start": 30, "end": 45, "label": "EMAIL", "score": 0.99},
]
doc = DocBuilder(nlp, text).with_entities(entities).build()

Entity Mappers

Use entity_mapper to normalize different NER output formats. MaskPipe provides pre-configured mappers:

Mapper	Use For	Key Fields
`GLINER_MAPPER`	GLiNER (x-large)	`start`, `end`, `label`, `score`
`GLINER2_MAPPER`	GLiNER2 (nested format)	nested `{label: {start, end, confidence}}`
`HF_NER_MAPPER`	HuggingFace NER	`entity_group` (or `entity`), `start`, `end`, `score`
`OPENMED_MAPPER`	OpenMed (biomedical)	`start`, `end`, `label`, `confidence`

Example with GLiNER:

from maskpipe import GLINER_MAPPER, DocBuilder
from gliner import GLiNER

model = GLiNER.from_pretrained("knowledgator/gliner-x-large")
text = "Patient John Doe, email: john@example.com"
predictions = model.predict_entities(text, labels=["person", "email"], threshold=0.5)

doc = DocBuilder(nlp, text).with_entities(predictions, entity_mapper=GLINER_MAPPER).build()
doc = nlp(doc)
print(doc._.masked)
# [PERSON], email: [EMAIL]

Example with HuggingFace NER:

from maskpipe import HF_NER_MAPPER, DocBuilder
from transformers import pipeline

ner = pipeline("ner", model="dslim/bert-base-NER")
text = "Contact: alice@example.com"
results = ner(text)

doc = DocBuilder(nlp, text).with_entities(results, entity_mapper=HF_NER_MAPPER).build()
doc = nlp(doc)
print(doc._.masked)
# Contact: [EMAIL]

Custom Mappers

Create custom mappers for other NER systems:

from maskpipe import EntityMapper

# For any system with {start, end, label, score}
custom_mapper = EntityMapper(label="type", score="confidence")

# For systems with conditional label fields
fallback_mapper = EntityMapper(
    label="entity_type",
    label_fallback="category",  # use if entity_type not found
    score="conf"
)

Batch Processing

Use build_batch() to process multiple texts at once:

docs = list(DocBuilder.build_batch(
    nlp=nlp,
    texts=["text1", "text2", "text3"],
    entities_list=[
        [{"start": 0, "end": 5, "label": "PERSON", "score": 0.9}],
        [{"start": 10, "end": 20, "label": "EMAIL", "score": 0.95}],
        [],  # no entities
    ],
))

for doc in nlp.pipe(docs):
    print(doc._.masked)

Context Words

Add context to help the context_enhancer component make relabeling decisions:

doc = DocBuilder(nlp, text).with_context_words(["email", "contact"]).build()

Customizing Components

PipelineBuilder

PipelineBuilder adds the default component chain in this order:

recognizer
context_enhancer
conflict_resolver
anonymizer

You can disable components you do not need:

from maskpipe import PipelineBuilder

builder = PipelineBuilder(
    nlp,
    label_mapping={"persoon": "PERSON"},
    disable=["context_enhancer"],
)

Context Enhancement

Add context patterns directly to the component:

context_enhancer = nlp.get_pipe("context_enhancer")
context_enhancer.add_patterns([
    {
        "label": "EMAIL",
        "pattern": [{"LOWER": {"IN": ["email", "mail", "e-mail"]}}],
    }
])

Important:

context patterns match by label
score changes come from component config such as confidence_boost
context_label can relabel a matched span
doc._.context_words lets you add extra context terms not present in the text

Anonymizer

anonymizer = nlp.get_pipe("anonymizer")
anonymizer.add_redactors({
    "EMAIL": "[REDACTED]",
    "ID": lambda: "ID-000001",
    "PERSON": lambda text: text[0] + "." * (len(text) - 1),
})

The anonymizer:

leaves doc.text unchanged
stores masked output in doc._.masked
stores the chosen replacement in span._.replacement

spaCy Extensions Added by MaskPipe

Document extensions:

doc._.masked
doc._.context_words

Span extensions:

span._.score
span._.context
span._.replacement

Minimal API Reference

PipelineBuilder(nlp, label_mapping=None, disable=None)
DocBuilder(
    nlp,
    text,
    label_mapping=None,
    spans_key="sc",
    annotate_ents=False,
    default_score=0.6,
    alignment_mode="strict",
)
Entity(
    label,
    patterns=None,
    custom_matcher=None,
    validator=None,
    context_patterns=None,
    redactor=None,
)

Development

uv sync --dev
uv run pytest -q

License

MIT. See LICENSE.

Project details

Release history Release notifications | RSS feed

This version

0.0.14

May 25, 2026

0.0.13

May 11, 2026

0.0.12

May 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

maskpipe-0.0.14.tar.gz (127.2 kB view details)

Uploaded May 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

maskpipe-0.0.14-py3-none-any.whl (32.6 kB view details)

Uploaded May 25, 2026 Python 3

File details

Details for the file maskpipe-0.0.14.tar.gz.

File metadata

Download URL: maskpipe-0.0.14.tar.gz
Upload date: May 25, 2026
Size: 127.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for maskpipe-0.0.14.tar.gz
Algorithm	Hash digest
SHA256	`00e0f89d98c8877247effa0ed45aea323a6c7b1162d4ca0b80e8d5b922149750`
MD5	`9d44480e6567880ddd3fbfe0971f0534`
BLAKE2b-256	`e861aea2e2693c8d5302ae69f70970f4947378d87c7bc4f6926169dc0660be36`

See more details on using hashes here.

File details

Details for the file maskpipe-0.0.14-py3-none-any.whl.

File metadata

Download URL: maskpipe-0.0.14-py3-none-any.whl
Upload date: May 25, 2026
Size: 32.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for maskpipe-0.0.14-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6f890c576069231f71500fedd5ae9fe674b5e572d19314e450b184e3f84df32e`
MD5	`972057f3938cdce12b674da5a4e7c8f4`
BLAKE2b-256	`e641d39387dba661111a381f82185a6fbe986cc64436ea5c3945be260640cfe7`

See more details on using hashes here.

maskpipe 0.0.14

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

MaskPipe

What MaskPipe Does

Installation

Quick Start: Built-in Detection + Masking

Quick Start: External NER + Masking

Built-in Entities

Creating Custom Entities

DocBuilder

Basic Usage

Entity Format

Entity Mappers

Custom Mappers

Batch Processing

Context Words

Customizing Components

PipelineBuilder

Context Enhancement

Anonymizer

spaCy Extensions Added by MaskPipe

Minimal API Reference

Development

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes