MaskPipe is a modular, spaCy-native PII de-identification pipeline. Refine and orchestrate entity results from GLiNER, HuggingFace, or any NER source with context-aware boosting, rule-based validation, and flexible redaction.
Project description
MaskPipe
MaskPipe is a spaCy-native toolkit for detecting, refining, resolving, and redacting PII.
Use it when you want one of these workflows:
- detect PII with built-in and custom rules, then redact it
- take entities from another NER system, run overlap resolution, then redact them
- combine both approaches in one spaCy pipeline
What MaskPipe Does
MaskPipe gives you four composable pipeline components:
recognizer: finds spans from token patterns, phrase patterns, and custom matcherscontext_enhancer: boosts scores or relabels spans from nearby contextconflict_resolver: resolves overlap and filters low-confidence spansanonymizer: writes masked output todoc._.masked
The original doc.text is never modified.
Installation
pip install maskpipe
python -m spacy download nl_core_news_sm
Requirements:
- Python 3.11-3.14
- spaCy 3.8+
Optional dependencies for examples and integrations:
pip install faker gliner transformers
Quick Start: Built-in Detection + Masking
This is the default workflow if you want MaskPipe to detect PII itself.
import spacy
from maskpipe import PipelineBuilder
from maskpipe import entities
from maskpipe.entities import nl
nlp = spacy.load("nl_core_news_sm", disable=["ner"])
builder = PipelineBuilder(nlp)
builder.add_entities([
nl.BSN.replace(redactor="[BSN]"),
entities.PHONE_NUMBER.replace(redactor="[PHONE_NUMBER]"),
entities.EMAIL.replace(redactor="[EMAIL]"),
])
nlp = builder.build()
doc = nlp("Mijn BSN is 692015644, bel me op 0612345678 of mail naar info@example.com")
print(doc.text)
print(doc._.masked)
for ent in doc.ents:
print(ent.text, ent.label_, ent._.score, ent._.replacement)
# Mijn BSN is 692015644, bel me op 0612345678 of mail naar info@example.com
# Mijn BSN is [BSN], bel me op [PHONE_NUMBER] of mail naar [EMAIL]
# 692015644 BSN 0.85 [BSN]
# 0612345678 PHONE_NUMBER 0.75 [PHONE_NUMBER]
# info@example.com EMAIL 1.0 [EMAIL]
Output model:
doc.text: original textdoc._.masked: masked textdoc.ents: resolved spans after conflict resolutionspan._.replacement: replacement chosen by the anonymizer
If no redactor is registered for a label, MaskPipe uses [LABEL].
Quick Start: External NER + Masking
This is the right setup if another model already produced entity offsets.
import spacy
from transformers import pipeline
from maskpipe import PipelineBuilder, DocBuilder, HF_NER_MAPPER
# Load your NER model
ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")
# Set up MaskPipe to only resolve overlaps and mask (no local detection).
nlp = spacy.load("nl_core_news_sm", disable=["ner"])
builder = PipelineBuilder(nlp, disable=["recognizer", "context_enhancer"])
nlp = builder.build()
text = "Alice works at Google. Contact her at alice@example.com or 555-1234."
results = ner(text)
# results is a list of dicts like:
# [{"word": "Alice", "start": 0, "end": 5, "entity_group": "B-PER", "score": 0.98}, ...]
doc = DocBuilder(nlp, text).with_entities(results, entity_mapper=HF_NER_MAPPER).build()
doc = nlp(doc)
print(doc._.masked)
# [PERSON] works at [ORG]. Contact her at [EMAIL] or [PHONE_NUMBER].
Why this works:
HF_NER_MAPPERnormalizes HuggingFace NER output to the canonical entity format.with_entities()converts character offsets to spaCy spans with scores.conflict_resolverdeduplicates overlapping spans and writes clean results todoc.ents.anonymizerreadsdoc.entsand generatesdoc._.masked.
Built-in Entities
Available in maskpipe.entities:
CREDIT_CARDDATEEMAILIBANIPV4IPV6MAC_ADDRESSNUMBERPHONE_NUMBERURL
Available in maskpipe.entities.nl:
BSN
Entity objects are immutable configs. Use .replace(...) to override one field without rebuilding the whole entity:
from maskpipe import entities
masked_email = entities.EMAIL.replace(redactor="[EMAIL]")
Creating Custom Entities
from maskpipe.entities import Entity
EMPLOYEE_ID = Entity(
label="EMPLOYEE_ID",
patterns=[
{"pattern": [{"TEXT": {"REGEX": r"EMP-\\d{5}"}}], "score": 0.9, "id": "employee-id"},
],
context_patterns=[
{"pattern": [{"LOWER": "employee"}]},
{"context_label": "STAFF_ID", "pattern": [{"LOWER": "staff"}, {"LOWER": "id"}]},
],
validator=lambda span: span.text.startswith("EMP-"),
redactor=lambda text: "EMP-XXXXX",
)
Supported redactors:
- fixed string:
"[MASK]" - zero-argument callable:
lambda: "generated-value" - one-argument callable:
lambda text: text[:1] + "*" * (len(text) - 1)
DocBuilder
DocBuilder converts character offsets from external NER systems into spaCy spans with scores.
Basic Usage
from maskpipe import DocBuilder
# Create a doc and add entities
doc = DocBuilder(nlp, text).with_entities(
entities=[
{"start": 0, "end": 5, "label": "PERSON", "score": 0.95},
{"start": 30, "end": 45, "label": "EMAIL", "score": 0.99},
]
).build()
doc = nlp(doc)
print(doc._.masked)
Entity Format
with_entities() expects a list of dicts with at least:
start: character offset (int)end: character offset (int)label: entity type (str)score: confidence [0.0, 1.0] (float, optional)
entities = [
{"start": 0, "end": 5, "label": "PERSON", "score": 0.95},
{"start": 30, "end": 45, "label": "EMAIL", "score": 0.99},
]
doc = DocBuilder(nlp, text).with_entities(entities).build()
Entity Mappers
Use entity_mapper to normalize different NER output formats. MaskPipe provides pre-configured mappers:
| Mapper | Use For | Key Fields |
|---|---|---|
GLINER_MAPPER |
GLiNER (x-large) | start, end, label, score |
GLINER2_MAPPER |
GLiNER2 (nested format) | nested {label: {start, end, confidence}} |
HF_NER_MAPPER |
HuggingFace NER | entity_group (or entity), start, end, score |
OPENMED_MAPPER |
OpenMed (biomedical) | start, end, label, confidence |
Example with GLiNER:
from maskpipe import GLINER_MAPPER, DocBuilder
from gliner import GLiNER
model = GLiNER.from_pretrained("knowledgator/gliner-x-large")
text = "Patient John Doe, email: john@example.com"
predictions = model.predict_entities(text, labels=["person", "email"], threshold=0.5)
doc = DocBuilder(nlp, text).with_entities(predictions, entity_mapper=GLINER_MAPPER).build()
doc = nlp(doc)
print(doc._.masked)
# [PERSON], email: [EMAIL]
Example with HuggingFace NER:
from maskpipe import HF_NER_MAPPER, DocBuilder
from transformers import pipeline
ner = pipeline("ner", model="dslim/bert-base-NER")
text = "Contact: alice@example.com"
results = ner(text)
doc = DocBuilder(nlp, text).with_entities(results, entity_mapper=HF_NER_MAPPER).build()
doc = nlp(doc)
print(doc._.masked)
# Contact: [EMAIL]
Custom Mappers
Create custom mappers for other NER systems:
from maskpipe import EntityMapper
# For any system with {start, end, label, score}
custom_mapper = EntityMapper(label="type", score="confidence")
# For systems with conditional label fields
fallback_mapper = EntityMapper(
label="entity_type",
label_fallback="category", # use if entity_type not found
score="conf"
)
Batch Processing
Use build_batch() to process multiple texts at once:
docs = list(DocBuilder.build_batch(
nlp=nlp,
texts=["text1", "text2", "text3"],
entities_list=[
[{"start": 0, "end": 5, "label": "PERSON", "score": 0.9}],
[{"start": 10, "end": 20, "label": "EMAIL", "score": 0.95}],
[], # no entities
],
))
for doc in nlp.pipe(docs):
print(doc._.masked)
Context Words
Add context to help the context_enhancer component make relabeling decisions:
doc = DocBuilder(nlp, text).with_context_words(["email", "contact"]).build()
Customizing Components
PipelineBuilder
PipelineBuilder adds the default component chain in this order:
recognizercontext_enhancerconflict_resolveranonymizer
You can disable components you do not need:
from maskpipe import PipelineBuilder
builder = PipelineBuilder(
nlp,
label_mapping={"persoon": "PERSON"},
disable=["context_enhancer"],
)
Context Enhancement
Add context patterns directly to the component:
context_enhancer = nlp.get_pipe("context_enhancer")
context_enhancer.add_patterns([
{
"label": "EMAIL",
"pattern": [{"LOWER": {"IN": ["email", "mail", "e-mail"]}}],
}
])
Important:
- context patterns match by label
- score changes come from component config such as
confidence_boost context_labelcan relabel a matched spandoc._.context_wordslets you add extra context terms not present in the text
Anonymizer
anonymizer = nlp.get_pipe("anonymizer")
anonymizer.add_redactors({
"EMAIL": "[REDACTED]",
"ID": lambda: "ID-000001",
"PERSON": lambda text: text[0] + "." * (len(text) - 1),
})
The anonymizer:
- leaves
doc.textunchanged - stores masked output in
doc._.masked - stores the chosen replacement in
span._.replacement
spaCy Extensions Added by MaskPipe
Document extensions:
doc._.maskeddoc._.context_words
Span extensions:
span._.scorespan._.contextspan._.replacement
Minimal API Reference
PipelineBuilder(nlp, label_mapping=None, disable=None)
DocBuilder(
nlp,
text,
label_mapping=None,
spans_key="sc",
annotate_ents=False,
default_score=0.6,
alignment_mode="strict",
)
Entity(
label,
patterns=None,
custom_matcher=None,
validator=None,
context_patterns=None,
redactor=None,
)
Development
uv sync --dev
uv run pytest -q
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file maskpipe-0.0.14.tar.gz.
File metadata
- Download URL: maskpipe-0.0.14.tar.gz
- Upload date:
- Size: 127.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00e0f89d98c8877247effa0ed45aea323a6c7b1162d4ca0b80e8d5b922149750
|
|
| MD5 |
9d44480e6567880ddd3fbfe0971f0534
|
|
| BLAKE2b-256 |
e861aea2e2693c8d5302ae69f70970f4947378d87c7bc4f6926169dc0660be36
|
File details
Details for the file maskpipe-0.0.14-py3-none-any.whl.
File metadata
- Download URL: maskpipe-0.0.14-py3-none-any.whl
- Upload date:
- Size: 32.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f890c576069231f71500fedd5ae9fe674b5e572d19314e450b184e3f84df32e
|
|
| MD5 |
972057f3938cdce12b674da5a4e7c8f4
|
|
| BLAKE2b-256 |
e641d39387dba661111a381f82185a6fbe986cc64436ea5c3945be260640cfe7
|