Skip to main content

MaskPipe is a modular, spaCy-native PII de-identification pipeline. Refine and orchestrate entity results from GLiNER, HuggingFace, or any NER source with context-aware boosting, rule-based validation, and flexible redaction.

Project description

MaskPipe

MaskPipe is a spaCy-native toolkit for detecting, refining, resolving, and redacting PII.

Use it when you want one of these workflows:

  • detect PII with built-in and custom rules, then redact it
  • take entities from another NER system, run overlap resolution, then redact them
  • combine both approaches in one spaCy pipeline

What MaskPipe Does

MaskPipe gives you four composable pipeline components:

  • recognizer: finds spans from token patterns, phrase patterns, and custom matchers
  • context_enhancer: boosts scores or relabels spans from nearby context
  • conflict_resolver: resolves overlap and filters low-confidence spans
  • anonymizer: writes masked output to doc._.masked

The original doc.text is never modified.

Installation

pip install maskpipe
python -m spacy download nl_core_news_sm

Requirements:

  • Python 3.11-3.14
  • spaCy 3.8+

Optional dependencies for examples and integrations:

pip install faker gliner transformers

Quick Start: Built-in Detection + Masking

This is the default workflow if you want MaskPipe to detect PII itself.

import spacy
from maskpipe import PipelineBuilder
from maskpipe import entities
from maskpipe.entities import nl

nlp = spacy.load("nl_core_news_sm", disable=["ner"])

builder = PipelineBuilder(nlp)
builder.add_entities([
    nl.BSN.replace(redactor="[BSN]"),
    entities.PHONE_NUMBER.replace(redactor="[PHONE_NUMBER]"),
    entities.EMAIL.replace(redactor="[EMAIL]"),
])

nlp = builder.build()

doc = nlp("Mijn BSN is 692015644, bel me op 0612345678 of mail naar info@example.com")

print(doc.text)
print(doc._.masked)
for ent in doc.ents:
    print(ent.text, ent.label_, ent._.score, ent._.replacement)
# Mijn BSN is 692015644, bel me op 0612345678 of mail naar info@example.com
# Mijn BSN is [BSN], bel me op [PHONE_NUMBER] of mail naar [EMAIL]
# 692015644 BSN 0.85 [BSN]
# 0612345678 PHONE_NUMBER 0.75 [PHONE_NUMBER]
# info@example.com EMAIL 1.0 [EMAIL]

Output model:

  • doc.text: original text
  • doc._.masked: masked text
  • doc.ents: resolved spans after conflict resolution
  • span._.replacement: replacement chosen by the anonymizer

If no redactor is registered for a label, MaskPipe uses [LABEL].

Quick Start: External NER + Masking

This is the right setup if another model already produced entity offsets.

import spacy
from transformers import pipeline
from maskpipe import PipelineBuilder, DocBuilder, HF_NER_MAPPER

# Load your NER model
ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")

# Set up MaskPipe to only resolve overlaps and mask (no local detection).
nlp = spacy.load("nl_core_news_sm", disable=["ner"])
builder = PipelineBuilder(nlp, disable=["recognizer", "context_enhancer"])
nlp = builder.build()

text = "Alice works at Google. Contact her at alice@example.com or 555-1234."
results = ner(text)

# results is a list of dicts like:
# [{"word": "Alice", "start": 0, "end": 5, "entity_group": "B-PER", "score": 0.98}, ...]

doc = DocBuilder(nlp, text).with_entities(results, entity_mapper=HF_NER_MAPPER).build()
doc = nlp(doc)
print(doc._.masked)
# [PERSON] works at [ORG]. Contact her at [EMAIL] or [PHONE_NUMBER].

Why this works:

  • HF_NER_MAPPER normalizes HuggingFace NER output to the canonical entity format.
  • with_entities() converts character offsets to spaCy spans with scores.
  • conflict_resolver deduplicates overlapping spans and writes clean results to doc.ents.
  • anonymizer reads doc.ents and generates doc._.masked.

Built-in Entities

Available in maskpipe.entities:

  • CREDIT_CARD
  • DATE
  • EMAIL
  • IBAN
  • IPV4
  • IPV6
  • MAC_ADDRESS
  • NUMBER
  • PHONE_NUMBER
  • URL

Available in maskpipe.entities.nl:

  • BSN

Entity objects are immutable configs. Use .replace(...) to override one field without rebuilding the whole entity:

from maskpipe import entities

masked_email = entities.EMAIL.replace(redactor="[EMAIL]")

Creating Custom Entities

from maskpipe.entities import Entity

EMPLOYEE_ID = Entity(
    label="EMPLOYEE_ID",
    patterns=[
        {"pattern": [{"TEXT": {"REGEX": r"EMP-\\d{5}"}}], "score": 0.9, "id": "employee-id"},
    ],
    context_patterns=[
        {"pattern": [{"LOWER": "employee"}]},
        {"context_label": "STAFF_ID", "pattern": [{"LOWER": "staff"}, {"LOWER": "id"}]},
    ],
    validator=lambda span: span.text.startswith("EMP-"),
    redactor=lambda text: "EMP-XXXXX",
)

Supported redactors:

  • fixed string: "[MASK]"
  • zero-argument callable: lambda: "generated-value"
  • one-argument callable: lambda text: text[:1] + "*" * (len(text) - 1)

DocBuilder

DocBuilder converts character offsets from external NER systems into spaCy spans with scores.

Basic Usage

from maskpipe import DocBuilder

# Create a doc and add entities
doc = DocBuilder(nlp, text).with_entities(
    entities=[
        {"start": 0, "end": 5, "label": "PERSON", "score": 0.95},
        {"start": 30, "end": 45, "label": "EMAIL", "score": 0.99},
    ]
).build()

doc = nlp(doc)
print(doc._.masked)

Entity Format

with_entities() expects a list of dicts with at least:

  • start: character offset (int)
  • end: character offset (int)
  • label: entity type (str)
  • score: confidence [0.0, 1.0] (float, optional)
entities = [
    {"start": 0, "end": 5, "label": "PERSON", "score": 0.95},
    {"start": 30, "end": 45, "label": "EMAIL", "score": 0.99},
]
doc = DocBuilder(nlp, text).with_entities(entities).build()

Entity Mappers

Use entity_mapper to normalize different NER output formats. MaskPipe provides pre-configured mappers:

Mapper Use For Key Fields
GLINER_MAPPER GLiNER (x-large) start, end, label, score
GLINER2_MAPPER GLiNER2 (nested format) nested {label: {start, end, confidence}}
HF_NER_MAPPER HuggingFace NER entity_group (or entity), start, end, score
OPENMED_MAPPER OpenMed (biomedical) start, end, label, confidence

Example with GLiNER:

from maskpipe import GLINER_MAPPER, DocBuilder
from gliner import GLiNER

model = GLiNER.from_pretrained("knowledgator/gliner-x-large")
text = "Patient John Doe, email: john@example.com"
predictions = model.predict_entities(text, labels=["person", "email"], threshold=0.5)

doc = DocBuilder(nlp, text).with_entities(predictions, entity_mapper=GLINER_MAPPER).build()
doc = nlp(doc)
print(doc._.masked)
# [PERSON], email: [EMAIL]

Example with HuggingFace NER:

from maskpipe import HF_NER_MAPPER, DocBuilder
from transformers import pipeline

ner = pipeline("ner", model="dslim/bert-base-NER")
text = "Contact: alice@example.com"
results = ner(text)

doc = DocBuilder(nlp, text).with_entities(results, entity_mapper=HF_NER_MAPPER).build()
doc = nlp(doc)
print(doc._.masked)
# Contact: [EMAIL]

Custom Mappers

Create custom mappers for other NER systems:

from maskpipe import EntityMapper

# For any system with {start, end, label, score}
custom_mapper = EntityMapper(label="type", score="confidence")

# For systems with conditional label fields
fallback_mapper = EntityMapper(
    label="entity_type",
    label_fallback="category",  # use if entity_type not found
    score="conf"
)

Batch Processing

Use build_batch() to process multiple texts at once:

docs = list(DocBuilder.build_batch(
    nlp=nlp,
    texts=["text1", "text2", "text3"],
    entities_list=[
        [{"start": 0, "end": 5, "label": "PERSON", "score": 0.9}],
        [{"start": 10, "end": 20, "label": "EMAIL", "score": 0.95}],
        [],  # no entities
    ],
))

for doc in nlp.pipe(docs):
    print(doc._.masked)

Context Words

Add context to help the context_enhancer component make relabeling decisions:

doc = DocBuilder(nlp, text).with_context_words(["email", "contact"]).build()

Customizing Components

PipelineBuilder

PipelineBuilder adds the default component chain in this order:

  1. recognizer
  2. context_enhancer
  3. conflict_resolver
  4. anonymizer

You can disable components you do not need:

from maskpipe import PipelineBuilder

builder = PipelineBuilder(
    nlp,
    label_mapping={"persoon": "PERSON"},
    disable=["context_enhancer"],
)

Context Enhancement

Add context patterns directly to the component:

context_enhancer = nlp.get_pipe("context_enhancer")
context_enhancer.add_patterns([
    {
        "label": "EMAIL",
        "pattern": [{"LOWER": {"IN": ["email", "mail", "e-mail"]}}],
    }
])

Important:

  • context patterns match by label
  • score changes come from component config such as confidence_boost
  • context_label can relabel a matched span
  • doc._.context_words lets you add extra context terms not present in the text

Anonymizer

anonymizer = nlp.get_pipe("anonymizer")
anonymizer.add_redactors({
    "EMAIL": "[REDACTED]",
    "ID": lambda: "ID-000001",
    "PERSON": lambda text: text[0] + "." * (len(text) - 1),
})

The anonymizer:

  • leaves doc.text unchanged
  • stores masked output in doc._.masked
  • stores the chosen replacement in span._.replacement

spaCy Extensions Added by MaskPipe

Document extensions:

  • doc._.masked
  • doc._.context_words

Span extensions:

  • span._.score
  • span._.context
  • span._.replacement

Minimal API Reference

PipelineBuilder(nlp, label_mapping=None, disable=None)
DocBuilder(
    nlp,
    text,
    label_mapping=None,
    spans_key="sc",
    annotate_ents=False,
    default_score=0.6,
    alignment_mode="strict",
)
Entity(
    label,
    patterns=None,
    custom_matcher=None,
    validator=None,
    context_patterns=None,
    redactor=None,
)

Development

uv sync --dev
uv run pytest -q

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

maskpipe-0.0.14.tar.gz (127.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

maskpipe-0.0.14-py3-none-any.whl (32.6 kB view details)

Uploaded Python 3

File details

Details for the file maskpipe-0.0.14.tar.gz.

File metadata

  • Download URL: maskpipe-0.0.14.tar.gz
  • Upload date:
  • Size: 127.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for maskpipe-0.0.14.tar.gz
Algorithm Hash digest
SHA256 00e0f89d98c8877247effa0ed45aea323a6c7b1162d4ca0b80e8d5b922149750
MD5 9d44480e6567880ddd3fbfe0971f0534
BLAKE2b-256 e861aea2e2693c8d5302ae69f70970f4947378d87c7bc4f6926169dc0660be36

See more details on using hashes here.

File details

Details for the file maskpipe-0.0.14-py3-none-any.whl.

File metadata

  • Download URL: maskpipe-0.0.14-py3-none-any.whl
  • Upload date:
  • Size: 32.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for maskpipe-0.0.14-py3-none-any.whl
Algorithm Hash digest
SHA256 6f890c576069231f71500fedd5ae9fe674b5e572d19314e450b184e3f84df32e
MD5 972057f3938cdce12b674da5a4e7c8f4
BLAKE2b-256 e641d39387dba661111a381f82185a6fbe986cc64436ea5c3945be260640cfe7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page