Skip to main content

A Python library to deidentify and redact PII, PHI, and other sensitive information from text

Project description

Phileas (python)

A Python port of Phileas (Java) — a library to deidentify and redact PII, PHI, and other sensitive information from text.

Overview

Phileas analyzes text searching for sensitive information such as email addresses, phone numbers, SSNs, credit card numbers, and many other types of PII/PHI. When sensitive information is identified, Phileas can manipulate it in a variety of ways: the information can be redacted, masked, hashed, or replaced with a static value. The user defines how to handle each type of sensitive information through policies (YAML or JSON).

Other capabilities include referential integrity for redactions, conditional logic for redactions, and a CLI.

Phileas requires no external dependencies (e.g. no ChatGPT/etc.) and is intended to be lightweight and easy to use.

Compatibility Notes

Note that this port of Phileas is not 1:1 with the Java version. There are some differences:

  • This project includes a server that exposes redaction HTTP endpoints. For the Java version, the API server is Philter.
  • This project includes support for policies in YAML as well as JSON.
  • This project does not include all redaction strategies present in the Java version.
  • This project includes a CLI.
  • This project includes the ability to evaluate performance using precision and recall through a built-in evaluation tool.
  • This project does not include support for PDF documents which is present in the Java version.

Installation

pip install phileas-redact

Or, to install in development mode from source:

git clone https://github.com/philterd/phileas-python.git
cd phileas-python
pip install -e ".[dev]"

Quick Start

from phileas.policy.policy import Policy
from phileas.services.filter_service import FilterService

# Define a policy as a Python dict (or load from YAML)
policy_dict = {
    "name": "my-policy",
    "identifiers": {
        "emailAddress": {
            "emailAddressFilterStrategies": [{
                "strategy": "REDACT",
                "redactionFormat": "{{{REDACTED-%t}}}"
            }]
        },
        "ssn": {
            "ssnFilterStrategies": [{
                "strategy": "REDACT",
                "redactionFormat": "{{{REDACTED-%t}}}"
            }]
        }
    }
}

policy = Policy.from_dict(policy_dict)
service = FilterService()

result = service.filter(
    policy=policy,
    context="my-context",
    document_id="doc-001",
    text="Contact john@example.com or call about SSN 123-45-6789."
)

print(result.filtered_text)
# Contact {{{REDACTED-email-address}}} or call about SSN {{{REDACTED-ssn}}}.

for span in result.spans:
    print(f"  [{span.filter_type}] '{span.text}' -> '{span.replacement}' at {span.character_start}:{span.character_end}")

Supported PII / PHI Types

Policy Key Filter Type Description
age age Age references (e.g., "35 years old", "aged 25")
emailAddress email-address Email addresses
creditCard credit-card Credit card numbers (Visa, MC, AmEx, Discover, etc.)
ssn ssn Social Security Numbers (SSNs) and TINs
phoneNumber phone-number US phone numbers
ipAddress ip-address IPv4 and IPv6 addresses
url url HTTP/HTTPS URLs
zipCode zip-code US ZIP codes (5-digit and ZIP+4)
vin vin Vehicle Identification Numbers
bitcoinAddress bitcoin-address Bitcoin addresses
bankRoutingNumber bank-routing-number US ABA bank routing numbers
date date Dates in common formats
macAddress mac-address Network MAC addresses
currency currency USD currency amounts
streetAddress street-address US street addresses
trackingNumber tracking-number UPS, FedEx, and USPS tracking numbers
driversLicense drivers-license US driver's license numbers
ibanCode iban-code International Bank Account Numbers (IBANs)
passportNumber passport-number US passport numbers
patterns user-defined Custom regex-based patterns (list of pattern filters)

Policies

A policy is a YAML (or Python dict) object that defines what sensitive information to identify and how to handle it.

Policy Structure

name: my-policy
identifiers:
  emailAddress:
    enabled: true
    emailAddressFilterStrategies:
      - strategy: REDACT
        redactionFormat: "{{{REDACTED-%t}}}"
    ignored:
      - noreply@example.com
ignored:
  - safe-term
ignoredPatterns:
  - "\\d{3}-test-\\d{4}"

Filter Strategies

Each filter type supports one or more strategies that define what to do with the identified information:

Strategy Description Example Output
REDACT Replace with a redaction tag {{{REDACTED-email-address}}}
MASK Replace each character with * ***@*******.***
STATIC_REPLACE Replace with a fixed string [REMOVED]
HASH_SHA256_REPLACE Replace with the SHA-256 hash a665a4592...
LAST_4 Mask all but the last 4 characters ****6789
SAME Leave the value unchanged (identify only) 123-45-6789
TRUNCATE Keep leading or trailing characters john@***
ABBREVIATE Abbreviate the value J. S.

Strategy Options

strategy: REDACT
redactionFormat: "{{{REDACTED-%t}}}"
staticReplacement: "[REMOVED]"
maskCharacter: "*"
maskLength: SAME
truncateLeaveCharacters: 4
truncateDirection: LEADING
condition: ""
  • %t in redactionFormat is replaced by the filter type name.

Ignored Terms

You can specify terms that should never be redacted at the policy level or per-filter level:

policy_dict = {
    "name": "my-policy",
    "identifiers": {
        "emailAddress": {
            "emailAddressFilterStrategies": [{"strategy": "REDACT"}],
            "ignored": ["noreply@internal.com"]
        }
    },
    "ignored": ["safe-global-term"],
    "ignoredPatterns": ["\\d{3}-555-\\d{4}"]
}

Pattern-Based Filters

A policy can include a list of custom regex-based filters. Each pattern filter specifies a pattern (a regular expression) and an optional label used as the filter type in results. This is useful for identifying domain-specific PII that is not covered by the built-in filters.

policy_dict = {
    "name": "my-policy",
    "identifiers": {
        "patterns": [
            {
                "pattern": "\\d{3}-\\d{3}-\\d{3}",
                "label": "custom-id",
                "patternFilterStrategies": [{"strategy": "REDACT"}]
            }
        ]
    }
}

policy = Policy.from_dict(policy_dict)
result = service.filter(policy, "ctx", "doc1", "ID: 123-456-789")
print(result.filtered_text)  # ID: {{{REDACTED-custom-id}}}

Multiple pattern filters can be included in the same policy:

"patterns": [
    {"pattern": "\\d{3}-\\d{3}-\\d{3}", "label": "id-number"},
    {"pattern": "[A-Z]{2}\\d{6}", "label": "passport-number"}
]

Pattern Filter Options

Field Type Description
pattern str Regular expression used to identify PII
label str Filter type label used in spans (defaults to "pattern")
patternFilterStrategies list List of filter strategies (same as other filter types)
ignored list Terms that should not be redacted even if they match
enabled bool Whether the filter is active (default: true)

Contexts and Referential Integrity

Every call to FilterService.filter() takes a context name. The context is a logical grouping that ties multiple documents together — for example, all documents belonging to a single patient, user, or case.

Phileas uses the context to maintain referential integrity: once a PII token has been replaced, every subsequent occurrence of that same token in the same context receives the identical replacement. This ensures that redacted documents within a context remain internally consistent and can still be cross-referenced without revealing the underlying sensitive values.

How it works

Phileas maintains a ContextService — a map of maps with the structure:

context_name → { token → replacement }

Before applying any replacement, FilterService checks whether the token already has a stored replacement for the current context:

  • Token found — the stored replacement is used instead of generating a new one.
  • Token not found — the newly generated replacement is stored and then applied.

The default implementation is InMemoryContextService, which stores mappings in memory for the lifetime of the FilterService instance.

Using the default in-memory context service

from phileas import FilterService

service = FilterService()  # uses InMemoryContextService automatically

# Both calls operate in the same context, so 555-123-4567 always gets
# the same replacement across documents.
result1 = service.filter(policy, "patient-records", "doc1", "Call 555-123-4567 for info.")
result2 = service.filter(policy, "patient-records", "doc2", "Patient called 555-123-4567 back.")

Pre-seeding the context service

You can pre-populate the context service before filtering to force specific replacements:

from phileas import FilterService, InMemoryContextService

ctx_svc = InMemoryContextService()
ctx_svc.put("patient-records", "john@example.com", "EMAIL-001")

service = FilterService(context_service=ctx_svc)
# john@example.com will always be replaced with EMAIL-001 in the "patient-records" context

Providing a custom context service

Subclass AbstractContextService to integrate any external store (e.g. Redis, a database):

from phileas import FilterService, AbstractContextService

class RedisContextService(AbstractContextService):
    def put(self, context: str, token: str, replacement: str) -> None:
        # store in Redis
        ...

    def get(self, context: str, token: str) -> str | None:
        # retrieve from Redis, return None if not found
        ...

    def contains(self, context: str, token: str) -> bool:
        # check existence in Redis
        ...

service = FilterService(context_service=RedisContextService())

API Reference

FilterService

from phileas.services.filter_service import FilterService

service = FilterService(context_service=None)
result = service.filter(policy, context, document_id, text)

Constructor Parameters

Parameter Type Description
context_service AbstractContextService | None Context service implementation to use for referential integrity. Defaults to InMemoryContextService when None.

filter() Parameters

Parameter Type Description
policy Policy The policy to apply
context str Named context that groups documents for referential integrity (e.g., a patient ID or session name)
document_id str A unique identifier for the document being filtered
text str The text to filter

Returns FilterResult

Attribute Type Description
filtered_text str The text with sensitive information replaced
spans List[Span] Metadata about each identified piece of sensitive information
context str The context passed to filter()
document_id str The document ID passed to filter()

Span

Attribute Type Description
character_start int Start index of the span in the original text
character_end int End index of the span in the original text
filter_type str The type of PII identified (e.g., "email-address")
text str The original text of the span
replacement str The replacement value
confidence float Confidence score (0.0–1.0)
ignored bool Whether this span was marked as ignored (not replaced)
context str The context

Policy

from phileas.policy.policy import Policy

# From a dict
policy = Policy.from_dict({"name": "default", "identifiers": {...}})

# From a JSON string
policy = Policy.from_json('{"name": "default", ...}')

# To JSON
json_str = policy.to_json()

# To dict
d = policy.to_dict()

AbstractContextService

Abstract base class for context service implementations. Subclass this to provide a custom backend.

from phileas import AbstractContextService

class MyContextService(AbstractContextService):
    def put(self, context: str, token: str, replacement: str) -> None: ...
    def get(self, context: str, token: str) -> str | None: ...
    def contains(self, context: str, token: str) -> bool: ...

Methods

Method Signature Description
put (context, token, replacement) -> None Store a replacement value for a token under the given context
get (context, token) -> str | None Return the stored replacement, or None if not found
contains (context, token) -> bool Return True if a replacement exists for the token in the given context

InMemoryContextService

Default implementation of AbstractContextService backed by a dict[str, dict[str, str]]. Suitable for single-process, in-memory use.

from phileas import InMemoryContextService

ctx_svc = InMemoryContextService()
ctx_svc.put("my-context", "john@example.com", "EMAIL-001")
ctx_svc.get("my-context", "john@example.com")      # "EMAIL-001"
ctx_svc.contains("my-context", "john@example.com") # True

Examples

Mask credit card numbers

policy_dict = {
    "name": "cc-mask",
    "identifiers": {
        "creditCard": {
            "creditCardFilterStrategies": [{"strategy": "LAST_4"}]
        }
    }
}
policy = Policy.from_dict(policy_dict)
result = service.filter(policy, "ctx", "doc1", "Card: 4111111111111111")
print(result.filtered_text)  # Card: ************1111

Hash SSNs

policy_dict = {
    "name": "ssn-hash",
    "identifiers": {
        "ssn": {
            "ssnFilterStrategies": [{"strategy": "HASH_SHA256_REPLACE"}]
        }
    }
}

Disable a filter

policy_dict = {
    "name": "no-url",
    "identifiers": {
        "url": {"enabled": False}
    }
}

CLI

phileas ships a phileas command that performs redaction directly from the terminal.

Usage

phileas -p POLICY_FILE -c CONTEXT (-t TEXT | -f FILE) [options]
Argument Description
-p / --policy FILE Path to a policy file (JSON or YAML).
-c / --context CONTEXT Context name for referential integrity.
-t / --text TEXT Text to redact (mutually exclusive with --file).
-f / --file FILE Path to a file to redact (mutually exclusive with --text).
-d / --document-id ID Optional document identifier (auto-generated if omitted).
-o / --output FILE Write redacted text to a file instead of stdout.
--spans Print span metadata as JSON to stderr.
--evaluate FILE Evaluate redaction quality against a JSON ground-truth file. Prints precision, recall, and F1 metrics to stdout.

Examples

Redact a string:

phileas -p policy.yaml -c my-context -t "Contact john@example.com or call 800-555-1234."
# Contact {{{REDACTED-email-address}}} or call {{{REDACTED-phone-number}}}.

Redact a file and write output to a new file:

phileas -p policy.yaml -c my-context -f report.txt -o report_redacted.txt

View span metadata for each detected item:

phileas -p policy.yaml -c my-context -t "Email john@example.com." --spans

Evaluation Mode

Use --evaluate FILE to measure the redaction quality of a policy against a set of ground-truth annotations. Phileas runs the filter on the input text, compares the detected spans against the ground-truth spans, and prints precision, recall, and F1 metrics to stdout.

phileas -p policy.json -c my-context -t "Email john@example.com." --evaluate gt.json

The ground-truth file must be a JSON array of span objects, or a JSON object with a "spans" key. Each span must have "start" and "end" character positions; "type" is optional:

[{"start": 6, "end": 22, "type": "email-address"}]

Example output:

Email {{{REDACTED-email-address}}}.
{
  "truePositives": 1,
  "falsePositives": 0,
  "falseNegatives": 0,
  "precision": 1.0,
  "recall": 1.0,
  "f1": 1.0
}

Running Tests

pytest tests/ -v

License

Copyright 2026 Philterd, LLC.

Licensed under the Apache License, Version 2.0. See LICENSE for details.

"Phileas" and "Philter" are registered trademarks of Philterd, LLC.

This project is a Python port of Phileas, which is also Apache-2.0 licensed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phileas_redact-1.0.0.tar.gz (227.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phileas_redact-1.0.0-py3-none-any.whl (239.5 kB view details)

Uploaded Python 3

File details

Details for the file phileas_redact-1.0.0.tar.gz.

File metadata

  • Download URL: phileas_redact-1.0.0.tar.gz
  • Upload date:
  • Size: 227.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for phileas_redact-1.0.0.tar.gz
Algorithm Hash digest
SHA256 d28a8d6c7b8cb7ba510afd9a454ba7ae78549920a8ce00a7c00fe12f4c39648c
MD5 572cb92ff97e907ebfeef2d5d2c12330
BLAKE2b-256 167be4dc0c3a7741d17b5f4635c3b89dcab6b8bb7f57958342c846bf7ed5990b

See more details on using hashes here.

File details

Details for the file phileas_redact-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: phileas_redact-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 239.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for phileas_redact-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7eb303f69e58988ec8be77b0b4227d0139c6f22eca4c806602fdc91aef25937a
MD5 cc9619b0ba27afe688b1408d02ebe747
BLAKE2b-256 b322e64449b130f65d5489f193e482cd8c8e1e87f84a34534436b6e298a0a797

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page