Skip to main content

PyO3 bindings for sci-anonymizer — reversible entity anonymization for LLM round-trips

Project description

sci-anonymizer-py

PyO3 bindings for sci-anonymizer — reversible entity anonymization for LLM round-trips.

Compiled to platform-specific wheels using maturin with abi3 (stable ABI), so a single wheel covers Python 3.10+.

Installation (from source)

Prerequisites

  • Python 3.10+
  • Rust 1.95+
  • maturin (install via pipx install maturin or pip install maturin)

Build and install in a venv

cd core/crates/sci-anonymizer-py

# Create or activate a Python venv
python3 -m venv .venv
source .venv/bin/activate  # on Windows: .venv\Scripts\activate

# Install maturin if not already present
pip install maturin

# Build and install the wheel into the venv
maturin develop

# Verify installation
python -c "import sci_anonymizer; print(sci_anonymizer.SESSION_FORMAT_VERSION)"

Quick Start

from sci_anonymizer import (
    anonymize, deanonymize,
    Entity, EntityType,
)

# Anonymize text
text = "Email casey@example.com about the Acme project deal"
result = anonymize(text)

print(result.text)
# Output: Email EMAIL_1 about the PROJECT_1 deal

# Deanonymize (reverse the tokens back to real entities)
model_reply = "I'll contact EMAIL_1 about PROJECT_1 next week"
restored = deanonymize(model_reply, result.token_map)

print(restored)
# Output: I'll contact casey@example.com about Acme next week

# Use custom entities (domain-specific terms)
custom = [Entity("InternalCodeXYZ", EntityType.Secret)]
result = anonymize(text, custom_entities=custom)

API Overview

Core Functions

  • anonymize(text, existing=None) — Detect entities and replace with tokens.
  • anonymize_with_custom(text, existing=None, custom_entities=None) — Same + custom entities.
  • deanonymize(text, token_map) — Reverse: tokens → entities.
  • build_token_map(entities, existing=None) — Lower-level: build a token map from entities.
  • apply_token_map(text, token_map) — Lower-level: apply substitutions to text.

Types

  • EntityType — Enum: Person, Place, Org, Project, Email, Phone, Url, Handle, Secret, IpAddress.
  • Entity — A detected span: Entity(text, entity_type).
  • TokenMap — Bidirectional mapping. Can serialize/deserialize:
    • token_map.to_session_json() → JSON string (versioned envelope).
    • TokenMap.from_session_json(json_str) → TokenMap (raises ValueError if unsupported version).
  • AnonymizeResult — Output of anonymize* with .text, .token_map, .entity_count, .entities.

Constants

  • SESSION_FORMAT_VERSION — Current session format version (int). See session serialization contract in ../sci-anonymizer/API.md.

Session Persistence

# Serialize a token map for storage
json_str = result.token_map.to_session_json()
# Save json_str to disk/database

# Later, restore and extend
token_map = TokenMap.from_session_json(json_str)
next_result = anonymize(new_text, existing=token_map)
# Same entity will get the same token as before

Testing

Run the Python smoke test:

cd core/crates/sci-anonymizer-py
python tests/test_smoke.py

The smoke test validates:

  • Round-trip fidelity: deanonymize(anonymize(text).text, map) == text
  • Multiple entity types detected correctly
  • Session serialization/deserialization
  • Custom entities

Limitations

This binding wraps the portable regex and CamelCase entity detection from sci-anonymizer. It does not include:

  • NLP NER (Named Entity Recognition for PERSON/PLACE/ORG): Tracked in SCI-123. The Rust port uses a CamelCase heuristic to catch compound proper nouns, but bare "John Doe" style names are not detected without an NER model.
  • Custom entity loading from identity_facts: Tracked in SCI-124. Users supply custom entities via the custom_entities parameter.

For production use with full NER, integrate with the Rust core directly or patch this layer with the SCI-123/124 implementations when available.

License

Licensed under Apache-2.0 OR MIT, same as sci-anonymizer.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sci_anonymizer-0.2.0.tar.gz (99.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sci_anonymizer-0.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (693.3 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

sci_anonymizer-0.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (682.2 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

sci_anonymizer-0.2.0-cp310-abi3-macosx_11_0_arm64.whl (596.5 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

sci_anonymizer-0.2.0-cp310-abi3-macosx_10_12_x86_64.whl (622.0 kB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file sci_anonymizer-0.2.0.tar.gz.

File metadata

  • Download URL: sci_anonymizer-0.2.0.tar.gz
  • Upload date:
  • Size: 99.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for sci_anonymizer-0.2.0.tar.gz
Algorithm Hash digest
SHA256 5220eca5db3f691918efe9a071597c700c226e2acbe2300aa7b631097d35b1f6
MD5 697d710016bcce385ca4c5e5c4756b95
BLAKE2b-256 3be29f1a08e12af5299d7d6d38a78008c3d8a25069e43b4bb9c9393775c5f1b8

See more details on using hashes here.

File details

Details for the file sci_anonymizer-0.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sci_anonymizer-0.2.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c90122230408bf265e2fab5167e1530519ae426e7c63744320a79eb67cabbf58
MD5 a82ebb6fa35b47146bc14add0ab2bc2d
BLAKE2b-256 81cac5d05582e97b8c6e227f1c8b96459a0aacc469d610d7884378a9ac0e2944

See more details on using hashes here.

File details

Details for the file sci_anonymizer-0.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for sci_anonymizer-0.2.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 4011b7afa886f3d8ae2b1d883e0b1500c1c60622dac59ac48e2a1acfbae35cb6
MD5 e508bf3393781d88ef863ddc84654f6c
BLAKE2b-256 2ae4c5c4c62f9d3e4fa97dfe7d3fff4789986bc2c5af661b62929f9af2cc6876

See more details on using hashes here.

File details

Details for the file sci_anonymizer-0.2.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for sci_anonymizer-0.2.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fb88ab9b142e30fb8907fb090568b57307d9e3bfb4c80971745d90bfa1357992
MD5 a2008aecffcb14fdae37c36e7f6ebd1a
BLAKE2b-256 00e55e343b30453dcf1546354e08edd68d671828094e8e31df3275ed5de9684a

See more details on using hashes here.

File details

Details for the file sci_anonymizer-0.2.0-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sci_anonymizer-0.2.0-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 884aaf24dacc69e526dcc451077922397219b06444d6ab9bc3f5dc9db9f710d8
MD5 69b25a40767eeaf06da0486d1a448e25
BLAKE2b-256 c454b6cee1b59362176d658bea86b2a6ae6a16c07a081d7a37e12dbf4cfff31f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page