GDPR-compliant data privacy tool for entity redaction and de-anonymization
Project description
Inconnu
What is Inconnu?
Inconnu is a GDPR-compliant data privacy tool designed for entity redaction and de-anonymization. It provides cutting-edge NLP-based tools for anonymizing and pseudonymizing text data while maintaining data utility, ensuring your business meets stringent privacy regulations.
Why Inconnu?
-
Seamless Compliance: Inconnu simplifies the complexity of GDPR and other privacy laws, making sure your data handling practices are always in line with legal standards.
-
State-of-the-Art NLP: Utilizing advanced spaCy models and custom entity recognition, Inconnu ensures that personal identifiers are completely detected and properly handled.
-
Transparency and Trust: Complete processing documentation with timestamping, hashing, and entity mapping for full audit trails.
-
Reversible Processing: Support for both anonymization and pseudonymization with complete de-anonymization capabilities.
-
Performance Optimized: Fast processing with singleton pattern optimization and configurable text length limits.
Installation
Prerequisites
- Python 3.10 or higher
- pip (Python package manager)
Install from PyPI
# Using pip
pip install inconnu
# Using UV (Recommended)
uv add inconnu
Note: Language models are NOT included as optional dependencies. You'll need to download them separately using the inconnu-download command after installation (see below).
Download Language Models
After installing Inconnu, use the inconnu-download command to download spaCy language models:
# Download default (small) models
inconnu-download en # English
inconnu-download de # German
inconnu-download en de fr # Multiple languages
inconnu-download all # All default models
# Download specific model sizes
inconnu-download en --size large # Large English model
inconnu-download en --size transformer # Transformer model (English only)
# List available models and check what's installed
inconnu-download --list
# Upgrade existing models
inconnu-download en --upgrade
# Get help for UV environments
inconnu-download --uv-help
How Model Installation Works
- No Optional Dependencies: spaCy models are NOT included as pip/uv optional dependencies to avoid unnecessary downloads during dependency resolution
- On-Demand Downloads: The
inconnu-downloadcommand downloads only the models you need - Smart Environment Detection: Automatically detects UV environments and provides appropriate guidance
- Verification: Checks if models are already installed before downloading
Available Model Sizes
- Small (sm): Default, fast processing, ~15-50MB, good for high-volume
- Medium (md): Better accuracy, ~50-200MB, moderate speed
- Large (lg): High accuracy, ~200-600MB, slower processing
- Transformer (trf): Highest accuracy, ~400MB+, GPU-optimized (English only)
Alternative: Direct spaCy Download
You can also use spaCy directly if preferred:
python -m spacy download en_core_web_sm # English small
python -m spacy download de_core_news_lg # German large
Install from Source
-
Clone the repository:
git clone https://github.com/0xjgv/inconnu.git cd inconnu
-
Install with UV (recommended for development):
uv sync # Install dependencies inconnu-download en de # Download language models make test # Run tests
-
Or install with pip:
pip install -e . # Install in editable mode python -m spacy download en_core_web_sm
Development Commands
For development, the Makefile provides convenience targets:
# Download models using make commands
make model-en # English small
make model-de # German small
make model-it # Italian small
make model-es # Spanish small
make model-fr # French small
# Other development commands
make test # Run tests
make lint # Check code with ruff
make format # Format code
make clean # Clean cache and format code
Using Different Models in Code
To use a different model size, first download it, then specify it when initializing:
from inconnu import Inconnu
from inconnu.nlp.entity_redactor import SpacyModels
# First, download the model you want
# $ inconnu-download en --size large
# Then use it in your code
inconnu = Inconnu(
language="en",
model_name=SpacyModels.EN_CORE_WEB_LG # Use large model
)
# For highest accuracy (transformer model)
inconnu_trf = Inconnu(
language="en",
model_name=SpacyModels.EN_CORE_WEB_TRF
)
Model Selection Guide:
en_core_web_sm: Fast processing, good for high-volumeen_core_web_lg: Better accuracy, moderate speeden_core_web_trf: Highest accuracy, GPU-optimized (recommended for sensitive data)
For a complete list of supported models, run inconnu-download --list
Development Setup
Available Commands
# Development workflow
make install # Install all dependencies
make model-de # Download German spaCy model
make model-it # Download Italian spaCy model
make model-es # Download Spanish spaCy model
make model-fr # Download French spaCy model
make test # Run full test suite
make lint # Check code with ruff
make format # Format code with ruff
make fix # Auto-fix linting issues
make clean # Format, lint, fix, and clean cache
make update-deps # Update dependencies
Running Tests
# Run all tests
make test
# Run with verbose output
uv run pytest -vv
# Run specific test file
uv run pytest tests/test_inconnu.py -vv
# Run specific test class
uv run pytest tests/test_inconnu.py::TestInconnuPseudonymizer -vv
Usage Examples
Basic Text Anonymization
from inconnu import Inconnu
# Simple initialization - no Config class required!
inconnu = Inconnu() # Uses sensible defaults
# Simple anonymization - just the redacted text
text = "John Doe from New York visited Paris last summer."
redacted = inconnu.redact(text)
print(redacted)
# Output: "[PERSON] from [GPE] visited [GPE] [DATE]."
# Pseudonymization - get both redacted text and entity mapping
redacted_text, entity_map = inconnu.pseudonymize(text)
print(redacted_text)
# Output: "[PERSON_0] from [GPE_0] visited [GPE_1] [DATE_0]."
print(entity_map)
# Output: {'[PERSON_0]': 'John Doe', '[GPE_0]': 'New York', '[GPE_1]': 'Paris', '[DATE_0]': 'last summer'}
# Advanced usage with full metadata (original API)
result = inconnu(text=text)
print(result.redacted_text)
print(f"Processing time: {result.processing_time_ms:.2f}ms")
Async and Batch Processing
import asyncio
# Async processing for non-blocking operations
async def process_texts():
inconnu = Inconnu()
# Single async processing
text = "John Doe called from +1-555-123-4567"
redacted = await inconnu.redact_async(text)
print(redacted) # "[PERSON] called from [PHONE_NUMBER]"
# Batch async processing
texts = [
"Alice Smith visited Berlin",
"Bob Jones went to Tokyo",
"Carol Brown lives in Paris"
]
results = await inconnu.redact_batch_async(texts)
for result in results:
print(result)
asyncio.run(process_texts())
Customer Service Email Processing
# Process customer service email with personal data
customer_email = """
Dear SolarTech Team,
I am Max Mustermann living at Hauptstraße 50, 80331 Munich, Germany.
My phone number is +49 89 1234567 and my email is max@example.com.
I need to return my solar modules (Order: ST-78901) due to relocation.
Best regards,
Max Mustermann
"""
# Simple redaction
redacted = inconnu.redact(customer_email)
print(redacted)
# Personal identifiers are automatically detected and redacted
Multi-language Support
# German language processing - simplified!
inconnu_de = Inconnu("de") # Just specify the language
german_text = "Herr Schmidt aus München besuchte Berlin im März."
redacted = inconnu_de.redact(german_text)
print(redacted)
# Output: "[PERSON] aus [GPE] besuchte [GPE] [DATE]."
Custom Entity Recognition
from inconnu import Inconnu, NERComponent
import re
# Add custom entity recognition
custom_components = [
NERComponent(
label="CREDIT_CARD",
pattern=re.compile(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b'),
processing_function=None
)
]
# Simple initialization with custom components
inconnu_custom = Inconnu(
language="en",
custom_components=custom_components
)
# Test custom entity detection
text = "My card number is 1234 5678 9012 3456"
redacted = inconnu_custom.redact(text)
print(redacted) # "My card number is [CREDIT_CARD]"
Context Manager for Resource Management
# Automatic resource cleanup
with Inconnu() as inc:
redacted = inc.redact("Sensitive data about John Doe")
print(redacted)
# Resources automatically cleaned up
Error Handling
from inconnu import Inconnu, TextTooLongError, ProcessingError
inconnu = Inconnu(max_text_length=100) # Set small limit for demo
try:
long_text = "x" * 200 # Exceeds limit
result = inconnu.redact(long_text)
except TextTooLongError as e:
print(f"Text too long: {e}")
# Error includes helpful suggestions for resolution
except ProcessingError as e:
print(f"Processing failed: {e}")
Use Cases
1. Customer Support Systems
Automatically redact personal information from customer service emails, chat logs, and support tickets while maintaining context for analysis.
2. Legal Document Processing
Anonymize legal documents, contracts, and case files for training, analysis, or public release while ensuring GDPR compliance.
3. Medical Record Anonymization
Process medical records and research data to remove patient identifiers while preserving clinical information for research purposes.
4. Financial Transaction Analysis
Redact personal financial information from transaction logs and banking communications for fraud analysis and compliance reporting.
5. Survey and Feedback Analysis
Anonymize customer feedback, survey responses, and user-generated content for analysis while protecting respondent privacy.
6. Training Data Preparation
Prepare training datasets for machine learning models by removing personal identifiers from text data while maintaining semantic meaning.
Supported Entity Types
- Standard Entities: PERSON, GPE (locations), DATE, ORG, MONEY
- Custom Entities: EMAIL, IBAN, PHONE_NUMBER
- Enhanced Detection: Person titles (Dr, Mr, Ms), international phone numbers
- Multilingual: English, German, Italian, Spanish, and French language support
Features
- Robust Entity Detection: Advanced NLP with spaCy models and custom regex patterns
- Dual Processing Modes: Anonymization (
[PERSON]) and pseudonymization ([PERSON_0]) - Complete Audit Trail: Timestamping, hashing, and processing metadata
- Reversible Processing: Full de-anonymization capabilities with entity mapping
- Performance Optimized: Singleton pattern for model loading, configurable limits
- GDPR Compliant: Built-in data retention policies and compliance features
Contributing
We welcome contributions to Inconnu! As an open source project, we believe in the power of community collaboration to build better privacy tools.
How to Contribute
1. Bug Reports & Feature Requests
- Open an issue on GitHub with detailed descriptions
- Include code examples and expected vs actual behavior
- Tag issues appropriately (bug, enhancement, documentation)
2. Code Contributions
# Fork the repository and create a feature branch
git checkout -b feature/your-feature-name
# Make your changes and ensure tests pass
make test
make lint
# Submit a pull request with:
# - Clear description of changes
# - Test coverage for new features
# - Updated documentation if needed
3. Development Guidelines
- Follow existing code style and patterns
- Add tests for new functionality
- Update documentation for user-facing changes
- Ensure GDPR compliance considerations are addressed
4. Areas for Contribution
- Language Support: Add new language models and region-specific entity detection
- Custom Entities: Implement detection for industry-specific identifiers
- Performance: Optimize processing speed and memory usage
- Documentation: Improve examples, tutorials, and API documentation
- Testing: Expand test coverage and edge case handling
5. Code Review Process
- All contributions require code review
- Automated tests must pass
- Documentation updates are appreciated
- Maintain backward compatibility when possible
Community Guidelines
- Be Respectful: Foster an inclusive environment for all contributors
- Privacy First: Always consider privacy implications of changes
- Security Minded: Report security issues privately before public disclosure
- Quality Focused: Prioritize code quality and comprehensive testing
Getting Help
- Discussions: Use GitHub Discussions for questions and ideas
- Issues: Report bugs and request features through GitHub Issues
- Documentation: Check existing docs and contribute improvements
Thank you for helping make Inconnu a better tool for data privacy and GDPR compliance!
Publishing to PyPI
For Maintainers
To publish a new version to PyPI:
-
Configure Trusted Publisher (first time only):
- Go to https://pypi.org/manage/project/inconnu/settings/publishing/
- Add a new trusted publisher:
- Publisher: GitHub
- Organization/username:
0xjgv - Repository name:
inconnu - Workflow name:
publish.yml - Environment name:
pypi(optional but recommended)
- For Test PyPI, do the same at https://test.pypi.org with environment name:
testpypi
-
Update Version: Update the version in
pyproject.tomlandinconnu/__init__.py -
Create a Git Tag:
git tag v0.1.0 git push origin v0.1.0
-
GitHub Actions: The workflow will automatically:
- Run tests on Python 3.10, 3.11, and 3.12
- Build the package
- Publish to PyPI using Trusted Publisher (no API tokens needed!)
- Generate PEP 740 attestations for security
-
Test PyPI Publishing:
- Use workflow_dispatch to manually trigger Test PyPI publishing
- Go to Actions → Publish to PyPI → Run workflow
Manual Publishing (if needed)
# Build the package
uv build
# Check the package
twine check dist/*
# Upload to Test PyPI (requires API token)
twine upload --repository testpypi dist/*
# Upload to PyPI (requires API token)
twine upload dist/*
GitHub Environments (Recommended)
Configure GitHub environments for additional security:
- Go to Settings → Environments
- Create
pypiandtestpypienvironments - Add protection rules:
- Required reviewers
- Restrict to specific tags (e.g.,
v*) - Add deployment branch restrictions
Additional Resources
- spaCy Models Directory - Complete list of available language models
- spaCy Model Releases - GitHub repository for model updates
- pgeocode - Geographic location processing (potential future integration)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file inconnu-0.1.1.tar.gz.
File metadata
- Download URL: inconnu-0.1.1.tar.gz
- Upload date:
- Size: 137.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
79709f369a6011439f2242f1b122e871ae9eddfaffdddbc94eea9248aac81ba1
|
|
| MD5 |
384670701d96ac3ac3ee21731959353c
|
|
| BLAKE2b-256 |
7e62e392205fb136e5c3f1dbca640d87514f2472418fcdb3decacf2e3f831e91
|
Provenance
The following attestation bundles were made for inconnu-0.1.1.tar.gz:
Publisher:
publish.yml on 0xjgv/inconnu
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
inconnu-0.1.1.tar.gz -
Subject digest:
79709f369a6011439f2242f1b122e871ae9eddfaffdddbc94eea9248aac81ba1 - Sigstore transparency entry: 305754704
- Sigstore integration time:
-
Permalink:
0xjgv/inconnu@77a724a868167809ba959fdd929f3a2b1192ec19 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/0xjgv
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@77a724a868167809ba959fdd929f3a2b1192ec19 -
Trigger Event:
push
-
Statement type:
File details
Details for the file inconnu-0.1.1-py3-none-any.whl.
File metadata
- Download URL: inconnu-0.1.1-py3-none-any.whl
- Upload date:
- Size: 21.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
695d2602f571fd37fd53ae9f347eeeb7193e0f11ce619996f60eaf7579cb44b2
|
|
| MD5 |
ef949d0725c432f05bd9b916375e800c
|
|
| BLAKE2b-256 |
ee73478ba39da5ebc778054d3f49c5699bd067f078db3090159043630f8ef6a3
|
Provenance
The following attestation bundles were made for inconnu-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on 0xjgv/inconnu
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
inconnu-0.1.1-py3-none-any.whl -
Subject digest:
695d2602f571fd37fd53ae9f347eeeb7193e0f11ce619996f60eaf7579cb44b2 - Sigstore transparency entry: 305754713
- Sigstore integration time:
-
Permalink:
0xjgv/inconnu@77a724a868167809ba959fdd929f3a2b1192ec19 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/0xjgv
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@77a724a868167809ba959fdd929f3a2b1192ec19 -
Trigger Event:
push
-
Statement type: