Privacy-first text anonymization tool with enterprise-grade accuracy for removing PII from documents
Project description
🕵️ Anon - Privacy-First Text Anonymizer
A powerful, offline-first text anonymization tool that removes personal identifiable information (PII) from text while keeping all data on your machine. Built with enterprise-grade accuracy using spaCy NER models and Microsoft Presidio.
✨ Features
- 🔒 100% Offline - All processing happens on your machine
- 🎯 High Accuracy - Advanced NER using spaCy large models + Presidio
- 🔐 Secure Always-Redact - Custom sensitive terms stored securely in
~/.anonymizer - 🖥️ Multiple Interfaces - Modern GUI, Web API, and CLI
- 🚀 Background Processing - CLIs run detached with proper logging
- 📦 Easy Installation - One-command install with automatic model setup
- 🏢 Cross-Platform - Windows, macOS, and Linux support
🚀 Quick Start
Installation
pip install simple-anonymizer
The installation will automatically download the required spaCy model (en_core_web_lg) for optimal accuracy.
GUI Application
Launch the modern GUI interface:
anon-gui
✅ The GUI runs in background - you can close the terminal after launch
📝 Logs available at ~/.anonymizer/gui_YYYYMMDD_HHMMSS.log
Web Interface
Start the web server:
anon-web start
✅ Server runs in background - accessible at http://127.0.0.1:8080
📝 Comprehensive logging and process management
Web Server Management
# Start server (custom host/port)
anon-web start --host 0.0.0.0 --port 5000
# Check server status
anon-web status
# View recent logs
anon-web logs
# Stop server
anon-web stop
# Clean old log files (preserves always-redact settings)
anon-web clean
Always-Redact Management
Securely manage custom sensitive terms that should always be anonymized:
# Add terms to always-redact list
anon-web add-redact "CompanyName"
anon-web add-redact "ProjectCodename"
# Remove terms from always-redact list
anon-web remove-redact "ProjectCodename"
# List all always-redacted terms
anon-web list-redact
🔐 Security Features:
- Terms stored securely in
~/.anonymizer/always_redact.txt - Not visible in GUI or web interfaces (add/remove only)
- Persists across all anonymization operations
- Case-insensitive matching with duplicate prevention
Python API
from anonymizer_core import redact
# Basic anonymization
result = redact("John Doe works at Microsoft in Seattle.")
print(result.text)
# Output: "<REDACTED> works at <REDACTED> in <REDACTED>."
# Always-redact terms are automatically applied
# (managed via CLI commands shown above)
result = redact("Contact john@acme.com about AcmeProject details.")
print(result.text)
# Output: "Contact <REDACTED> about <REDACTED> details."
# (if "AcmeProject" was added to always-redact list)
🔐 Data Security & Privacy
Always-Redact Terms
- Secure Storage: Custom sensitive terms are stored in
~/.anonymizer/always_redact.txt - No Shipping: The file is created locally on first use, never shipped with the package
- Privacy-First: Terms are not exposed through GUI or web interfaces
- CLI-Only Access: Terms can only be viewed via command line for security
- Persistent: Settings survive application updates and log cleanups
File Locations
# User data directory
~/.anonymizer/
├── always_redact.txt # Your custom sensitive terms
├── gui_YYYYMMDD_HHMMSS.log # GUI application logs
└── web_server_*.log # Web server logs
Data Flow
- Input Text → Standard PII Detection (emails, phones, etc.)
- Input Text → Always-Redact Terms (your custom words)
- Combined Results → Final Anonymized Output
🔧 Advanced Usage
GUI Features
- Modern Interface: Clean, intuitive design with real-time processing
- Secure Term Management: Add/remove always-redact terms without exposure
- File Processing: Load and save text files directly
- Background Processing: Non-blocking anonymization with progress indicators
Web API Features
- RESTful Endpoints: Standard HTTP API for integration
- File Upload: Process text files via web interface
- JSON Response: Structured output with metadata
- Health Checks: Monitor service status programmatically
CLI Management
- Process Control: Start/stop/status for web server
- Log Management: View and clean application logs
- Term Management: Secure always-redact term administration
- Background Operation: All services run detached from terminal
🛠️ Technical Details
Anonymization Engine
- Multi-Tier Processing: Pattern-based → Always-redact → NER fallback
- Position Tracking: Prevents overlapping redactions for accuracy
- Case Insensitive: Always-redact terms match regardless of case
- Word Boundaries: Only complete words are redacted (not partial matches)
Supported Entity Types
- Emails: john@example.com
- URLs: https://example.com
- IP Addresses: 192.168.1.1
- Phone Numbers: +1-555-123-4567
- Custom Terms: Your always-redact list
- Names: Via NER when available
- Organizations: Via NER when available
📋 Examples & Use Cases
Basic Anonymization
from anonymizer_core import redact
text = "Please contact John Smith at john.smith@acme.com or call +1-555-0123."
result = redact(text)
print(result.text)
# Output: "Please contact <REDACTED> at <REDACTED> or call <REDACTED>."
Company-Specific Anonymization
# Set up company-specific terms
anon-web add-redact "AcmeCorp"
anon-web add-redact "ProjectTitan"
anon-web add-redact "confidential"
# Now these terms are always redacted
python -c "
from anonymizer_core import redact
text = 'AcmeCorp confidential: ProjectTitan budget is 500K'
print(redact(text).text)
"
# Output: "<REDACTED> <REDACTED>: <REDACTED> budget is 500K"
Enterprise Integration
# Configure once via CLI
# anon-web add-redact "YourCompanyName"
# anon-web add-redact "YourProduct"
# Use in your application
from anonymizer_core import redact
def process_support_ticket(ticket_text):
"""Anonymize support tickets before logging."""
result = redact(ticket_text)
return result.text
# All company-specific terms are automatically redacted
anonymized = process_support_ticket(
"Customer john@email.com reported YourProduct crashed on YourCompanyName servers."
)
print(anonymized)
# Output: "Customer <REDACTED> reported <REDACTED> crashed on <REDACTED> servers."
Batch Processing
# Set up your terms once
anon-web add-redact "SensitiveTerm1"
anon-web add-redact "SensitiveTerm2"
# Process multiple files - terms persist across all operations
for file in *.txt; do
python -c "
from anonymizer_core import redact
with open('$file', 'r') as f:
content = f.read()
with open('anonymized_$file', 'w') as f:
f.write(redact(content).text)
"
done
Security Audit
# List all configured terms (CLI only for security)
anon-web list-redact
# Remove terms that are no longer sensitive
anon-web remove-redact "OldProjectName"
# Clean logs while preserving term configuration
anon-web clean
🚨 Security Best Practices
Always-Redact Configuration
- Review Regularly: Audit your always-redact terms periodically
- Principle of Least Privilege: Only add terms that truly need redaction
- Team Coordination: Ensure team members know which terms are configured
- Backup: Consider backing up
~/.anonymizer/always_redact.txtsecurely
Production Deployment
- Isolated Environment: Deploy in secure, isolated environments
- Log Management: Regularly clean logs with
anon-web clean - Access Control: Restrict CLI access to authorized personnel only
- Monitor Usage: Review anonymization logs for compliance
📊 CLI Command Reference
Server Management
anon-web start [--host HOST] [--port PORT] # Start web server
anon-web stop # Stop web server
anon-web status # Check server status
anon-web logs # View recent logs
anon-web clean # Clean old logs (preserve settings)
Always-Redact Management
anon-web add-redact "TERM" # Add term to always-redact list
anon-web remove-redact "TERM" # Remove term from list
anon-web list-redact # List all terms (CLI only)
GUI Launch
anon-gui # Launch GUI application
🔍 Troubleshooting
Common Issues
Terms not being redacted?
- Verify term was added:
anon-web list-redact - Check exact spelling and case sensitivity
- Ensure word boundaries (partial matches won't work)
GUI/Web not reflecting new terms?
- This is by design for security
- Terms are automatically applied during anonymization
- Use CLI
list-redactto verify configuration
Server won't start?
- Check if port is already in use:
anon-web status - Try different port:
anon-web start --port 8081 - Check logs:
anon-web logs
Performance issues?
- Clean old logs:
anon-web clean - For large texts, consider batch processing
- Restart services if needed:
anon-web stop && anon-web start
Need help? Check the logs in ~/.anonymizer/ for detailed error information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file simple_anonymizer-0.1.13.tar.gz.
File metadata
- Download URL: simple_anonymizer-0.1.13.tar.gz
- Upload date:
- Size: 334.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.13.5 Darwin/24.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
812a89a602bc526e183b561c87ea576a742e5496800e021ea83fff222d7b15a7
|
|
| MD5 |
2d15b10e639e2ce7c7c813968d059083
|
|
| BLAKE2b-256 |
d1be62be8c25978e1cb21b2cf8173cabe969b2b3187cf2abd63c7eac41f759ad
|
File details
Details for the file simple_anonymizer-0.1.13-py3-none-any.whl.
File metadata
- Download URL: simple_anonymizer-0.1.13-py3-none-any.whl
- Upload date:
- Size: 56.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.13.5 Darwin/24.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f77ac3fc72fd4afb31c9203a0ee00ab45f156336e23b62b6201eed2491af0930
|
|
| MD5 |
01c2cf9485a02857cee67a3369bb2416
|
|
| BLAKE2b-256 |
59a8ab74e0a1b5e96aa00367062fe2ad60a32935760b812fce399ea383b7539d
|