Australian-focused PII detection and anonymization for the insurance industry
Project description
Allyanonimiser
Australian-focused PII detection and anonymization for the insurance industry.
Quick Start
from allyanonimiser import create_allyanonimiser
# Create the Allyanonimiser instance with default settings
ally = create_allyanonimiser()
# Analyze text
results = ally.analyze(
text="Please reference your policy AU-12345678 for claims related to your vehicle rego XYZ123."
)
# Print results
for result in results:
print(f"Entity: {result.entity_type}, Text: {result.text}, Score: {result.score}")
DataFrame Processing
import pandas as pd
from allyanonimiser import create_allyanonimiser
# Create a DataFrame with potentially sensitive data
df = pd.DataFrame({
'id': range(1, 5),
'name': ['John Smith', 'Jane Doe', 'Bob Johnson', 'Alice Williams'],
'note': [
'Customer called about policy POL123456. DOB: 15/07/1982.',
'Email from jane@example.com about claim CL789012. Lives at 10 Queen St, Sydney NSW 2000.',
'Medicare number 2123 45678 1 received for claim. Patient born 1990-03-22.',
'TFN: 123 456 789. Contact at 42 Example St, Melbourne VIC 3000. Age: 45.'
]
})
# Create an Allyanonimiser instance
ally = create_allyanonimiser()
# Process a DataFrame (anonymize sensitive data)
anonymized_df = ally.anonymize_dataframe(
df,
'note',
operators={
'PERSON': 'replace',
'EMAIL_ADDRESS': 'mask',
'INSURANCE_POLICY_NUMBER': 'redact',
'INSURANCE_CLAIM_NUMBER': 'redact',
'DATE_OF_BIRTH': 'age_bracket' # Convert DOBs to age brackets
},
active_entity_types=[ # Only detect these entity types
'PERSON', 'EMAIL_ADDRESS', 'INSURANCE_POLICY_NUMBER',
'INSURANCE_CLAIM_NUMBER', 'DATE_OF_BIRTH', 'AU_ADDRESS',
'AU_POSTCODE', 'AU_MEDICARE', 'AU_TFN'
],
age_bracket_size=5, # Use 5-year brackets (default)
keep_postcode=True # Preserve postcodes in addresses (default)
)
Configuration and Pattern Management
from allyanonimiser import create_allyanonimiser
# Create with default settings
ally = create_allyanonimiser()
# Add a custom pattern
ally.add_pattern({
"entity_type": "INTERNAL_REFERENCE",
"patterns": [r"REF-\d{5}", r"Reference:\s*([A-Z0-9-]+)"],
"context": ["internal", "reference", "ref"],
"name": "Internal Reference Number"
})
# Save configuration for reuse by team members
ally.export_config("company_config.json")
Sharing and Controlling Configurations
from allyanonimiser import create_allyanonimiser
# Load a shared configuration
ally = create_allyanonimiser(settings_path="company_config.json")
# Enable only specific entity types for this session
active_entities = ["PERSON", "INTERNAL_REFERENCE", "EMAIL_ADDRESS"]
result = ally.anonymize(
text="Reference: REF-12345 submitted by John Smith (john.smith@example.com)",
operators={"PERSON": "redact"},
active_entity_types=active_entities # Only these entities will be detected
)
print(result["text"])
# Output: "Reference: <INTERNAL_REFERENCE> submitted by [REDACTED] ([REDACTED])"
# Get available entity types to selectively enable/disable
available_entities = ally.get_available_entity_types()
print(f"Available entity types: {list(available_entities.keys())}")
# Update an existing configuration with new patterns
ally.add_pattern({
"entity_type": "PROJECT_CODE",
"patterns": [r"PROJ-\d{4}"],
"name": "Project Code"
})
# Export as a new version
ally.export_config("company_config_v2.json")
Simplified API (v1.1.0)
Version 1.1.0 introduces a simplified API with configuration objects and unified interface methods:
from allyanonimiser import create_allyanonimiser, AnalysisConfig, AnonymizationConfig
# Create instance
ally = create_allyanonimiser()
# Use configuration objects for cleaner parameter organization
analysis_config = AnalysisConfig(
active_entity_types=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER"],
min_score_threshold=0.8,
expand_acronyms=True
)
anonymization_config = AnonymizationConfig(
operators={
"PERSON": "replace",
"EMAIL_ADDRESS": "mask",
"PHONE_NUMBER": "redact"
},
age_bracket_size=10,
keep_postcode=True
)
# Process text with configuration objects
result = ally.process(
text="Customer John Smith (john.smith@example.com) called from 0412 345 678.",
analysis_config=analysis_config,
anonymization_config=anonymization_config
)
# Unified pattern management interface
ally.manage_patterns(
action="create_from_examples",
entity_type="CUSTOMER_ID",
examples=["CID-12345", "CID-67890"],
generalization_level="medium"
)
# Unified acronym management interface
ally.manage_acronyms(
action="add",
data={"POL": "Policy", "CL": "Claim", "DOB": "Date of Birth"}
)
# Unified DataFrame processing interface
df_result = ally.process_dataframe(
df=my_dataframe,
column="notes",
operation="anonymize",
analysis_config=analysis_config,
anonymization_config=anonymization_config
)
See example_simplified_api.py for a complete demonstration of the simplified API.
Features
- Australian-Specific PII Detection: Specialized recognizers for Australian TFNs, Medicare numbers, driver's licenses, and other Australian-specific identifiers.
- Insurance Industry Focus: Recognition of policy numbers, claim references, vehicle identifiers, and other insurance-specific data.
- Long Text Processing: Optimized for processing lengthy free-text fields like claim notes, medical reports, and emails.
- Custom Pattern Creation: Easy creation of custom entity recognizers for organization-specific data.
- Synthetic Data Generation: Generate realistic Australian test data for validation.
- LLM Integration: Use Language Models to create challenging datasets for testing.
- Extensible Architecture: Built on Presidio and spaCy with a modular, extensible design.
Version History
Version 1.1.0 - API Simplification
This version introduces a simplified, more consistent API while maintaining backward compatibility.
Key Features
-
Unified Interface Methods:
- Added
manage_acronyms(action, data, ...)to replace multiple acronym methods - Added
manage_patterns(action, data, ...)to replace multiple pattern methods - Added unified
process_dataframe(operation, ...)to consolidate DataFrame methods - Reduced total API method count while preserving all functionality
- Added
-
Configuration Objects:
- Added
AnalysisConfigfor grouping analysis parameters - Added
AnonymizationConfigfor grouping anonymization parameters - Improved parameter organization and reduced parameter count in method signatures
- Enhanced readability and maintainability
- Added
-
Improved Developer Experience:
- New example script (
example_simplified_api.py) demonstrating the simplified API - Maintained backward compatibility with deprecated method support
- Comprehensive docstrings for all new methods
- Consolidated parameter handling for better code organization
- New example script (
Version 1.0.0 - Simplified API and Enhanced Features
This version introduces a simplified API with cleaner function names and adds powerful new features for more flexible anonymization. This is a breaking change from previous versions.
Key Updates
-
Simplified API:
- Removed legacy analyzer factory functions (
create_au_analyzer,create_insurance_analyzer,create_au_insurance_analyzer) - Added cleaner
create_analyzer()function as the main entry point - Streamlined and simplified internal implementation
- Removed legacy analyzer factory functions (
-
New Features:
- Address Postcode Preservation: Preserve postcodes when anonymizing addresses with the
keep_postcodeparameter - Age Bracketing: Convert dates of birth to age brackets with configurable bracket sizes using the
age_bracketoperator - Improved error handling with informative error messages
- Enhanced input validation for more robust operation
- Address Postcode Preservation: Preserve postcodes when anonymizing addresses with the
-
Major Version Upgrade:
- Increased major version number to 1.0.0 to indicate production readiness
- Removed backward compatibility with 0.x versions
- Improved documentation and examples
Version 0.3.3 - Python 3.10+ Compatibility
This version updates the package to ensure compatibility with Python 3.10 and newer versions, addressing dependency changes and improving CI/CD workflows.
Key Updates
-
Python Version Requirements:
- Updated minimum Python version to 3.10+
- Added support for Python 3.11 and 3.12
- Removed support for Python 3.8 and 3.9 due to dependency requirements
-
Improved Compatibility:
- Fixed circular import issues in the insurance module
- Enhanced batch processing capabilities
- Addressed issues with newer NumPy (2.0+) requirements
-
CI/CD Improvements:
- Updated GitHub Actions workflows to test on Python 3.10-3.12
- Fixed failing tests and improved test coverage
- Enhanced build process for PyPI deployment
Version 0.3.2 - Configuration Sharing and PyArrow Integration
This version adds functionality to export configuration settings to shareable files and improves DataFrame performance with optional PyArrow integration.
Key Features
-
Configuration Export/Import:
- Export settings to JSON or YAML files for sharing with team members
- Load exported configuration in new instances
- Include optional metadata for documentation
- Seamless integration with existing settings management
from allyanonimiser import create_allyanonimiser # Export configuration to share with team members ally = create_allyanonimiser() ally.set_acronym_dictionary({'POL': 'Policy', 'CL': 'Claim'}) ally.settings_manager.set_entity_types(['PERSON', 'EMAIL_ADDRESS']) ally.export_config("my_config.json") # Team members can load the configuration shared_ally = create_allyanonimiser(settings_path="my_config.json")
-
PyArrow Integration:
- Optional PyArrow support for improved DataFrame performance
- Graceful fallback when PyArrow isn't available
- Configurable through settings with sensible defaults
- Enhanced DataFrame processing speed for large datasets
Version 0.3.1 - DataFrame Processing and Batch Operations
This version adds comprehensive DataFrame processing capabilities, enabling efficient analysis and anonymization of data stored in pandas DataFrames.
Key Features
-
DataFrame Processing:
- Process pandas DataFrames with optimized memory usage
- Batch operations for efficient handling of large datasets
- Parallel processing for improved performance
- Progress tracking for long-running operations
-
Anonymization and Analysis:
- Anonymize specific columns or entire DataFrames
- Extract entities from text columns
- Generate statistical reports on detected entities
- Configure entity types and anonymization operators
Version 0.3.0 - Custom Pattern Support and Pattern Management
This version adds comprehensive custom pattern creation and management capabilities, enabling users to define, test, save, and load their own PII detection patterns.
Key Features
-
Custom Pattern Creation:
- Create and manage custom pattern definitions
- Generate patterns from examples with different generalization levels
- Save and load pattern definitions for reuse
- Add patterns directly to analyzers or pattern registries
-
Pattern Testing and Verification:
- Test patterns against positive and negative examples
- Validate pattern structure and components
- Get diagnostic information for pattern matching
- Identify and fix issues in pattern definitions
Installation
pip install allyanonimiser==1.1.0
Prerequisites
- Python 3.10 or higher
- For optimal performance, also install a spaCy model:
python -m spacy download en_core_web_lg
Usage Examples
Direct Analyzer Usage
from allyanonimiser import create_analyzer, EnhancedAnalyzer
# Use the factory function for a pre-configured analyzer
analyzer = create_analyzer()
# Analyze some text
results = analyzer.analyze(
text="My policy number is POL-123456. My claim reference is CLM/2023/78901.",
language="en"
)
# Print the results
for result in results:
print(f"Entity: {result.entity_type}, Text: {result.text}, Score: {result.score}")
Text Anonymization
from allyanonimiser import create_allyanonimiser
# Create an Allyanonimiser instance
ally = create_allyanonimiser()
# Analyze and anonymize text
result = ally.anonymize(
text="Please contact John Smith at john.smith@example.com regarding TFN: 123 456 789.",
operators={
"PERSON": "replace", # Replace with [PERSON]
"EMAIL_ADDRESS": "mask", # Apply partial masking: j***@e******.com
"AU_TFN": "redact", # Fully redact the TFN: [REDACTED]
}
)
print(result["text"])
# Output: "Please contact [PERSON] at j***@e******.com regarding TFN: [REDACTED]."
Advanced Features
from allyanonimiser import create_allyanonimiser
# Create an Allyanonimiser instance
ally = create_allyanonimiser()
# Preserve postcodes in addresses
result = ally.anonymize(
text="The customer lives at 42 Main St, Sydney NSW 2000.",
keep_postcode=True # Default is True
)
print(result["text"])
# Output: "The customer lives at <AU_ADDRESS> 2000."
# Convert dates of birth to age brackets
result = ally.anonymize(
text="Patient: Jane Doe, DOB: 15/03/1980, Medicare: 1234 56789 0",
operators={
"DATE_OF_BIRTH": "age_bracket", # Convert to 5-year brackets by default
"PERSON": "replace"
}
)
print(result["text"])
# Output: "Patient: <PERSON>, DOB: 40-44, Medicare: <AU_MEDICARE>"
# Process multiple files from a directory
results = ally.process_files(
file_paths=["claim1.txt", "claim2.txt"],
output_dir="anonymized_output/",
operators={"DATE_OF_BIRTH": "age_bracket", "AU_TFN": "redact"},
age_bracket_size=10
)
Supported Entity Types
The package includes recognition for:
Australian-Specific Entities
AU_TFN(Tax File Numbers)AU_MEDICARE(Medicare card numbers)AU_PASSPORT(Australian passport numbers)AU_DRIVERS_LICENSE(Australian drivers license numbers)AU_ABN(Australian Business Numbers)AU_ACN(Australian Company Numbers)AU_ADDRESS(Australian street addresses)AU_POSTCODE(Australian postcodes)AU_PHONE(Australian phone numbers)AU_REGO(Vehicle registration numbers)
Insurance-Specific Entities
INSURANCE_POLICY_NUMBER(Policy identifiers)INSURANCE_CLAIM_NUMBER(Claim reference numbers)INSURANCE_MEMBER_NUMBER(Membership numbers)INSURANCE_GROUP_NUMBER(Group policy identifiers)VEHICLE_IDENTIFIER(VIN, chassis, license plate)CASE_REFERENCE(Internal case references)
General PII Entities
PERSON(Person names)EMAIL_ADDRESS(Email addresses)PHONE_NUMBER(Generic phone numbers)CREDIT_CARD(Credit card numbers)DATE(Calendar dates)URL(Web addresses)LOCATION(Physical locations)ORGANIZATION(Company and organization names)MONETARY_VALUE(Currency amounts)ID(Generic identification numbers)
Anonymization Operations
The package supports several anonymization operators:
replace: Replace with entity type (e.g., [PERSON])redact: Fully redact the entity (e.g., [REDACTED])mask: Partially mask while preserving structure (e.g., j***@e****.com)hash: Replace with a consistent hash valueencrypt: Encrypt with a key (recoverable)age_bracket: Convert dates of birth to age brackets (e.g., 40-44)custom: Define your own replacement function
Key Features Explained
Age Bracketing
The age bracketing feature converts dates of birth to age brackets for enhanced privacy:
from allyanonimiser import create_allyanonimiser
ally = create_allyanonimiser()
result = ally.anonymize(
text="Patient: John Smith, DOB: 15/03/1980, Medicare: 1234 56789 0",
operators={"DATE_OF_BIRTH": "age_bracket"},
age_bracket_size=5 # Default is 5 years
)
print(result["text"])
# Output: "Patient: <PERSON>, DOB: 40-44, Medicare: <AU_MEDICARE>"
- Supports multiple date formats (DD/MM/YYYY, YYYY-MM-DD, etc.)
- Handles direct age mentions (e.g., "Age: 42")
- Customize bracket sizes (5, 10 years, etc.)
Postcode Preservation
Preserve postcodes when anonymizing addresses to maintain geographical context:
from allyanonimiser import create_allyanonimiser
ally = create_allyanonimiser()
result = ally.anonymize(
text="Customer lives at 42 Main St, Sydney NSW 2000.",
keep_postcode=True # Default is True
)
print(result["text"])
# Output: "Customer lives at <AU_ADDRESS> 2000."
- Maintains geographic context while anonymizing specific addresses
- Works with various address formats
- Can be enabled/disabled as needed (enabled by default)
Creating and Managing Patterns
Customize the library to detect organization-specific entities and control which patterns are active:
from allyanonimiser import create_allyanonimiser
ally = create_allyanonimiser()
# Method 1: Add a custom regex pattern directly
ally.add_pattern({
"entity_type": "CUSTOMER_ID",
"patterns": [r"C-\d{5}-[A-Z]{2}"],
"context": ["customer", "id", "identifier", "number"],
"name": "Customer ID Pattern"
})
# Method 2: Generate a pattern from examples
examples = ["PRJ-2023-001", "PRJ-2023-002", "PRJ-2023-003"]
ally.create_pattern_from_examples(
entity_type="PROJECT_CODE",
examples=examples,
context=["project", "code"],
generalization_level="medium" # Controls pattern flexibility
)
# Method 3: Import patterns from a CSV file
ally.import_patterns_from_csv(
csv_path="patterns/customer_patterns.csv",
entity_column="entity_type",
pattern_column="regex"
)
# Control which patterns are active for a specific analysis
# Get all available patterns first
available_entities = ally.get_available_entity_types()
print(f"Available patterns: {list(available_entities.keys())}")
# Selectively enable only certain patterns
result = ally.analyze(
text="Customer ID: C-12345-AB, Project: PRJ-2023-001, Medicare: 1234 56789 0",
active_entity_types=["CUSTOMER_ID", "PROJECT_CODE"] # Only these will be detected
)
# Save all patterns for reuse
ally.export_config("patterns_repository.json")
Running Tests
The package includes comprehensive tests:
# Run all tests
pytest
# Run specific test modules
pytest tests/test_analyzer.py tests/test_anonymizer.py
# Run with coverage report
pytest --cov=allyanonimiser
Test Coverage
Test coverage focuses on several key areas:
- Factory Functions: Tests that the factory functions like
create_analyzerwork correctly - Entity Detection: Tests for accurate detection of all entity types
- Anonymization: Tests for proper text replacement with different operators
- Extensibility: Tests for adding custom patterns and entity types
- Edge Cases: Tests for handling of boundary conditions and unusual inputs
Contributing
Contributions are welcome! Please check the Contributing Guide for details.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file allyanonimiser-1.1.0.tar.gz.
File metadata
- Download URL: allyanonimiser-1.1.0.tar.gz
- Upload date:
- Size: 159.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a99408ac3beca1551af80adaae05361cb079338dd15e4d488ea4ad234ed0af0
|
|
| MD5 |
2203d45bcbd8eb690a67ab23e728e12e
|
|
| BLAKE2b-256 |
86dd6554cabb0516c00cb86db2cb576d15a2c061a44843bd7812f9fc3b3a2118
|
File details
Details for the file allyanonimiser-1.1.0-py3-none-any.whl.
File metadata
- Download URL: allyanonimiser-1.1.0-py3-none-any.whl
- Upload date:
- Size: 181.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad081c2be41582d90243519d35821354382c8db6bccbd396ac5103a02e8d91b3
|
|
| MD5 |
f96c43e20e76449d5341cfed1d4de32e
|
|
| BLAKE2b-256 |
6dd97a4b89fb4d30c511a8981112167ee51d4e87c99b14b74b39ada0f33b8639
|