Australian-focused PII detection and anonymization for the insurance industry

These details have not been verified by PyPI

Project links

Project description

Allyanonimiser

Australian-focused PII detection and anonymization for the insurance industry.

Quick Start

from allyanonimiser import create_allyanonimiser

# Create the Allyanonimiser instance with default settings
ally = create_allyanonimiser()

# Analyze text
results = ally.analyze(
    text="Please reference your policy AU-12345678 for claims related to your vehicle rego XYZ123."
)

# Print results
for result in results:
    print(f"Entity: {result.entity_type}, Text: {result.text}, Score: {result.score}")

DataFrame Processing

import pandas as pd
from allyanonimiser import create_allyanonimiser

# Create a DataFrame with potentially sensitive data
df = pd.DataFrame({
    'id': range(1, 5),
    'name': ['John Smith', 'Jane Doe', 'Bob Johnson', 'Alice Williams'],
    'note': [
        'Customer called about policy POL123456. DOB: 15/07/1982.',
        'Email from jane@example.com about claim CL789012. Lives at 10 Queen St, Sydney NSW 2000.',
        'Medicare number 2123 45678 1 received for claim. Patient born 1990-03-22.',
        'TFN: 123 456 789. Contact at 42 Example St, Melbourne VIC 3000. Age: 45.'
    ]
})

# Create an Allyanonimiser instance
ally = create_allyanonimiser()

# Process a DataFrame (anonymize sensitive data)
anonymized_df = ally.anonymize_dataframe(
    df, 
    'note',
    operators={
        'PERSON': 'replace',
        'EMAIL_ADDRESS': 'mask',
        'INSURANCE_POLICY_NUMBER': 'redact',
        'INSURANCE_CLAIM_NUMBER': 'redact',
        'DATE_OF_BIRTH': 'age_bracket'  # Convert DOBs to age brackets
    },
    active_entity_types=[   # Only detect these entity types
        'PERSON', 'EMAIL_ADDRESS', 'INSURANCE_POLICY_NUMBER', 
        'INSURANCE_CLAIM_NUMBER', 'DATE_OF_BIRTH', 'AU_ADDRESS',
        'AU_POSTCODE', 'AU_MEDICARE', 'AU_TFN'
    ],
    age_bracket_size=5,     # Use 5-year brackets (default)
    keep_postcode=True      # Preserve postcodes in addresses (default)
)

Configuration and Pattern Management

from allyanonimiser import create_allyanonimiser

# Create with default settings
ally = create_allyanonimiser()

# Add a custom pattern
ally.add_pattern({
    "entity_type": "INTERNAL_REFERENCE",
    "patterns": [r"REF-\d{5}", r"Reference:\s*([A-Z0-9-]+)"],
    "context": ["internal", "reference", "ref"],
    "name": "Internal Reference Number"
})

# Save configuration for reuse by team members
ally.export_config("company_config.json")

Sharing and Controlling Configurations

from allyanonimiser import create_allyanonimiser

# Load a shared configuration
ally = create_allyanonimiser(settings_path="company_config.json")

# Enable only specific entity types for this session
active_entities = ["PERSON", "INTERNAL_REFERENCE", "EMAIL_ADDRESS"]
result = ally.anonymize(
    text="Reference: REF-12345 submitted by John Smith (john.smith@example.com)",
    operators={"PERSON": "redact"},
    active_entity_types=active_entities  # Only these entities will be detected
)

print(result["text"])
# Output: "Reference: <INTERNAL_REFERENCE> submitted by [REDACTED] ([REDACTED])"

# Get available entity types to selectively enable/disable
available_entities = ally.get_available_entity_types()
print(f"Available entity types: {list(available_entities.keys())}")

# Update an existing configuration with new patterns
ally.add_pattern({
    "entity_type": "PROJECT_CODE",
    "patterns": [r"PROJ-\d{4}"],
    "name": "Project Code"
})

# Export as a new version
ally.export_config("company_config_v2.json")

Simplified API (v1.1.0)

Version 1.1.0 introduces a simplified API with configuration objects and unified interface methods:

from allyanonimiser import create_allyanonimiser, AnalysisConfig, AnonymizationConfig

# Create instance
ally = create_allyanonimiser()

# Use configuration objects for cleaner parameter organization
analysis_config = AnalysisConfig(
    active_entity_types=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER"],
    min_score_threshold=0.8,
    expand_acronyms=True
)

anonymization_config = AnonymizationConfig(
    operators={
        "PERSON": "replace",
        "EMAIL_ADDRESS": "mask", 
        "PHONE_NUMBER": "redact"
    },
    age_bracket_size=10,
    keep_postcode=True
)

# Process text with configuration objects
result = ally.process(
    text="Customer John Smith (john.smith@example.com) called from 0412 345 678.",
    analysis_config=analysis_config,
    anonymization_config=anonymization_config
)

# Unified pattern management interface
ally.manage_patterns(
    action="create_from_examples",
    entity_type="CUSTOMER_ID",
    examples=["CID-12345", "CID-67890"],
    generalization_level="medium"
)

# Unified acronym management interface
ally.manage_acronyms(
    action="add",
    data={"POL": "Policy", "CL": "Claim", "DOB": "Date of Birth"}
)

# Unified DataFrame processing interface
df_result = ally.process_dataframe(
    df=my_dataframe,
    column="notes",
    operation="anonymize",
    analysis_config=analysis_config,
    anonymization_config=anonymization_config
)

See example_simplified_api.py for a complete demonstration of the simplified API.

Features

Australian-Specific PII Detection: Specialized recognizers for Australian TFNs, Medicare numbers, driver's licenses, and other Australian-specific identifiers.
Insurance Industry Focus: Recognition of policy numbers, claim references, vehicle identifiers, and other insurance-specific data.
Long Text Processing: Optimized for processing lengthy free-text fields like claim notes, medical reports, and emails.
Custom Pattern Creation: Easy creation of custom entity recognizers for organization-specific data.
Synthetic Data Generation: Generate realistic Australian test data for validation.
LLM Integration: Use Language Models to create challenging datasets for testing.
Extensible Architecture: Built on Presidio and spaCy with a modular, extensible design.

Version History

Version 1.1.0 - API Simplification

This version introduces a simplified, more consistent API while maintaining backward compatibility.

Key Features

Unified Interface Methods:
- Added manage_acronyms(action, data, ...) to replace multiple acronym methods
- Added manage_patterns(action, data, ...) to replace multiple pattern methods
- Added unified process_dataframe(operation, ...) to consolidate DataFrame methods
- Reduced total API method count while preserving all functionality
Configuration Objects:
- Added AnalysisConfig for grouping analysis parameters
- Added AnonymizationConfig for grouping anonymization parameters
- Improved parameter organization and reduced parameter count in method signatures
- Enhanced readability and maintainability
Improved Developer Experience:
- New example script (example_simplified_api.py) demonstrating the simplified API
- Maintained backward compatibility with deprecated method support
- Comprehensive docstrings for all new methods
- Consolidated parameter handling for better code organization

Version 1.0.0 - Simplified API and Enhanced Features

This version introduces a simplified API with cleaner function names and adds powerful new features for more flexible anonymization. This is a breaking change from previous versions.

Key Updates

Simplified API:
- Removed legacy analyzer factory functions (create_au_analyzer, create_insurance_analyzer, create_au_insurance_analyzer)
- Added cleaner create_analyzer() function as the main entry point
- Streamlined and simplified internal implementation
New Features:
- Address Postcode Preservation: Preserve postcodes when anonymizing addresses with the keep_postcode parameter
- Age Bracketing: Convert dates of birth to age brackets with configurable bracket sizes using the age_bracket operator
- Improved error handling with informative error messages
- Enhanced input validation for more robust operation
Major Version Upgrade:
- Increased major version number to 1.0.0 to indicate production readiness
- Removed backward compatibility with 0.x versions
- Improved documentation and examples

Version 0.3.3 - Python 3.10+ Compatibility

This version updates the package to ensure compatibility with Python 3.10 and newer versions, addressing dependency changes and improving CI/CD workflows.

Key Updates

Python Version Requirements:
- Updated minimum Python version to 3.10+
- Added support for Python 3.11 and 3.12
- Removed support for Python 3.8 and 3.9 due to dependency requirements
Improved Compatibility:
- Fixed circular import issues in the insurance module
- Enhanced batch processing capabilities
- Addressed issues with newer NumPy (2.0+) requirements
CI/CD Improvements:
- Updated GitHub Actions workflows to test on Python 3.10-3.12
- Fixed failing tests and improved test coverage
- Enhanced build process for PyPI deployment

Version 0.3.2 - Configuration Sharing and PyArrow Integration

This version adds functionality to export configuration settings to shareable files and improves DataFrame performance with optional PyArrow integration.

Key Features

Configuration Export/Import:

Export settings to JSON or YAML files for sharing with team members
Load exported configuration in new instances
Include optional metadata for documentation
Seamless integration with existing settings management

from allyanonimiser import create_allyanonimiser

# Export configuration to share with team members
ally = create_allyanonimiser()
ally.set_acronym_dictionary({'POL': 'Policy', 'CL': 'Claim'})
ally.settings_manager.set_entity_types(['PERSON', 'EMAIL_ADDRESS'])
ally.export_config("my_config.json")

# Team members can load the configuration
shared_ally = create_allyanonimiser(settings_path="my_config.json")

PyArrow Integration:
- Optional PyArrow support for improved DataFrame performance
- Graceful fallback when PyArrow isn't available
- Configurable through settings with sensible defaults
- Enhanced DataFrame processing speed for large datasets

Version 0.3.1 - DataFrame Processing and Batch Operations

This version adds comprehensive DataFrame processing capabilities, enabling efficient analysis and anonymization of data stored in pandas DataFrames.

Key Features

DataFrame Processing:
- Process pandas DataFrames with optimized memory usage
- Batch operations for efficient handling of large datasets
- Parallel processing for improved performance
- Progress tracking for long-running operations
Anonymization and Analysis:
- Anonymize specific columns or entire DataFrames
- Extract entities from text columns
- Generate statistical reports on detected entities
- Configure entity types and anonymization operators

Version 0.3.0 - Custom Pattern Support and Pattern Management

This version adds comprehensive custom pattern creation and management capabilities, enabling users to define, test, save, and load their own PII detection patterns.

Key Features

Custom Pattern Creation:
- Create and manage custom pattern definitions
- Generate patterns from examples with different generalization levels
- Save and load pattern definitions for reuse
- Add patterns directly to analyzers or pattern registries
Pattern Testing and Verification:
- Test patterns against positive and negative examples
- Validate pattern structure and components
- Get diagnostic information for pattern matching
- Identify and fix issues in pattern definitions

Installation

pip install allyanonimiser==1.1.0

Prerequisites

Python 3.10 or higher
For optimal performance, also install a spaCy model:
```
python -m spacy download en_core_web_lg
```

Usage Examples

Direct Analyzer Usage

from allyanonimiser import create_analyzer, EnhancedAnalyzer

# Use the factory function for a pre-configured analyzer
analyzer = create_analyzer()

# Analyze some text
results = analyzer.analyze(
    text="My policy number is POL-123456. My claim reference is CLM/2023/78901.",
    language="en"
)

# Print the results
for result in results:
    print(f"Entity: {result.entity_type}, Text: {result.text}, Score: {result.score}")

Text Anonymization

from allyanonimiser import create_allyanonimiser

# Create an Allyanonimiser instance
ally = create_allyanonimiser()

# Analyze and anonymize text
result = ally.anonymize(
    text="Please contact John Smith at john.smith@example.com regarding TFN: 123 456 789.",
    operators={
        "PERSON": "replace",  # Replace with [PERSON]
        "EMAIL_ADDRESS": "mask",  # Apply partial masking: j***@e******.com
        "AU_TFN": "redact",  # Fully redact the TFN: [REDACTED]
    }
)

print(result["text"])
# Output: "Please contact [PERSON] at j***@e******.com regarding TFN: [REDACTED]."

Advanced Features

from allyanonimiser import create_allyanonimiser

# Create an Allyanonimiser instance
ally = create_allyanonimiser()

# Preserve postcodes in addresses
result = ally.anonymize(
    text="The customer lives at 42 Main St, Sydney NSW 2000.",
    keep_postcode=True  # Default is True
)

print(result["text"])
# Output: "The customer lives at <AU_ADDRESS> 2000."

# Convert dates of birth to age brackets
result = ally.anonymize(
    text="Patient: Jane Doe, DOB: 15/03/1980, Medicare: 1234 56789 0",
    operators={
        "DATE_OF_BIRTH": "age_bracket",  # Convert to 5-year brackets by default
        "PERSON": "replace"
    }
)

print(result["text"])
# Output: "Patient: <PERSON>, DOB: 40-44, Medicare: <AU_MEDICARE>"

# Process multiple files from a directory
results = ally.process_files(
    file_paths=["claim1.txt", "claim2.txt"],
    output_dir="anonymized_output/",
    operators={"DATE_OF_BIRTH": "age_bracket", "AU_TFN": "redact"},
    age_bracket_size=10
)

Supported Entity Types

The package includes recognition for:

Australian-Specific Entities

AU_TFN (Tax File Numbers)
AU_MEDICARE (Medicare card numbers)
AU_PASSPORT (Australian passport numbers)
AU_DRIVERS_LICENSE (Australian drivers license numbers)
AU_ABN (Australian Business Numbers)
AU_ACN (Australian Company Numbers)
AU_ADDRESS (Australian street addresses)
AU_POSTCODE (Australian postcodes)
AU_PHONE (Australian phone numbers)
AU_REGO (Vehicle registration numbers)

Insurance-Specific Entities

INSURANCE_POLICY_NUMBER (Policy identifiers)
INSURANCE_CLAIM_NUMBER (Claim reference numbers)
INSURANCE_MEMBER_NUMBER (Membership numbers)
INSURANCE_GROUP_NUMBER (Group policy identifiers)
VEHICLE_IDENTIFIER (VIN, chassis, license plate)
CASE_REFERENCE (Internal case references)

General PII Entities

PERSON (Person names)
EMAIL_ADDRESS (Email addresses)
PHONE_NUMBER (Generic phone numbers)
CREDIT_CARD (Credit card numbers)
DATE (Calendar dates)
URL (Web addresses)
LOCATION (Physical locations)
ORGANIZATION (Company and organization names)
MONETARY_VALUE (Currency amounts)
ID (Generic identification numbers)

Anonymization Operations

The package supports several anonymization operators:

replace: Replace with entity type (e.g., [PERSON])
redact: Fully redact the entity (e.g., [REDACTED])
mask: Partially mask while preserving structure (e.g., j***@e****.com)
hash: Replace with a consistent hash value
encrypt: Encrypt with a key (recoverable)
age_bracket: Convert dates of birth to age brackets (e.g., 40-44)
custom: Define your own replacement function

Key Features Explained

Age Bracketing

The age bracketing feature converts dates of birth to age brackets for enhanced privacy:

from allyanonimiser import create_allyanonimiser

ally = create_allyanonimiser()
result = ally.anonymize(
    text="Patient: John Smith, DOB: 15/03/1980, Medicare: 1234 56789 0",
    operators={"DATE_OF_BIRTH": "age_bracket"},
    age_bracket_size=5  # Default is 5 years
)

print(result["text"])
# Output: "Patient: <PERSON>, DOB: 40-44, Medicare: <AU_MEDICARE>"

Supports multiple date formats (DD/MM/YYYY, YYYY-MM-DD, etc.)
Handles direct age mentions (e.g., "Age: 42")
Customize bracket sizes (5, 10 years, etc.)

Postcode Preservation

Preserve postcodes when anonymizing addresses to maintain geographical context:

from allyanonimiser import create_allyanonimiser

ally = create_allyanonimiser()
result = ally.anonymize(
    text="Customer lives at 42 Main St, Sydney NSW 2000.",
    keep_postcode=True  # Default is True
)

print(result["text"])
# Output: "Customer lives at <AU_ADDRESS> 2000."

Maintains geographic context while anonymizing specific addresses
Works with various address formats
Can be enabled/disabled as needed (enabled by default)

Creating and Managing Patterns

Customize the library to detect organization-specific entities and control which patterns are active:

from allyanonimiser import create_allyanonimiser

ally = create_allyanonimiser()

# Method 1: Add a custom regex pattern directly
ally.add_pattern({
    "entity_type": "CUSTOMER_ID",
    "patterns": [r"C-\d{5}-[A-Z]{2}"],
    "context": ["customer", "id", "identifier", "number"],
    "name": "Customer ID Pattern"
})

# Method 2: Generate a pattern from examples
examples = ["PRJ-2023-001", "PRJ-2023-002", "PRJ-2023-003"]
ally.create_pattern_from_examples(
    entity_type="PROJECT_CODE",
    examples=examples,
    context=["project", "code"],
    generalization_level="medium"  # Controls pattern flexibility
)

# Method 3: Import patterns from a CSV file
ally.import_patterns_from_csv(
    csv_path="patterns/customer_patterns.csv",
    entity_column="entity_type",
    pattern_column="regex"
)

# Control which patterns are active for a specific analysis
# Get all available patterns first
available_entities = ally.get_available_entity_types()
print(f"Available patterns: {list(available_entities.keys())}")

# Selectively enable only certain patterns
result = ally.analyze(
    text="Customer ID: C-12345-AB, Project: PRJ-2023-001, Medicare: 1234 56789 0",
    active_entity_types=["CUSTOMER_ID", "PROJECT_CODE"]  # Only these will be detected
)

# Save all patterns for reuse
ally.export_config("patterns_repository.json")

Running Tests

The package includes comprehensive tests:

# Run all tests
pytest

# Run specific test modules
pytest tests/test_analyzer.py tests/test_anonymizer.py

# Run with coverage report
pytest --cov=allyanonimiser

Test Coverage

Test coverage focuses on several key areas:

Factory Functions: Tests that the factory functions like create_analyzer work correctly
Entity Detection: Tests for accurate detection of all entity types
Anonymization: Tests for proper text replacement with different operators
Extensibility: Tests for adding custom patterns and entity types
Edge Cases: Tests for handling of boundary conditions and unusual inputs

Contributing

Contributions are welcome! Please check the Contributing Guide for details.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

3.3.0

Apr 15, 2026

3.2.0

Apr 15, 2026

3.1.2

Apr 15, 2026

3.1.1 yanked

Apr 15, 2026

Reason this release was yanked:

deprecated code included in release

2.5.0

Aug 14, 2025

2.4.0

Jul 25, 2025

2.3.0

Jul 23, 2025

2.2.0

Jul 21, 2025

2.1.0

Mar 3, 2025

2.0.0

Mar 3, 2025

1.2.0

Mar 3, 2025

This version

1.1.0

Mar 3, 2025

0.3.3

Mar 3, 2025

0.3.2

Feb 28, 2025

0.2.1

Feb 28, 2025

0.2.0

Feb 28, 2025

0.1.9

Feb 28, 2025

0.1.8

Feb 28, 2025

0.1.7

Feb 28, 2025

0.1.6

Feb 28, 2025

0.1.5

Feb 28, 2025

0.1.4

Feb 27, 2025

0.1.3

Feb 27, 2025

0.1.2

Feb 27, 2025

0.1.1

Feb 27, 2025

0.1.0

Feb 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

allyanonimiser-1.1.0.tar.gz (159.7 kB view details)

Uploaded Mar 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

allyanonimiser-1.1.0-py3-none-any.whl (181.3 kB view details)

Uploaded Mar 3, 2025 Python 3

File details

Details for the file allyanonimiser-1.1.0.tar.gz.

File metadata

Download URL: allyanonimiser-1.1.0.tar.gz
Upload date: Mar 3, 2025
Size: 159.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for allyanonimiser-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6a99408ac3beca1551af80adaae05361cb079338dd15e4d488ea4ad234ed0af0`
MD5	`2203d45bcbd8eb690a67ab23e728e12e`
BLAKE2b-256	`86dd6554cabb0516c00cb86db2cb576d15a2c061a44843bd7812f9fc3b3a2118`

See more details on using hashes here.

File details

Details for the file allyanonimiser-1.1.0-py3-none-any.whl.

File metadata

Download URL: allyanonimiser-1.1.0-py3-none-any.whl
Upload date: Mar 3, 2025
Size: 181.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for allyanonimiser-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ad081c2be41582d90243519d35821354382c8db6bccbd396ac5103a02e8d91b3`
MD5	`f96c43e20e76449d5341cfed1d4de32e`
BLAKE2b-256	`6dd97a4b89fb4d30c511a8981112167ee51d4e87c99b14b74b39ada0f33b8639`

See more details on using hashes here.

allyanonimiser 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Allyanonimiser

Quick Start

DataFrame Processing

Configuration and Pattern Management

Sharing and Controlling Configurations

Simplified API (v1.1.0)

Features

Version History

Version 1.1.0 - API Simplification

Key Features

Version 1.0.0 - Simplified API and Enhanced Features

Key Updates

Version 0.3.3 - Python 3.10+ Compatibility

Key Updates

Version 0.3.2 - Configuration Sharing and PyArrow Integration

Key Features

Version 0.3.1 - DataFrame Processing and Batch Operations

Key Features

Version 0.3.0 - Custom Pattern Support and Pattern Management

Key Features

Installation

Prerequisites

Usage Examples

Direct Analyzer Usage

Text Anonymization

Advanced Features

Supported Entity Types

Australian-Specific Entities

Insurance-Specific Entities

General PII Entities

Anonymization Operations

Key Features Explained

Age Bracketing

Postcode Preservation

Creating and Managing Patterns

Running Tests

Test Coverage

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes