Skip to main content

High-performance postal address parser and normalizer using libpostal with Rust bindings

Project description

oxidize-postal

Python bindings for libpostal address parsing with improved performance and installation experience.

oxidize-postal provides the same address parsing capabilities as pypostal but addresses key limitations: it installs without C compilation, releases the Python GIL for true parallel processing, and offers a cleaner API. Built using Rust and libpostal-rust bindings to the libpostal C library.

Key Improvements Over pypostal

Feature oxidize-postal pypostal
Installation pip install with pre-built wheels Requires C compilation, system dependencies
Parallel Processing GIL released, true multithreading GIL blocks concurrent parsing
API Design Single module, consistent naming Multiple imports, scattered functions
Error Handling Structured errors with context Basic exception messages
Platform Support Cross-platform wheels Complex Windows build process

Core Functionality

  • Address Parsing: Extract components (street, city, state, postal code, etc.) from address strings
  • Address Expansion: Generate normalized variations with abbreviations expanded (St. → Street)
  • Address Normalization: Standardize address formatting and component ordering
  • International Support: Handles addresses worldwide with Unicode and multiple scripts

Installation

pip install oxidize-postal

# Download language model data (one-time setup)
python -c "import oxidize_postal; oxidize_postal.download_data()"

Usage

Basic Address Parsing

import oxidize_postal

# Parse an address into components
address = "781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA"
parsed = oxidize_postal.parse_address(address)
print(parsed)
# Output: {'house_number': '781', 'road': 'franklin ave', 'suburb': 'crown heights', 
#          'city': 'brooklyn', 'state': 'ny', 'postcode': '11216', 'country': 'usa'}

# Get parsed address as JSON string
json_result = oxidize_postal.parse_address_to_json(address)

Address Expansion

# Expand address abbreviations
address = "123 Main St NYC NY"
expansions = oxidize_postal.expand_address(address)
print(expansions)
# Output: ['123 main street nyc new york', '123 main street nyc ny', ...]

# Get expansions as JSON
json_expansions = oxidize_postal.expand_address_to_json(address)

Parallel Processing & Performance

One of the key advantages of oxidize-postal over pypostal is GIL-free parallel processing. However, it's important to understand when you'll see benefits.

When Parallel Processing Helps

Parallel processing provides the most benefit when combined with slower I/O operations:

Great for parallel processing:

import oxidize_postal
from concurrent.futures import ThreadPoolExecutor
import requests

def process_customer_record(record):
    # Fetch from API (50-200ms)
    customer = requests.get(f"https://api.example.com/customers/{record['id']}").json()
    
    # Parse address (0.3ms) - GIL released so other threads can work
    parsed = oxidize_postal.parse_address(customer['address'])
    
    # Write to database (50-200ms)
    db.update(customer['id'], parsed)
    
    return parsed

# Process many records in parallel
with ThreadPoolExecutor(max_workers=20) as executor:
    results = list(executor.map(process_customer_record, records))

Limited benefit for pure address parsing:

# Just parsing addresses without I/O
addresses = ["123 Main St", "456 Oak Ave"] * 100

# Parallel might even be slower due to thread overhead
with ThreadPoolExecutor() as executor:
    results = list(executor.map(oxidize_postal.parse_address, addresses))

Real-World Use Cases

Where to use oxidize-postal's GIL release:

  1. ETL Pipelines: Reading from databases/APIs, parsing, and writing back
  2. Stream Processing: Handling Kafka/Kinesis streams with address data
  3. Web Services: API endpoints that parse addresses alongside other operations
  4. File Processing: Reading large CSV/Parquet files, parsing addresses, writing results

Threading vs Multiprocessing

Because oxidize-postal releases the GIL, threading is usually preferable to multiprocessing:

from concurrent.futures import ThreadPoolExecutor
from multiprocessing import Pool

# Threading - Lower overhead, shared memory
with ThreadPoolExecutor(max_workers=8) as executor:
    results = list(executor.map(oxidize_postal.parse_address, addresses))

# Multiprocessing - Higher overhead due to serialization
# Only use if you need true CPU parallelism for other operations
with Pool(processes=8) as pool:
    results = pool.map(oxidize_postal.parse_address, addresses)

Threading outperforms multiprocessing by 3-5x for pure address parsing of small batches (under 5-20k addresses depending on your machine) due to lower overhead.

API Reference

Core Functions

parse_address(address: str) -> dict

Parse an address string into its component parts.

Parameters:

  • address: The address string to parse

Returns:

  • Dictionary with keys like 'house_number', 'road', 'city', 'state', 'postcode', etc.

expand_address(address: str) -> list[str]

Generate normalized variations of an address.

Parameters:

  • address: The address string to expand

Returns:

  • List of expanded address strings

download_data(force: bool = False) -> bool

Download the libpostal data files.

Parameters:

  • force: If True, re-download even if data exists

Returns:

  • True if successful, False otherwise

Additional Functions

  • parse_address_to_json(address: str) -> str: Parse and return as JSON
  • expand_address_to_json(address: str) -> str: Expand and return as JSON
  • normalize_address(address: str) -> str: Normalize an address string

Constants

The module provides various constants for address components:

import oxidize_postal

# Address component constants
oxidize_postal.ADDRESS_ANY
oxidize_postal.ADDRESS_NAME
oxidize_postal.ADDRESS_HOUSE_NUMBER
oxidize_postal.ADDRESS_STREET
oxidize_postal.ADDRESS_UNIT
oxidize_postal.ADDRESS_LEVEL
oxidize_postal.ADDRESS_POSTAL_CODE
# ... and more

Requirements

  • Python 3.9+
  • libpostal data files (~2GB, downloaded separately)
  • Rust toolchain (for building from source)

Project Structure

oxidize-postal/
├── oxidize-postal/         # Rust extension module
│   ├── src/
│   │   ├── lib.rs          # PyO3 module definition
│   │   └── postal/
│   │       ├── parser.rs   # Core parsing functions
│   │       ├── python_api.rs   # Python-exposed functions
│   │       ├── error.rs    # Error types
│   │       └── constants.rs    # libpostal constants
│   ├── Cargo.toml          # Rust dependencies
│   └── pyproject.toml      # Python package config
├── tests/
│   ├── fixtures/           # Sample addresses
│   ├── unit/               # Unit tests
│   ├── integration/        # End-to-end tests
│   └── performance/        # Benchmarking tests
├── main.py                 # Usage examples
├── data_manager.py         # libpostal data downloader
├── build.sh                # Build script
└── pyproject.toml          # Root package config

Architecture

  • Stack: Python → PyO3 → Rust → libpostal-rust → libpostal C library
  • GIL Release: All parsing operations release the Python GIL for true parallel processing
  • Error Handling: Rust errors are converted to Python exceptions (ValueError, RuntimeError)
  • Data Requirements: libpostal needs ~2GB of language model data (stored in /usr/local/share/libpostal)

Build Process

  1. maturin compiles the Rust extension with PyO3 bindings
  2. Links against libpostal-rust crate
  3. Produces a Python wheel with native extension
  4. No Python runtime dependencies required

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgments

  • libpostal - The core C library for address parsing
  • libpostal-rust - Rust bindings for libpostal
  • pypostal - The original Python bindings that inspired this project

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

oxidize_postal-0.1.2-cp312-cp312-macosx_11_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

File details

Details for the file oxidize_postal-0.1.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for oxidize_postal-0.1.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 475ec0e98c0770b17c4606adedaa00de8f06b42ef9b9b1c53f8cf90cb68964a9
MD5 f45813c91c3c24a720f62a8364eb4ae8
BLAKE2b-256 f45e070269819484c2611bbc22447ac741b98fbfc38e2dbc402eaa4a99a9aa5c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page