High-performance postal address parser and normalizer using libpostal with Rust bindings
Project description
oxidize-postal
Python bindings for libpostal address parsing with improved performance and installation experience.
oxidize-postal provides the same address parsing capabilities as pypostal but addresses key limitations: it installs without C compilation, releases the Python GIL for true parallel processing, and offers a cleaner API. Built using Rust and libpostal-rust bindings to the libpostal C library.
Key Improvements Over pypostal
| Feature | oxidize-postal | pypostal |
|---|---|---|
| Installation | pip install with pre-built wheels |
Requires C compilation, system dependencies |
| Parallel Processing | GIL released, true multithreading | GIL blocks concurrent parsing |
| API Design | Single module, consistent naming | Multiple imports, scattered functions |
| Error Handling | Structured errors with context | Basic exception messages |
| Platform Support | Cross-platform wheels | Complex Windows build process |
Core Functionality
- Address Parsing: Extract components (street, city, state, postal code, etc.) from address strings
- Address Expansion: Generate normalized variations with abbreviations expanded (St. → Street)
- Address Normalization: Standardize address formatting and component ordering
- International Support: Handles addresses worldwide with Unicode and multiple scripts
Installation
pip install oxidize-postal
# Download language model data (one-time setup)
python -c "import oxidize_postal; oxidize_postal.download_data()"
Usage
Basic Address Parsing
import oxidize_postal
# Parse an address into components
address = "781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA"
parsed = oxidize_postal.parse_address(address)
print(parsed)
# Output: {'house_number': '781', 'road': 'franklin ave', 'suburb': 'crown heights',
# 'city': 'brooklyn', 'state': 'ny', 'postcode': '11216', 'country': 'usa'}
# Get parsed address as JSON string
json_result = oxidize_postal.parse_address_to_json(address)
Address Expansion
# Expand address abbreviations
address = "123 Main St NYC NY"
expansions = oxidize_postal.expand_address(address)
print(expansions)
# Output: ['123 main street nyc new york', '123 main street nyc ny', ...]
# Get expansions as JSON
json_expansions = oxidize_postal.expand_address_to_json(address)
Parallel Processing & Performance
One of the key advantages of oxidize-postal over pypostal is GIL-free parallel processing. However, it's important to understand when you'll see benefits.
When Parallel Processing Helps
arallel processing provides the most benefit when combined with slower I/O operations:
Great for parallel processing:
import oxidize_postal
from concurrent.futures import ThreadPoolExecutor
import requests
def process_customer_record(record):
# Fetch from API (50-200ms)
customer = requests.get(f"https://api.example.com/customers/{record['id']}").json()
# Parse address (0.3ms) - GIL released so other threads can work
parsed = oxidize_postal.parse_address(customer['address'])
# Write to database (50-200ms)
db.update(customer['id'], parsed)
return parsed
# Process many records in parallel
with ThreadPoolExecutor(max_workers=20) as executor:
results = list(executor.map(process_customer_record, records))
Limited benefit for pure address parsing:
# Just parsing addresses without I/O
addresses = ["123 Main St", "456 Oak Ave"] * 100
# Parallel might even be slower due to thread overhead
with ThreadPoolExecutor() as executor:
results = list(executor.map(oxidize_postal.parse_address, addresses))
Real-World Use Cases
Where to use oxidize-postal's GIL release:
- ETL Pipelines: Reading from databases/APIs, parsing, and writing back
- Stream Processing: Handling Kafka/Kinesis streams with address data
- Web Services: API endpoints that parse addresses alongside other operations
- File Processing: Reading large CSV/Parquet files, parsing addresses, writing results
Threading vs Multiprocessing
Because oxidize-postal releases the GIL, threading is usually preferable to multiprocessing:
from concurrent.futures import ThreadPoolExecutor
from multiprocessing import Pool
# Threading - Lower overhead, shared memory
with ThreadPoolExecutor(max_workers=8) as executor:
results = list(executor.map(oxidize_postal.parse_address, addresses))
# Multiprocessing - Higher overhead due to serialization
# Only use if you need true CPU parallelism for other operations
with Pool(processes=8) as pool:
results = pool.map(oxidize_postal.parse_address, addresses)
Threading outperforms multiprocessing by 3-5x for pure address parsing of small batches (under 5-20k addresses depending on your machine) due to lower overhead.
API Reference
Core Functions
parse_address(address: str) -> dict
Parse an address string into its component parts.
Parameters:
address: The address string to parse
Returns:
- Dictionary with keys like 'house_number', 'road', 'city', 'state', 'postcode', etc.
expand_address(address: str) -> list[str]
Generate normalized variations of an address.
Parameters:
address: The address string to expand
Returns:
- List of expanded address strings
download_data(force: bool = False) -> bool
Download the libpostal data files.
Parameters:
force: If True, re-download even if data exists
Returns:
- True if successful, False otherwise
Additional Functions
parse_address_to_json(address: str) -> str: Parse and return as JSONexpand_address_to_json(address: str) -> str: Expand and return as JSONnormalize_address(address: str) -> str: Normalize an address string
Constants
The module provides various constants for address components:
import oxidize_postal
# Address component constants
oxidize_postal.ADDRESS_ANY
oxidize_postal.ADDRESS_NAME
oxidize_postal.ADDRESS_HOUSE_NUMBER
oxidize_postal.ADDRESS_STREET
oxidize_postal.ADDRESS_UNIT
oxidize_postal.ADDRESS_LEVEL
oxidize_postal.ADDRESS_POSTAL_CODE
# ... and more
Requirements
- Python 3.9+
- libpostal data files (~2GB, downloaded separately)
- Rust toolchain (for building from source)
Project Structure
oxidize-postal/
├── oxidize-postal/ # Rust extension module
│ ├── src/
│ │ ├── lib.rs # PyO3 module definition
│ │ └── postal/
│ │ ├── parser.rs # Core parsing functions
│ │ ├── python_api.rs # Python-exposed functions
│ │ ├── error.rs # Error types
│ │ └── constants.rs # libpostal constants
│ ├── Cargo.toml # Rust dependencies
│ └── pyproject.toml # Python package config
├── tests/
│ ├── fixtures/ # Sample addresses
│ ├── unit/ # Unit tests
│ ├── integration/ # End-to-end tests
│ └── performance/ # Benchmarking tests
├── main.py # Usage examples
├── data_manager.py # libpostal data downloader
├── build.sh # Build script
└── pyproject.toml # Root package config
Architecture
- Stack: Python → PyO3 → Rust → libpostal-rust → libpostal C library
- GIL Release: All parsing operations release the Python GIL for true parallel processing
- Error Handling: Rust errors are converted to Python exceptions (ValueError, RuntimeError)
- Data Requirements: libpostal needs ~2GB of language model data (stored in
/usr/local/share/libpostal)
Build Process
maturincompiles the Rust extension with PyO3 bindings- Links against libpostal-rust crate
- Produces a Python wheel with native extension
- No Python runtime dependencies required
License
MIT License
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Acknowledgments
- libpostal - The core C library for address parsing
- libpostal-rust - Rust bindings for libpostal
- pypostal - The original Python bindings that inspired this project
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oxidize_postal-0.1.1-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: oxidize_postal-0.1.1-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d442896c9fa27806025146cc39d821c743520391ace442bbcc855d68e038f32
|
|
| MD5 |
e27d2090c69c660649b3d3446d14865e
|
|
| BLAKE2b-256 |
d8537c66058af8eabdf327f14ae556895c07ef9e820c98e436ce43ac2e632fad
|