High-performance data processing tools for Python, built with Rust
Project description
Oxidize
High-performance data processing tools for Python, built with Rust.
Philosophy
Python data tools often require choosing between performance, installation simplicity, and parallel processing. These tools address that by providing:
- Performance: Rust implementation with 2-10x speed improvements
- Easy installation: Pre-built wheels, no compilation required
- True parallelism: GIL release for concurrent processing
- Practical focus: Solutions for common data engineering tasks
Tools
oxidize-postal
Address parsing and normalization with international support.
import oxidize_postal
parsed = oxidize_postal.parse_address("781 Franklin Ave Brooklyn NY 11216")
# {'house_number': '781', 'road': 'franklin ave', 'city': 'brooklyn', 'state': 'ny', 'postcode': '11216'}
expansions = oxidize_postal.expand_address("123 Main St NYC NY")
# ['123 main street nyc new york', '123 main street nyc ny', ...]
Improvements over pypostal:
- pip install with pre-built wheels (no C compilation)
- GIL released for parallel processing
- Single module API
- Cross-platform support
oxidize-xml
Streaming XML to JSON conversion for large files.
import oxidize_xml
# Extract repeated elements to JSON Lines
count = oxidize_xml.parse_xml_file_to_json_file("data.xml", "book", "output.jsonl")
# Stream processing for large files
json_lines = oxidize_xml.parse_xml_file_to_json_string("export.xml", "record")
Improvements over lxml:
- 2-3x faster streaming parser
- Processes files larger than available RAM
- Consistent schema output for data analysis
- Built-in XML security protections
Technical Approach
Rust + PyO3: Combines Rust's performance and memory safety with Python's ecosystem integration.
GIL Release: All compute operations release Python's Global Interpreter Lock, enabling true parallel processing in threaded environments.
Streaming Architecture: Designed for processing large datasets without loading everything into memory.
Pre-built Wheels: Cross-platform distribution eliminates compilation requirements and system dependencies.
Use Cases
- ETL pipelines with address normalization
- Processing large XML exports and API responses
- Data cleaning workflows requiring parallel processing
- Web services handling structured data parsing
Future Tools
Planned additions following the same principles:
- oxidize-csv: High-performance CSV processing
- oxidize-json: Streaming JSON operations
- oxidize-regex: Parallel text processing
Contributing
Each tool has its own repository with specific contribution guidelines. General focus areas:
- Performance improvements with benchmarks
- API usability for common workflows
- Documentation and examples
- Test coverage for edge cases
License
MIT License for all tools.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file oxidize-0.6.0.tar.gz.
File metadata
- Download URL: oxidize-0.6.0.tar.gz
- Upload date:
- Size: 3.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
47c8da368a5fdc6fd4d464a39cca1621ccd29ed69f002f5f76b96ed611972136
|
|
| MD5 |
b3b90bebdb88dd1b9deff1df59604dd9
|
|
| BLAKE2b-256 |
c2af2b375b4676a4de7d20953bc0968b33487822c86d65cfab868b0709a8db34
|
File details
Details for the file oxidize-0.6.0-py3-none-any.whl.
File metadata
- Download URL: oxidize-0.6.0-py3-none-any.whl
- Upload date:
- Size: 3.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f5384e4e2d176bf1cbf15eceb66353d2c8e2a8562d69c64ad54dc1d23a4e1fcf
|
|
| MD5 |
5a1330f355ad31c53e3b69b157befb80
|
|
| BLAKE2b-256 |
e372292f813774414648aa5628f0078e2a4eea334aaa8ab1da234f41443698a8
|