High-performance bitemporal timeseries update processor
Project description
PyTemporal Library
A high-performance Rust library with Python bindings for processing bitemporal timeseries data. Optimized for financial services and applications requiring immutable audit trails with both business and system time dimensions.
Features
- High Performance: 500k records processed in ~885ms with adaptive parallelization
- Zero-Copy Processing: Apache Arrow columnar data format for efficient memory usage
- Parallel Processing: Rayon-based parallelization with adaptive thresholds
- Conflation: Automatic merging of adjacent segments with identical values to reduce storage
- Flexible Schema: Dynamic ID and value column configuration
- Python Integration: Seamless PyO3 bindings for Python workflows
- Modular Architecture: Clean separation of concerns with dedicated modules
- Performance Monitoring: Integrated flamegraph generation and GitHub Pages benchmark reports
Installation
Build from source (requires Rust):
git clone <your-repository-url>
cd pytemporal
uv run maturin develop --release
Quick Start
import pandas as pd
from pytemporal import compute_changes
import pyarrow as pa
from datetime import datetime
# Convert pandas DataFrames to Arrow RecordBatches
def df_to_record_batch(df):
table = pa.Table.from_pandas(df)
return table.to_batches()[0]
# Current state
current_state = pd.DataFrame({
'id': [1234, 1234],
'field': ['test', 'fielda'],
'mv': [300, 400],
'price': [400, 500],
'effective_from': pd.to_datetime(['2020-01-01', '2020-01-01']),
'effective_to': pd.to_datetime(['2021-01-01', '2021-01-01']),
'as_of_from': pd.to_datetime(['2025-01-01', '2025-01-01']),
'as_of_to': pd.to_datetime(['2262-04-11', '2262-04-11']), # Max date
'value_hash': [0, 0] # Will be computed automatically
})
# Updates
updates = pd.DataFrame({
'id': [1234],
'field': ['test'],
'mv': [400],
'price': [300],
'effective_from': pd.to_datetime(['2020-06-01']),
'effective_to': pd.to_datetime(['2020-09-01']),
'as_of_from': pd.to_datetime(['2025-07-27']),
'as_of_to': pd.to_datetime(['2262-04-11']),
'value_hash': [0]
})
# Process updates
expire_indices, insert_batches = compute_changes(
df_to_record_batch(current_state),
df_to_record_batch(updates),
id_columns=['id', 'field'],
value_columns=['mv', 'price'],
system_date='2025-07-27',
update_mode='delta'
)
print(f"Records to expire: {len(expire_indices)}")
print(f"Records to insert: {len(insert_batches)}")
Algorithm Explanation with Examples
Bitemporal Model
Each record tracks two time dimensions:
- Effective Time (
effective_from,effective_to): When the data is valid in the real world - As-Of Time (
as_of_from,as_of_to): When the data was known to the system
Both use TimestampMicrosecond precision for maximum accuracy.
Core Algorithm: Timeline Processing
The algorithm processes updates by creating a timeline of events and determining what should be active at each point in time.
Example 1: Simple Overwrite
Current State:
ID=123, effective: [2020-01-01, 2021-01-01], as_of: [2025-01-01, max], mv=100
Update:
ID=123, effective: [2020-06-01, 2020-09-01], as_of: [2025-07-27, max], mv=200
Timeline Processing:
-
Create Events:
- 2020-01-01: Current starts (mv=100)
- 2020-06-01: Update starts (mv=200)
- 2020-09-01: Update ends
- 2021-01-01: Current ends
-
Process Timeline:
- [2020-01-01, 2020-06-01): Current active → emit mv=100
- [2020-06-01, 2020-09-01): Update active → emit mv=200
- [2020-09-01, 2021-01-01): Current active → emit mv=100
-
Result:
- Expire: Original record (index 0)
- Insert: Three new records covering the split timeline
Visual Representation:
Before:
Current |=======mv=100========|
2020-01-01 2021-01-01
Update |==mv=200==|
2020-06-01 2020-09-01
After:
New |=100=|=mv=200=|=100=|
2020 2020 2020 2021
01-01 06-01 09-01 01-01
Example 2: Conflation (Adjacent Identical Values)
Current State:
ID=123, effective: [2020-01-01, 2020-06-01], as_of: [2025-01-01, max], mv=100
ID=123, effective: [2020-06-01, 2021-01-01], as_of: [2025-01-01, max], mv=100
Update:
ID=123, effective: [2020-03-01, 2020-04-01], as_of: [2025-07-27, max], mv=100
Since the update has the same value (mv=100) as the current state, the algorithm detects this as a no-change scenario and skips processing entirely.
Example 3: Complex Multi-Update
Current State:
ID=123, effective: [2020-01-01, 2021-01-01], as_of: [2025-01-01, max], mv=100
Updates:
ID=123, effective: [2020-03-01, 2020-06-01], as_of: [2025-07-27, max], mv=200
ID=123, effective: [2020-09-01, 2020-12-01], as_of: [2025-07-27, max], mv=300
Timeline Processing:
-
Events: 2020-01-01 (current start), 2020-03-01 (update1 start), 2020-06-01 (update1 end), 2020-09-01 (update2 start), 2020-12-01 (update2 end), 2021-01-01 (current end)
-
Result:
- [2020-01-01, 2020-03-01): mv=100 (current)
- [2020-03-01, 2020-06-01): mv=200 (update1)
- [2020-06-01, 2020-09-01): mv=100 (current)
- [2020-09-01, 2020-12-01): mv=300 (update2)
- [2020-12-01, 2021-01-01): mv=100 (current)
Post-Processing Conflation
After timeline processing, the algorithm merges adjacent segments with identical value hashes:
Before Conflation:
|--mv=100--|--mv=100--|--mv=200--|--mv=100--|--mv=100--|
After Conflation:
|--------mv=100--------|--mv=200--|--------mv=100--------|
This significantly reduces database row count while preserving temporal accuracy.
Update Modes
- Delta Mode (default): Only provided records are updates, existing state is preserved where not overlapped
- Full State Mode: Provided records represent complete new state, all current records for matching IDs are expired
Parallelization Strategy
The algorithm uses adaptive parallelization:
- Serial Processing: Small datasets (<50 ID groups AND <10k records)
- Parallel Processing: Large datasets using Rayon for CPU-bound operations
- ID Group Independence: Each ID group processes independently, enabling perfect parallelization
Performance
Benchmarked on modern hardware:
- 500k records: ~885ms processing time
- Adaptive Parallelization: Automatically uses multiple threads for large datasets
- Parallel Thresholds: >50 ID groups OR >10k total records triggers parallel processing
- Conflation Efficiency: Significant row reduction for datasets with temporal continuity
Testing
Run the test suites:
# Rust tests
cargo test
# Python tests
uv run python -m pytest tests/test_bitemporal.py -v
# Benchmarks
cargo bench
Development
Project Structure
Modular Architecture (274 lines total in main file, down from 1,085):
src/lib.rs- Main processing function and Python bindings (274 lines)src/types.rs- Core data structures and constants (88 lines)src/overlap.rs- Overlap detection and record categorization (68 lines)src/timeline.rs- Timeline event processing algorithm (218 lines)src/conflation.rs- Record conflation and deduplication (157 lines)src/batch_utils.rs- Arrow RecordBatch utilities (122 lines)tests/integration_tests.rs- Rust integration tests (5 test scenarios)tests/test_bitemporal_manual.py- Python test suite (22 test scenarios)benches/bitemporal_benchmarks.rs- Performance benchmarksCLAUDE.md- Project context and development notes
Key Commands
# Build release version
cargo build --release
# Run benchmarks with HTML reports
cargo bench
# Build Python wheel
uv run maturin build --release
# Development install
uv run maturin develop
Module Responsibilities
types.rs- Data structures (BitemporalRecord,ChangeSet,UpdateMode) and type conversionsoverlap.rs- Determines which records overlap in time and need timeline processing vs direct insertiontimeline.rs- Core algorithm that processes overlapping records through event timelineconflation.rs- Post-processes results to merge adjacent segments with identical valuesbatch_utils.rs- Arrow utilities for RecordBatch creation and timestamp handling
Dependencies
- arrow (53.4) - Columnar data processing
- pyo3 (0.21) - Python bindings
- chrono (0.4) - Date/time handling
- blake3 (1.5) - Cryptographic hashing
- rayon (1.8) - Parallel processing
- criterion (0.5) - Benchmarking framework
Architecture
Rust Core
- Zero-copy Arrow array processing
- Parallel execution with Rayon
- Hash-based change detection with BLAKE3
- Post-processing conflation for optimal storage
- Modular design with clear separation of concerns
Python Interface
- PyO3 bindings for seamless integration
- Arrow RecordBatch input/output
- Compatible with pandas DataFrames via conversion
Performance Monitoring
This project includes comprehensive performance monitoring with flamegraph analysis:
📊 Release Performance Reports
View performance metrics and flamegraphs for each release at: Release Benchmarks
Each version tag automatically generates comprehensive performance documentation with flamegraphs, creating a historical record of performance evolution across releases.
🔥 Generating Flamegraphs Locally
# Generate flamegraphs for key benchmarks
cargo bench --bench bitemporal_benchmarks medium_dataset -- --profile-time 5
cargo bench --bench bitemporal_benchmarks conflation_effectiveness -- --profile-time 5
cargo bench --bench bitemporal_benchmarks "scaling_by_dataset_size/records/500000" -- --profile-time 5
# Add flamegraph links to HTML reports
python3 scripts/add_flamegraphs_to_html.py
# View reports locally
python3 -m http.server 8000 --directory target/criterion
# Then visit: http://localhost:8000/report/
📈 Performance Expectations
| Dataset Size | Processing Time | Flamegraph Available |
|---|---|---|
| Small (5 records) | ~30-35 µs | ❌ |
| Medium (100 records) | ~165-170 µs | ✅ |
| Large (500k records) | ~900-950 ms | ✅ |
| Conflation test | ~28 µs | ✅ |
🎯 Key Optimization Areas (from Flamegraph Analysis)
process_id_timeline: Core algorithm logic- Rayon parallelization: Thread management overhead
- Arrow operations: Columnar data processing
- BLAKE3 hashing: Value fingerprinting for conflation
See docs/benchmark-publishing.md for complete setup details.
Contributing
- Check
CLAUDE.mdfor project context and conventions - Run tests before submitting changes
- Follow existing code style and patterns
- Update benchmarks for performance-related changes
- Use flamegraphs to validate performance improvements
- Maintain modular architecture when adding features
License
MIT License - see LICENSE file for details.
Built With
- Apache Arrow - Columnar data format
- PyO3 - Rust-Python bindings
- Rayon - Data parallelism
- Criterion - Benchmarking
- BLAKE3 - Cryptographic hashing algorithm
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pytemporal-1.2.1-cp312-cp312-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: pytemporal-1.2.1-cp312-cp312-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 2.4 MB
- Tags: CPython 3.12, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2edf32b2bb7027f32a87deee06d84a42eac26dce8baff2b9a909f992a53729d5
|
|
| MD5 |
3d338ed8df984b1cf0366f9f4c710e09
|
|
| BLAKE2b-256 |
b3356ffecfbb1a9d047dac1b72debd2bd2425c9ae057a4527660be298f68e91d
|
Provenance
The following attestation bundles were made for pytemporal-1.2.1-cp312-cp312-manylinux_2_34_x86_64.whl:
Publisher:
build-wheels.yml on gingermike/pytemporal
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pytemporal-1.2.1-cp312-cp312-manylinux_2_34_x86_64.whl -
Subject digest:
2edf32b2bb7027f32a87deee06d84a42eac26dce8baff2b9a909f992a53729d5 - Sigstore transparency entry: 450341423
- Sigstore integration time:
-
Permalink:
gingermike/pytemporal@cfeaece3c519b54bb5e160710176df2fcfcf952e -
Branch / Tag:
refs/tags/v1.2.1 - Owner: https://github.com/gingermike
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build-wheels.yml@cfeaece3c519b54bb5e160710176df2fcfcf952e -
Trigger Event:
push
-
Statement type:
File details
Details for the file pytemporal-1.2.1-cp310-cp310-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: pytemporal-1.2.1-cp310-cp310-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 2.4 MB
- Tags: CPython 3.10, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd578096a854ff7b03440d2e83c353e96bf2f1c348eb850e7288c386976f5557
|
|
| MD5 |
9387ced4b75fb207c237fcccf5515f32
|
|
| BLAKE2b-256 |
6fd8d638262805117712659a17001341d1967e5d41b333a8d815912c2fc6fcae
|
Provenance
The following attestation bundles were made for pytemporal-1.2.1-cp310-cp310-manylinux_2_34_x86_64.whl:
Publisher:
build-wheels.yml on gingermike/pytemporal
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pytemporal-1.2.1-cp310-cp310-manylinux_2_34_x86_64.whl -
Subject digest:
fd578096a854ff7b03440d2e83c353e96bf2f1c348eb850e7288c386976f5557 - Sigstore transparency entry: 450341438
- Sigstore integration time:
-
Permalink:
gingermike/pytemporal@cfeaece3c519b54bb5e160710176df2fcfcf952e -
Branch / Tag:
refs/tags/v1.2.1 - Owner: https://github.com/gingermike
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build-wheels.yml@cfeaece3c519b54bb5e160710176df2fcfcf952e -
Trigger Event:
push
-
Statement type: