Skip to main content

High-performance bitemporal timeseries update processor

Project description

PyTemporal Library

A high-performance Rust library with Python bindings for processing bitemporal timeseries data. Optimized for financial services and applications requiring immutable audit trails with both business and system time dimensions.

Features

  • High Performance: 500k records processed in ~885ms with adaptive parallelization
  • Zero-Copy Processing: Apache Arrow columnar data format for efficient memory usage
  • Parallel Processing: Rayon-based parallelization with adaptive thresholds
  • Conflation: Automatic merging of adjacent segments with identical values to reduce storage
  • Flexible Schema: Dynamic ID and value column configuration
  • Python Integration: Seamless PyO3 bindings for Python workflows
  • Modular Architecture: Clean separation of concerns with dedicated modules
  • Performance Monitoring: Integrated flamegraph generation and GitHub Pages benchmark reports

Installation

Build from source (requires Rust):

git clone <your-repository-url>
cd pytemporal
uv run maturin develop --release

Quick Start

import pandas as pd
from pytemporal import compute_changes
import pyarrow as pa
from datetime import datetime

# Convert pandas DataFrames to Arrow RecordBatches
def df_to_record_batch(df):
    table = pa.Table.from_pandas(df)
    return table.to_batches()[0]

# Current state
current_state = pd.DataFrame({
    'id': [1234, 1234],
    'field': ['test', 'fielda'], 
    'mv': [300, 400],
    'price': [400, 500],
    'effective_from': pd.to_datetime(['2020-01-01', '2020-01-01']),
    'effective_to': pd.to_datetime(['2021-01-01', '2021-01-01']),
    'as_of_from': pd.to_datetime(['2025-01-01', '2025-01-01']),
    'as_of_to': pd.to_datetime(['2262-04-11', '2262-04-11']),  # Max date
    'value_hash': [0, 0]  # Will be computed automatically
})

# Updates
updates = pd.DataFrame({
    'id': [1234],
    'field': ['test'],
    'mv': [400], 
    'price': [300],
    'effective_from': pd.to_datetime(['2020-06-01']),
    'effective_to': pd.to_datetime(['2020-09-01']),
    'as_of_from': pd.to_datetime(['2025-07-27']),
    'as_of_to': pd.to_datetime(['2262-04-11']),
    'value_hash': [0]
})

# Process updates
expire_indices, insert_batches = compute_changes(
    df_to_record_batch(current_state),
    df_to_record_batch(updates),
    id_columns=['id', 'field'],
    value_columns=['mv', 'price'],
    system_date='2025-07-27',
    update_mode='delta'
)

print(f"Records to expire: {len(expire_indices)}")
print(f"Records to insert: {len(insert_batches)}")

Algorithm Explanation with Examples

Bitemporal Model

Each record tracks two time dimensions:

  • Effective Time (effective_from, effective_to): When the data is valid in the real world
  • As-Of Time (as_of_from, as_of_to): When the data was known to the system

Both use TimestampMicrosecond precision for maximum accuracy.

Core Algorithm: Timeline Processing

The algorithm processes updates by creating a timeline of events and determining what should be active at each point in time.

Example 1: Simple Overwrite

Current State:

ID=123, effective: [2020-01-01, 2021-01-01], as_of: [2025-01-01, max], mv=100

Update:

ID=123, effective: [2020-06-01, 2020-09-01], as_of: [2025-07-27, max], mv=200

Timeline Processing:

  1. Create Events:

    • 2020-01-01: Current starts (mv=100)
    • 2020-06-01: Update starts (mv=200)
    • 2020-09-01: Update ends
    • 2021-01-01: Current ends
  2. Process Timeline:

    • [2020-01-01, 2020-06-01): Current active → emit mv=100
    • [2020-06-01, 2020-09-01): Update active → emit mv=200
    • [2020-09-01, 2021-01-01): Current active → emit mv=100
  3. Result:

    • Expire: Original record (index 0)
    • Insert: Three new records covering the split timeline

Visual Representation:

Before:
Current |=======mv=100========|
        2020-01-01      2021-01-01

Update       |==mv=200==|
             2020-06-01  2020-09-01

After:
New     |=100=|=mv=200=|=100=|
        2020   2020     2020  2021
        01-01  06-01    09-01 01-01

Example 2: Conflation (Adjacent Identical Values)

Current State:

ID=123, effective: [2020-01-01, 2020-06-01], as_of: [2025-01-01, max], mv=100
ID=123, effective: [2020-06-01, 2021-01-01], as_of: [2025-01-01, max], mv=100  

Update:

ID=123, effective: [2020-03-01, 2020-04-01], as_of: [2025-07-27, max], mv=100

Since the update has the same value (mv=100) as the current state, the algorithm detects this as a no-change scenario and skips processing entirely.

Example 3: Complex Multi-Update

Current State:

ID=123, effective: [2020-01-01, 2021-01-01], as_of: [2025-01-01, max], mv=100

Updates:

ID=123, effective: [2020-03-01, 2020-06-01], as_of: [2025-07-27, max], mv=200
ID=123, effective: [2020-09-01, 2020-12-01], as_of: [2025-07-27, max], mv=300

Timeline Processing:

  1. Events: 2020-01-01 (current start), 2020-03-01 (update1 start), 2020-06-01 (update1 end), 2020-09-01 (update2 start), 2020-12-01 (update2 end), 2021-01-01 (current end)

  2. Result:

    • [2020-01-01, 2020-03-01): mv=100 (current)
    • [2020-03-01, 2020-06-01): mv=200 (update1)
    • [2020-06-01, 2020-09-01): mv=100 (current)
    • [2020-09-01, 2020-12-01): mv=300 (update2)
    • [2020-12-01, 2021-01-01): mv=100 (current)

Post-Processing Conflation

After timeline processing, the algorithm merges adjacent segments with identical value hashes:

Before Conflation:

|--mv=100--|--mv=100--|--mv=200--|--mv=100--|--mv=100--|

After Conflation:

|--------mv=100--------|--mv=200--|--------mv=100--------|

This significantly reduces database row count while preserving temporal accuracy.

Update Modes

The algorithm supports two distinct update modes that determine how updates interact with existing current state:

1. Delta Mode (default)

  • Behavior: Only provided records are treated as updates
  • Existing State: Preserved where not temporally overlapped by updates
  • Use Case: Incremental updates where you only want to modify specific time periods
  • Processing: Uses timeline-based algorithm to handle temporal overlaps

2. Full State Mode

  • Behavior: Provided records represent the complete desired state for each ID group
  • Existing State: Only preserved if identical records exist in the updates (same values and temporal ranges)
  • Value Comparison: Records are compared using SHA256 hashes of value columns
  • Processing Rules:
    • Unchanged Records: If an update has identical values and temporal range as current state, neither expire nor insert
    • Changed Records: If values differ, expire old record and insert new record
    • New Records: Insert records that don't exist in current state
    • Removed Records: Expire current records not present in updates

Full State Mode Example

Current State:

ID=1, mv=100, price=250, effective=[2020-01-01, 2020-02-01]
ID=2, mv=300, price=400, effective=[2020-01-01, 2020-02-01]  

Updates (Full State):

ID=1, mv=150, price=250, effective=[2020-01-01, 2020-02-01]  # Values changed
ID=2, mv=300, price=400, effective=[2020-01-01, 2020-02-01]  # Values identical  
ID=3, mv=500, price=600, effective=[2020-01-01, 2020-02-01]  # New record

Result:

  • Expire: ID=1 (values changed: mv 100→150)
  • Insert: ID=1 (new values), ID=3 (new record)
  • No Action: ID=2 (identical values, no change needed)

Parallelization Strategy

The algorithm uses adaptive parallelization:

  • Serial Processing: Small datasets (<50 ID groups AND <10k records)
  • Parallel Processing: Large datasets using Rayon for CPU-bound operations
  • ID Group Independence: Each ID group processes independently, enabling perfect parallelization

Performance

Benchmarked on modern hardware:

  • 500k records: ~885ms processing time
  • Adaptive Parallelization: Automatically uses multiple threads for large datasets
  • Parallel Thresholds: >50 ID groups OR >10k total records triggers parallel processing
  • Conflation Efficiency: Significant row reduction for datasets with temporal continuity

Testing

Run the test suites:

# Rust tests
cargo test

# Python tests  
uv run python -m pytest tests/test_bitemporal.py -v

# Benchmarks
cargo bench

Development

Project Structure

Modular Architecture (274 lines total in main file, down from 1,085):

  • src/lib.rs - Main processing function and Python bindings (274 lines)
  • src/types.rs - Core data structures and constants (88 lines)
  • src/overlap.rs - Overlap detection and record categorization (68 lines)
  • src/timeline.rs - Timeline event processing algorithm (218 lines)
  • src/conflation.rs - Record conflation and deduplication (157 lines)
  • src/batch_utils.rs - Arrow RecordBatch utilities (122 lines)
  • tests/integration_tests.rs - Rust integration tests (5 test scenarios)
  • tests/test_bitemporal_manual.py - Python test suite (22 test scenarios)
  • benches/bitemporal_benchmarks.rs - Performance benchmarks
  • CLAUDE.md - Project context and development notes

Key Commands

# Build release version
cargo build --release

# Run benchmarks with HTML reports
cargo bench

# Build Python wheel  
uv run maturin build --release

# Development install
uv run maturin develop

Module Responsibilities

  1. types.rs - Data structures (BitemporalRecord, ChangeSet, UpdateMode) and type conversions
  2. overlap.rs - Determines which records overlap in time and need timeline processing vs direct insertion
  3. timeline.rs - Core algorithm that processes overlapping records through event timeline
  4. conflation.rs - Post-processes results to merge adjacent segments with identical values
  5. batch_utils.rs - Arrow utilities for RecordBatch creation and timestamp handling

Dependencies

  • arrow (53.4) - Columnar data processing
  • pyo3 (0.21) - Python bindings
  • chrono (0.4) - Date/time handling
  • sha2 (0.10) - SHA256 hashing for client-compatible hex digests
  • rayon (1.8) - Parallel processing
  • criterion (0.5) - Benchmarking framework

Architecture

Rust Core

  • Zero-copy Arrow array processing
  • Parallel execution with Rayon
  • Hash-based change detection with SHA256 (client-compatible hex digests)
  • Post-processing conflation for optimal storage
  • Modular design with clear separation of concerns

Python Interface

  • PyO3 bindings for seamless integration
  • Arrow RecordBatch input/output
  • Compatible with pandas DataFrames via conversion

Performance Monitoring

This project includes comprehensive performance monitoring with flamegraph analysis:

📊 Release Performance Reports

View performance metrics and flamegraphs for each release at: Release Benchmarks

Each version tag automatically generates comprehensive performance documentation with flamegraphs, creating a historical record of performance evolution across releases.

🔥 Generating Flamegraphs Locally

# Generate flamegraphs for key benchmarks
cargo bench --bench bitemporal_benchmarks medium_dataset -- --profile-time 5
cargo bench --bench bitemporal_benchmarks conflation_effectiveness -- --profile-time 5 
cargo bench --bench bitemporal_benchmarks "scaling_by_dataset_size/records/500000" -- --profile-time 5

# Add flamegraph links to HTML reports  
python3 scripts/add_flamegraphs_to_html.py

# View reports locally
python3 -m http.server 8000 --directory target/criterion
# Then visit: http://localhost:8000/report/

📈 Performance Expectations

Dataset Size Processing Time Flamegraph Available
Small (5 records) ~30-35 µs
Medium (100 records) ~165-170 µs
Large (500k records) ~900-950 ms
Conflation test ~28 µs

🎯 Key Optimization Areas (from Flamegraph Analysis)

  • process_id_timeline: Core algorithm logic
  • Rayon parallelization: Thread management overhead
  • Arrow operations: Columnar data processing
  • SHA256 hashing: Value fingerprinting for conflation

See docs/benchmark-publishing.md for complete setup details.

Contributing

  1. Check CLAUDE.md for project context and conventions
  2. Run tests before submitting changes
  3. Follow existing code style and patterns
  4. Update benchmarks for performance-related changes
  5. Use flamegraphs to validate performance improvements
  6. Maintain modular architecture when adding features

License

MIT License - see LICENSE file for details.

Built With

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pytemporal-1.3.0-cp312-cp312-manylinux_2_34_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

pytemporal-1.3.0-cp310-cp310-manylinux_2_34_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

File details

Details for the file pytemporal-1.3.0-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for pytemporal-1.3.0-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 bee0e764f48ff050b8acefe2d68e259c2b79b391aa559b4aa337b56255e0a483
MD5 73c13535ae61e9e08aede5f53a91f95b
BLAKE2b-256 e426eb6b9ff238dc8be5d3b00c442331a2a59ae80255f061404bdc89877c6055

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytemporal-1.3.0-cp312-cp312-manylinux_2_34_x86_64.whl:

Publisher: build-wheels.yml on gingermike/pytemporal

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pytemporal-1.3.0-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for pytemporal-1.3.0-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 002b01302818b6371d58b6884b584f46ba974e25ea6705090141385b0c59a136
MD5 21c9e8d01a09b3ae779b4b409d803b97
BLAKE2b-256 095813b8cbc3e24684f1f20528c57f0257a53e47a2fa7c0bead89834fb4e7e8b

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytemporal-1.3.0-cp310-cp310-manylinux_2_34_x86_64.whl:

Publisher: build-wheels.yml on gingermike/pytemporal

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page