Skip to main content

High-performance caching system using Apache Arrow and DuckDB

Project description

Arrow Cache

A high-performance caching system for data frames and tables using Apache Arrow and DuckDB with efficient memory management and persistence capabilities.

Features

  • High-performance storage - Store pandas DataFrames, GeoPandas GeoDataFrames, Parquet files, and Arrow tables with minimal overhead
  • Zero-copy data access - Fast data access using Arrow's shared memory model
  • Memory efficiency - Intelligent partitioning of large datasets to manage memory usage
  • SQL query capabilities - DuckDB-powered SQL queries against cached tables
  • Persistence - Store and load data to/from disk with atomic operations for data safety
  • Automatic cache eviction - LRU, LFU, and other eviction policies to manage memory pressure
  • Memory-aware spilling - Automatically spill partitions to disk when memory is low
  • Thread-safe operations - Proper locking for concurrent access to the cache
  • Metadata management - All metadata stored efficiently in DuckDB

Installation

pip install arrow_cache

With GeoPandas support:

pip install arrow_cache[geo]

Quick Start

from arrow_cache import ArrowCache

# Create a cache with default settings
cache = ArrowCache()

# Cache a pandas DataFrame
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})
cache.put('my_dataframe', df)

# Query the cached data using SQL
result_df = cache.query('SELECT * FROM _cache_my_dataframe WHERE a > 1')

# Retrieve the data
cached_df = cache.get('my_dataframe')  # Returns a pandas DataFrame

# When finished, close the cache to clean up resources
cache.close()

Configuration

Arrow Cache offers extensive configuration options:

from arrow_cache import ArrowCache, ArrowCacheConfig

config = ArrowCacheConfig(
    # Memory management
    memory_limit=1024 * 1024 * 1024,  # 1GB cache limit
    memory_spill_threshold=0.8,       # Start spilling when 80% full
    
    # Partitioning for large datasets
    auto_partition=True,
    partition_size_rows=100_000,      # 100k rows per partition
    partition_size_bytes=50 * 1024 * 1024,  # 50MB per partition
    
    # Compression settings
    enable_compression=True,
    compression_type="lz4",           # Fast compression
    dictionary_encoding=True,         # Dictionary encoding for strings
    
    # Performance tuning
    thread_count=4,                   # Worker threads
    cache_query_plans=True,           # Cache query plans for repeated queries
    
    # Storage settings
    spill_to_disk=True,               # Allow spilling to disk
    spill_directory=".arrow_cache_spill",
    persistent_storage=True,
    storage_path=".arrow_cache_storage",
    delete_files_on_close=True,       # Clean up files when closing
)

cache = ArrowCache(config=config)

Core Functionality

Storing and Retrieving Data

# Store data with optional TTL (time-to-live)
cache.put('my_dataframe', df, ttl=3600)  # Expire after 1 hour

# Add metadata
cache.put('my_dataframe', df, metadata={'source': 'database', 'version': 2})

# Retrieve data
df = cache.get('my_dataframe')

# Get a slice of data (efficient for large datasets)
df_slice = cache.get('my_dataframe', offset=1000, limit=100)

# Get as a specific type
arrow_table = cache.get('my_dataframe', target_type='arrow')

SQL Queries

SQL queries are powered by DuckDB and can be run directly against cached tables:

# Tables are registered with the prefix '_cache_'
result = cache.query('''
    SELECT 
        a, 
        COUNT(*) as count 
    FROM _cache_my_dataframe 
    GROUP BY a
''')

# Explain the query plan
explain = cache.explain('SELECT * FROM _cache_my_dataframe WHERE a > 10')
print(explain)

Memory Management

Arrow Cache intelligently manages memory:

# Check current cache status
status = cache.status()
print(f"Cache size: {status['current_size_bytes'] / 1024 / 1024:.2f} MB")
print(f"Entries: {status['entry_count']}")
print(f"Memory usage: {status['memory']['allocated_bytes'] / 1024 / 1024:.2f} MB")

# Manually clear cache if needed
cache.clear()

Data Persistence

Store and load data to/from disk:

# Persist a dataset to disk
cache.persist('my_dataframe', storage_dir='/path/to/storage')

# Load a persisted dataset
cache.load('my_dataframe', storage_dir='/path/to/storage')

Advanced Features

Working with Large Datasets

Arrow Cache automatically partitions large datasets for efficient memory management:

# Large datasets are automatically partitioned
large_df = pd.read_csv('large_dataset.csv')  # Millions of rows
cache.put('large_data', large_df)

# Get just a slice without loading everything into memory
slice_df = cache.get('large_data', offset=1000000, limit=1000)

Working with GeoPandas

import geopandas as gpd
from arrow_cache import ArrowCache

cache = ArrowCache()
gdf = gpd.read_file("some_geo_data.geojson")
cache.put('geo_data', gdf)

# Run spatial queries through DuckDB
result = cache.query('''
    SELECT * FROM _cache_geo_data 
    WHERE ST_Contains(geometry, ST_Point(0, 0))
''')

Thread Safety

All cache operations are thread-safe:

import concurrent.futures

def process_batch(batch_id):
    # Each thread can safely access the cache
    data = cache.get('large_data', offset=batch_id*1000, limit=1000)
    processed = process_function(data)
    return processed

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(process_batch, range(10)))

Resource Management

Arrow Cache efficiently manages resources:

# Use as a context manager
with ArrowCache() as cache:
    cache.put('temp_data', df)
    # Cache and resources are automatically cleaned up when exiting

# Or explicitly close when done
cache = ArrowCache()
try:
    # Use cache
    cache.put('my_data', df)
    result = cache.query('SELECT * FROM _cache_my_data')
finally:
    # This will clean up all resources, including any persisted files
    # if delete_files_on_close=True (default)
    cache.close()

How It Works

Arrow Cache uses several advanced techniques:

  1. Apache Arrow - For zero-copy, columnar data storage
  2. DuckDB - For metadata storage and high-performance SQL queries
  3. Partitioning - Breaking large datasets into manageable chunks
  4. Memory Tracking - Monitoring memory usage and triggering spilling/eviction
  5. Atomic Operations - Ensuring data integrity during failures
  6. Shared Memory Model - Reducing memory overhead through shared buffers

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arrow_cache-0.2.1.tar.gz (68.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arrow_cache-0.2.1-py3-none-any.whl (72.1 kB view details)

Uploaded Python 3

File details

Details for the file arrow_cache-0.2.1.tar.gz.

File metadata

  • Download URL: arrow_cache-0.2.1.tar.gz
  • Upload date:
  • Size: 68.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for arrow_cache-0.2.1.tar.gz
Algorithm Hash digest
SHA256 2b5f4845a653ab538dcc78f461e18cb07ffce1b295b369ebb195af55f3754279
MD5 a8286bcf98ebe46f5ce861e28017742c
BLAKE2b-256 65f8d05c7e183d02b978102468dd190ab0d0b073e3137285365ba7036cea4963

See more details on using hashes here.

File details

Details for the file arrow_cache-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: arrow_cache-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 72.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for arrow_cache-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0003d7ac8ec37506279771748820e744b2d3c8e9b983b6d2a45d3594465da61d
MD5 11ea3a7bd5abebe282f41fdc99370497
BLAKE2b-256 5efa00beaf5151279f77830fe3fc571b520450d82f1d6e0b5ea8b6be91716abb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page