Skip to main content

High-performance caching system using Apache Arrow and DuckDB

Project description

Arrow Cache

A high-performance caching system for data frames and tables using Apache Arrow and DuckDB with efficient memory management and persistence capabilities.

Features

  • High-performance storage - Store pandas DataFrames, GeoPandas GeoDataFrames, Parquet files, and Arrow tables with minimal overhead
  • Zero-copy data access - Fast data access using Arrow's shared memory model
  • Memory efficiency - Intelligent partitioning of large datasets to manage memory usage
  • SQL query capabilities - DuckDB-powered SQL queries against cached tables
  • Persistence - Store and load data to/from disk with atomic operations for data safety
  • Automatic cache eviction - LRU, LFU, and other eviction policies to manage memory pressure
  • Memory-aware spilling - Automatically spill partitions to disk when memory is low
  • Thread-safe operations - Proper locking for concurrent access to the cache
  • Metadata management - All metadata stored efficiently in DuckDB

Installation

pip install arrow_cache

With GeoPandas support:

pip install arrow_cache[geo]

Quick Start

from arrow_cache import ArrowCache

# Create a cache with default settings
cache = ArrowCache()

# Cache a pandas DataFrame
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})
cache.put('my_dataframe', df)

# Query the cached data using SQL
result_df = cache.query('SELECT * FROM _cache_my_dataframe WHERE a > 1')

# Retrieve the data
cached_df = cache.get('my_dataframe')  # Returns a pandas DataFrame

# When finished, close the cache to clean up resources
cache.close()

Configuration

Arrow Cache offers extensive configuration options:

from arrow_cache import ArrowCache, ArrowCacheConfig

config = ArrowCacheConfig(
    # Memory management
    memory_limit=1024 * 1024 * 1024,  # 1GB cache limit
    memory_spill_threshold=0.8,       # Start spilling when 80% full
    
    # Partitioning for large datasets
    auto_partition=True,
    partition_size_rows=100_000,      # 100k rows per partition
    partition_size_bytes=50 * 1024 * 1024,  # 50MB per partition
    
    # Compression settings
    enable_compression=True,
    compression_type="lz4",           # Fast compression
    dictionary_encoding=True,         # Dictionary encoding for strings
    
    # Performance tuning
    thread_count=4,                   # Worker threads
    cache_query_plans=True,           # Cache query plans for repeated queries
    
    # Storage settings
    spill_to_disk=True,               # Allow spilling to disk
    spill_directory=".arrow_cache_spill",
    persistent_storage=True,
    storage_path=".arrow_cache_storage",
    delete_files_on_close=True,       # Clean up files when closing
)

cache = ArrowCache(config=config)

Core Functionality

Storing and Retrieving Data

# Store data with optional TTL (time-to-live)
cache.put('my_dataframe', df, ttl=3600)  # Expire after 1 hour

# Add metadata
cache.put('my_dataframe', df, metadata={'source': 'database', 'version': 2})

# Retrieve data
df = cache.get('my_dataframe')

# Get a slice of data (efficient for large datasets)
df_slice = cache.get('my_dataframe', offset=1000, limit=100)

# Get as a specific type
arrow_table = cache.get('my_dataframe', target_type='arrow')

SQL Queries

SQL queries are powered by DuckDB and can be run directly against cached tables:

# Tables are registered with the prefix '_cache_'
result = cache.query('''
    SELECT 
        a, 
        COUNT(*) as count 
    FROM _cache_my_dataframe 
    GROUP BY a
''')

# Explain the query plan
explain = cache.explain('SELECT * FROM _cache_my_dataframe WHERE a > 10')
print(explain)

Memory Management

Arrow Cache intelligently manages memory:

# Check current cache status
status = cache.status()
print(f"Cache size: {status['current_size_bytes'] / 1024 / 1024:.2f} MB")
print(f"Entries: {status['entry_count']}")
print(f"Memory usage: {status['memory']['allocated_bytes'] / 1024 / 1024:.2f} MB")

# Manually clear cache if needed
cache.clear()

Data Persistence

Store and load data to/from disk:

# Persist a dataset to disk
cache.persist('my_dataframe', storage_dir='/path/to/storage')

# Load a persisted dataset
cache.load('my_dataframe', storage_dir='/path/to/storage')

Advanced Features

Working with Large Datasets

Arrow Cache automatically partitions large datasets for efficient memory management:

# Large datasets are automatically partitioned
large_df = pd.read_csv('large_dataset.csv')  # Millions of rows
cache.put('large_data', large_df)

# Get just a slice without loading everything into memory
slice_df = cache.get('large_data', offset=1000000, limit=1000)

Working with GeoPandas

import geopandas as gpd
from arrow_cache import ArrowCache

cache = ArrowCache()
gdf = gpd.read_file("some_geo_data.geojson")
cache.put('geo_data', gdf)

# Run spatial queries through DuckDB
result = cache.query('''
    SELECT * FROM _cache_geo_data 
    WHERE ST_Contains(geometry, ST_Point(0, 0))
''')

Thread Safety

All cache operations are thread-safe:

import concurrent.futures

def process_batch(batch_id):
    # Each thread can safely access the cache
    data = cache.get('large_data', offset=batch_id*1000, limit=1000)
    processed = process_function(data)
    return processed

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(process_batch, range(10)))

Resource Management

Arrow Cache efficiently manages resources:

# Use as a context manager
with ArrowCache() as cache:
    cache.put('temp_data', df)
    # Cache and resources are automatically cleaned up when exiting

# Or explicitly close when done
cache = ArrowCache()
try:
    # Use cache
    cache.put('my_data', df)
    result = cache.query('SELECT * FROM _cache_my_data')
finally:
    # This will clean up all resources, including any persisted files
    # if delete_files_on_close=True (default)
    cache.close()

How It Works

Arrow Cache uses several advanced techniques:

  1. Apache Arrow - For zero-copy, columnar data storage
  2. DuckDB - For metadata storage and high-performance SQL queries
  3. Partitioning - Breaking large datasets into manageable chunks
  4. Memory Tracking - Monitoring memory usage and triggering spilling/eviction
  5. Atomic Operations - Ensuring data integrity during failures
  6. Shared Memory Model - Reducing memory overhead through shared buffers

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arrow_cache-0.2.2.tar.gz (77.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arrow_cache-0.2.2-py3-none-any.whl (82.2 kB view details)

Uploaded Python 3

File details

Details for the file arrow_cache-0.2.2.tar.gz.

File metadata

  • Download URL: arrow_cache-0.2.2.tar.gz
  • Upload date:
  • Size: 77.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for arrow_cache-0.2.2.tar.gz
Algorithm Hash digest
SHA256 0192f8bc538a70c14566d59cad2cac4b65067ca7b6e29a1cdba131b153a5ad3f
MD5 1b3c8b9fd98cd0e96cb66ec3ba1a373f
BLAKE2b-256 b66d07d2ea003af9fc06ce0f12b3505e1fb0b98ddce4598e02a6416061697760

See more details on using hashes here.

File details

Details for the file arrow_cache-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: arrow_cache-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 82.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for arrow_cache-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9c2fe5c692b6771cb4fb8b7f0ea7c8e4676f5927e6a07e69ca2072753bb141b4
MD5 c62fe18a766e45f7a9323acfbb6408a2
BLAKE2b-256 7b08a8f8c1f2881088797b13739ccc9d3038ba0e6bf9a8ebdda1716c275c1e1c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page