High-performance caching system using Apache Arrow and DuckDB

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Arrow Cache

A high-performance caching system for data frames and tables using Apache Arrow and DuckDB with efficient memory management and persistence capabilities.

Features

High-performance storage - Store pandas DataFrames, GeoPandas GeoDataFrames, Parquet files, and Arrow tables with minimal overhead
Zero-copy data access - Fast data access using Arrow's shared memory model
Memory efficiency - Intelligent partitioning of large datasets to manage memory usage
SQL query capabilities - DuckDB-powered SQL queries against cached tables
Persistence - Store and load data to/from disk with atomic operations for data safety
Automatic cache eviction - LRU, LFU, and other eviction policies to manage memory pressure
Memory-aware spilling - Automatically spill partitions to disk when memory is low
Thread-safe operations - Proper locking for concurrent access to the cache
Metadata management - All metadata stored efficiently in DuckDB

Installation

pip install arrow_cache

With GeoPandas support:

pip install arrow_cache[geo]

Quick Start

from arrow_cache import ArrowCache

# Create a cache with default settings
cache = ArrowCache()

# Cache a pandas DataFrame
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['x', 'y', 'z']})
cache.put('my_dataframe', df)

# Query the cached data using SQL
result_df = cache.query('SELECT * FROM _cache_my_dataframe WHERE a > 1')

# Retrieve the data
cached_df = cache.get('my_dataframe')  # Returns a pandas DataFrame

# When finished, close the cache to clean up resources
cache.close()

Configuration

Arrow Cache offers extensive configuration options:

from arrow_cache import ArrowCache, ArrowCacheConfig

config = ArrowCacheConfig(
    # Memory management
    memory_limit=1024 * 1024 * 1024,  # 1GB cache limit
    memory_spill_threshold=0.8,       # Start spilling when 80% full
    
    # Partitioning for large datasets
    auto_partition=True,
    partition_size_rows=100_000,      # 100k rows per partition
    partition_size_bytes=50 * 1024 * 1024,  # 50MB per partition
    
    # Compression settings
    enable_compression=True,
    compression_type="lz4",           # Fast compression
    dictionary_encoding=True,         # Dictionary encoding for strings
    
    # Performance tuning
    thread_count=4,                   # Worker threads
    cache_query_plans=True,           # Cache query plans for repeated queries
    
    # Storage settings
    spill_to_disk=True,               # Allow spilling to disk
    spill_directory=".arrow_cache_spill",
    persistent_storage=True,
    storage_path=".arrow_cache_storage",
    delete_files_on_close=True,       # Clean up files when closing
)

cache = ArrowCache(config=config)

Core Functionality

Storing and Retrieving Data

# Store data with optional TTL (time-to-live)
cache.put('my_dataframe', df, ttl=3600)  # Expire after 1 hour

# Add metadata
cache.put('my_dataframe', df, metadata={'source': 'database', 'version': 2})

# Retrieve data
df = cache.get('my_dataframe')

# Get a slice of data (efficient for large datasets)
df_slice = cache.get('my_dataframe', offset=1000, limit=100)

# Get as a specific type
arrow_table = cache.get('my_dataframe', target_type='arrow')

SQL Queries

SQL queries are powered by DuckDB and can be run directly against cached tables:

# Tables are registered with the prefix '_cache_'
result = cache.query('''
    SELECT 
        a, 
        COUNT(*) as count 
    FROM _cache_my_dataframe 
    GROUP BY a
''')

# Explain the query plan
explain = cache.explain('SELECT * FROM _cache_my_dataframe WHERE a > 10')
print(explain)

Memory Management

Arrow Cache intelligently manages memory:

# Check current cache status
status = cache.status()
print(f"Cache size: {status['current_size_bytes'] / 1024 / 1024:.2f} MB")
print(f"Entries: {status['entry_count']}")
print(f"Memory usage: {status['memory']['allocated_bytes'] / 1024 / 1024:.2f} MB")

# Manually clear cache if needed
cache.clear()

Data Persistence

Store and load data to/from disk:

# Persist a dataset to disk
cache.persist('my_dataframe', storage_dir='/path/to/storage')

# Load a persisted dataset
cache.load('my_dataframe', storage_dir='/path/to/storage')

Advanced Features

Working with Large Datasets

Arrow Cache automatically partitions large datasets for efficient memory management:

# Large datasets are automatically partitioned
large_df = pd.read_csv('large_dataset.csv')  # Millions of rows
cache.put('large_data', large_df)

# Get just a slice without loading everything into memory
slice_df = cache.get('large_data', offset=1000000, limit=1000)

Working with GeoPandas

import geopandas as gpd
from arrow_cache import ArrowCache

cache = ArrowCache()
gdf = gpd.read_file("some_geo_data.geojson")
cache.put('geo_data', gdf)

# Run spatial queries through DuckDB
result = cache.query('''
    SELECT * FROM _cache_geo_data 
    WHERE ST_Contains(geometry, ST_Point(0, 0))
''')

Thread Safety

All cache operations are thread-safe:

import concurrent.futures

def process_batch(batch_id):
    # Each thread can safely access the cache
    data = cache.get('large_data', offset=batch_id*1000, limit=1000)
    processed = process_function(data)
    return processed

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(process_batch, range(10)))

Resource Management

Arrow Cache efficiently manages resources:

# Use as a context manager
with ArrowCache() as cache:
    cache.put('temp_data', df)
    # Cache and resources are automatically cleaned up when exiting

# Or explicitly close when done
cache = ArrowCache()
try:
    # Use cache
    cache.put('my_data', df)
    result = cache.query('SELECT * FROM _cache_my_data')
finally:
    # This will clean up all resources, including any persisted files
    # if delete_files_on_close=True (default)
    cache.close()

How It Works

Arrow Cache uses several advanced techniques:

Apache Arrow - For zero-copy, columnar data storage
DuckDB - For metadata storage and high-performance SQL queries
Partitioning - Breaking large datasets into manageable chunks
Memory Tracking - Monitoring memory usage and triggering spilling/eviction
Atomic Operations - Ensuring data integrity during failures
Shared Memory Model - Reducing memory overhead through shared buffers

License

MIT License

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.2.2

Apr 16, 2025

This version

0.2.1

Apr 10, 2025

0.2.0

Apr 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arrow_cache-0.2.1.tar.gz (68.1 kB view details)

Uploaded Apr 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arrow_cache-0.2.1-py3-none-any.whl (72.1 kB view details)

Uploaded Apr 10, 2025 Python 3

File details

Details for the file arrow_cache-0.2.1.tar.gz.

File metadata

Download URL: arrow_cache-0.2.1.tar.gz
Upload date: Apr 10, 2025
Size: 68.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for arrow_cache-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`2b5f4845a653ab538dcc78f461e18cb07ffce1b295b369ebb195af55f3754279`
MD5	`a8286bcf98ebe46f5ce861e28017742c`
BLAKE2b-256	`65f8d05c7e183d02b978102468dd190ab0d0b073e3137285365ba7036cea4963`

See more details on using hashes here.

File details

Details for the file arrow_cache-0.2.1-py3-none-any.whl.

File metadata

Download URL: arrow_cache-0.2.1-py3-none-any.whl
Upload date: Apr 10, 2025
Size: 72.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.14

File hashes

Hashes for arrow_cache-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0003d7ac8ec37506279771748820e744b2d3c8e9b983b6d2a45d3594465da61d`
MD5	`11ea3a7bd5abebe282f41fdc99370497`
BLAKE2b-256	`5efa00beaf5151279f77830fe3fc571b520450d82f1d6e0b5ea8b6be91716abb`

See more details on using hashes here.

arrow-cache 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Arrow Cache

Features

Installation

Quick Start

Configuration

Core Functionality

Storing and Retrieving Data

SQL Queries

Memory Management

Data Persistence

Advanced Features

Working with Large Datasets

Working with GeoPandas

Thread Safety

Resource Management

How It Works

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes