pystore

Fast data store for Pandas timeseries data

These details have not been verified by PyPI

Project links

Project description

PyStore - Fast data store for Pandas timeseries data

PyStore is a simple (yet powerful) datastore for Pandas dataframes, and while it can store any Pandas object, it was designed with storing timeseries data in mind.

It’s built on top of Pandas, Numpy, Dask, and Parquet (via pyarrow), to provide an easy to use datastore for Python developers that can easily query millions of rows per second per client.

New in 2025 Release (PR #77):

MultiIndex Support - Store and retrieve DataFrames with Pandas MultiIndex
Complex Data Types - Full support for Timedelta, Period, Interval, Categorical dtypes
Timezone-Aware Operations - Proper handling of timezone data with UTC storage
Async/Await Support - Non-blocking I/O operations for better performance
Data Validation Framework - Extensible validation rules for data integrity
Schema Evolution - Handle schema changes over time with flexible strategies
Transaction Support - Atomic operations with rollback capabilities
Performance Optimizations - Streaming operations and memory management

Performance Enhancements (Phase 3 Release):

Streaming Operations - Memory-efficient append for datasets larger than RAM
Batch Processing - 5-10x faster parallel read/write operations
Intelligent Partitioning - Automatic time-based and size-based partitioning
Memory Management - 70-90% memory reduction with monitoring and optimization
Metadata Caching - 100x faster metadata access with TTL cache
Query Optimization - Column selection and predicate pushdown at storage level

Performance improvements include:

Append 1M rows: 3.75x faster, 90% less memory
Batch operations: 6x faster for multiple items
Column selection: 4x faster when reading subset of columns
Filtered reads: 8x faster with predicate pushdown

==> Check out this Blog post for the reasoning and philosophy behind PyStore, as well as a detailed tutorial with code examples.

==> Follow this PyStore tutorial in Jupyter notebook format.

Quickstart

Install PyStore

Install using pip:

$ pip install pystore --upgrade --no-cache-dir

Install using conda:

$ conda install -c ranaroussi pystore

INSTALLATION NOTE: If you don’t have Snappy installed (compression/decompression library), you’ll need to you’ll need to install it first.

Using PyStore

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import pystore
import yfinance as yf

# Set storage path (optional)
# Defaults to `~/pystore` or `PYSTORE_PATH` environment variable (if set)
pystore.set_path("~/pystore")

# List stores
pystore.list_stores()

# Connect to datastore (create it if not exist)
store = pystore.store('mydatastore')

# List existing collections
store.list_collections()

# Access a collection (create it if not exist)
collection = store.collection('NASDAQ')

# List items in collection
collection.list_items()

# Load some data from yfinance
aapl = yf.download("AAPL", multi_level_index=False)

# Store the first 100 rows of the data in the collection under "AAPL"
collection.write('AAPL', aapl[:100], metadata={'source': 'yfinance'})

# Reading the item's data
item = collection.item('AAPL')
data = item.data  # <-- Dask dataframe (see dask.pydata.org)
metadata = item.metadata
df = item.to_pandas()

# Append the rest of the rows to the "AAPL" item
collection.append('AAPL', aapl[100:])

# Reading the item's data
item = collection.item('AAPL')
data = item.data
metadata = item.metadata
df = item.to_pandas()


# --- Query functionality ---

# Query available symbols based on metadata
collection.list_items(some_key='some_value', other_key='other_value')


# --- Snapshot functionality ---

# Snapshot a collection
# (Point-in-time named reference for all current symbols in a collection)
collection.create_snapshot('snapshot_name')

# List available snapshots
collection.list_snapshots()

# Get a version of a symbol given a snapshot name
collection.item('AAPL', snapshot='snapshot_name')

# Delete a collection snapshot
collection.delete_snapshot('snapshot_name')


# ...


# Delete the item from the current version
collection.delete_item('AAPL')

# Delete the collection
store.delete_collection('NASDAQ')

Advanced Features

Async Operations:

import asyncio
from pystore import async_pystore

async def async_example():
    async with async_pystore.store('mydatastore') as store:
        async with store.collection('NASDAQ') as collection:
            # Async write
            await collection.write('AAPL', df)
            # Async read
            df = await collection.item('AAPL').to_pandas()

asyncio.run(async_example())

Data Validation:

from pystore import create_validator, ColumnExistsRule, RangeRule

# Create a validator
validator = create_validator([
    ColumnExistsRule(['Open', 'High', 'Low', 'Close']),
    RangeRule('Close', min_value=0)
])

# Apply validator to collection
collection.set_validator(validator)

Schema Evolution:

from pystore import SchemaEvolution, EvolutionStrategy

# Enable schema evolution
evolution = collection.enable_schema_evolution(
    'AAPL',
    strategy=EvolutionStrategy.FLEXIBLE
)

# Schema changes are handled automatically during append
collection.append('AAPL', new_data_with_extra_columns)

Complex Data Types:

# DataFrames with Period, Interval, Categorical types
df = pd.DataFrame({
    'period': pd.period_range('2024-01', periods=12, freq='M'),
    'interval': pd.IntervalIndex.from_tuples([(0, 1), (1, 2)]),
    'category': pd.Categorical(['A', 'B', 'A']),
    'nested': [{'key': 'value'}, [1, 2, 3], None]
})
collection.write('complex_data', df)

Performance Features:

# Streaming append for large datasets
def data_generator():
    for chunk in pd.read_csv('huge_file.csv', chunksize=100000):
        yield chunk

collection.append_stream('large_data', data_generator())

# Batch operations
items_to_write = {
    'item1': df1,
    'item2': df2,
    'item3': df3
}
collection.write_batch(items_to_write, parallel=True)

# Read multiple items efficiently
results = collection.read_batch(['item1', 'item2', 'item3'])

# Memory-optimized reading
from pystore.memory import optimize_dataframe_memory, read_in_chunks

# Optimize DataFrame memory usage
df = collection.item('large_item').to_pandas()
df_optimized = optimize_dataframe_memory(df)  # Up to 70% memory reduction

# Read in chunks for processing
for chunk in read_in_chunks(collection, 'large_item', chunk_size=50000):
    # Process chunk - automatically garbage collected
    process(chunk)

Query Optimization:

# Column selection - read only what you need
item = collection.item('data')
df = item.to_pandas(columns=['price', 'volume'])  # 4x faster for subset

# Filter at storage level
df = item.to_pandas(filters=[('price', '>', 100)])  # 8x faster

Using Dask schedulers

PyStore supports using Dask distributed.

To use a local Dask scheduler, add this to your code:

from dask.distributed import LocalCluster
pystore.set_client(LocalCluster())

To use a distributed Dask scheduler, add this to your code:

pystore.set_client("tcp://xxx.xxx.xxx.xxx:xxxx")
pystore.set_path("/path/to/shared/volume/all/workers/can/access")

Concepts

PyStore provides namespaced collections of data. These collections allow bucketing data by source, user or some other metric (for example, frequency: End-Of-Day, Minute Bars, etc.). Each collection (or namespace) maps to directory containing partitioned parquet files for each item (e.g., symbol).

A good practice it to create collections that may look something like this:

collection.EOD
collection.ONEMINUTE

Requirements

Python >= 3.8
Pandas >= 2.0
Numpy >= 1.20
Dask >= 2023.1
PyArrow >= 10.0 (Parquet engine)
Snappy (Google’s compression/decompression library)
multitasking
pytest-asyncio (for async testing)

PyStore was tested to work on *nix-like systems, including macOS.

Dependencies:

PyStore utilizes Snappy, a fast and efficient compression/decompression library developed by Google. You’ll need to install Snappy on your system before installing PyStore.

* See the python-snappy Github repo for more information.

*nix Systems:

APT: sudo apt-get install libsnappy-dev
RPM: sudo yum install libsnappy-devel

macOS:

First, install Snappy’s C library using Homebrew:

$ brew install snappy

Then, install Python’s snappy using conda:

$ conda install python-snappy -c conda-forge

…or, using pip:

$ CPPFLAGS="-I/usr/local/include -L/usr/local/lib" pip install python-snappy

Windows:

Windows users should check out Snappy for Windows and this Stackoverflow post for help on installing Snappy and python-snappy.

Current Status

Core Features:

Local filesystem support with Parquet storage
Full Pandas DataFrame compatibility, including MultiIndex
Snapshots for point-in-time data versioning
Metadata support for data organization

Advanced Features (July 2025 Release):

Complex data type serialization (Period, Interval, Categorical, nested objects)
Timezone-aware datetime handling with UTC storage
Async/await operations for non-blocking I/O
Data validation framework with extensible rules
Schema evolution for handling data structure changes
Transaction support with rollback capabilities

Performance Features:

Streaming operations for datasets larger than RAM
Batch read/write with parallel processing
Intelligent partitioning (time-based and size-based)
Memory optimization with automatic type downcasting
Metadata caching for faster access
Query optimization with column selection and predicate pushdown

Known Limitations:

MultiIndex append operations have limited support due to Dask limitations - while there’s a workaround that converts MultiIndex to regular columns, it may not fully preserve the MultiIndex structure after append (test remains marked as expected failure)
Some Parquet limitations with preserving exact index metadata

Future Plans:

Amazon S3 support (via s3fs)
Google Cloud Storage support (via gcsfs)
Hadoop Distributed File System support (via hdfs3)

Acknowledgements

PyStore is hugely inspired by Man AHL’s Arctic which uses MongoDB for storage and allows for versioning and other features. I highly recommend you check it out.

License

PyStore is licensed under the Apache License, Version 2.0. A copy of which is included in LICENSE.txt.

I’m very interested in your experience with PyStore. Please drop me a note with any feedback you have.

Contributions welcome!

- Ran Aroussi

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.1

Jul 22, 2025

1.0.0

Jul 20, 2025

0.1.24

Jul 10, 2024

0.1.23

Feb 11, 2022

0.1.22

Sep 27, 2020

0.1.21

Sep 27, 2020

0.1.20

Sep 27, 2020

0.1.19

Sep 27, 2020

0.1.18

Sep 27, 2020

0.1.17

May 30, 2020

0.1.16

May 30, 2020

0.1.15

Oct 27, 2019

0.1.14

Sep 4, 2019

0.1.13

Aug 22, 2019

0.1.12

Aug 4, 2019

0.1.11

Aug 2, 2019

0.1.10

Aug 2, 2019

0.1.9

May 22, 2019

0.1.8

Apr 1, 2019

0.1.7

Apr 1, 2019

0.1.6

Mar 23, 2019

0.1.5

Sep 29, 2018

0.1.4

Sep 19, 2018

0.1.3

Sep 7, 2018

0.1.2

Sep 6, 2018

0.1.1

Sep 3, 2018

0.1.0

Jul 26, 2018

0.0.111

Jun 13, 2018

0.0.12

Jul 2, 2018

0.0.10

Jun 6, 2018

0.0.9

Jun 5, 2018

0.0.8

Jun 3, 2018

0.0.7

Jun 3, 2018

0.0.6

Jun 3, 2018

0.0.5

Jun 2, 2018

0.0.4

May 27, 2018

0.0.2

May 27, 2018

0.0.1

May 26, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pystore-1.0.1.tar.gz (65.2 kB view details)

Uploaded Jul 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pystore-1.0.1-py3-none-any.whl (52.0 kB view details)

Uploaded Jul 22, 2025 Python 3

File details

Details for the file pystore-1.0.1.tar.gz.

File metadata

Download URL: pystore-1.0.1.tar.gz
Upload date: Jul 22, 2025
Size: 65.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for pystore-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`0ae3e3444ed9b06702bc0bfb6d1df60dc3df93ca271d1b9d85c82b27c9e81f4b`
MD5	`1a674212067d553a6b0825f273c931b1`
BLAKE2b-256	`fc93eff9adbc02f775f48eb93f3a13c1ed19d7c9fcc27b2a23c7f0b58204e904`

See more details on using hashes here.

File details

Details for the file pystore-1.0.1-py3-none-any.whl.

File metadata

Download URL: pystore-1.0.1-py3-none-any.whl
Upload date: Jul 22, 2025
Size: 52.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for pystore-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b40cd21d0c03dfc526119555ceed0015dd172f9e86eb877686b614bd7ff23b9b`
MD5	`4ff3c737d40d9b1cd736a213b0b20cfd`
BLAKE2b-256	`5823645d9b61d81aca7c77ac9c44203247d465e8f763121233334ef5c0f27d12`

See more details on using hashes here.

pystore 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyStore - Fast data store for Pandas timeseries data

Quickstart

Install PyStore

Using PyStore

Advanced Features

Using Dask schedulers

Concepts

Requirements

Dependencies:

Current Status

Acknowledgements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes