Skip to main content

File Storage Made Simple

Project description

AutoStore - File Storage With Automatic Backend Detection

AutoStore provides a dictionary-like interface for reading and writing files from cloud storage and local filesystems.

AutoStore eliminates the cognitive overhead of managing different file formats and storage backends, letting you focus on your data and analysis rather than the mechanics of file I/O. It automatically detects storage backends from URI prefixes (s3://, gcs://, etc.), handles file format detection, type inference, and provides a clean, intuitive API for data persistence across local and cloud storage.

Features

  • Automatically detects storage type from URI prefixes
  • Use multiple S3-compatible services (AWS, Conductor, MinIO, etc.) with different configurations
  • Access any storage backend from a single store instance using URI syntax
  • Automatically handles both individual files and multi-file datasets (parquet, CSV collections)
  • Caching system with configurable expiration reduces redundant downloads
  • Built-in support for Polars DataFrames, JSON, CSV, images, PyTorch models, NumPy arrays, and more
  • Configuration with IDE support and validation for each service

Getting Started

AutoStore requires Python 3.10+ and can be installed via pip.

pip install autostore

Basic Usage - Zero Configuration

from autostore import AutoStore

# Local storage - no configuration needed
store = AutoStore("./data")

# Write data - automatically saves with appropriate extensions
store["my_dataframe"] = df           # ./data/my_dataframe.parquet
store["config"] = {"key": "value"}   # ./data/config.json
store["logs"] = [{"event": "start"}] # ./data/logs.jsonl

# Read data
df = store["my_dataframe"]           # Returns a Polars DataFrame
config = store["config"]             # Returns a dict
logs = store["logs"]                 # Returns a list of dicts

Cloud Storage - Automatic Detection

from autostore import AutoStore
from autostore.s3 import S3Options

# S3 - automatically detected from s3:// prefix
store = AutoStore(
    "s3://my-bucket/data/",
    profile_name="my-profile",
    cache_enabled=True
)

# Or with explicit options
options = S3Options(
    profile_name="my-profile",
    region_name="us-east-1",
    cache_enabled=True,
    cache_expiry_hours=12
)
store = AutoStore("s3://my-bucket/data/", options=options)

# Write data to S3
store["experiment/results"] = {"accuracy": 0.95, "epochs": 100}

# Read data from S3
results = store["experiment/results"]  # Uses cache on subsequent loads

Cross-Backend Access

from autostore import AutoStore

# Create a local store as primary backend
store = AutoStore("./local-cache", cache_enabled=True)

# Access different backends using full URIs
store["local_file"] = {"type": "local"}                    # Primary backend
store["s3://bucket/remote.json"] = {"type": "s3"}          # S3 backend

# Read from any backend
local_data = store["local_file"]                           # From local
s3_data = store["s3://bucket/remote.json"]                 # From S3

Multiple S3-Compatible Services

AutoStore supports multiple S3-compatible services with different configurations:

from autostore import AutoStore
from autostore.s3 import S3Options

# Register new schemes for different S3-compatible services
AutoStore.register_scheme("minio", "autostore.s3")
AutoStore.register_scheme("digitalocean", "autostore.s3")

# Create service-specific options with different configurations
aws_options = S3Options(
    scheme="s3",
    profile_name="aws-production",
    region_name="us-east-1",
    cache_enabled=True
)

minio_options = S3Options(
    scheme="minio",
    endpoint_url="https://minio.mycompany.com",
    aws_access_key_id="minioadmin",
    aws_secret_access_key="minioadmin",
    region_name="us-east-1"
)

digitalocean_options = S3Options(
    scheme="digitalocean",
    endpoint_url="https://nyc3.digitaloceanspaces.com",
    region_name="nyc3",
    cache_enabled=True
)

# Create AutoStore with multiple backend options
store = AutoStore(
    "./local-cache",
    options=[aws_options, minio_options, digitalocean_options]
)

# Each scheme automatically uses its appropriate configuration
store["s3://aws-bucket/data.json"] = {"source": "aws"}
store["minio://my-bucket/data.json"] = {"source": "minio"}
store["digitalocean://my-space/data.json"] = {"source": "digitalocean"}

# Cross-backend data access with automatic option selection
aws_data = store["s3://aws-bucket/data.json"]
minio_data = store["minio://my-bucket/data.json"]
digitalocean_data = store["digitalocean://my-space/data.json"]

Dataset Support

from autostore import AutoStore

# Automatically detects and handles datasets
# For example, if you have multiple parquet files in an S3 bucket:
# ├── weather
# │   ├── 2024
# │   │   ├── january.parquet
# │   │   ├── february.parquet
# │   │   └── march.parquet
store = AutoStore("s3://my-bucket/datasets/")

# Access parquet dataset (multiple files)
weather_data = store["weather/2024/"]  # Loads entire dataset as LazyFrame

# Access individual file
single_file = store["weather/2024/january.parquet"]

# List files in dataset
files = list(store.list_files("weather/2024/*", recursive=True))

AutoPath - Path-like Interface

AutoPath provides a pathlib.Path-like interface for unified access to both local filesystem and cloud storage. It combines the familiar Path API with AutoStore's automatic backend detection and data handling capabilities.

Basic AutoPath Usage

from autostore import AutoStore, AutoPath
from autostore.s3 import S3Options

# Create a store with multiple backends
store = AutoStore(
    "./local-data",
    options=[
        S3Options(
            scheme="s3",
            profile_name="aws-prod",
            cache_enabled=True,
            cache_expiry_hours=6
        )
    ]
)

# Create AutoPath instances
local_path = AutoPath("./local-data/config.json", store=store)
s3_path = AutoPath("s3://my-bucket/data/results.parquet", store=store)

# Path operations work the same for local and cloud storage
config_exists = local_path.exists()          # True/False
results_exists = s3_path.exists()            # True/False

# Read files as text or bytes
config_text = local_path.read_text()         # Read as string
results_bytes = s3_path.read_bytes()         # Read as bytes

# Write files
local_path.write_text('{"key": "value"}')    # Write string
s3_path.write_bytes(b"binary data")          # Write bytes

Path Manipulation and Navigation

# Path joining works like pathlib.Path
data_dir = AutoPath("s3://my-bucket/datasets", store=store)
experiment_dir = data_dir / "experiment_1"
results_file = experiment_dir / "results.parquet"

print(results_file)  # s3://my-bucket/datasets/experiment_1/results.parquet

# Path properties
print(results_file.name)       # results.parquet
print(results_file.stem)       # results
print(results_file.suffix)     # .parquet
print(results_file.parent)     # s3://my-bucket/datasets/experiment_1

# Navigate parent directories
parent = results_file.parent
grandparent = parent.parent
all_parents = results_file.parents  # List of all parent directories

File and Directory Operations

# File operations
if results_file.exists():
    print("File exists")

if results_file.is_file():
    print("It's a file")

if data_dir.is_dir():
    print("It's a directory")

# Directory listing
for item in data_dir.iterdir():
    print(f"Found: {item}")
    if item.is_file():
        print(f"  File size: {item.stat().size}")

# Glob patterns
for parquet_file in data_dir.glob("**/*.parquet"):
    print(f"Parquet file: {parquet_file}")

for csv_file in experiment_dir.glob("*.csv"):
    print(f"CSV file: {csv_file}")

Directory Management

# For local paths, this creates real directories
local_dir = AutoPath("./data/analysis", store=store)
local_dir.mkdir(parents=True, exist_ok=True)

# Remove directories
empty_dir = AutoPath("s3://my-bucket/empty-folder", store=store)
empty_dir.rmdir()  # Remove empty directory

# Delete files or directories with contents
old_experiment = AutoPath("s3://my-bucket/old-experiment", store=store)
old_experiment.delete()  # Recursively deletes all contents

File Transfer Operations

# Copy files between any backends
local_file = AutoPath("./data/model.pt", store=store)
s3_backup = AutoPath("s3://backup-bucket/models/model.pt", store=store)

# Copy local file to S3
local_file.copy_to(s3_backup)

# Move files
temp_file = AutoPath("./temp/processing.csv", store=store)
final_location = AutoPath("s3://data-bucket/processed/final.csv", store=store)
temp_file.move_to(final_location)

# Upload from local filesystem
local_source = "./analysis/results.xlsx"
s3_destination = AutoPath("s3://reports/analysis/results.xlsx", store=store)
s3_destination.upload_from(local_source)

# Download to local filesystem
s3_source = AutoPath("s3://data/large_dataset.parquet", store=store)
local_destination = "./downloads/dataset.parquet"
s3_source.download_to(local_destination)

Data Loading with Automatic Format Detection

AutoPath integrates with AutoStore's handler system to load data in the appropriate format based on file extensions or content type. It supports Polars DataFrames, JSON, CSV, and more.

# Load data with automatic format detection
parquet_path = AutoPath("s3://data/sales.parquet", store=store)
df = parquet_path.load()  # Returns Polars DataFrame

json_path = AutoPath("s3://config/settings.json", store=store)
settings = json_path.load()  # Returns dict

# Force specific format
csv_as_parquet = AutoPath("s3://data/data.csv", store=store)
df = csv_as_parquet.load(format="parquet")  # Force parquet parsing

# Bypass cache
fresh_data = parquet_path.load(ignore_cache=True)

# Save data with automatic format detection
results = {"accuracy": 0.95, "model": "transformer"}
results_path = AutoPath("s3://experiments/run_001/results.json", store=store)
results_path.save(results)  # Automatically saves as JSON

# Save DataFrame
import polars as pl
df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
data_path = AutoPath("s3://datasets/processed.parquet", store=store)
data_path.save(df)  # Automatically saves as Parquet

AutoPath without Explicit Store

AutoPath can automatically create appropriate stores:

# For local paths
local_path = AutoPath("./data/file.json")  # Creates local store automatically

# For S3 URIs
s3_path = AutoPath("s3://bucket/file.json")  # Creates S3 store with default options

# Path operations work the same
data = s3_path.load()
s3_path.save({"new": "data"})

Advanced Path Operations

# Path pattern matching
log_path = AutoPath("s3://logs/app.2024-01-15.log", store=store)
if log_path.match("*.log"):
    print("It's a log file")

# Relative paths
base_path = AutoPath("s3://data/experiments", store=store)
result_path = AutoPath("s3://data/experiments/run_1/results.json", store=store)
relative = result_path.relative_to(base_path)  # "run_1/results.json"

# Path transformations
config_path = AutoPath("s3://app/config.yaml", store=store)
backup_path = config_path.with_suffix(".yaml.bak")    # config.yaml.bak
renamed_path = config_path.with_name("new_config.yaml")  # new_config.yaml
stemmed_path = config_path.with_stem("production")       # production.yaml

# Absolute and URI representations
print(local_path.as_posix())   # Forward slashes
print(local_path.as_uri())     # file:// URI
print(s3_path.as_uri())        # s3:// URI
print(s3_path.is_absolute())   # True for URIs

Integration Example

from autostore import AutoStore, AutoPath
from autostore.s3 import S3Options
import polars as pl

# Setup store with caching
store = AutoStore(
    "./cache",
    options=[S3Options(
        scheme="s3",
        profile_name="production",
        cache_enabled=True,
        cache_expiry_hours=0  # Never expire
    )]
)

# Define paths
raw_data = AutoPath("s3://raw-data/sales/2024/", store=store)
processed_data = AutoPath("s3://processed/sales_summary.parquet", store=store)
local_backup = AutoPath("./backups/sales_summary.parquet", store=store)

# Process data using path-like interface
if raw_data.is_dir():
    # Load all files in directory as dataset
    df = raw_data.load()  # Loads entire directory as LazyFrame

    # Process data
    summary = df.group_by("region").agg([
        pl.col("sales").sum().alias("total_sales"),
        pl.col("sales").count().alias("transaction_count")
    ])

    # Save processed data
    processed_data.save(summary.collect())

    # Create local backup
    processed_data.copy_to(local_backup)

    print(f"Processed {summary.height} regions")
    print(f"Backup created at: {local_backup}")

Supported Data Types

Data Type File Extension Description Library Required
Polars DataFrame/LazyFrame .parquet, .csv High-performance DataFrames polars
Python dict/list .json Standard JSON serialization built-in
List of dicts .jsonl JSON Lines format built-in
Pydantic models .pydantic.json Structured data models pydantic
Python dataclasses .dataclass.json Dataclass serialization built-in
String data .txt, .html, .md Plain text files built-in
NumPy arrays .npy, .npz Numerical data numpy
SciPy sparse matrices .sparse Sparse matrix data scipy
PyTorch tensors/models .pt, .pth Deep learning models torch
PIL/Pillow images .png, .jpg, etc. Image data Pillow
YAML data .yaml, .yml Human-readable config files PyYAML
Any Python object .pkl Pickle fallback built-in

Supported Storage Backends

AutoStore automatically detects the storage backend from URI prefixes:

Backend URI Prefix Options Class Example
Local ./path or /path Options ./data, /Users/name/files
S3 s3:// S3Options s3://bucket/prefix/

Configuration Options

Base Options (All Backends)

from autostore import Options

base_options = Options(
    cache_enabled=True,           # Enable local caching
    cache_dir="./cache",          # Custom cache directory
    cache_expiry_hours=12,        # Cache expiration time (0 = never expire)
    timeout=30,                   # Request timeout in seconds
    max_retries=3,                # Maximum retry attempts
    retry_delay=1.0               # Delay between retries
)

S3Options

from autostore.s3 import S3Options

s3_options = S3Options(
    # Scheme specification for multi-backend support
    scheme="s3",                          # URI scheme this options applies to

    # Authentication
    aws_access_key_id="your-key",
    aws_secret_access_key="your-secret",
    profile_name="my-profile",            # AWS profile name

    # Configuration
    region_name="us-east-1",
    endpoint_url="custom-endpoint",       # For S3-compatible services

    # Performance
    multipart_threshold=64 * 1024 * 1024, # Files > 64MB use multipart
    multipart_chunksize=16 * 1024 * 1024, # Chunk size for uploads
    max_concurrency=10,                   # Concurrent operations

    # Inherited from Options
    cache_enabled=True,
    cache_expiry_hours=6          # 0 = never expire
)

Usage Patterns

# Method 1: Keyword arguments
store = AutoStore("s3://bucket/", profile_name="prod", cache_enabled=True)

# Method 2: Single options object
options = S3Options(scheme="s3", profile_name="prod", cache_enabled=True)
store = AutoStore("s3://bucket/", options=options)

# Method 3: Multiple options for different services
aws_options = S3Options(scheme="s3", profile_name="aws-prod")
minio_options = S3Options(scheme="minio", endpoint_url="https://minio.example.com")
store = AutoStore("./cache", options=[aws_options, minio_options])

# Method 4: Mixed (options object + additional kwargs)
base_options = S3Options(scheme="s3", profile_name="prod")
store = AutoStore("s3://bucket/", options=base_options, cache_enabled=True)

Advanced Features

Backend Management

# Register new S3-compatible services
AutoStore.register_scheme("minio", "autostore.s3")
AutoStore.register_scheme("digitalocean", "autostore.s3")

# Check supported backends
backends = store.get_supported_backends()
print(f"Available: {backends}")  # ['s3', 'minio', 'digitalocean', 'file', '']

# View active backends
active = store.list_active_backends()
print(f"Active: {active}")  # ['primary: ./data', 's3: s3://bucket/', 'minio: minio://bucket/']

# Backend auto-loading with appropriate options
data = store["s3://bucket/file.json"]              # Uses AWS S3 options
data = store["minio://bucket/file.json"]           # Uses MinIO options
data = store["digitalocean://space/file.json"]     # Uses DigitalOcean options

Dataset Operations

# Dataset detection
is_dataset = store.primary_backend.is_dataset("path/to/data/")
is_directory = store.primary_backend.is_directory("path/")

# List dataset files
files = list(store.list_files("dataset/*", recursive=True))

# Load entire dataset (for parquet/CSV collections)
lazy_frame = store["weather_data/"]  # Loads all parquet files as one LazyFrame

Caching System

AutoStore includes caching that:

  • Stores frequently accessed files locally
  • Uses ETags for cache validation
  • Automatically expires old cache entries (or never expires if cache_expiry_hours=0)
  • Works across all backends
# Enable caching for any backend
store = AutoStore("s3://bucket/", cache_enabled=True, cache_expiry_hours=6)

# Never expire cache entries (useful for immutable data)
store = AutoStore("s3://bucket/", cache_enabled=True, cache_expiry_hours=0)

# Cache management
store.cleanup_cache()  # Remove expired cache entries

# Check cache status
metadata = store.get_metadata("large_file")
print(f"File size: {metadata.size} bytes")
print(f"ETag: {metadata.etag}")

Custom Data Handlers

Add support for new data types by creating custom handlers:

from pathlib import Path
from autostore.autostore import DataHandler

class CustomLogHandler(DataHandler):
    def can_handle_extension(self, extension: str) -> bool:
        return extension.lower() == ".log"

    def can_handle_data(self, data) -> bool:
        return isinstance(data, list) and all(
            isinstance(item, dict) and "timestamp" in item
            for item in data
        )

    def read_from_file(self, file_path: Path, file_extension: str):
        logs = []
        with open(file_path, 'r') as f:
            for line in f:
                if line.strip():
                    logs.append(json.loads(line))
        return logs

    def write_to_file(self, data, file_path: Path, file_extension: str):
        file_path.parent.mkdir(parents=True, exist_ok=True)
        with open(file_path, 'w') as f:
            for entry in data:
                f.write(json.dumps(entry) + '\n')

    @property
    def extensions(self):
        return [".log"]

    @property
    def priority(self):
        return 15

# Register the handler
store.register_handler(CustomLogHandler())

File Operations

# Check existence
if "config" in store:
    print("Config file exists")

# List all files
for key in store.keys():
    print(f"File: {key}")

# Get file metadata
metadata = store.get_metadata("large_dataset")
print(f"Size: {metadata.size} bytes")
print(f"Modified: {metadata.modified_time}")

# Copy and move files
store.copy("original", "backup")
store.move("temp_file", "permanent_file")

# Delete files
del store["old_data"]

Context Management

# Automatic cleanup of temporary files and cache
with AutoStore("./data", config=config) as store:
    store["data"] = large_dataset
    results = store["data"]
# Temporary files are automatically cleaned up here

Performance Considerations

Large File Handling

AutoStore automatically optimizes for large files:

  • Multipart uploads/downloads for files > 64MB
  • Configurable chunk sizes and concurrency
  • Streaming operations to minimize memory usage

When to Use AutoStore

Choose AutoStore when you need:

  • Multi-cloud data access with seamless backend switching
  • Dataset processing with automatic detection of file collections
  • Zero-configuration setup for rapid prototyping and development
  • Cross-backend operations without managing multiple client libraries
  • Data science projects with mixed file types across storage systems
  • Type-safe configuration with IDE support and validation
  • Intelligent caching to optimize cloud storage costs and performance

Don't choose AutoStore when:

  • You need complex queries or relational operations (use databases)
  • You only work with one data type and one storage backend consistently
  • You need zero dependencies (use built-in libraries like shelve)
  • You require advanced database features like transactions or indexing
  • You need fine-grained control over every storage operation

Changes

  • 0.1.14 - AutoPath now has a load and save method that uses the built-in handlers
  • 0.1.13 - Added AutoPath class for path-like operations with automatic backend detection
    • AutoPath supports all storage operations like read, write, upload, download, delete, etc.
    • AutoPath can be used in place of AutoStore for path-like interactions
  • 0.1.8 - Auto scheme registration enhancement
  • 0.1.7 - Cache expiry can be set to 0 to never expire cache entries.
  • 0.1.6 - Scheme-based backend detection and Options system with automatic backend detection from URI schemes
    • Unified Options dataclass system replacing separate config classes
    • Cross-backend access from single store instance
    • Dataset support with automatic multi-file detection
    • Enhanced error handling with dependency management
    • Breaking: Removed manual backend registration
    • Breaking: Replaced S3StorageConfig with S3Options
  • 0.1.5 - Added StorePath to use the Autostore instance in path-like operations
  • 0.1.4 - parquet and csv are loaded as LazyFrames by default and sparse matrices are now saved as .sparse.npz
  • 0.1.3
    • Refactored to use different storage backends including local file system and S3.
    • Implement S3 storage backend with basic operations
    • Added S3StorageConfig for configuration management.
    • Implemented S3Backend class for handling S3 interactions.
    • Included methods for file operations: upload, download, delete, copy, move, and list files.
    • Added support for directory-like structures in S3.
    • Implemented metadata retrieval for files.
    • Integrated error handling for common S3 exceptions.
    • Added support for multipart uploads and downloads.
    • Included utility functions for path parsing and glob pattern matching.
    • Calling store.keys() now only returns keys without extensions.
  • 0.1.2 - config, setup_logging, and load_dotenv are now imported at the module top level
  • 0.1.1 - Added config, setup_logging, and load_dotenv
  • 0.1.0 - Initial release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autostore-0.1.18.tar.gz (68.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autostore-0.1.18-py3-none-any.whl (44.2 kB view details)

Uploaded Python 3

File details

Details for the file autostore-0.1.18.tar.gz.

File metadata

  • Download URL: autostore-0.1.18.tar.gz
  • Upload date:
  • Size: 68.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for autostore-0.1.18.tar.gz
Algorithm Hash digest
SHA256 aa108ab6552dbd9684c98592f397d6c6b4485b242be6aa11d66dbaa9fef43cae
MD5 8f99e6770fd95e8d39963a8227d5785c
BLAKE2b-256 10e35615191656a05b822b7dfb2021af6753c8849949d2140f36921deeb3f84e

See more details on using hashes here.

File details

Details for the file autostore-0.1.18-py3-none-any.whl.

File metadata

  • Download URL: autostore-0.1.18-py3-none-any.whl
  • Upload date:
  • Size: 44.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.9

File hashes

Hashes for autostore-0.1.18-py3-none-any.whl
Algorithm Hash digest
SHA256 da7bea41ebf9ed982af1906f9379d05e210ddb4e5e64899fddb8c27b239e100e
MD5 7e41dc91076c883bf498c2f1251626cc
BLAKE2b-256 3679a98321984815812f60d670d87edf55a5be672a672ac2f283d1946019ca6c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page