Enhanced utilities and extensions for fsspec, storage_options and obstore with multi-format I/O support.
Project description
fsspeckit
Enhanced utilities and extensions for fsspec filesystems with multi-format I/O support.
Overview
fsspeckit is a comprehensive toolkit that extends fsspec with:
- Multi-cloud storage configuration - Easy setup for AWS S3, Google Cloud Storage, Azure Storage, GitHub, and GitLab
- Enhanced caching - Improved caching filesystem with monitoring and path preservation
- Extended I/O operations - Read/write operations for JSON, CSV, Parquet with Polars/PyArrow integration
- Domain-specific packages - Organized into logical packages for better discoverability
Package Structure
fsspeckit is organized into domain-specific packages:
fsspeckit.core- Core filesystem APIs and backend-neutral planning logicfsspeckit.storage_options- Multi-cloud storage configuration classesfsspeckit.datasets- Dataset-level operations (DuckDB & PyArrow helpers)fsspeckit.sql- SQL-to-filter translation helpersfsspeckit.common- Cross-cutting utilities (logging, parallelism, type conversion)fsspeckit.utils- Backwards-compatible façade that re-exports from domain packages
Note: The
fsspeckit.utilsmodule is maintained for backwards compatibility. New code should import directly from the domain packages for better discoverability.
Installation
# Basic installation
pip install fsspeckit
# Specific cloud providers
pip install "fsspeckit[aws]" # AWS S3 support
pip install "fsspeckit[gcp]" # Google Cloud Storage
pip install "fsspeckit[azure]" # Azure Storage
# Multiple cloud providers
pip install "fsspeckit[aws,gcp,azure]"
Quick Start
Basic Filesystem Operations
from fsspeckit import filesystem
# Local filesystem
fs = filesystem("file")
files = fs.ls("/path/to/data")
# S3 with caching
fs = filesystem("s3://my-bucket/", cached=True)
data = fs.cat("data/file.txt")
Storage Configuration
from fsspeckit.storage_options import AwsStorageOptions
# Configure S3 access
options = AwsStorageOptions(
region="us-west-2",
access_key_id="YOUR_KEY",
secret_access_key="YOUR_SECRET"
)
fs = filesystem("s3", storage_options=options, cached=True)
Environment-based Configuration
from fsspeckit.storage_options import AwsStorageOptions
# Load from environment variables
options = AwsStorageOptions.from_env()
fs = filesystem("s3", storage_options=options)
# Load with anonymous access from environment
# Set AWS_S3_ANONYMOUS=true in environment
options = AwsStorageOptions.from_env()
fs = filesystem("s3", storage_options=options)
DuckDB Parquet Maintenance
from fsspeckit.datasets import DuckDBParquetHandler
with DuckDBParquetHandler() as handler:
# Inspect fragmentation without writing
dry_stats = handler.compact_parquet_dataset(
path="/data/events/",
target_mb_per_file=256,
dry_run=True,
)
# Compact tiny files and recompress with zstd
handler.compact_parquet_dataset(
path="/data/events/",
target_rows_per_file=500_000,
compression="zstd",
)
# Recluster partitions with z-order style ordering
handler.optimize_parquet_dataset(
path="/data/events/",
zorder_columns=["user_id", "event_date"],
partition_filter=["date=2025-11-10"],
)
Multiple Cloud Providers
from fsspeckit.storage_options import (
AwsStorageOptions,
GcsStorageOptions,
GitHubStorageOptions
)
# AWS S3
s3_fs = filesystem("s3", storage_options=AwsStorageOptions.from_env())
# Google Cloud Storage
gcs_fs = filesystem("gs", storage_options=GcsStorageOptions.from_env())
# GitHub repository
github_fs = filesystem("github", storage_options=GitHubStorageOptions(
org="microsoft",
repo="vscode",
token="ghp_xxxx"
))
Storage Options
AWS S3
from fsspeckit.storage_options import AwsStorageOptions
# Basic credentials
options = AwsStorageOptions(
access_key_id="AKIAXXXXXXXX",
secret_access_key="SECRET",
region="us-east-1"
)
# From AWS profile
options = AwsStorageOptions.create(profile="dev")
# S3-compatible service (MinIO)
options = AwsStorageOptions(
endpoint_url="http://localhost:9000",
access_key_id="minioadmin",
secret_access_key="minioadmin",
allow_http=True
)
# Anonymous access for public buckets
options = AwsStorageOptions(anonymous=True)
Google Cloud Storage
from fsspeckit.storage_options import GcsStorageOptions
# Service account
options = GcsStorageOptions(
token="path/to/service-account.json",
project="my-project-123"
)
# From environment
options = GcsStorageOptions.from_env()
Azure Storage
from fsspeckit.storage_options import AzureStorageOptions
# Account key
options = AzureStorageOptions(
protocol="az",
account_name="mystorageacct",
account_key="key123..."
)
# Connection string
options = AzureStorageOptions(
protocol="az",
connection_string="DefaultEndpoints..."
)
GitHub
from fsspeckit.storage_options import GitHubStorageOptions
# Public repository
options = GitHubStorageOptions(
org="microsoft",
repo="vscode",
ref="main"
)
# Private repository
options = GitHubStorageOptions(
org="myorg",
repo="private-repo",
token="ghp_xxxx",
ref="develop"
)
GitLab
from fsspeckit.storage_options import GitLabStorageOptions
# Public project
options = GitLabStorageOptions(
project_name="group/project",
ref="main"
)
# Private project with token
options = GitLabStorageOptions(
project_id=12345,
token="glpat_xxxx",
ref="develop"
)
Enhanced Caching
from fsspeckit import filesystem
# Enable caching with monitoring
fs = filesystem(
"s3://my-bucket/",
cached=True,
cache_storage="/tmp/my_cache",
verbose=True
)
# Cache preserves directory structure
data = fs.cat("deep/nested/path/file.txt")
# Cached at: /tmp/my_cache/deep/nested/path/file.txt
Utilities
Parallel Processing
from fsspeckit.common import run_parallel
# Run function in parallel
def process_file(path, multiplier=1):
return len(path) * multiplier
results = run_parallel(
process_file,
["/path1", "/path2", "/path3"],
multiplier=2,
n_jobs=4,
verbose=True
)
Type Conversion
from fsspeckit.common.types import dict_to_dataframe, to_pyarrow_table
# Convert dict to DataFrame
data = {"col1": [1, 2, 3], "col2": [4, 5, 6]}
df = dict_to_dataframe(data)
# Convert to PyArrow table
table = to_pyarrow_table(df)
Logging
from fsspeckit.common.logging import setup_logging
# Configure logging
setup_logging(level="DEBUG", format_string="{time} | {level} | {message}")
Migration Guide
The package structure was refactored in version X.X.0 to improve discoverability and organization.
For new code, use the canonical imports from domain packages:
- Dataset operations:
from fsspeckit.datasets import ... - SQL helpers:
from fsspeckit.sql import ... - Common utilities:
from fsspeckit.common import ...
For existing code, all fsspeckit.utils imports continue to work unchanged.
For detailed migration instructions, see the Migration Guide.
Dependencies
Core Dependencies
fsspec>=2023.1.0- Filesystem interfacemsgspec>=0.18.0- Serializationpyyaml>=6.0- YAML supportrequests>=2.25.0- HTTP requestsloguru>=0.7.0- Logging
Optional Dependencies
orjson>=3.8.0- Fast JSON processingpolars>=0.19.0- Fast DataFramespyarrow>=10.0.0- Columnar datapandas>=1.5.0- Data analysisjoblib>=1.3.0- Parallel processingrich>=13.0.0- Progress bars
Cloud Provider Dependencies
boto3>=1.26.0,s3fs>=2023.1.0- AWS S3gcsfs>=2023.1.0- Google Cloud Storageadlfs>=2023.1.0- Azure Storage
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fsspeckit-0.4.0.tar.gz.
File metadata
- Download URL: fsspeckit-0.4.0.tar.gz
- Upload date:
- Size: 647.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.11 {"installer":{"name":"uv","version":"0.9.11"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
65d0a8db4b78915926680aa5c9383c717db7e99864ad81ce68b37813184a219c
|
|
| MD5 |
81b6092dafd5a4fdac38618cd099bb14
|
|
| BLAKE2b-256 |
2716826fe0e3d0fcf207d7a2bece0a20e1d198922f9283cf82e14888e672a3b1
|
File details
Details for the file fsspeckit-0.4.0-py3-none-any.whl.
File metadata
- Download URL: fsspeckit-0.4.0-py3-none-any.whl
- Upload date:
- Size: 102.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.11 {"installer":{"name":"uv","version":"0.9.11"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e60dc637e500947fece84886cf71cb4e0a8524f0af910511513a3a5b41c71485
|
|
| MD5 |
058e765bca956c9a1c9b24cd8f6d5087
|
|
| BLAKE2b-256 |
4d905d83a5cab89bb5706ee67e7151543a2856c448febe289dd0f3893491b148
|