Enhanced utilities and extensions for fsspec filesystems with multi-format I/O support
Project description
fsspec-utils
Enhanced utilities and extensions for fsspec filesystems with multi-format I/O support.
Overview
fsspec-utils is a comprehensive toolkit that extends fsspec with:
- Multi-cloud storage configuration - Easy setup for AWS S3, Google Cloud Storage, Azure Storage, GitHub, and GitLab
- Enhanced caching - Improved caching filesystem with monitoring and path preservation
- Extended I/O operations - Read/write operations for JSON, CSV, Parquet with Polars/PyArrow integration
- Utility functions - Type conversion, parallel processing, and data transformation helpers
Installation
# Basic installation
pip install fsspec-utils
# With all optional dependencies
pip install fsspec-utils[full]
# Specific cloud providers
pip install fsspec-utils[aws] # AWS S3 support
pip install fsspec-utils[gcp] # Google Cloud Storage
pip install fsspec-utils[azure] # Azure Storage
Quick Start
Basic Filesystem Operations
from fsspec_utils import filesystem
# Local filesystem
fs = filesystem("file")
files = fs.ls("/path/to/data")
# S3 with caching
fs = filesystem("s3://my-bucket/", cached=True)
data = fs.cat("data/file.txt")
Storage Configuration
from fsspec_utils.storage import AwsStorageOptions
# Configure S3 access
options = AwsStorageOptions(
region="us-west-2",
access_key_id="YOUR_KEY",
secret_access_key="YOUR_SECRET"
)
fs = filesystem("s3", storage_options=options, cached=True)
Environment-based Configuration
from fsspec_utils.storage import AwsStorageOptions
# Load from environment variables
options = AwsStorageOptions.from_env()
fs = filesystem("s3", storage_options=options)
Multiple Cloud Providers
from fsspec_utils.storage import (
AwsStorageOptions,
GcsStorageOptions,
GitHubStorageOptions
)
# AWS S3
s3_fs = filesystem("s3", storage_options=AwsStorageOptions.from_env())
# Google Cloud Storage
gcs_fs = filesystem("gs", storage_options=GcsStorageOptions.from_env())
# GitHub repository
github_fs = filesystem("github", storage_options=GitHubStorageOptions(
org="microsoft",
repo="vscode",
token="ghp_xxxx"
))
Storage Options
AWS S3
from fsspec_utils.storage import AwsStorageOptions
# Basic credentials
options = AwsStorageOptions(
access_key_id="AKIAXXXXXXXX",
secret_access_key="SECRET",
region="us-east-1"
)
# From AWS profile
options = AwsStorageOptions.create(profile="dev")
# S3-compatible service (MinIO)
options = AwsStorageOptions(
endpoint_url="http://localhost:9000",
access_key_id="minioadmin",
secret_access_key="minioadmin",
allow_http=True
)
Google Cloud Storage
from fsspec_utils.storage import GcsStorageOptions
# Service account
options = GcsStorageOptions(
token="path/to/service-account.json",
project="my-project-123"
)
# From environment
options = GcsStorageOptions.from_env()
Azure Storage
from fsspec_utils.storage import AzureStorageOptions
# Account key
options = AzureStorageOptions(
protocol="az",
account_name="mystorageacct",
account_key="key123..."
)
# Connection string
options = AzureStorageOptions(
protocol="az",
connection_string="DefaultEndpoints..."
)
GitHub
from fsspec_utils.storage import GitHubStorageOptions
# Public repository
options = GitHubStorageOptions(
org="microsoft",
repo="vscode",
ref="main"
)
# Private repository
options = GitHubStorageOptions(
org="myorg",
repo="private-repo",
token="ghp_xxxx",
ref="develop"
)
GitLab
from fsspec_utils.storage import GitLabStorageOptions
# Public project
options = GitLabStorageOptions(
project_name="group/project",
ref="main"
)
# Private project with token
options = GitLabStorageOptions(
project_id=12345,
token="glpat_xxxx",
ref="develop"
)
Enhanced Caching
from fsspec_utils import filesystem
# Enable caching with monitoring
fs = filesystem(
"s3://my-bucket/",
cached=True,
cache_storage="/tmp/my_cache",
verbose=True
)
# Cache preserves directory structure
data = fs.cat("deep/nested/path/file.txt")
# Cached at: /tmp/my_cache/deep/nested/path/file.txt
Utilities
Parallel Processing
from fsspec_utils.utils import run_parallel
# Run function in parallel
def process_file(path, multiplier=1):
return len(path) * multiplier
results = run_parallel(
process_file,
["/path1", "/path2", "/path3"],
multiplier=2,
n_jobs=4,
verbose=True
)
Type Conversion
from fsspec_utils.utils import dict_to_dataframe, to_pyarrow_table
# Convert dict to DataFrame
data = {"col1": [1, 2, 3], "col2": [4, 5, 6]}
df = dict_to_dataframe(data)
# Convert to PyArrow table
table = to_pyarrow_table(df)
Logging
from fsspec_utils.utils import setup_logging
# Configure logging
setup_logging(level="DEBUG", format_string="{time} | {level} | {message}")
Dependencies
Core Dependencies
fsspec>=2023.1.0- Filesystem interfacemsgspec>=0.18.0- Serializationpyyaml>=6.0- YAML supportrequests>=2.25.0- HTTP requestsloguru>=0.7.0- Logging
Optional Dependencies
orjson>=3.8.0- Fast JSON processingpolars>=0.19.0- Fast DataFramespyarrow>=10.0.0- Columnar datapandas>=1.5.0- Data analysisjoblib>=1.3.0- Parallel processingrich>=13.0.0- Progress bars
Cloud Provider Dependencies
boto3>=1.26.0,s3fs>=2023.1.0- AWS S3gcsfs>=2023.1.0- Google Cloud Storageadlfs>=2023.1.0- Azure Storage
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Relationship to FlowerPower
This package was extracted from the FlowerPower workflow framework to provide standalone filesystem utilities that can be used independently or as a dependency in other projects.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fsspec_utils-0.2.6.6.tar.gz.
File metadata
- Download URL: fsspec_utils-0.2.6.6.tar.gz
- Upload date:
- Size: 44.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94ad79c1a7601833cae0049d7ad3f576bce068bef12bd1ef66ca772d24d1fccb
|
|
| MD5 |
6022fbd4bbefb24f47e0a0fc80a5f1c8
|
|
| BLAKE2b-256 |
80f2119942b77461f2724c87b7b47b161f87e3c2bcfbf4218a8734ff56a31cfb
|
File details
Details for the file fsspec_utils-0.2.6.6-py3-none-any.whl.
File metadata
- Download URL: fsspec_utils-0.2.6.6-py3-none-any.whl
- Upload date:
- Size: 68.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84b1f997e1a4c3fac10eebad269ac8841da65c1c9f934c90bb50a9e15b49fbfc
|
|
| MD5 |
974360e311ee2bfed45d71729c303865
|
|
| BLAKE2b-256 |
89ca66c66415ca8daf91e3ac3feffa52a19068e4dc5fe0fbb93b74b81f987274
|