Skip to main content

Enhanced utilities and extensions for fsspec filesystems with multi-format I/O support

Project description

fsspec-utils

⚠️ DEPRECATED - This package is no longer maintained

This package has been superseded by fsspeckit.

Action Required: Please migrate to fsspeckit for:

  • Continued support and bug fixes
  • New features and improvements
  • Latest dependency updates

Migration Guide

Replace your imports:

# OLD - fsspec-utils (deprecated)
from fsspec_utils import filesystem
from fsspec_utils.storage_options import AwsStorageOptions

# NEW - fsspeckit (recommended)
from fsspeckit import filesystem
from fsspeckit.storage_options import AwsStorageOptions

Update installation:

pip uninstall fsspec-utils
pip install fsspeckit

All functionality from fsspec-utils is now available in fsspeckit with the same API for easy migration.


Enhanced utilities and extensions for fsspec filesystems with multi-format I/O support.

Overview

fsspec-utils is a comprehensive toolkit that extends fsspec with:

  • Multi-cloud storage configuration - Easy setup for AWS S3, Google Cloud Storage, Azure Storage, GitHub, and GitLab
  • Enhanced caching - Improved caching filesystem with monitoring and path preservation
  • Extended I/O operations - Read/write operations for JSON, CSV, Parquet with Polars/PyArrow integration
  • Utility functions - Type conversion, parallel processing, and data transformation helpers

Ask DeepWiki

Installation

# Basic installation
pip install fsspec-utils

# With all optional dependencies
pip install fsspec-utils[full]

# Specific cloud providers
pip install fsspec-utils[aws]     # AWS S3 support
pip install fsspec-utils[gcp]     # Google Cloud Storage
pip install fsspec-utils[azure]   # Azure Storage

Quick Start

Basic Filesystem Operations

from fsspec_utils import filesystem

# Local filesystem
fs = filesystem("file")
files = fs.ls("/path/to/data")

# S3 with caching
fs = filesystem("s3://my-bucket/", cached=True)
data = fs.cat("data/file.txt")

Storage Configuration

from fsspec_utils.storage import AwsStorageOptions

# Configure S3 access
options = AwsStorageOptions(
    region="us-west-2",
    access_key_id="YOUR_KEY",
    secret_access_key="YOUR_SECRET"
)

fs = filesystem("s3", storage_options=options, cached=True)

Environment-based Configuration

from fsspec_utils.storage import AwsStorageOptions

# Load from environment variables
options = AwsStorageOptions.from_env()
fs = filesystem("s3", storage_options=options)

Multiple Cloud Providers

from fsspec_utils.storage import (
    AwsStorageOptions, 
    GcsStorageOptions,
    GitHubStorageOptions
)

# AWS S3
s3_fs = filesystem("s3", storage_options=AwsStorageOptions.from_env())

# Google Cloud Storage  
gcs_fs = filesystem("gs", storage_options=GcsStorageOptions.from_env())

# GitHub repository
github_fs = filesystem("github", storage_options=GitHubStorageOptions(
    org="microsoft",
    repo="vscode", 
    token="ghp_xxxx"
))

Storage Options

AWS S3

from fsspec_utils.storage import AwsStorageOptions

# Basic credentials
options = AwsStorageOptions(
    access_key_id="AKIAXXXXXXXX",
    secret_access_key="SECRET",
    region="us-east-1"
)

# From AWS profile
options = AwsStorageOptions.create(profile="dev")

# S3-compatible service (MinIO)
options = AwsStorageOptions(
    endpoint_url="http://localhost:9000",
    access_key_id="minioadmin",
    secret_access_key="minioadmin",
    allow_http=True
)

Google Cloud Storage

from fsspec_utils.storage import GcsStorageOptions

# Service account
options = GcsStorageOptions(
    token="path/to/service-account.json",
    project="my-project-123"
)

# From environment
options = GcsStorageOptions.from_env()

Azure Storage

from fsspec_utils.storage import AzureStorageOptions

# Account key
options = AzureStorageOptions(
    protocol="az",
    account_name="mystorageacct",
    account_key="key123..."
)

# Connection string
options = AzureStorageOptions(
    protocol="az",
    connection_string="DefaultEndpoints..."
)

GitHub

from fsspec_utils.storage import GitHubStorageOptions

# Public repository
options = GitHubStorageOptions(
    org="microsoft",
    repo="vscode",
    ref="main"
)

# Private repository
options = GitHubStorageOptions(
    org="myorg",
    repo="private-repo",
    token="ghp_xxxx",
    ref="develop"
)

GitLab

from fsspec_utils.storage import GitLabStorageOptions

# Public project
options = GitLabStorageOptions(
    project_name="group/project",
    ref="main"
)

# Private project with token
options = GitLabStorageOptions(
    project_id=12345,
    token="glpat_xxxx",
    ref="develop"
)

Enhanced Caching

from fsspec_utils import filesystem

# Enable caching with monitoring
fs = filesystem(
    "s3://my-bucket/",
    cached=True,
    cache_storage="/tmp/my_cache",
    verbose=True
)

# Cache preserves directory structure
data = fs.cat("deep/nested/path/file.txt")
# Cached at: /tmp/my_cache/deep/nested/path/file.txt

Utilities

Parallel Processing

from fsspec_utils.utils import run_parallel

# Run function in parallel
def process_file(path, multiplier=1):
    return len(path) * multiplier

results = run_parallel(
    process_file,
    ["/path1", "/path2", "/path3"],
    multiplier=2,
    n_jobs=4,
    verbose=True
)

Type Conversion

from fsspec_utils.utils import dict_to_dataframe, to_pyarrow_table

# Convert dict to DataFrame
data = {"col1": [1, 2, 3], "col2": [4, 5, 6]}
df = dict_to_dataframe(data)

# Convert to PyArrow table
table = to_pyarrow_table(df)

Logging

from fsspec_utils.utils import setup_logging

# Configure logging
setup_logging(level="DEBUG", format_string="{time} | {level} | {message}")

Dependencies

Core Dependencies

  • fsspec>=2023.1.0 - Filesystem interface
  • msgspec>=0.18.0 - Serialization
  • pyyaml>=6.0 - YAML support
  • requests>=2.25.0 - HTTP requests
  • loguru>=0.7.0 - Logging

Optional Dependencies

  • orjson>=3.8.0 - Fast JSON processing
  • polars>=0.19.0 - Fast DataFrames
  • pyarrow>=10.0.0 - Columnar data
  • pandas>=1.5.0 - Data analysis
  • joblib>=1.3.0 - Parallel processing
  • rich>=13.0.0 - Progress bars

Cloud Provider Dependencies

  • boto3>=1.26.0, s3fs>=2023.1.0 - AWS S3
  • gcsfs>=2023.1.0 - Google Cloud Storage
  • adlfs>=2023.1.0 - Azure Storage

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Relationship to FlowerPower

This package was extracted from the FlowerPower workflow framework to provide standalone filesystem utilities that can be used independently or as a dependency in other projects.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fsspec_utils-0.3.0.tar.gz (44.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fsspec_utils-0.3.0-py3-none-any.whl (70.1 kB view details)

Uploaded Python 3

File details

Details for the file fsspec_utils-0.3.0.tar.gz.

File metadata

  • Download URL: fsspec_utils-0.3.0.tar.gz
  • Upload date:
  • Size: 44.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.6

File hashes

Hashes for fsspec_utils-0.3.0.tar.gz
Algorithm Hash digest
SHA256 fc4d8bd63ed2d8fb30c494ed53fafe307a315073b0e4d7e1985c13b4c3ede379
MD5 50205c55e520bc2875e439323c3b8a0e
BLAKE2b-256 c5da60abb347cf44af6178b92167942dc3ecb85d0a24e898e0d2169d15630b14

See more details on using hashes here.

File details

Details for the file fsspec_utils-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for fsspec_utils-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a5f33256587527e41627a22f54321a52ded24d30c7c4ee85ab4c4e91dd5f7a80
MD5 fa17bb529c223c871db183678dce3ee1
BLAKE2b-256 68940b2d3019e991b11e28574f5b30e6c9f1ba1f274d7f055fe80f00e6830ad5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page