Skip to main content

An fsspec implementation for lakeFS.

Project description

lakeFS-spec logo

lakeFS-spec: An fsspec backend for lakeFS

GitHub docs GitHub

Welcome to lakeFS-spec, a filesystem-spec backend implementation for the lakeFS data lake. Our primary goal is to streamline versioned data operations in lakeFS, enabling seamless integration with popular data science tools such as Pandas, Polars, and DuckDB directly from Python.

Highlights:

  • Simple repository operations in lakeFS
  • Easy access to underlying storage and versioning operations
  • Seamless integration with the fsspec ecosystem
  • Directly access lakeFS objects from popular data science libraries (including Pandas, Polars, DuckDB, Hugging Face Datasets, PyArrow) with minimal code
  • Transaction support for reliable data version control
  • Smart data transfers through client-side caching (up-/download)
  • Auto-discovery configuration

[!NOTE] We are seeking early adopters who would like to actively participate in our feedback process and shape the future of the library. If you are interested in using the library and want to get in touch with us, please reach out via Github Discussions.

Installation

lakeFS-spec is published on PyPI, you can simply install it using your favorite package manager:

$ pip install lakefs-spec
  # or, for example with uv:
$ uv add lakefs-spec

Usage

The following usage examples showcase two major ways of using lakeFS-spec: as a low-level filesystem abstraction, and through third-party (data science) libraries.

For a more thorough overview of the features and use cases for lakeFS-spec, see the user guide and tutorials sections in the documentation.

Low-level: As a fsspec filesystem

The following example shows how to upload a file, create a commit, and read back the committed data using the bare lakeFS filesystem implementation. It assumes you have already created a repository named repo and have lakectl credentials set up on your machine in ~/.lakectl.yaml (see the lakeFS quickstart guide if you are new to lakeFS and need guidance).

from pathlib import Path

from lakefs_spec import LakeFSFileSystem

REPO, BRANCH = "repo", "main"

# Prepare example local data
local_path = Path("demo.txt")
local_path.write_text("Hello, lakeFS!")

# Upload to lakeFS and create a commit
fs = LakeFSFileSystem()  # will auto-discover config from ~/.lakectl.yaml

# Upload a file on a temporary transaction branch
with fs.transaction(repository=REPO, base_branch=BRANCH) as tx:
    fs.put(local_path, f"{REPO}/{tx.branch.id}/{local_path.name}")
    tx.commit(message="Add demo data")

# Read back committed file
f = fs.open(f"{REPO}/{BRANCH}/demo.txt", "rt")
print(f.readline())  # "Hello, lakeFS!"

High-level: Via third-party libraries

A variety of widely-used data science tools are building on fsspec to access remote storage resources and can thus work with lakeFS data lakes directly through lakeFS-spec (see the fsspec docs for details). The examples assume you have a lakeFS instance with the quickstart repository containing sample data available.

# Pandas -- see https://pandas.pydata.org/docs/user_guide/io.html#reading-writing-remote-files
import pandas as pd

data = pd.read_parquet("lakefs://quickstart/main/lakes.parquet")
print(data.head())


# Polars -- see https://pola-rs.github.io/polars/user-guide/io/cloud-storage/
import polars as pl

data = pl.read_parquet("lakefs://quickstart/main/lakes.parquet", use_pyarrow=True)
print(data.head())


# DuckDB -- see https://duckdb.org/docs/guides/python/filesystems.html
import duckdb
import fsspec

duckdb.register_filesystem(fsspec.filesystem("lakefs"))
res = duckdb.read_parquet("lakefs://quickstart/main/lakes.parquet")
res.show()

Contributing

We encourage and welcome contributions from the community to enhance the project. Please check discussions or raise an issue on GitHub for any problems you encounter with the library.

For information on the general development workflow, see the contribution guide.

License

The lakeFS-spec library is distributed under the Apache-2 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakefs_spec-0.14.0.tar.gz (528.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lakefs_spec-0.14.0-py3-none-any.whl (23.8 kB view details)

Uploaded Python 3

File details

Details for the file lakefs_spec-0.14.0.tar.gz.

File metadata

  • Download URL: lakefs_spec-0.14.0.tar.gz
  • Upload date:
  • Size: 528.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lakefs_spec-0.14.0.tar.gz
Algorithm Hash digest
SHA256 6a9e92b78400149521d577d3ef8b6e3d1f2b61ba69b1b0cc67cbbf387819d094
MD5 11556aaea1a4d0eb6ec8cbf683f38b12
BLAKE2b-256 59e232c7e193554f21fc6d0d08a07243823bb3a9d59c9cc9e383644b48779b29

See more details on using hashes here.

Provenance

The following attestation bundles were made for lakefs_spec-0.14.0.tar.gz:

Publisher: release.yaml on aai-institute/lakefs-spec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lakefs_spec-0.14.0-py3-none-any.whl.

File metadata

  • Download URL: lakefs_spec-0.14.0-py3-none-any.whl
  • Upload date:
  • Size: 23.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lakefs_spec-0.14.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ba5d872bf402bba0216a61ef4a27a4c3c944bacbb82ab565b7b220a2cc54a12f
MD5 059f8a8d2faaf11043a80efe8b62ce7e
BLAKE2b-256 de19658b894aec189034700b1feb5c8d76e5c63d0dd73e16af1363fc3f4b51b1

See more details on using hashes here.

Provenance

The following attestation bundles were made for lakefs_spec-0.14.0-py3-none-any.whl:

Publisher: release.yaml on aai-institute/lakefs-spec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page