Skip to main content

Bridge HuggingFace datasets with Apache Iceberg

Project description

Faceberg

Faceberg

Bridge HuggingFace datasets with Apache Iceberg tables — no data copying, just metadata.

Faceberg maps HuggingFace datasets to Apache Iceberg tables. Your catalog metadata lives on HuggingFace Spaces with an auto-deployed REST API, and any Iceberg-compatible query engine can access the data.

Installation

pip install faceberg

Quick Start

export HF_TOKEN=your_huggingface_token

# Create a catalog on HuggingFace Hub
faceberg user/mycatalog init

# Add datasets
faceberg user/mycatalog add stanfordnlp/imdb --config plain_text
faceberg user/mycatalog add openai/gsm8k --config main

# Query with interactive DuckDB shell
faceberg user/mycatalog quack
SELECT label, substr(text, 1, 100) as preview
FROM iceberg_catalog.stanfordnlp.imdb
LIMIT 10;

How It Works

HuggingFace Hub
┌─────────────────────────────────────────────────────────┐
│                                                         │
│  ┌─────────────────────┐    ┌─────────────────────────┐ │
│  │  HF Datasets        │    │  HF Spaces (Catalog)    │ │
│  │  (Original Parquet) │◄───│  • Iceberg metadata     │ │
│  │                     │    │  • REST API endpoint    │ │
│  │  stanfordnlp/imdb/  │    │  • faceberg.yml         │ │
│  │   └── *.parquet     │    │                         │ │
│  └─────────────────────┘    └───────────┬─────────────┘ │
│                                         │               │
└─────────────────────────────────────────┼───────────────┘
                                          │ Iceberg REST API
                                          ▼
                              ┌─────────────────────────┐
                              │     Query Engines       │
                              │  DuckDB, Pandas, Spark  │
                              └─────────────────────────┘

No data is copied — only metadata is created. Query with DuckDB, PyIceberg, Spark, or any Iceberg-compatible tool.

Python API

import os
from faceberg import catalog

cat = catalog("user/mycatalog", hf_token=os.environ.get("HF_TOKEN"))
table = cat.load_table("stanfordnlp.imdb")
df = table.scan(limit=100).to_pandas()

Share Your Catalog

Your catalog is accessible to anyone via the REST API:

import duckdb

conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg")
conn.execute("ATTACH 'https://user-mycatalog.hf.space' AS cat (TYPE ICEBERG)")

result = conn.execute("SELECT * FROM cat.stanfordnlp.imdb LIMIT 5").fetchdf()

Documentation

Read the docs →

Development

git clone https://github.com/kszucs/faceberg
cd faceberg
pip install -e .

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

faceberg-0.2.1.tar.gz (80.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

faceberg-0.2.1-py3-none-any.whl (88.6 kB view details)

Uploaded Python 3

File details

Details for the file faceberg-0.2.1.tar.gz.

File metadata

  • Download URL: faceberg-0.2.1.tar.gz
  • Upload date:
  • Size: 80.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for faceberg-0.2.1.tar.gz
Algorithm Hash digest
SHA256 200e68927d4ae15da4624da6e8682f4c8b4a7336d435fc0439a125cd0743ecb5
MD5 7437d138a8ba1e3ee2aeb694376073a6
BLAKE2b-256 e300b4a1676608b0d9c623f7669c26ce3f219ad52e90492f084b9900583935d8

See more details on using hashes here.

Provenance

The following attestation bundles were made for faceberg-0.2.1.tar.gz:

Publisher: main.yml on kszucs/faceberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file faceberg-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: faceberg-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 88.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for faceberg-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e5db6d7c7a55c455a59ba029b78284b5389c26ebdb7c8570db5f876dcca5007c
MD5 ad1e7aa23f118e7e196002b5e0508e15
BLAKE2b-256 5b6f251f11075a2e02331d8649517edd5e18386bcc7cd6dc1f2b761655b97a91

See more details on using hashes here.

Provenance

The following attestation bundles were made for faceberg-0.2.1-py3-none-any.whl:

Publisher: main.yml on kszucs/faceberg

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page