Bridge HuggingFace datasets with Apache Iceberg
Project description
Faceberg
Bridge HuggingFace datasets with Apache Iceberg tables — no data copying, just metadata.
Faceberg maps HuggingFace datasets to Apache Iceberg tables. Your catalog metadata lives on HuggingFace Spaces with an auto-deployed REST API, and any Iceberg-compatible query engine can access the data.
Installation
pip install faceberg
Quick Start
export HF_TOKEN=your_huggingface_token
# Create a catalog on HuggingFace Hub
faceberg user/mycatalog init
# Add datasets
faceberg user/mycatalog add stanfordnlp/imdb --config plain_text
faceberg user/mycatalog add openai/gsm8k --config main
# Query with interactive DuckDB shell
faceberg user/mycatalog quack
SELECT label, substr(text, 1, 100) as preview
FROM iceberg_catalog.stanfordnlp.imdb
LIMIT 10;
How It Works
HuggingFace Hub
┌─────────────────────────────────────────────────────────┐
│ │
│ ┌─────────────────────┐ ┌─────────────────────────┐ │
│ │ HF Datasets │ │ HF Spaces (Catalog) │ │
│ │ (Original Parquet) │◄───│ • Iceberg metadata │ │
│ │ │ │ • REST API endpoint │ │
│ │ stanfordnlp/imdb/ │ │ • faceberg.yml │ │
│ │ └── *.parquet │ │ │ │
│ └─────────────────────┘ └───────────┬─────────────┘ │
│ │ │
└─────────────────────────────────────────┼───────────────┘
│ Iceberg REST API
▼
┌─────────────────────────┐
│ Query Engines │
│ DuckDB, Pandas, Spark │
└─────────────────────────┘
No data is copied — only metadata is created. Query with DuckDB, PyIceberg, Spark, or any Iceberg-compatible tool.
Python API
import os
from faceberg import catalog
cat = catalog("user/mycatalog", hf_token=os.environ.get("HF_TOKEN"))
table = cat.load_table("stanfordnlp.imdb")
df = table.scan(limit=100).to_pandas()
Share Your Catalog
Your catalog is accessible to anyone via the REST API:
import duckdb
conn = duckdb.connect()
conn.execute("INSTALL iceberg; LOAD iceberg")
conn.execute("ATTACH 'https://user-mycatalog.hf.space' AS cat (TYPE ICEBERG)")
result = conn.execute("SELECT * FROM cat.stanfordnlp.imdb LIMIT 5").fetchdf()
Documentation
- Getting Started — Full quickstart guide
- Local Catalogs — Use local catalogs for development
- DuckDB Integration — Advanced SQL queries
- Pandas Integration — Load into DataFrames
Development
git clone https://github.com/kszucs/faceberg
cd faceberg
pip install -e .
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file faceberg-0.1.1.tar.gz.
File metadata
- Download URL: faceberg-0.1.1.tar.gz
- Upload date:
- Size: 82.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
17ef95d48a7e321cd6ab29ac29e74ee8c0acd6eff20a0f3215301229cf2a7de3
|
|
| MD5 |
6777388392edec103d02d329daa17259
|
|
| BLAKE2b-256 |
43b55fec9ebe9e66ca52e71f0b65b2abb6272755e0382044457047d5592a72ea
|
File details
Details for the file faceberg-0.1.1-py3-none-any.whl.
File metadata
- Download URL: faceberg-0.1.1-py3-none-any.whl
- Upload date:
- Size: 91.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a2f0c7c4d64dabe478b15e709ae058032ee18aa858a21fc306054d765faad2d1
|
|
| MD5 |
f876c9d05fbe90b95229bdf1b7c1f51c
|
|
| BLAKE2b-256 |
af9d68d387f07763d46f5e93a42b85d94eef5505caf7687ab1feb0fe1128c231
|