Bridge HuggingFace datasets with Apache Iceberg
Project description
Faceberg
Bridge HuggingFace datasets with Apache Iceberg tables.
Installation
pip install faceberg
Quick Start
# Create a catalog and add a dataset
faceberg mycatalog init
faceberg mycatalog add stanfordnlp/imdb --config plain_text
faceberg mycatalog sync
# Query the data
faceberg mycatalog scan default.imdb --limit 5
Python API:
from faceberg import catalog
cat = catalog("mycatalog")
table = cat.load_table("default.imdb")
df = table.scan().to_pandas()
print(df.head())
Documentation:
- Getting Started - Quickstart guide
- Local Catalogs - Use local catalogs for testing
- DuckDB Integration - Query with SQL
- Pandas Integration - Load into DataFrames
How It Works
Faceberg creates lightweight Iceberg metadata that points to original HuggingFace dataset files:
HuggingFace Dataset Your Catalog
┌─────────────────┐ ┌──────────────────┐
│ org/dataset │ │ mycatalog/ │
│ ├── train.pq ◄──┼─────────┼─ default/ │
│ └── test.pq ◄──┼─────────┼─ └── imdb/ │
└─────────────────┘ │ └── metadata/
└──────────────────┘
No data is copied—only metadata is created. Query with DuckDB, PyIceberg, Spark, or any Iceberg-compatible tool.
Usage
CLI Commands
# Initialize catalog
faceberg mycatalog init
# Add datasets
faceberg mycatalog add openai/gsm8k --config main
# Sync datasets (creates Iceberg metadata)
faceberg mycatalog sync
# List tables
faceberg mycatalog list
# Show table info
faceberg mycatalog info default.gsm8k
# Scan data
faceberg mycatalog scan default.gsm8k --limit 10
# Start REST server
faceberg mycatalog serve --port 8181
Remote Catalogs on HuggingFace Hub
# Initialize remote catalog
export HF_TOKEN=your_token
faceberg org/catalog-repo init
# Add and sync datasets
faceberg org/catalog-repo add deepmind/code_contests --config default
faceberg org/catalog-repo sync
# Serve remote catalog
faceberg org/catalog-repo serve
Query with DuckDB
import duckdb
conn = duckdb.connect()
conn.execute("INSTALL httpfs; LOAD httpfs")
conn.execute("INSTALL iceberg; LOAD iceberg")
# Query local catalog
result = conn.execute("""
SELECT * FROM iceberg_scan('mycatalog/default/imdb/metadata/v1.metadata.json')
LIMIT 10
""").fetchall()
# Query remote catalog
result = conn.execute("""
SELECT * FROM iceberg_scan('hf://datasets/org/catalog/default/table/metadata/v1.metadata.json')
LIMIT 10
""").fetchall()
Development
git clone https://github.com/kszucs/faceberg
cd faceberg
pip install -e .
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file faceberg-0.1.0.tar.gz.
File metadata
- Download URL: faceberg-0.1.0.tar.gz
- Upload date:
- Size: 82.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed3e1dab50422021e7f4f70db6f8eb73e9715779e62ecc3700f70a423f7cb21d
|
|
| MD5 |
00a77a278b47e659491e476737445d21
|
|
| BLAKE2b-256 |
25466a6db838101c548902a472410fba7d775e3dd2f4b8b67751a5b7ca26ba18
|
File details
Details for the file faceberg-0.1.0-py3-none-any.whl.
File metadata
- Download URL: faceberg-0.1.0-py3-none-any.whl
- Upload date:
- Size: 91.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1b49bdc25941a5ab14cfc7784b9c5b88209d78796e32dd3b2fa96276e656bfdf
|
|
| MD5 |
e4977a111be10911f1feb82958a2b75e
|
|
| BLAKE2b-256 |
3a478a5c0f9c56ebff75e926e043f1b7eb88e8dc939d7904f7598278580bc73b
|