Skip to main content

Bridge HuggingFace datasets with Apache Iceberg

Project description

Faceberg

Faceberg

Bridge HuggingFace datasets with Apache Iceberg tables.

Installation

pip install faceberg

Quick Start

# Create a catalog and add a dataset
faceberg mycatalog init
faceberg mycatalog add stanfordnlp/imdb --config plain_text
faceberg mycatalog sync

# Query the data
faceberg mycatalog scan default.imdb --limit 5

Python API:

from faceberg import catalog

cat = catalog("mycatalog")
table = cat.load_table("default.imdb")
df = table.scan().to_pandas()
print(df.head())

Documentation:

How It Works

Faceberg creates lightweight Iceberg metadata that points to original HuggingFace dataset files:

HuggingFace Dataset          Your Catalog
┌─────────────────┐         ┌──────────────────┐
│ org/dataset     │         │ mycatalog/       │
│ ├── train.pq ◄──┼─────────┼─ default/        │
│ └── test.pq  ◄──┼─────────┼─   └── imdb/     │
└─────────────────┘         │       └── metadata/
                            └──────────────────┘

No data is copied—only metadata is created. Query with DuckDB, PyIceberg, Spark, or any Iceberg-compatible tool.

Usage

CLI Commands

# Initialize catalog
faceberg mycatalog init

# Add datasets
faceberg mycatalog add openai/gsm8k --config main

# Sync datasets (creates Iceberg metadata)
faceberg mycatalog sync

# List tables
faceberg mycatalog list

# Show table info
faceberg mycatalog info default.gsm8k

# Scan data
faceberg mycatalog scan default.gsm8k --limit 10

# Start REST server
faceberg mycatalog serve --port 8181

Remote Catalogs on HuggingFace Hub

# Initialize remote catalog
export HF_TOKEN=your_token
faceberg org/catalog-repo init

# Add and sync datasets
faceberg org/catalog-repo add deepmind/code_contests --config default
faceberg org/catalog-repo sync

# Serve remote catalog
faceberg org/catalog-repo serve

Query with DuckDB

import duckdb

conn = duckdb.connect()
conn.execute("INSTALL httpfs; LOAD httpfs")
conn.execute("INSTALL iceberg; LOAD iceberg")

# Query local catalog
result = conn.execute("""
    SELECT * FROM iceberg_scan('mycatalog/default/imdb/metadata/v1.metadata.json')
    LIMIT 10
""").fetchall()

# Query remote catalog
result = conn.execute("""
    SELECT * FROM iceberg_scan('hf://datasets/org/catalog/default/table/metadata/v1.metadata.json')
    LIMIT 10
""").fetchall()

Development

git clone https://github.com/kszucs/faceberg
cd faceberg
pip install -e .

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

faceberg-0.1.0.tar.gz (82.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

faceberg-0.1.0-py3-none-any.whl (91.1 kB view details)

Uploaded Python 3

File details

Details for the file faceberg-0.1.0.tar.gz.

File metadata

  • Download URL: faceberg-0.1.0.tar.gz
  • Upload date:
  • Size: 82.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for faceberg-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ed3e1dab50422021e7f4f70db6f8eb73e9715779e62ecc3700f70a423f7cb21d
MD5 00a77a278b47e659491e476737445d21
BLAKE2b-256 25466a6db838101c548902a472410fba7d775e3dd2f4b8b67751a5b7ca26ba18

See more details on using hashes here.

File details

Details for the file faceberg-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: faceberg-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 91.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for faceberg-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1b49bdc25941a5ab14cfc7784b9c5b88209d78796e32dd3b2fa96276e656bfdf
MD5 e4977a111be10911f1feb82958a2b75e
BLAKE2b-256 3a478a5c0f9c56ebff75e926e043f1b7eb88e8dc939d7904f7598278580bc73b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page