Skip to main content

Set up AWS Glue Catalog Federation for lakeFS Iceberg REST Catalog

Project description

lakeFS Iceberg REST Catalog -> AWS Glue Catalog Federation

Query lakeFS-managed Apache Iceberg tables from Amazon Athena using AWS Glue Catalog Federation -- no data copying, no metadata sync, real-time access.

How it works

flowchart LR
    subgraph AWS
        Athena["Athena / Redshift / EMR"]
        Glue["Glue Data Catalog (Federated)"]
        LF["Lake Formation"]
        S3["Amazon S3 (Iceberg data & metadata)"]
    end

    subgraph External
        lakeFS["lakeFS\nIceberg REST Catalog"]
    end

    Athena -- "SQL query" --> Glue
    Glue -- "Iceberg REST API (OAuth2)" --> lakeFS
    lakeFS -. "metadata-location (s3://... paths)" .-> Glue
    LF -- "scoped S3\ncredentials" --> Athena
    Athena -- "read data files" --> S3
    lakeFS -. "data files live here" .-> S3
  1. Glue connects to the lakeFS Iceberg REST Catalog via OAuth2 and calls standard Iceberg REST API endpoints (listNamespaces, listTables, loadTable)
  2. lakeFS returns table metadata including physical S3 paths to metadata.json, manifest lists, and data files
  3. Lake Formation vends temporary, scoped S3 credentials to the query engine
  4. Athena (or Redshift/EMR) reads Iceberg data files directly from S3

The external catalog is only used for metadata discovery. All data access goes through S3 with Lake Formation-managed credentials.

Quick start

Prerequisites

  • Python 3.11+
  • AWS credentials configured (e.g. ~/.aws/credentials, environment variables, or IAM role) with permissions for IAM, Glue, Lake Formation, and Secrets Manager
  • A lakeFS instance with the Iceberg REST Catalog enabled
  • A lakeFS service account (access key + secret key)

Install and run

Using uv (no install needed):

uvx lakefs-glue federate \
    --lakefs-url https://my-org.us-east-1.lakefscloud.io \
    --lakefs-repo my-repo \
    --lakefs-ref main \
    --lakefs-access-key-id AKIAIOSFODNN7EXAMPLE \
    --lakefs-secret-access-key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \
    --grant-to arn:aws:iam::123456789012:role/DataAnalysts

Using pip:

pip install lakefs-glue

lakefs-glue federate \
    --lakefs-url https://my-org.us-east-1.lakefscloud.io \
    --lakefs-repo my-repo \
    --lakefs-ref main \
    --lakefs-access-key-id AKIAIOSFODNN7EXAMPLE \
    --lakefs-secret-access-key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \
    --grant-to arn:aws:iam::123456789012:role/DataAnalysts

From source:

git clone https://github.com/treeverse/lakefs-glue.git
cd lakefs-glue

# Using uv
uv sync
uv run lakefs-glue --help

# Or using pip
pip install -e .
lakefs-glue --help

Query from Athena:

SELECT * FROM "lakefs-catalog"."default"."my_table" LIMIT 10;

Options

Option Required Default Description
--lakefs-url Yes - lakeFS server URL
--lakefs-repo Yes - lakeFS repository name
--lakefs-ref No main lakeFS ref to expose (branch, tag, or commit ID)
--lakefs-access-key-id Yes - lakeFS access key ID
--lakefs-secret-access-key Yes - lakeFS secret access key
--catalog-name No lakefs-catalog Name for the Glue federated catalog
--region No us-east-1 AWS region
--grant-to No - IAM ARNs to grant catalog access (repeatable)

Multiple branches, tags, and repositories

Each federated catalog is scoped to a single lakeFS repository + ref (branch, tag, or commit ID). To expose multiple refs:

# Main branch
lakefs-glue federate \
    --lakefs-url https://my-org.lakefscloud.io \
    --lakefs-repo my-repo --lakefs-ref main \
    --catalog-name my-repo-main \
    --lakefs-access-key-id ... --lakefs-secret-access-key ...

# Dev branch
lakefs-glue federate \
    --lakefs-url https://my-org.lakefscloud.io \
    --lakefs-repo my-repo --lakefs-ref dev \
    --catalog-name my-repo-dev \
    --lakefs-access-key-id ... --lakefs-secret-access-key ...

# A tagged release (point-in-time snapshot)
lakefs-glue federate \
    --lakefs-url https://my-org.lakefscloud.io \
    --lakefs-repo my-repo --lakefs-ref v1.0 \
    --catalog-name my-repo-v1 \
    --lakefs-access-key-id ... --lakefs-secret-access-key ...

All catalogs appear independently in Athena and Lake Formation. This lets you query a stable tagged snapshot alongside the latest branch data, or compare across branches by joining different catalogs.

What the script creates

AWS Resource Name Pattern Purpose
Secrets Manager secret {catalog-name}-secret Stores lakeFS secret key for OAuth2
IAM role {catalog-name}-GlueConnectionRole Assumed by Glue and Lake Formation
Glue Connection {catalog-name}-connection REST API bridge to lakeFS
Lake Formation resource (registered connection) Enables S3 credential vending
Glue Catalog {catalog-name} The federated catalog visible in Athena
Lake Formation grants (on the catalog) Permissions for specified principals

The script is idempotent -- rerunning with the same parameters updates resources in place. Rerunning with changed parameters (e.g., different branch or credentials) converges to the new state.

Remove

To remove a specific federated catalog and its associated resources (connection, Lake Formation registration, Secrets Manager secret, and IAM role):

lakefs-glue rm my-catalog

To discover and remove all federated catalogs in the account:

lakefs-glue rm --all

Use --yes to skip the confirmation prompt. Use --region to target a specific AWS region (default: us-east-1).

Limitations

  • Read-only: Glue Catalog Federation only supports queries (AWS docs). You cannot INSERT INTO, CREATE TABLE, or modify data through the federated catalog. Use Spark/PyIceberg/Trino connected directly to lakeFS for writes.
  • Single ref per catalog: Each federated catalog points to one lakeFS ref (branch or tag). Create multiple catalogs to expose multiple refs.
  • No nested namespaces: Glue catalog federation only supports single-level namespaces (AWS docs). Tables must follow a flat catalog.namespace.table structure. This is why each catalog must be scoped to a specific repo.ref - it flattens the lakeFS hierarchy so namespaces within the ref are exposed as top-level databases.

AWS SDK/CLI configuration notes

The AWS Console handles these automatically, but when automating via SDK/CLI they need to be set explicitly:

  1. Single IAM role for Glue + Lake Formation: The same IAM role must be used for both the Glue connection's ROLE_ARN and the Lake Formation RegisterResource call. Using separate roles (even with identical policies) causes federation to silently fail - get_databases times out with FederationSourceRetryableException and zero HTTP requests reach the external catalog.

  2. SUPER_USER Lake Formation grant: The SUPER_USER permission must be granted on the federated catalog for it to appear in the Lake Formation console UI. Without it, the catalog works via CLI/SDK but is invisible in the console.

  3. WithPrivilegedAccess in RegisterResource: The RegisterResource call should include WithPrivilegedAccess=True to grant the registering principal full control over the federated resource.

Technical details

See the annotated source in lakefs_glue_federation.py for a detailed walkthrough of the integration architecture, OAuth2 authentication flow, namespace mapping, and S3 credential vending.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lakefs_glue-0.1.0.tar.gz (28.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lakefs_glue-0.1.0-py3-none-any.whl (15.7 kB view details)

Uploaded Python 3

File details

Details for the file lakefs_glue-0.1.0.tar.gz.

File metadata

  • Download URL: lakefs_glue-0.1.0.tar.gz
  • Upload date:
  • Size: 28.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakefs_glue-0.1.0.tar.gz
Algorithm Hash digest
SHA256 bf7985e7ab881f12396562c855f8fda708db95eb30cba9512328fc3358c1a18d
MD5 8bf9e35966de5b35ffd8c7ef95d05e9c
BLAKE2b-256 c9a6f9de2fea9e843f0c5b23d2eed1fe8a7d46d83e4e27ea7ca7935ab73d8c41

See more details on using hashes here.

File details

Details for the file lakefs_glue-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: lakefs_glue-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lakefs_glue-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1076a0c3163ee3da2948ce5bb913eeee9c119aa9e8247e39e9874f3a5d0b3e33
MD5 6a7bdfbc2c8ef242a92b9f3241b60888
BLAKE2b-256 92754fb77b6952755a2ecb78c5db22d68d9d799eb40bcc82047d1c7aca87d000

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page