Set up AWS Glue Catalog Federation for lakeFS Iceberg REST Catalog
Project description
lakeFS Iceberg REST Catalog -> AWS Glue Catalog Federation
Query lakeFS-managed Apache Iceberg tables from Amazon Athena using AWS Glue Catalog Federation -- no data copying, no metadata sync, real-time access.
How it works
flowchart LR
subgraph AWS
Athena["Athena / Redshift / EMR"]
Glue["Glue Data Catalog (Federated)"]
LF["Lake Formation"]
S3["Amazon S3 (Iceberg data & metadata)"]
end
subgraph External
lakeFS["lakeFS\nIceberg REST Catalog"]
end
Athena -- "SQL query" --> Glue
Glue -- "Iceberg REST API (OAuth2)" --> lakeFS
lakeFS -. "metadata-location (s3://... paths)" .-> Glue
LF -- "scoped S3\ncredentials" --> Athena
Athena -- "read data files" --> S3
lakeFS -. "data files live here" .-> S3
- Glue connects to the lakeFS Iceberg REST Catalog via OAuth2 and calls standard Iceberg REST API endpoints (
listNamespaces,listTables,loadTable) - lakeFS returns table metadata including physical S3 paths to
metadata.json, manifest lists, and data files - Lake Formation vends temporary, scoped S3 credentials to the query engine
- Athena (or Redshift/EMR) reads Iceberg data files directly from S3
The external catalog is only used for metadata discovery. All data access goes through S3 with Lake Formation-managed credentials.
Quick start
Prerequisites
- Python 3.11+
- AWS credentials configured (e.g.
~/.aws/credentials, environment variables, or IAM role) with permissions for IAM, Glue, Lake Formation, and Secrets Manager - A lakeFS instance with the Iceberg REST Catalog enabled
- A lakeFS service account (access key + secret key)
Install and run
Using uv (no install needed):
uvx lakefs-glue federate \
--lakefs-url https://my-org.us-east-1.lakefscloud.io \
--lakefs-repo my-repo \
--lakefs-ref main \
--lakefs-access-key-id AKIAIOSFODNN7EXAMPLE \
--lakefs-secret-access-key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \
--grant-to arn:aws:iam::123456789012:role/DataAnalysts
Using pip:
pip install lakefs-glue
lakefs-glue federate \
--lakefs-url https://my-org.us-east-1.lakefscloud.io \
--lakefs-repo my-repo \
--lakefs-ref main \
--lakefs-access-key-id AKIAIOSFODNN7EXAMPLE \
--lakefs-secret-access-key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \
--grant-to arn:aws:iam::123456789012:role/DataAnalysts
From source:
git clone https://github.com/treeverse/lakefs-glue.git
cd lakefs-glue
# Using uv
uv sync
uv run lakefs-glue --help
# Or using pip
pip install -e .
lakefs-glue --help
Query from Athena:
SELECT * FROM "lakefs-catalog"."default"."my_table" LIMIT 10;
Options
| Option | Required | Default | Description |
|---|---|---|---|
--lakefs-url |
Yes | - | lakeFS server URL |
--lakefs-repo |
Yes | - | lakeFS repository name |
--lakefs-ref |
No | main |
lakeFS ref to expose (branch, tag, or commit ID) |
--lakefs-access-key-id |
Yes | - | lakeFS access key ID |
--lakefs-secret-access-key |
Yes | - | lakeFS secret access key |
--catalog-name |
No | lakefs-catalog |
Name for the Glue federated catalog |
--region |
No | us-east-1 |
AWS region |
--grant-to |
No | - | IAM ARNs to grant catalog access (repeatable) |
Multiple branches, tags, and repositories
Each federated catalog is scoped to a single lakeFS repository + ref (branch, tag, or commit ID). To expose multiple refs:
# Main branch
lakefs-glue federate \
--lakefs-url https://my-org.lakefscloud.io \
--lakefs-repo my-repo --lakefs-ref main \
--catalog-name my-repo-main \
--lakefs-access-key-id ... --lakefs-secret-access-key ...
# Dev branch
lakefs-glue federate \
--lakefs-url https://my-org.lakefscloud.io \
--lakefs-repo my-repo --lakefs-ref dev \
--catalog-name my-repo-dev \
--lakefs-access-key-id ... --lakefs-secret-access-key ...
# A tagged release (point-in-time snapshot)
lakefs-glue federate \
--lakefs-url https://my-org.lakefscloud.io \
--lakefs-repo my-repo --lakefs-ref v1.0 \
--catalog-name my-repo-v1 \
--lakefs-access-key-id ... --lakefs-secret-access-key ...
All catalogs appear independently in Athena and Lake Formation. This lets you query a stable tagged snapshot alongside the latest branch data, or compare across branches by joining different catalogs.
What the script creates
| AWS Resource | Name Pattern | Purpose |
|---|---|---|
| Secrets Manager secret | {catalog-name}-secret |
Stores lakeFS secret key for OAuth2 |
| IAM role | {catalog-name}-GlueConnectionRole |
Assumed by Glue and Lake Formation |
| Glue Connection | {catalog-name}-connection |
REST API bridge to lakeFS |
| Lake Formation resource | (registered connection) | Enables S3 credential vending |
| Glue Catalog | {catalog-name} |
The federated catalog visible in Athena |
| Lake Formation grants | (on the catalog) | Permissions for specified principals |
The script is idempotent -- rerunning with the same parameters updates resources in place. Rerunning with changed parameters (e.g., different branch or credentials) converges to the new state.
Remove
To remove a specific federated catalog and its associated resources (connection, Lake Formation registration, Secrets Manager secret, and IAM role):
lakefs-glue rm my-catalog
To discover and remove all federated catalogs in the account:
lakefs-glue rm --all
Use --yes to skip the confirmation prompt. Use --region to target a specific AWS region (default: us-east-1).
Limitations
- Read-only: Glue Catalog Federation only supports queries (AWS docs). You cannot
INSERT INTO,CREATE TABLE, or modify data through the federated catalog. Use Spark/PyIceberg/Trino connected directly to lakeFS for writes. - Single ref per catalog: Each federated catalog points to one lakeFS ref (branch or tag). Create multiple catalogs to expose multiple refs.
- No nested namespaces: Glue catalog federation only supports single-level namespaces (AWS docs). Tables must follow a flat
catalog.namespace.tablestructure. This is why each catalog must be scoped to a specificrepo.ref- it flattens the lakeFS hierarchy so namespaces within the ref are exposed as top-level databases.
AWS SDK/CLI configuration notes
The AWS Console handles these automatically, but when automating via SDK/CLI they need to be set explicitly:
-
Single IAM role for Glue + Lake Formation: The same IAM role must be used for both the Glue connection's
ROLE_ARNand the Lake FormationRegisterResourcecall. Using separate roles (even with identical policies) causes federation to silently fail -get_databasestimes out withFederationSourceRetryableExceptionand zero HTTP requests reach the external catalog. -
SUPER_USERLake Formation grant: TheSUPER_USERpermission must be granted on the federated catalog for it to appear in the Lake Formation console UI. Without it, the catalog works via CLI/SDK but is invisible in the console. -
WithPrivilegedAccessinRegisterResource: TheRegisterResourcecall should includeWithPrivilegedAccess=Trueto grant the registering principal full control over the federated resource.
Technical details
See the annotated source in lakefs_glue_federation.py for a detailed walkthrough of the integration architecture, OAuth2 authentication flow, namespace mapping, and S3 credential vending.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lakefs_glue-0.1.0.tar.gz.
File metadata
- Download URL: lakefs_glue-0.1.0.tar.gz
- Upload date:
- Size: 28.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf7985e7ab881f12396562c855f8fda708db95eb30cba9512328fc3358c1a18d
|
|
| MD5 |
8bf9e35966de5b35ffd8c7ef95d05e9c
|
|
| BLAKE2b-256 |
c9a6f9de2fea9e843f0c5b23d2eed1fe8a7d46d83e4e27ea7ca7935ab73d8c41
|
File details
Details for the file lakefs_glue-0.1.0-py3-none-any.whl.
File metadata
- Download URL: lakefs_glue-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1076a0c3163ee3da2948ce5bb913eeee9c119aa9e8247e39e9874f3a5d0b3e33
|
|
| MD5 |
6a7bdfbc2c8ef242a92b9f3241b60888
|
|
| BLAKE2b-256 |
92754fb77b6952755a2ecb78c5db22d68d9d799eb40bcc82047d1c7aca87d000
|