athena_bridge lets you run PySpark-style code using Python + AWS Athena, replicating PySpark’s syntax and functions. Write new code or migrate existing ones without EMR or Glue, leveraging Athena to cut costs and boost performance.
Project description
🪶 athena_bridge
athena_bridge is an open-source Python library that replicates the most common PySpark functions, allowing you to execute PySpark-like code directly on AWS Athena via automatically generated SQL.
With this library, you can reuse your existing PySpark code without needing an EMR Cluster or Glue Interactive Session, leveraging Athena’s SQL backend with identical syntax to PySpark.
✨ Key Features
- Mirrors the most used
pyspark.sql.functions,DataFrame,Column, andWindowAPIs. - Enables migration of PySpark code to environments without Spark.
- Translates PySpark-style operations into executable Athena SQL through
awswrangler. - Fully compatible with Python ≥ 3.8 and AWS Athena / Glue Catalog.
📦 Installation
Available on PyPI:
pip install athena_bridge
Dependencies
awswranglerboto3pandas
⚙️ AWS Configuration
To run athena_bridge queries from Amazon SageMaker, the execution role (AmazonSageMaker-ExecutionRole-xxxxxxxxxxxxx) must have the proper permissions for Glue, Athena, and S3.
Edit the role and attach a policy like the following (replace account IDs and bucket names with your own).
Example (anonymized) of a IAM Role Policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "GlueAllDatabasesAllTables",
"Effect": "Allow",
"Action": [
"glue:GetCatalogImportStatus",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:CreateDatabase",
"glue:UpdateDatabase",
"glue:DeleteDatabase",
"glue:GetTable",
"glue:GetTables",
"glue:CreateTable",
"glue:UpdateTable",
"glue:DeleteTable",
"glue:GetPartition",
"glue:GetPartitions",
"glue:CreatePartition",
"glue:BatchCreatePartition",
"glue:UpdatePartition",
"glue:DeletePartition",
"glue:BatchDeletePartition"
],
"Resource": [
"arn:aws:glue:eu-central-1:__ACCOUNT_ID_HERE__:catalog",
"arn:aws:glue:eu-central-1:__ACCOUNT_ID_HERE__:database/*",
"arn:aws:glue:eu-central-1:__ACCOUNT_ID_HERE__:table/*/*"
]
},
{
"Sid": "AthenaWorkgroupAccess",
"Effect": "Allow",
"Action": [
"athena:GetWorkGroup",
"athena:StartQueryExecution",
"athena:GetQueryExecution",
"athena:GetQueryResults",
"athena:StopQueryExecution"
],
"Resource": "arn:aws:athena:eu-central-1:__ACCOUNT_ID_HERE__:workgroup/__YOUR_WORKGROUP_HERE_"
},
{
"Sid": "AthenaS3AccessNOTE",
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::sagemaker-studio-__ACCOUNT_ID_HERE__-xxxxxxxxxxx",
"arn:aws:s3:::sagemaker-studio-__ACCOUNT_ID_HERE__-xxxxxxxxxxx/*",
"arn:aws:s3:::sagemaker-eu-central-1-__ACCOUNT_ID_HERE__",
"arn:aws:s3:::sagemaker-eu-central-1-__ACCOUNT_ID_HERE__/*"
]
}
]
}
⚠️ Note: It is recommended to restrict the
Resourcefield to specific buckets, databases, and workgroups.
To ensure that all Athena queries (including UNLOAD and CTAS operations) store their results only in the designated S3 bucket, you must enable the “Enforce workgroup configuration / Override client-side settings” option in the Athena Workgroup configuration. This setting prevents clients (such as boto3 or awswrangler) from overriding the result location and guarantees that all query outputs are written to the S3 path defined in the workgroup. Without this enforcement, UNLOAD commands may write temporary files (e.g., .csv, .metadata, .manifest) into unintended locations, potentially corrupting Parquet datasets.
🚀 Quick Start
from athena_bridge import functions as F
from athena_bridge.spark_athena_bridge import get_spark
# --- Initialize Spark-like session ---
spark = get_spark(
database_tmp="__YOUR_ATHENA_DATABASE__",
path_tmp="s3://__YOUR_S3_TEMP_PATH_FOR_ATHENA_BRIDGE__/",
workgroup="__YOUR_ATHENA_WORKGROUP__"
)
# --- Read data from S3 (CSV, Parquet, etc.) ---
df_csv = (
spark.read
.format("csv")
.option("header", True) # usa True si tus CSV tienen cabecera
.option("sep", ";") # cambia a "," o elimina esta línea si no aplica
.load("s3://__YOUR_S3_DIRECTORY_THAT_CONTAINS_CSV__/")
)
# --- Write dataset as Parquet ---
df_csv.write.format("parquet").mode("overwrite").save(
"s3://__YOUR_S3_PARQUET_OUTPUT_PATH__/"
)
# --- Read back the Parquet dataset ---
df = spark.read.format("parquet").load(
"s3://__YOUR_S3_PARQUET_OUTPUT_PATH__/"
)
# --- Simple DataFrame operations ---
df = df.withColumn("total_amount", F.lit(1000))
df.filter(F.col("total_amount") > 500).show()
# --- Stop session ---
spark.stop()
💡 Note: Make sure the “Enforce workgroup configuration / Override client-side settings” option is enabled in your Athena Workgroup, so that all queries and UNLOAD operations always write their outputs to the S3 location defined in the workgroup, preventing auxiliary files from being written outside that path.
🧠 Result: The code initializes a Spark-like session connected to Athena, reads data from S3 (for example, in CSV format), writes it back as Parquet, and allows you to perform operations using PySpark-style syntax (such as withColumn, filter, show). The computations are executed on Athena, and the results are displayed directly in the execution environment (e.g., SageMaker or a local notebook).
📘 More detailed examples
You can find more detailed notebooks in the examples/ directory:
- example_athena_bridge_using_dataproc_module.ipynb — Read & write example using the Dataproc module.
- example_athena_bridge_using_spark_module.ipynb — Read & write example using the Spark module.
- quickstart.ipynb — Minimal quickstart example.
🧰 PySpark Compatibility
athena_bridge implements a large subset of PySpark’s native functions.
You can check the complete list of implemented functions and links to the official documentation:
| Module | Available Functions | Link |
|---|---|---|
functions |
100+ PySpark functions: math, string, date, and collection operations | functions.html |
dataframe |
DataFrame methods (select, filter, join, show, etc.) |
dataframe.html |
column |
Column expressions and operators | column.html |
window |
Basic window operations (partitionBy, orderBy) |
window.html |
Each link includes direct references to the official PySpark documentation for easier migration.
⚠️ Differences from PySpark
- Operations are executed on Athena, not on a distributed Spark cluster.
- Some advanced methods (e.g.,
collect_set,rdd,pivot) are not implemented yet. - Streaming and RDD-based features are not supported.
- Performance depends on Athena query limits and execution times.
🔐 License
This project is licensed under the Apache License 2.0.
It includes parts of the public API interface from Apache Spark (PySpark) under the same license.
See:
📜 Credits
Developed by Alvaro Del Monte
Based on the API of Apache Spark (PySpark)
Published on PyPI as athena_bridge.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file athena_bridge-0.0.4.tar.gz.
File metadata
- Download URL: athena_bridge-0.0.4.tar.gz
- Upload date:
- Size: 54.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
81e9cb03553361160189b0f58370fac556f66be136b097d7d4f0e90e54813223
|
|
| MD5 |
08277992252ee7cc3b3f98db938e735a
|
|
| BLAKE2b-256 |
85416024628f132a0c68309a3351a1bd4412af0c7e87ce406c374b3532b36a14
|
File details
Details for the file athena_bridge-0.0.4-py3-none-any.whl.
File metadata
- Download URL: athena_bridge-0.0.4-py3-none-any.whl
- Upload date:
- Size: 35.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d97b168cb7111049a0568ff0351c0b7a79da1c6fd0795fa35fa440f6370d27d9
|
|
| MD5 |
ad0eaca81d718f92d0e4488c2e4405d1
|
|
| BLAKE2b-256 |
aa1276c09cfe4a0b0ad5ecef1070cffd547fc61b2f61c6b2f3696fbbe6f45539
|