athena_bridge lets you run PySpark-style code using Python + AWS Athena, replicating PySpark’s syntax and functions. Write new code or migrate existing ones without EMR or Glue, leveraging Athena to cut costs and boost performance.

These details have not been verified by PyPI

Project links

Project description

🪶 athena_bridge

🇪🇸 Leer en español

athena_bridge is an open-source Python library that replicates the most common PySpark functions, allowing you to execute PySpark-like code directly on AWS Athena via automatically generated SQL.

With this library, you can reuse your existing PySpark code without needing an EMR Cluster or Glue Interactive Session, leveraging Athena’s SQL backend with identical syntax to PySpark.

✨ Key Features

Mirrors the most used pyspark.sql.functions, DataFrame, Column, and Window APIs.
Enables migration of PySpark code to environments without Spark.
Translates PySpark-style operations into executable Athena SQL through awswrangler.
Fully compatible with Python ≥ 3.8 and AWS Athena / Glue Catalog.

📦 Installation

Available on PyPI:

pip install athena_bridge

Dependencies

awswrangler
boto3
pandas

⚙️ AWS Configuration

To run athena_bridge queries from Amazon SageMaker, the execution role (AmazonSageMaker-ExecutionRole-xxxxxxxxxxxxx) must have the proper permissions for Glue, Athena, and S3.

Edit the role and attach a policy like the following (replace account IDs and bucket names with your own).

Example (anonymized) of a IAM Role Policy:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "GlueAllDatabasesAllTables",
			"Effect": "Allow",
			"Action": [
				"glue:GetCatalogImportStatus",
				"glue:GetDatabase",
				"glue:GetDatabases",
				"glue:CreateDatabase",
				"glue:UpdateDatabase",
				"glue:DeleteDatabase",
				"glue:GetTable",
				"glue:GetTables",
				"glue:CreateTable",
				"glue:UpdateTable",
				"glue:DeleteTable",
				"glue:GetPartition",
				"glue:GetPartitions",
				"glue:CreatePartition",
				"glue:BatchCreatePartition",
				"glue:UpdatePartition",
				"glue:DeletePartition",
				"glue:BatchDeletePartition"
			],
			"Resource": [
				"arn:aws:glue:eu-central-1:__ACCOUNT_ID_HERE__:catalog",
				"arn:aws:glue:eu-central-1:__ACCOUNT_ID_HERE__:database/*",
				"arn:aws:glue:eu-central-1:__ACCOUNT_ID_HERE__:table/*/*"
			]
		},
        {
			"Sid": "AthenaWorkgroupAccess",
			"Effect": "Allow",
			"Action": [
				"athena:GetWorkGroup",
				"athena:StartQueryExecution",
				"athena:GetQueryExecution",
				"athena:GetQueryResults",
				"athena:StopQueryExecution"
			],
			"Resource": "arn:aws:athena:eu-central-1:__ACCOUNT_ID_HERE__:workgroup/__YOUR_WORKGROUP_HERE_"
		},
		{
			"Sid": "AthenaS3AccessNOTE",
			"Effect": "Allow",
			"Action": [
				"s3:ListBucket",
				"s3:GetBucketLocation",
				"s3:GetObject",
				"s3:PutObject"
			],
			"Resource": [
				"arn:aws:s3:::sagemaker-studio-__ACCOUNT_ID_HERE__-xxxxxxxxxxx",
				"arn:aws:s3:::sagemaker-studio-__ACCOUNT_ID_HERE__-xxxxxxxxxxx/*",
				"arn:aws:s3:::sagemaker-eu-central-1-__ACCOUNT_ID_HERE__",
				"arn:aws:s3:::sagemaker-eu-central-1-__ACCOUNT_ID_HERE__/*"
			]
		}
	]
}

⚠️ Note: It is recommended to restrict the Resource field to specific buckets, databases, and workgroups.

To ensure that all Athena queries (including UNLOAD and CTAS operations) store their results only in the designated S3 bucket, you must enable the “Enforce workgroup configuration / Override client-side settings” option in the Athena Workgroup configuration. This setting prevents clients (such as boto3 or awswrangler) from overriding the result location and guarantees that all query outputs are written to the S3 path defined in the workgroup. Without this enforcement, UNLOAD commands may write temporary files (e.g., .csv, .metadata, .manifest) into unintended locations, potentially corrupting Parquet datasets.

🚀 Quick Start

from athena_bridge import functions as F
from athena_bridge.spark_athena_bridge import get_spark

# --- Initialize Spark-like session ---
spark = get_spark(
    database_tmp="__YOUR_ATHENA_DATABASE__",
    path_tmp="s3://__YOUR_S3_TEMP_PATH_FOR_ATHENA_BRIDGE__/",
    workgroup="__YOUR_ATHENA_WORKGROUP__"
)

# --- Read data from S3 (CSV, Parquet, etc.) ---
df_csv = (
    spark.read
         .format("csv")
         .option("header", True)      # usa True si tus CSV tienen cabecera
         .option("sep", ";")          # cambia a "," o elimina esta línea si no aplica
         .load("s3://__YOUR_S3_DIRECTORY_THAT_CONTAINS_CSV__/")
)

# --- Write dataset as Parquet ---
df_csv.write.format("parquet").mode("overwrite").save(
    "s3://__YOUR_S3_PARQUET_OUTPUT_PATH__/"
)

# --- Read back the Parquet dataset ---
df = spark.read.format("parquet").load(
    "s3://__YOUR_S3_PARQUET_OUTPUT_PATH__/"
)

# --- Simple DataFrame operations ---
df = df.withColumn("total_amount", F.lit(1000))
df.filter(F.col("total_amount") > 500).show()

# --- Stop session ---
spark.stop()

💡 Note: Make sure the “Enforce workgroup configuration / Override client-side settings” option is enabled in your Athena Workgroup, so that all queries and UNLOAD operations always write their outputs to the S3 location defined in the workgroup, preventing auxiliary files from being written outside that path.

🧠 Result: The code initializes a Spark-like session connected to Athena, reads data from S3 (for example, in CSV format), writes it back as Parquet, and allows you to perform operations using PySpark-style syntax (such as withColumn, filter, show). The computations are executed on Athena, and the results are displayed directly in the execution environment (e.g., SageMaker or a local notebook).

📘 More detailed examples

You can find more detailed notebooks in the examples/ directory:

example_athena_bridge_using_dataproc_module.ipynb — Read & write example using the Dataproc module.
example_athena_bridge_using_spark_module.ipynb — Read & write example using the Spark module.
quickstart.ipynb — Minimal quickstart example.

🧰 PySpark Compatibility

athena_bridge implements a large subset of PySpark’s native functions.
You can check the complete list of implemented functions and links to the official documentation:

Module	Available Functions	Link
`functions`	100+ PySpark functions: math, string, date, and collection operations	functions.html
`dataframe`	DataFrame methods (`select`, `filter`, `join`, `show`, etc.)	dataframe.html
`column`	Column expressions and operators	column.html
`window`	Basic window operations (`partitionBy`, `orderBy`)	window.html

Each link includes direct references to the official PySpark documentation for easier migration.

⚠️ Differences from PySpark

Operations are executed on Athena, not on a distributed Spark cluster.
Some advanced methods (e.g., collect_set, rdd, pivot) are not implemented yet.
Streaming and RDD-based features are not supported.
Performance depends on Athena query limits and execution times.

🔐 License

This project is licensed under the Apache License 2.0.

It includes parts of the public API interface from Apache Spark (PySpark) under the same license.

See:

📜 Credits

Developed by Alvaro Del Monte
Based on the API of Apache Spark (PySpark)
Published on PyPI as athena_bridge.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.4

Nov 10, 2025

0.0.3

Nov 10, 2025

0.0.2

Nov 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

athena_bridge-0.0.4.tar.gz (54.6 kB view details)

Uploaded Nov 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

athena_bridge-0.0.4-py3-none-any.whl (35.5 kB view details)

Uploaded Nov 10, 2025 Python 3

File details

Details for the file athena_bridge-0.0.4.tar.gz.

File metadata

Download URL: athena_bridge-0.0.4.tar.gz
Upload date: Nov 10, 2025
Size: 54.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for athena_bridge-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`81e9cb03553361160189b0f58370fac556f66be136b097d7d4f0e90e54813223`
MD5	`08277992252ee7cc3b3f98db938e735a`
BLAKE2b-256	`85416024628f132a0c68309a3351a1bd4412af0c7e87ce406c374b3532b36a14`

See more details on using hashes here.

File details

Details for the file athena_bridge-0.0.4-py3-none-any.whl.

File metadata

Download URL: athena_bridge-0.0.4-py3-none-any.whl
Upload date: Nov 10, 2025
Size: 35.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for athena_bridge-0.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d97b168cb7111049a0568ff0351c0b7a79da1c6fd0795fa35fa440f6370d27d9`
MD5	`ad0eaca81d718f92d0e4488c2e4405d1`
BLAKE2b-256	`aa1276c09cfe4a0b0ad5ecef1070cffd547fc61b2f61c6b2f3696fbbe6f45539`

See more details on using hashes here.

athena-bridge 0.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🪶 athena_bridge

✨ Key Features

📦 Installation

Dependencies

⚙️ AWS Configuration

🚀 Quick Start

📘 More detailed examples

🧰 PySpark Compatibility

⚠️ Differences from PySpark

🔐 License

📜 Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes