Skip to main content

athena_bridge lets you run PySpark-style code using Python + AWS Athena, replicating PySpark’s syntax and functions. Write new code or migrate existing ones without EMR or Glue, leveraging Athena to cut costs and boost performance.

Project description

PyPI version Python versions License Downloads GitHub stars

🪶 athena_bridge

🇪🇸 Leer en español

athena_bridge is an open-source Python library that replicates the most common PySpark functions, allowing you to execute PySpark-like code directly on AWS Athena via automatically generated SQL.

With this library, you can reuse your existing PySpark code without needing an EMR Cluster or Glue Interactive Session, leveraging Athena’s SQL backend with identical syntax to PySpark.


✨ Key Features

  • Mirrors the most used pyspark.sql.functions, DataFrame, Column, and Window APIs.
  • Enables migration of PySpark code to environments without Spark.
  • Translates PySpark-style operations into executable Athena SQL through awswrangler.
  • Fully compatible with Python ≥ 3.8 and AWS Athena / Glue Catalog.

📦 Installation

Available on PyPI:

pip install athena_bridge

Dependencies

  • awswrangler
  • boto3
  • pandas

⚙️ AWS Configuration

To run athena_bridge queries from Amazon SageMaker, the execution role (AmazonSageMaker-ExecutionRole-xxxxxxxxxxxxx) must have the proper permissions for Glue, Athena, and S3.

Edit the role and attach a policy like the following (replace account IDs and bucket names with your own).

Example (anonymized) of a IAM Role Policy:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "GlueAllDatabasesAllTables",
			"Effect": "Allow",
			"Action": [
				"glue:GetCatalogImportStatus",
				"glue:GetDatabase",
				"glue:GetDatabases",
				"glue:CreateDatabase",
				"glue:UpdateDatabase",
				"glue:DeleteDatabase",
				"glue:GetTable",
				"glue:GetTables",
				"glue:CreateTable",
				"glue:UpdateTable",
				"glue:DeleteTable",
				"glue:GetPartition",
				"glue:GetPartitions",
				"glue:CreatePartition",
				"glue:BatchCreatePartition",
				"glue:UpdatePartition",
				"glue:DeletePartition",
				"glue:BatchDeletePartition"
			],
			"Resource": [
				"arn:aws:glue:eu-central-1:__ACCOUNT_ID_HERE__:catalog",
				"arn:aws:glue:eu-central-1:__ACCOUNT_ID_HERE__:database/*",
				"arn:aws:glue:eu-central-1:__ACCOUNT_ID_HERE__:table/*/*"
			]
		},
        {
			"Sid": "AthenaWorkgroupAccess",
			"Effect": "Allow",
			"Action": [
				"athena:GetWorkGroup",
				"athena:StartQueryExecution",
				"athena:GetQueryExecution",
				"athena:GetQueryResults",
				"athena:StopQueryExecution"
			],
			"Resource": "arn:aws:athena:eu-central-1:__ACCOUNT_ID_HERE__:workgroup/__YOUR_WORKGROUP_HERE_"
		},
		{
			"Sid": "AthenaS3AccessNOTE",
			"Effect": "Allow",
			"Action": [
				"s3:ListBucket",
				"s3:GetBucketLocation",
				"s3:GetObject",
				"s3:PutObject"
			],
			"Resource": [
				"arn:aws:s3:::sagemaker-studio-__ACCOUNT_ID_HERE__-xxxxxxxxxxx",
				"arn:aws:s3:::sagemaker-studio-__ACCOUNT_ID_HERE__-xxxxxxxxxxx/*",
				"arn:aws:s3:::sagemaker-eu-central-1-__ACCOUNT_ID_HERE__",
				"arn:aws:s3:::sagemaker-eu-central-1-__ACCOUNT_ID_HERE__/*"
			]
		}
	]
}

⚠️ Note: It is recommended to restrict the Resource field to specific buckets, databases, and workgroups.

To ensure that all Athena queries (including UNLOAD and CTAS operations) store their results only in the designated S3 bucket, you must enable the “Enforce workgroup configuration / Override client-side settings” option in the Athena Workgroup configuration. This setting prevents clients (such as boto3 or awswrangler) from overriding the result location and guarantees that all query outputs are written to the S3 path defined in the workgroup. Without this enforcement, UNLOAD commands may write temporary files (e.g., .csv, .metadata, .manifest) into unintended locations, potentially corrupting Parquet datasets.


🚀 Quick Start

from athena_bridge import functions as F
from athena_bridge.spark_athena_bridge import get_spark

# --- Initialize Spark-like session ---
spark = get_spark(
    database_tmp="__YOUR_ATHENA_DATABASE__",
    path_tmp="s3://__YOUR_S3_TEMP_PATH_FOR_ATHENA_BRIDGE__/",
    workgroup="__YOUR_ATHENA_WORKGROUP__"
)

# --- Read data from S3 (CSV, Parquet, etc.) ---
df_csv = (
    spark.read
         .format("csv")
         .option("header", True)      # usa True si tus CSV tienen cabecera
         .option("sep", ";")          # cambia a "," o elimina esta línea si no aplica
         .load("s3://__YOUR_S3_DIRECTORY_THAT_CONTAINS_CSV__/")
)

# --- Write dataset as Parquet ---
df_csv.write.format("parquet").mode("overwrite").save(
    "s3://__YOUR_S3_PARQUET_OUTPUT_PATH__/"
)

# --- Read back the Parquet dataset ---
df = spark.read.format("parquet").load(
    "s3://__YOUR_S3_PARQUET_OUTPUT_PATH__/"
)

# --- Simple DataFrame operations ---
df = df.withColumn("total_amount", F.lit(1000))
df.filter(F.col("total_amount") > 500).show()

# --- Stop session ---
spark.stop()

💡 Note: Make sure the “Enforce workgroup configuration / Override client-side settings” option is enabled in your Athena Workgroup, so that all queries and UNLOAD operations always write their outputs to the S3 location defined in the workgroup, preventing auxiliary files from being written outside that path.

🧠 Result: The code initializes a Spark-like session connected to Athena, reads data from S3 (for example, in CSV format), writes it back as Parquet, and allows you to perform operations using PySpark-style syntax (such as withColumn, filter, show). The computations are executed on Athena, and the results are displayed directly in the execution environment (e.g., SageMaker or a local notebook).


📘 More detailed examples

You can find more detailed notebooks in the examples/ directory:


🧰 PySpark Compatibility

athena_bridge implements a large subset of PySpark’s native functions.
You can check the complete list of implemented functions and links to the official documentation:

Module Available Functions Link
functions 100+ PySpark functions: math, string, date, and collection operations functions.html
dataframe DataFrame methods (select, filter, join, show, etc.) dataframe.html
column Column expressions and operators column.html
window Basic window operations (partitionBy, orderBy) window.html

Each link includes direct references to the official PySpark documentation for easier migration.


⚠️ Differences from PySpark

  • Operations are executed on Athena, not on a distributed Spark cluster.
  • Some advanced methods (e.g., collect_set, rdd, pivot) are not implemented yet.
  • Streaming and RDD-based features are not supported.
  • Performance depends on Athena query limits and execution times.

🧪 Full Example (Jupyter / SageMaker)

Check out the notebook Ejemplo_finn_athena_bridge_usando_dataproc.ipynb for a complete example, including:

  • how to connect using boto3 and awswrangler,
  • how to create DataFrames from Athena results,
  • and how to combine athena_bridge with pandas seamlessly.

🔐 License

This project is licensed under the Apache License 2.0.

It includes parts of the public API interface from Apache Spark (PySpark) under the same license.

See:


📜 Credits

Developed by Alvaro Del Monte
Based on the API of Apache Spark (PySpark)
Published on PyPI as athena_bridge.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

athena_bridge-0.0.2.tar.gz (55.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

athena_bridge-0.0.2-py3-none-any.whl (35.7 kB view details)

Uploaded Python 3

File details

Details for the file athena_bridge-0.0.2.tar.gz.

File metadata

  • Download URL: athena_bridge-0.0.2.tar.gz
  • Upload date:
  • Size: 55.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for athena_bridge-0.0.2.tar.gz
Algorithm Hash digest
SHA256 ed9cecb0b304b5489bf2553dabf6586ac6b574a291bb72f10a056ca1a192e7de
MD5 047080ea39a640bfbc85927b8c052020
BLAKE2b-256 a8bdf611e53474284aa8d91766535295e01e2fc23200aceb02febee6a30a8b59

See more details on using hashes here.

File details

Details for the file athena_bridge-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: athena_bridge-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 35.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for athena_bridge-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f6de7cd2df23b9c6448aaa7baf5168ba7c39aa09c2c5b2a3c0cfa3157ce9d3d0
MD5 43ef9d76431d07d653b31337b032513a
BLAKE2b-256 fb2bca976f312d5eafeb56b1ccf974858de2f5bce93930103010aabbc596012c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page