Skip to main content

athena_bridge lets you run PySpark-style code using Python + AWS Athena, replicating PySpark’s syntax and functions. Write new code or migrate existing ones without EMR or Glue, leveraging Athena to cut costs and boost performance.

Project description

PyPI version Python versions License Downloads GitHub stars

🪶 athena_bridge

🇪🇸 Leer en español

athena_bridge is an open-source Python library that replicates the most common PySpark functions, allowing you to execute PySpark-like code directly on AWS Athena via automatically generated SQL.

With this library, you can reuse your existing PySpark code without needing an EMR Cluster or Glue Interactive Session, leveraging Athena’s SQL backend with identical syntax to PySpark.


✨ Key Features

  • Mirrors the most used pyspark.sql.functions, DataFrame, Column, and Window APIs.
  • Enables migration of PySpark code to environments without Spark.
  • Translates PySpark-style operations into executable Athena SQL through awswrangler.
  • Fully compatible with Python ≥ 3.8 and AWS Athena / Glue Catalog.

📦 Installation

Available on PyPI:

pip install athena_bridge

Dependencies

  • awswrangler
  • boto3
  • pandas

⚙️ AWS Configuration

To run athena_bridge queries from Amazon SageMaker, the execution role (AmazonSageMaker-ExecutionRole-xxxxxxxxxxxxx) must have the proper permissions for Glue, Athena, and S3.

Edit the role and attach a policy like the following (replace account IDs and bucket names with your own).

Example (anonymized) of a IAM Role Policy:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "GlueAllDatabasesAllTables",
			"Effect": "Allow",
			"Action": [
				"glue:GetCatalogImportStatus",
				"glue:GetDatabase",
				"glue:GetDatabases",
				"glue:CreateDatabase",
				"glue:UpdateDatabase",
				"glue:DeleteDatabase",
				"glue:GetTable",
				"glue:GetTables",
				"glue:CreateTable",
				"glue:UpdateTable",
				"glue:DeleteTable",
				"glue:GetPartition",
				"glue:GetPartitions",
				"glue:CreatePartition",
				"glue:BatchCreatePartition",
				"glue:UpdatePartition",
				"glue:DeletePartition",
				"glue:BatchDeletePartition"
			],
			"Resource": [
				"arn:aws:glue:eu-central-1:__ACCOUNT_ID_HERE__:catalog",
				"arn:aws:glue:eu-central-1:__ACCOUNT_ID_HERE__:database/*",
				"arn:aws:glue:eu-central-1:__ACCOUNT_ID_HERE__:table/*/*"
			]
		},
        {
			"Sid": "AthenaWorkgroupAccess",
			"Effect": "Allow",
			"Action": [
				"athena:GetWorkGroup",
				"athena:StartQueryExecution",
				"athena:GetQueryExecution",
				"athena:GetQueryResults",
				"athena:StopQueryExecution"
			],
			"Resource": "arn:aws:athena:eu-central-1:__ACCOUNT_ID_HERE__:workgroup/__YOUR_WORKGROUP_HERE_"
		},
		{
			"Sid": "AthenaS3AccessNOTE",
			"Effect": "Allow",
			"Action": [
				"s3:ListBucket",
				"s3:GetBucketLocation",
				"s3:GetObject",
				"s3:PutObject"
			],
			"Resource": [
				"arn:aws:s3:::sagemaker-studio-__ACCOUNT_ID_HERE__-xxxxxxxxxxx",
				"arn:aws:s3:::sagemaker-studio-__ACCOUNT_ID_HERE__-xxxxxxxxxxx/*",
				"arn:aws:s3:::sagemaker-eu-central-1-__ACCOUNT_ID_HERE__",
				"arn:aws:s3:::sagemaker-eu-central-1-__ACCOUNT_ID_HERE__/*"
			]
		}
	]
}

⚠️ Note: It is recommended to restrict the Resource field to specific buckets, databases, and workgroups.

To ensure that all Athena queries (including UNLOAD and CTAS operations) store their results only in the designated S3 bucket, you must enable the “Enforce workgroup configuration / Override client-side settings” option in the Athena Workgroup configuration. This setting prevents clients (such as boto3 or awswrangler) from overriding the result location and guarantees that all query outputs are written to the S3 path defined in the workgroup. Without this enforcement, UNLOAD commands may write temporary files (e.g., .csv, .metadata, .manifest) into unintended locations, potentially corrupting Parquet datasets.


🚀 Quick Start

from athena_bridge import functions as F
from athena_bridge.spark_athena_bridge import get_spark

# --- Initialize Spark-like session ---
spark = get_spark(
    database_tmp="__YOUR_ATHENA_DATABASE__",
    path_tmp="s3://__YOUR_S3_TEMP_PATH_FOR_ATHENA_BRIDGE__/",
    workgroup="__YOUR_ATHENA_WORKGROUP__"
)

# --- Read data from S3 (CSV, Parquet, etc.) ---
df_csv = (
    spark.read
         .format("csv")
         .option("header", True)      # usa True si tus CSV tienen cabecera
         .option("sep", ";")          # cambia a "," o elimina esta línea si no aplica
         .load("s3://__YOUR_S3_DIRECTORY_THAT_CONTAINS_CSV__/")
)

# --- Write dataset as Parquet ---
df_csv.write.format("parquet").mode("overwrite").save(
    "s3://__YOUR_S3_PARQUET_OUTPUT_PATH__/"
)

# --- Read back the Parquet dataset ---
df = spark.read.format("parquet").load(
    "s3://__YOUR_S3_PARQUET_OUTPUT_PATH__/"
)

# --- Simple DataFrame operations ---
df = df.withColumn("total_amount", F.lit(1000))
df.filter(F.col("total_amount") > 500).show()

# --- Stop session ---
spark.stop()

💡 Note: Make sure the “Enforce workgroup configuration / Override client-side settings” option is enabled in your Athena Workgroup, so that all queries and UNLOAD operations always write their outputs to the S3 location defined in the workgroup, preventing auxiliary files from being written outside that path.

🧠 Result: The code initializes a Spark-like session connected to Athena, reads data from S3 (for example, in CSV format), writes it back as Parquet, and allows you to perform operations using PySpark-style syntax (such as withColumn, filter, show). The computations are executed on Athena, and the results are displayed directly in the execution environment (e.g., SageMaker or a local notebook).


📘 More detailed examples

You can find more detailed notebooks in the examples/ directory:


🧰 PySpark Compatibility

athena_bridge implements a large subset of PySpark’s native functions.
You can check the complete list of implemented functions and links to the official documentation:

Module Available Functions Link
functions 100+ PySpark functions: math, string, date, and collection operations functions.html
dataframe DataFrame methods (select, filter, join, show, etc.) dataframe.html
column Column expressions and operators column.html
window Basic window operations (partitionBy, orderBy) window.html

Each link includes direct references to the official PySpark documentation for easier migration.


⚠️ Differences from PySpark

  • Operations are executed on Athena, not on a distributed Spark cluster.
  • Some advanced methods (e.g., collect_set, rdd, pivot) are not implemented yet.
  • Streaming and RDD-based features are not supported.
  • Performance depends on Athena query limits and execution times.

🔐 License

This project is licensed under the Apache License 2.0.

It includes parts of the public API interface from Apache Spark (PySpark) under the same license.

See:


📜 Credits

Developed by Alvaro Del Monte
Based on the API of Apache Spark (PySpark)
Published on PyPI as athena_bridge.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

athena_bridge-0.0.4.tar.gz (54.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

athena_bridge-0.0.4-py3-none-any.whl (35.5 kB view details)

Uploaded Python 3

File details

Details for the file athena_bridge-0.0.4.tar.gz.

File metadata

  • Download URL: athena_bridge-0.0.4.tar.gz
  • Upload date:
  • Size: 54.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for athena_bridge-0.0.4.tar.gz
Algorithm Hash digest
SHA256 81e9cb03553361160189b0f58370fac556f66be136b097d7d4f0e90e54813223
MD5 08277992252ee7cc3b3f98db938e735a
BLAKE2b-256 85416024628f132a0c68309a3351a1bd4412af0c7e87ce406c374b3532b36a14

See more details on using hashes here.

File details

Details for the file athena_bridge-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: athena_bridge-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 35.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for athena_bridge-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 d97b168cb7111049a0568ff0351c0b7a79da1c6fd0795fa35fa440f6370d27d9
MD5 ad0eaca81d718f92d0e4488c2e4405d1
BLAKE2b-256 aa1276c09cfe4a0b0ad5ecef1070cffd547fc61b2f61c6b2f3696fbbe6f45539

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page