Reusable utilities for working with Glue PySpark jobs

These details have not been verified by PyPI

Project links

Project description

glue-utils

glue-utils is a Python library designed to enhance the developer experience when working with AWS Glue ETL and Python Shell jobs. It reduces boilerplate code, increases type safety, and improves IDE auto-completion, making Glue development easier and more efficient.

Usage in AWS Glue
Usage when developing jobs locally
Main Features
Other features

Usage in AWS Glue

To use glue-utils in AWS Glue, it needs to be added as an additional python module in your Glue job.

You can do this by adding an --additional-python-modules job parameter with the value, glue_utils==0.8.0. For more information about setting job parameters, see AWS Glue job parameters.

Usage when developing jobs locally

This library does not include pyspark and aws-glue-libs as dependencies as they are already pre-installed in Glue's runtime environment.

To help in developing your Glue jobs locally in your IDE, it is helpful to install pyspark and aws-glue-libs. Unfortunately, aws-glue-libs is not available through PyPI so we can only install it from its git repository.

# Glue 4.0 uses PySpark 3.3.0
pip install pyspark==3.3.0
pip install git+https://github.com/awslabs/aws-glue-libs.git@master
pip install glue-utils

Main Features

BaseOptions
- a dataclass that parses the options supplied via command-line arguments
GluePySparkContext
- a subclass of awsglue.context.GlueContext that adds convenient type-safe methods (methods that ensure the correct data types are used) for the most common connection types.
GluePySparkJob
- a convenient class that simplifies and reduces the boilerplate code needed in Glue jobs.

`BaseOptions`

BaseOptions resolves the required arguments into a dataclass to help your IDE auto-complete and detect potential KeyErrors. It also makes type checkers such as pyright and mypy detect those errors at design or build time instead of at runtime.

from dataclasses import dataclass
from glue_utils import BaseOptions


@dataclass
class Options(BaseOptions):
    start_date: str
    end_date: str


args = Options.from_sys_argv()

print(f"The day partition key is: {args.start_date}")

Note: Similar to the behavior of awsglue.utils.getResolvedOptions, all arguments are strings. A warning is raised when defining a field as other data types. We aim to auto-cast those values in the future.

`GluePySparkContext`

GluePySparkContext is a subclass of awsglue.context.GlueContext with the following additional convenience methods for creating and writing DynamicFrames for the common connection types. The method signatures ensure that you are passing the right connection options and/or format options for the chosen connection type.

MySQL
- create_dynamic_frame_from_mysql
- write_dynamic_frame_to_mysql
Oracle
- create_dynamic_frame_from_oracle
- write_dynamic_frame_to_oracle
PostgreSQL
- create_dynamic_frame_from_postgresql
- write_dynamic_frame_to_postgresql
SQL Server
- create_dynamic_frame_from_sqlserver
- write_dynamic_frame_to_sqlserver
S3
- JSON
  - create_dynamic_frame_from_s3_json
  - write_dynamic_frame_to_s3_json
- CSV
  - create_dynamic_frame_from_s3_csv
  - write_dynamic_frame_to_s3_csv
- Parquet
  - create_dynamic_frame_from_s3_parquet
  - write_dynamic_frame_to_s3_parquet
- XML
  - create_dynamic_frame_from_s3_xml
  - write_dynamic_frame_to_s3_xml
DynamoDB
- create_dynamic_frame_from_dynamodb
- create_dynamic_frame_from_dynamodb_export
- write_dynamic_frame_to_dynamodb
Kinesis
- create_dynamic_frame_from_kinesis
- write_dynamic_frame_to_kinesis
Kafka
- create_dynamic_frame_from_kafka
- write_dynamic_frame_to_kafka
OpenSearch
- create_dynamic_frame_from_opensearch
- write_dynamic_frame_to_opensearch
DocumentDB
- create_dynamic_frame_from_documentdb
- write_dynamic_frame_to_documentdb
MongoDB
- create_dynamic_frame_from_mongodb
- write_dynamic_frame_to_mongodb

`GluePySparkJob`

GluePySparkJob reduces the boilerplate code needed by using reasonable defaults while still allowing for customizations by passing keyword arguments.

In its simplest form, it takes care of instantiating awsglue.context.GlueContext and initializing awsglue.job.Job.

from glue_utils.pyspark import GluePySparkJob

# Instantiate with defaults.
job = GluePySparkJob()

# This is the SparkContext object.
sc = job.sc

# This is the GluePySparkContext(GlueContext) object.
glue_context = job.glue_context

# This is the SparkSession object.
spark = job.spark

# The rest of your job's logic.

# Commit the job if necessary (e.g. when using bookmarks).
job.commit()

`options_cls`

You may pass a subclass of BaseOptions to make the resolved options available in job.options.

from dataclasses import dataclass
from glue_utils import BaseOptions
from glue_utils.pyspark import GluePySparkJob


@dataclass
class Options(BaseOptions):
    # Specify the arguments as field names
    start_date: str
    end_date: str
    source_path: str


# Instantiate with the above Options class.
job = GluePySparkJob(options_cls=Options)

# Use the resolved values using the fields available in job.options.
print(f"The S3 path is {job.options.source_path}")

`log_level`

You may configure the logging level. It is set to GluePySparkJob.LogLevel.WARN by default.

from glue_utils.pyspark import GluePySparkJob


# Log only errors.
job = GluePySparkJob(log_level=GluePySparkJob.LogLevel.ERROR)

`spark_conf`

You may set Spark configuration values by instantiating a custom pyspark.SparkConf object to pass to GluePySparkJob.

from pyspark import SparkConf
from glue_utils.pyspark import GluePySparkJob

# Instantiate a SparkConf and set the desired config keys/values.
spark_conf = SparkConf()
spark_conf.set("spark.driver.maxResultSize", "4g")

# Instantiate with the above custom SparkConf.
job = GluePySparkJob(spark_conf=spark_conf)

`glue_context_options`

You may set options that are passed to awsglue.context.GlueContext.

from glue_utils.pyspark import GlueContextOptions, GluePySparkJob

job = GluePySparkJob(glue_context_options={
    "minPartitions": 2,
    "targetPartitions": 10,
})

# Alternatively, you can use the GlueContextOptions TypedDict.
job = GluePySparkJob(glue_context_options=GlueContextOptions(
    minPartitions=2,
    targetPartitions=10,
)

Other features

The following modules contain useful TypedDicts for defining connection options or format options to pass as arguments to various awsglue.context.GlueContext methods:

glue_utils.pyspark.connection_options
- for defining connection_options for various connection types
glue_utils.pyspark.format_options
- for defining format_options for various formats

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.9.1

Nov 14, 2024

0.9.0

Nov 11, 2024

0.8.1

Aug 20, 2024

This version

0.8.0

Aug 8, 2024

0.7.1

Jul 10, 2024

0.7.0

Jul 3, 2024

0.6.0

Jun 7, 2024

0.5.1

May 23, 2024

0.5.0

May 20, 2024

0.4.0

May 16, 2024

0.4.0rc0 pre-release yanked

May 12, 2024

Reason this release was yanked:

accident publish

0.3.1

May 13, 2024

0.3.0rc2 pre-release

May 11, 2024

0.3.0rc1 pre-release

May 10, 2024

0.3.0rc0 pre-release

May 10, 2024

0.2.3

May 5, 2024

0.2.2b2 pre-release yanked

May 5, 2024

Reason this release was yanked:

buggy

0.2.2b1 pre-release yanked

May 5, 2024

Reason this release was yanked:

buggy

0.2.2b0 pre-release yanked

May 4, 2024

Reason this release was yanked:

buggy

0.2.1

May 3, 2024

0.2.1rc4 pre-release

May 3, 2024

0.2.1rc3 pre-release

May 3, 2024

0.2.1rc2 pre-release

May 3, 2024

0.2.1rc1 pre-release

May 3, 2024

0.2.0 yanked

May 2, 2024

Reason this release was yanked:

Release is buggy and throws instantly

0.1.2

Mar 11, 2024

0.1.1

Mar 7, 2024

0.1.0

Mar 6, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glue_utils-0.8.0.tar.gz (18.0 kB view details)

Uploaded Aug 8, 2024 Source

Built Distribution

glue_utils-0.8.0-py3-none-any.whl (20.8 kB view details)

Uploaded Aug 8, 2024 Python 3

File details

Details for the file glue_utils-0.8.0.tar.gz.

File metadata

Download URL: glue_utils-0.8.0.tar.gz
Upload date: Aug 8, 2024
Size: 18.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/23.5.0

File hashes

Hashes for glue_utils-0.8.0.tar.gz
Algorithm	Hash digest
SHA256	`4df35aae95134a72ace20592f117c050823957c032a9b1743e0a9fb38be37bc1`
MD5	`474c939c751b61322d198a56ef721649`
BLAKE2b-256	`eccf0e41479a165604ab5c4319c91a681b42923313520c2c8dd57b562ed8d078`

See more details on using hashes here.

File details

Details for the file glue_utils-0.8.0-py3-none-any.whl.

File metadata

Download URL: glue_utils-0.8.0-py3-none-any.whl
Upload date: Aug 8, 2024
Size: 20.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/23.5.0

File hashes

Hashes for glue_utils-0.8.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9504a35e5de86d73fdbd538c8acb98e45e1340c0459613ded4468822930b7f7c`
MD5	`3ab80c8d7bfa14f823c302b30178870f`
BLAKE2b-256	`6006c34c1e60e95aa6909dcc4430a9f45886d971075844e86fca304d2402f700`