Pyspark bridge to dataos
Project description
Dataos-PyFlare : DataOS SDK for Apache Spark
What it does:
Dataos-PyFlare is a powerful Python library designed to simplify data operations and interactions with the DataOS platform and Apache Spark. It provides a convenient and efficient way to load, transform, and save data.
It abstracts out the challenges/complexity around data flow. User can just focus on data transformations and business logic.
Features
-
Streamlined Data Operations: Dataos-PyFlare streamlines data operations by offering a unified interface for data loading, transformation, and storage, reducing development complexity and time.
-
Data Connector Integration: Seamlessly connect to various data connectors, including Google BigQuery, Google Cloud Storage (GCS), Snowflake, Redshift, Pulsar and more, using sdk's built-in capabilities.
-
Customizable and Extensible: Dataos-PyFlare allows for easy customization and extension to suit your specific project requirements. It integrates with existing Python libraries and frameworks for data manipulation.
-
Optimized for DataOS: Dataos-PyFlare is optimized for the DataOS platform, making it an ideal choice for managing and processing data within DataOS environments.
Steps to install
Before you begin, make sure you have Python 3 [version >= 3.7] installed on your system.
You can install Dataos-PyFlare and its dependencies using pip: ``
pip install dataos-pyflare
Additionally, make sure to have a Spark environment set up with the required configurations for your specific use case.
Getting Started
Sample Code:
This code snippet demonstrates how to configure a Dataos-PyFlare session to load data from a source, apply transformations, and save the result to a destination.
from pyflare.sdk import load, save, session_builder
# Define your spark conf params here
sparkConf = [("spark.app.name", "Dataos Sdk Spark App"), ("spark.master", "local[*]"), ("spark.executor.memory", "4g"),
("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.25.1,"
"com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.17,"
"net.snowflake:spark-snowflake_2.12:2.11.0-spark_3.3")
]
# Provide dataos token here
token = "bWF5YW5rLjkxYzZiNDQ3LWM3ZWYLWMzNjk3MzQ1MTQyNw=="
# provide dataos fully qualified domain name
DATAOS_FQDN = "sunny-prawn.dataos.app"
# initialize pyflare session
spark = session_builder.SparkSessionBuilder() \
.with_spark_conf(sparkConf) \
.with_user_apikey(token) \
.with_dataos_fqdn(DATAOS_FQDN) \
.with_depot(depot_name="icebase", acl="r") \
.with_depot("sanitysnowflake", "rw") \
.build_session()
# load() method will read dataset city from the source and return a governed dataframe
df_city = load(name="dataos://icebase:retail/city", format="iceberg")
# perform, required transformation as per business logic
df_city = df_city.drop("__metadata")
# save() will write transformed dataset to the sink
save(name="dataos://sanitysnowflake:public/city", mode="overwrite", dataframe=df_city, format="snowflake")
Explanation
-
Importing Libraries: We import necessary modules from the pyflare.sdk package.
-
Spark Configuration: We define Spark configuration parameters such as the Spark application name, master URL, executor memory, and additional packages required for connectors.
-
DataOS Token and FQDN: You provide your DataOS token and fully qualified domain name (FQDN) to authenticate and connect to the DataOS platform.
-
PyFlare Session Initialization: We create a PyFlare session using session_builder.SparkSessionBuilder(). This session will be used for data operations.
-
Loading Data: We use the load method to load data from a specified source (dataos://icebase:retail/city) in Iceberg format. The result is a governed DataFrame (df_city).
-
Transformation: We perform a transformation on the loaded DataFrame by dropping the __metadata column. You can customize this step to fit your business logic.
-
Saving Data: Finally, we use the save method to save the transformed DataFrame to a specified destination (dataos://sanitysnowflake:public/customer) in Snowflake format.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dataos_pyflare-0.1.13.tar.gz
.
File metadata
- Download URL: dataos_pyflare-0.1.13.tar.gz
- Upload date:
- Size: 660.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3f5360b5492b6d26d8836b4e927acb0cb953e5d5ae5da57bd165a6795c7c74e5 |
|
MD5 | ed3bfb2ec001777e69e95dca738dc6e4 |
|
BLAKE2b-256 | cdbe6f59db5e4e4e8a32e93db56de7530f2ef8ffb5a83889da844a4a496ce946 |
File details
Details for the file dataos_pyflare-0.1.13-py3-none-any.whl
.
File metadata
- Download URL: dataos_pyflare-0.1.13-py3-none-any.whl
- Upload date:
- Size: 675.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d25683ec8c68e12d10911018b018fb9d5be3338e0ab14207b6d0afd6795ff198 |
|
MD5 | 1dd5f17c61ba117b8b21a83ac172457f |
|
BLAKE2b-256 | a06a6035020beadeae3f8ccaea31185deaa364235cb0ce62072e1637ce18d8d6 |