Pyspark bridge to dataos

These details have not been verified by PyPI

Project links

Homepage

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Dataos-PyFlare : DataOS SDK for Apache Spark

What it does:

Dataos-PyFlare is a powerful Python library designed to simplify data operations and interactions with the DataOS platform and Apache Spark. It provides a convenient and efficient way to load, transform, and save data.

It abstracts out the challenges/complexity around data flow. User can just focus on data transformations and business logic.

Features

Streamlined Data Operations: Dataos-PyFlare streamlines data operations by offering a unified interface for data loading, transformation, and storage, reducing development complexity and time.
Data Connector Integration: Seamlessly connect to various data connectors, including Google BigQuery, Google Cloud Storage (GCS), Snowflake, Redshift, Pulsar and more, using sdk's built-in capabilities.
Customizable and Extensible: Dataos-PyFlare allows for easy customization and extension to suit your specific project requirements. It integrates with existing Python libraries and frameworks for data manipulation.
Optimized for DataOS: Dataos-PyFlare is optimized for the DataOS platform, making it an ideal choice for managing and processing data within DataOS environments.

Steps to install

Before you begin, make sure you have Python 3 [version >= 3.7] installed on your system.

You can install Dataos-PyFlare and its dependencies using pip:

pip install dataos-pyflare

Additionally, make sure to have a Spark environment set up with the required configurations for your specific use case.

Getting Started

Sample Code:

This code snippet demonstrates how to configure a Dataos-PyFlare session to load data from a source, apply transformations, and save the result to a destination.

from pyflare.sdk import load, save, session_builder

# Define your spark conf params here
sparkConf = [("spark.app.name", "Dataos Sdk Spark App"), ("spark.master", "local[*]"), ("spark.executor.memory", "4g"),
             ("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.25.1,"
                                     "com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.17,"
                                    "net.snowflake:spark-snowflake_2.12:2.11.0-spark_3.3")
             ]

# Provide dataos token here
token = "bWF5YW5rLjkxYzZiNDQ3LWM3ZWYLWMzNjk3MzQ1MTQyNw=="

# provide dataos fully qualified domain name
DATAOS_FQDN = "sunny-prawn.dataos.app"

# initialize pyflare session
spark = session_builder.SparkSessionBuilder() \
    .with_spark_conf(sparkConf) \
    .with_user_apikey(token) \
    .with_dataos_fqdn(DATAOS_FQDN) \
    .with_depot(depot_name="icebase", acl="r") \
    .with_depot("sanitysnowflake", "rw") \
    .build_session()

# load() method will read dataset city from the source and return a governed dataframe
df_city = load(name="dataos://icebase:retail/city", format="iceberg")

# perform, required transformation as per business logic
df_city = df_city.drop("__metadata")

# save() will write transformed dataset to the sink
save(name="dataos://sanitysnowflake:public/city", mode="overwrite", dataframe=df_city, format="snowflake")

Explanation

Importing Libraries: We import necessary modules from the pyflare.sdk package.
Spark Configuration: We define Spark configuration parameters such as the Spark application name, master URL, executor memory, and additional packages required for connectors.
DataOS Token and FQDN: You provide your DataOS token and fully qualified domain name (FQDN) to authenticate and connect to the DataOS platform.
PyFlare Session Initialization: We create a PyFlare session using session_builder.SparkSessionBuilder(). This session will be used for data operations.
Loading Data: We use the load method to load data from a specified source (dataos://icebase:retail/city) in Iceberg format. The result is a governed DataFrame (df_city).
Transformation: We perform a transformation on the loaded DataFrame by dropping the __metadata column. You can customize this step to fit your business logic.
Saving Data: Finally, we use the save method to save the transformed DataFrame to a specified destination (dataos://sanitysnowflake:public/customer) in Snowflake format.

Project details

These details have not been verified by PyPI

Project links

Homepage

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.6

Jan 2, 2024

0.1.5

Jan 2, 2024

0.1.4

Dec 29, 2023

0.1.3

Dec 29, 2023

0.1.2

Dec 28, 2023

0.1.1

Dec 22, 2023

0.1.0

Dec 22, 2023

0.0.8

Nov 3, 2023

0.0.7

Oct 30, 2023

0.0.6

Oct 27, 2023

0.0.5

Oct 19, 2023

0.0.4

Oct 17, 2023

0.0.3

Oct 3, 2023

0.0.2

Oct 3, 2023

0.0.1

Oct 3, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataos-pyflare-0.1.6.tar.gz (24.8 kB view hashes)

Uploaded Jan 2, 2024 Source

Built Distribution

dataos_pyflare-0.1.6-py3-none-any.whl (42.7 kB view hashes)

Uploaded Jan 2, 2024 Python 3

Hashes for dataos-pyflare-0.1.6.tar.gz

Hashes for dataos-pyflare-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`243b8fee77167f1543e9c9d69bd396ea8a26d3f7c1b943797495b9b3766f8daa`
MD5	`7822d4ebcb593a399c6b7769bd26fcd5`
BLAKE2b-256	`b8f7aab336433a50d0ebd8eee9cce96bfaadc37c456dad47bbd3836d637fd916`

Hashes for dataos_pyflare-0.1.6-py3-none-any.whl

Hashes for dataos_pyflare-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b738a8e16d22d64ab3291e06f4cd86dc2385ea3e90ceaf312f0726c8dade17f5`
MD5	`e54c188ef40adc616ef29748a376962b`
BLAKE2b-256	`786bc350ee12542572e8cb3faf311c5c96772d326655d59c2ba771dafcf70343`