Skip to main content

Fugue, Rapids, BlazingSQL integration

Project description

Fugue, Rapids, BlazingSQL integration

PyPI version PyPI pyversions PyPI license Doc

Slack Status

This project extends Fugue to support Rapids cuDF and BlazingSQL.

Installation

You need to install Rapids and BlazingSQL by yourself (see official instructions), and assume you installed them by conda, then you need to pip install in the same environment

conda run -n <your_env> pip install fugue-blazing

How To Use

As a standard Fugue extension, you can use in two ways: functional APIs and Fugue SQL. But Fugue SQL is the preferred way for this extension. This is because due to the special design of GPU, code to run on GPU has special requirement. Currently transform is leveraging NativeExecutionEngine which is using CPU. Other than transform, Fugue fully relies on cuDF and BlasingSQL to do the compute.

Practically, if you don't use transform, then SQL may be the better choice to express your data pipelines.

Functional APIs

Here is an example Fugue code snippet that illustrates some of the key features of the framework. A fillna function creates a new column named filled, which is the same as the column value except that the None values are filled.

from fugue import FugueWorkflow
from fugue_blazing import CudaExecutionEngine, setup_shortcuts

# Creating sample data
data = [
    ["A", "2020-01-01", 10],
    ["A", "2020-01-02", None],
    ["A", "2020-01-03", 30],
    ["B", "2020-01-01", 20],
    ["B", "2020-01-02", None],
    ["B", "2020-01-03", 40]
]
schema = "id:str,date:date,value:double"

dag = FugueWorkflow()
dag.df(data, schema).partition_by("id", presort="date").take(1).show()

dag.run(CudaExecutionEngine)

# call setup_shortcuts to make your code more expressive
setup_shortcuts()
dag.run("blazing")

You can also run SQL using functional API:

from fugue import FugueWorkflow
from fugue_blazing import setup_shortcuts

setup_shortcuts()

data = [
    ["A", "2020-01-01", 10],
    ["A", "2020-01-02", None],
    ["A", "2020-01-03", 30],
    ["B", "2020-01-01", 20],
    ["B", "2020-01-02", None],
    ["B", "2020-01-03", 40]
]
schema = "id:str,date:date,value:double"

with FugueWorkflow("blazing") as dag:
    df = dag.df(data, schema)
    dag.select("* from ",df," where value>20").show()

For detailed examples, please read Fugue Tutorials

Fugue SQL

Programmatical Approach

from fugue_sql import fsql
from fugue_blazing import setup_shortcuts
import pandas as pd
import cudf

setup_shortcuts()

pdf = pd.DataFrame([
    ["A", "2020-01-01", 10],
    ["A", "2020-01-02", None],
    ["A", "2020-01-03", 30],
    ["B", "2020-01-01", 20],
    ["B", "2020-01-02", None],
    ["B", "2020-01-03", 40]
], columns = ["id", "date", "value"])

result = fsql("""
TAKE 1 ROW FROM df PREPARTITION BY id PRESORT date
YIELD DATAFRAME AS x
""", df=pdf).run("blazing")

# this is how you get outputs from Fugue SQL
assert isinstance(result["x"].native, cudf.DataFrame)

fsql("""
SELECT * FROM best WHERE id='A'
PRINT
SELECT id, COUNT(*) AS ct FROM orig GROUP BY id
PRINT
""", best=result["x"], orig=pdf).run("blazing")

Jupyter Notebook

Before running Jupyter, you need to firstly install fugue and notebook extension

pip install fugue
jupyter nbextension install --sys-prefix --symlink --py fugue_notebook
jupyter nbextension enable --py fugue_notebook

In cell 1

%load_ext fugue_notebook

from fugue_blazing import setup_shortcuts
setup_shortcuts()

pdf = pd.DataFrame([
    ["A", "2020-01-01", 10],
    ["A", "2020-01-02", None],
    ["A", "2020-01-03", 30],
    ["B", "2020-01-01", 20],
    ["B", "2020-01-02", None],
    ["B", "2020-01-03", 40]
], columns = ["id", "date", "value"])

In cell 2

%%fsql blazing
TAKE 1 ROW FROM df PREPARTITION BY id PRESORT date
YIELD DATAFRAME AS x

In cell 3

%%fsql blazing
SELECT * FROM x WHERE id='A'
PRINT
SELECT id, COUNT(*) AS ct FROM pdf GROUP BY id
PRINT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fugue-blazing-0.0.3.tar.gz (22.2 kB view details)

Uploaded Source

Built Distribution

fugue_blazing-0.0.3-py3-none-any.whl (23.0 kB view details)

Uploaded Python 3

File details

Details for the file fugue-blazing-0.0.3.tar.gz.

File metadata

  • Download URL: fugue-blazing-0.0.3.tar.gz
  • Upload date:
  • Size: 22.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.10

File hashes

Hashes for fugue-blazing-0.0.3.tar.gz
Algorithm Hash digest
SHA256 bdbaf502df8e5ec81ba039cb70041fbce2a09c45e4e6fc8a90dbbd8f2075b8b3
MD5 34427634f887125b96cd6aa67465d7f6
BLAKE2b-256 dd93a9fb912c0810762f28559c2e00df25eaa988b4d79d1ee02aad5bd082b853

See more details on using hashes here.

File details

Details for the file fugue_blazing-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: fugue_blazing-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 23.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.10

File hashes

Hashes for fugue_blazing-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 054cab6a1165b1c367a94c52b3354fad5fb04f1fbc8e7fcf968d7c50d727efaa
MD5 8ba5dd21489c7fb6bbff8cd8629c0dea
BLAKE2b-256 5fcfc95f004fbac7d4ccb812451df9b621f68701c4fcb9d3f0781a54b7c28832

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page