PyStarburst DataFrame API allows you to query and transform data in Starburst products in a data pipeline without having to download the data locally.
Project description
PyStarburst DataFrame API
PyStarburst DataFrame API allows you to query and transform data in Starburst products in a data pipeline without having to download the data locally.
Documentation
See the PyStarburst API documentation and the examples repository.
Getting started
Install pystarburst
pip install pystarburst
Connect to a Starburst server
The parameters are the same connect parameters as in Trino Python Client.
from pystarburst import Session
connection_parameters = {
"host": "localhost",
"port": 8080,
"user": "admin",
"catalog": "tpch",
"schema": "tiny"
}
session = Session.builder.configs(connection_parameters).create()
Using SQL
from pystarburst import Session
session = Session.builder.configs({ ... }).create()
session.sql("SELECT 1 as a").show()
Querying a table
from pystarburst import Session
session = Session.builder.configs({ ... }).create()
df = session.table("nation")
print(df.schema)
df.show()
Filtering a data frame
from pystarburst import Session
session = Session.builder.configs({ ... }).create()
df = session.table("nation")
df.filter(df.col("regionkey") == 0).show()
Joining data frames
from pystarburst import Session
session = Session.builder.configs({ ... }).create()
df = session.table("nation")
df.filter(df.col("regionkey") == 0).show()
Aggregation
from pystarburst import Session
from pystarburst.functions import col
session = Session.builder.configs({ ... }).create()
df = session.table("nation")
df.agg((col("regionkey"), "max"), (col("regionkey"), "avg")).show()
Arrow spooling
When configured with Arrow encoding, DataFrame methods to_arrow_batches(), to_arrow_table() and to_pandas() use Arrow IPC spooling with parallel segment decoding for significantly faster transfer of large result sets.
pip install pystarburst[pyarrow]
from pystarburst import Session
session = Session.builder.configs({
...
"encoding": "arrow-preview+zstd",
}).create()
pandas_df = session.sql("SELECT * FROM nation").to_pandas()
# or
arrow_reader = session.sql("SELECT * FROM nation").to_arrow_batches()
# or
arrow_table = session.sql("SELECT * FROM nation").to_arrow_table()
Of the three methods: to_arrow_batches(), to_arrow_table() and to_pandas(), to_arrow_batches() is the most memory efficient, as it returns
pyarrow.RecordBatchReader that can iterate over record batches without materializing the entire result set in memory.
Arrow encoding is used only for those three methods. All other operations (collect(), show(), etc.) use the default encoding.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pystarburst-0.13.0-py3-none-any.whl.
File metadata
- Download URL: pystarburst-0.13.0-py3-none-any.whl
- Upload date:
- Size: 139.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.2.0 CPython/3.14.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1e3a1935e6fd42e253e04bf43b83455f5644eaba05abf3e6bf8ce9c2d817ee0
|
|
| MD5 |
3576f192b56469b08cf6e0b16007e2bc
|
|
| BLAKE2b-256 |
49e20361f231901d60c78227cc9a54c6f31949afb96a6662883ef5b6f87a4c0f
|