Distributed Spark and Trino/Starburst adapter with SQL/JSON DAG execution, Spark writes to Trino, Iceberg and Hive, and Dataiku DSS parallel Trino reads and writes.

These details have not been verified by PyPI

Project links

Project description

trino-spark-adapter

trino-spark-adapter is a Python package for building hybrid Spark and Trino/Starburst data pipelines.

It provides:

distributed Trino/Starburst reads executed from Spark executors;
Spark writes to Trino through distributed INSERT batches;
Spark writes to Iceberg tables with Iceberg partition transforms and table properties;
Spark writes to Hive-compatible tables or paths with Hive-style partitioning and bucketing;
a DAG runner that executes .sql and .json files by step number, running same-step files in parallel;
class-based logging with one logger per component;
trino_dataiku_adapter, a Dataiku DSS companion for bounded parallel Trino reads and writes through SQLExecutor2.

Installation

pip install trino-spark-adapter==1.2.1

Core concepts

A DAG is a folder of dot-separated files executed by DagRunner. The naming pattern is:

{step}.{name}.{operation}.{extension}

Segment	Role
`step`	Execution order. Files with the same step run in parallel.
`name`	Logical task name. Default Spark view for `.trino_to_spark.json`.
`operation`	Engine and bridge (`spark`, `trino_to_spark`, ...).
`extension`	File type (`sql`, `json`, ...).

Example layout:

1.source.trino_to_spark.json          # step 1, default view "source"
2.enrich.spark.sql                    # step 2
2.lookup.trino_to_spark.json          # step 2, parallel with enrich, view "lookup"
2.a.trino.sql                         # step 2, short alias also works
2.b.spark.sql                         # step 2
3.export.spark_to_trino.json          # step 3

For .trino_to_spark.json, the default Spark temporary view is the name segment unless target_view or view_name is set in the JSON. Example: 2.lookup.trino_to_spark.json creates view lookup.

Suffix	Action
`.trino.sql`	Execute SQL statements on Trino/Starburst.
`.spark.sql`	Execute SQL statements on Spark.
`.trino_to_spark.json`	Load a Trino table or query into a Spark temporary view.
`.spark_to_trino.json`	Write a Spark view to Trino through distributed inserts.
`.spark_to_iceberg.json`	Write a Spark view to an Iceberg table with Spark Writer V2.
`.spark_to_hive.json`	Write a Spark view to a Hive-compatible table or path.

Minimal DAG runner

import os

from pyspark.sql import SparkSession

from trino_spark_adapter import DagRunner, TrinoConnectionConfig

builder = SparkSession.builder.appName(
    os.environ.get("spark.app.name", "trino_spark_adapter_job")
)
for key, value in sorted(os.environ.items()):
    if key.startswith("spark."):
        builder = builder.config(key, value)
spark = builder.getOrCreate()

trino_config = TrinoConnectionConfig.from_env()

runner = DagRunner(
    spark=spark,
    trino_config=trino_config,
    params={"calculation_date": "2026-01-05"},
    reader_defaults={"fetch_size": 100_000, "num_ranges": 20, "num_partitions": 20},
)

results = runner.run_folder("dag")

Trino configuration

TrinoConnectionConfig.from_env() reads:

export TRINO_HOST="starburst.example.com"
export TRINO_USER="user"
export TRINO_PASSWORD="password"
export TRINO_ROLES="role"
export TRINO_VERIFY="false"
export TRINO_HTTP_SCHEME="https"
export TRINO_PORT="443"

Spark session

Create a Spark session and apply any environment variable whose name starts with spark.:

export spark.app.name="trino_spark_adapter_job"
export spark.sql.shuffle.partitions="200"

import os

from pyspark.sql import SparkSession

builder = SparkSession.builder.appName(
    os.environ.get("spark.app.name", "trino_spark_adapter_job")
)
for key, value in sorted(os.environ.items()):
    if key.startswith("spark."):
        builder = builder.config(key, value)
spark = builder.getOrCreate()

Date parameters

DateUtils can be used to compute placeholders used in DAG files.

from trino_spark_adapter import DateUtils

weekday_dates = DateUtils.generate_weekday_dates_between_start_stop(
    start_dt="2026-01-01",
    stop_dt="2026-02-01",
    weekday=0,
)

du = DateUtils(weekday_dates[0])
params = du.to_params()
params.update({"calculation_date": du.today_tiret[:10]})

Typical generated keys include today_tiret, today_slash, last_day_tiret, last_week_tiret, last_month_tiret, last_quarter_tiret, last_semester_tiret, and last_year_tiret.

`.trino_to_spark.json`

Load a complete table:

{
  "table_fullname": "catalog.schema.source_table",
  "target_view": "source_table"
}

Distributed reads split on any column (DATE, TIMESTAMP, integers, decimals), not only partition columns. Provide either num_ranges or step_ranges, not both.

Equal number of date ranges:

{
  "table_fullname": "catalog.schema.source_table",
  "colname": "event_date",
  "coltype": "DATE",
  "format": "%Y-%m-%d",
  "colname_start_value": "{start_date}",
  "colname_stop_value": "{end_date}",
  "num_ranges": 20
}

Fixed interval per range (pandas offset alias):

{
  "table_fullname": "catalog.schema.source_table",
  "colname": "event_ts",
  "coltype": "TIMESTAMP",
  "format": "%Y-%m-%d %H:%M:%S",
  "colname_start_value": "2026-01-01 00:00:00",
  "colname_stop_value": "2026-02-01 00:00:00",
  "step_ranges": "7D"
}

Optional rounding (D, H, min, S) adjusts split boundaries. Numeric columns accept num_ranges or numeric step_ranges. Omit both to run one query on the full interval.

num_partitions controls Spark parallelism. num_ranges controls how many Trino queries are generated.

The runner creates or replaces the Spark temporary view named by target_view. When omitted, the view name is the second dot-separated segment of the file name. Example: 1.source.trino_to_spark.json creates view source.

`.spark_to_trino.json`

Write a Spark view to a Trino table. The target table is created automatically when it does not exist, based on the Spark schema.

{
  "source_view": "prepared_view",
  "target_table": "catalog.schema.target_table",
  "repartition_by": ["entity_id"],
  "num_partitions": 40,
  "sort_by": ["entity_id"]
}

`.spark_to_iceberg.json`

Write a Spark view to an Iceberg table through the configured Spark Iceberg catalog.

{
  "source_view": "prepared_view",
  "catalog": "iceberg_catalog",
  "schema": "analytics",
  "table": "target_table",
  "mode": "append",
  "partition_spec": [
    {"transform": "day", "column": "event_ts"},
    {"transform": "bucket", "column": "entity_id", "num_buckets": 32}
  ],
  "distribution_mode": "hash",
  "format_version": 2,
  "file_format": "PARQUET",
  "repartition_by": ["entity_id"],
  "num_partitions": 200
}

Common modes are create, replace, append, and overwrite_partitions.

`.spark_to_hive.json`

Write a Spark view to a Hive table:

{
  "source_view": "prepared_view",
  "table": "analytics.target_table",
  "mode": "overwrite",
  "format": "parquet",
  "partition_by": ["event_date"],
  "repartition_by": ["event_date"],
  "num_partitions": 40
}

Write a Spark view to a path such as S3A:

{
  "source_view": "prepared_view",
  "path": "s3a://bucket/path/target_table",
  "mode": "overwrite",
  "format": "parquet",
  "partition_by": ["event_date"]
}

Hive bucketing is supported only with table / saveAsTable:

{
  "source_view": "prepared_view",
  "table": "analytics.bucketed_table",
  "mode": "overwrite",
  "format": "parquet",
  "bucket_by": ["entity_id"],
  "num_buckets": 32,
  "sort_by": ["entity_id"]
}

Dataiku DSS and Trino

Inside Dataiku DSS, Trino credentials are not exposed to Python code. The supported entry point is dataiku.SQLExecutor2, created from a reference dataset that already uses the Trino connection configured in the project.

Spark executors cannot access the Dataiku API or managed connections. For Dataiku recipes and notebooks, use trino_dataiku_adapter instead of the Spark-based distributed reader.

The adapter targets Dataiku DSS v13 and keeps all SQL execution inside the current Python node.

Parallel reads with SQLExecutor2

import dataiku
from trino_dataiku_adapter import DistributedTrinoDataikuReader

dataset = dataiku.Dataset("my_trino_reference_dataset")

reader = DistributedTrinoDataikuReader(
    dataset=dataset,
    queries=[
        "SELECT * FROM catalog.schema.events WHERE event_date >= DATE '2026-01-01' AND event_date < DATE '2026-01-15'",
        "SELECT * FROM catalog.schema.events WHERE event_date >= DATE '2026-01-15' AND event_date <= DATE '2026-01-31'",
    ],
    max_workers=2,
)

while reader.status().tasks_pending > 0 or reader.status().tasks_running > 0:
    result = reader.wait_for_any(timeout=30)
    preview = result.dataframe.head()

df = reader.pandas_df()

max_workers limits real concurrent execution. With 20 queries and max_workers=2, only two queries run at the same time.

Progressive notebook exploration

wait_for_any returns the next completed task without waiting for the full batch. This is useful when several chunks are loading and the first available result is enough to start exploring the data.

status = reader.status()
print(status.tasks_submitted, status.tasks_completed, status.tasks_pending)
print(status.avg_task_duration_seconds, status.estimated_remaining_seconds)

wait_for_all blocks until every task has finished.

Range-based builder

The Dataiku builder reuses the same split logic as DistributedTrinoSparkReaderBuilder:

from trino_dataiku_adapter import DistributedTrinoDataikuReaderBuilder

builder = DistributedTrinoDataikuReaderBuilder(
    dataset=dataset,
    table_fullname="catalog.schema.events",
    colname="event_date",
    coltype="DATE",
    colname_start_value="2026-01-01",
    colname_stop_value="2026-02-01",
    num_ranges=20,
    max_workers=2,
)

first_chunk = builder.wait_for_any().dataframe
full_df = builder.pandas_df()

Result assembly options

Stay in pandas:

pandas_df = reader.pandas_df()

Concatenate all pandas chunks, then create one Spark DataFrame:

spark_df = reader.to_spark_df(spark, strategy="concat")

Create one Spark DataFrame per chunk, then union them:

spark_df = reader.to_spark_df(spark, strategy="union")

Write a pandas DataFrame to Trino

import pandas as pd
from trino_dataiku_adapter import TrinoDataikuWriter, TrinoDataikuWriterConfig

df = pd.DataFrame({"id": [1, 2], "name": ["alice", "bob"]})

writer = TrinoDataikuWriter(
    TrinoDataikuWriterConfig(
        dataset=dataset,
        target_table="catalog.schema.target_table",
        batch_size=500,
        max_workers=2,
    )
)

writer.write_dataframe(df)
writer.wait_for_all()

Write batches use post_queries=['COMMIT'], as required by the Dataiku SQL API for INSERT statements.

Logging

Every main component inherits from LogBase and exposes a class logger.

import logging
from trino_spark_adapter import DagRunner, DistributedTrinoSparkReader, SparkHiveWriter

DagRunner.logger().setLevel(logging.INFO)
DistributedTrinoSparkReader.logger().setLevel(logging.DEBUG)
SparkHiveWriter.logger().setLevel(logging.INFO)

The default formatter includes timestamp, class name and level:

[2026-01-05 09:15:12] [DagRunner] [INFO] START task type=trino_to_spark file=1.source.trino_to_spark.json
[2026-01-05 09:17:03] [DagRunner] [INFO] SUCCESS task type=trino_to_spark file=1.source.trino_to_spark.json elapsed=0:01:51

Debug-only expensive Spark actions should be guarded explicitly:

logger = DagRunner.logger()
if logger.isEnabledFor(logging.DEBUG):
    logger.debug("row_count=%s", df.count())

Documentation

HTML documentation can be built locally:

pip install "trino-spark-adapter[docs]"
make -C docs html

Open docs/_build/html/index.html. The PyPI project page links to the online documentation URL.

Demos

Example notebooks are in demos/. They cover connection setup, distributed reads and writes, the DAG runner and date parameters. AES encryption is documented separately at the end of the demo list and in the optional AES section below.

Publishing

The source archive includes pypi.sh.

chmod +x .pypi.sh
./.pypi.sh

Resuming or partially executing a DAG

DagRunner.run_folder can execute only part of a DAG folder. This is useful when a long run fails after several successful files and you want to restart from the failing task.

runner.run_folder(
    "dag",
    start_from_file="6.transform.spark.sql",
)

You can also start from a zero-based index:

runner.run_folder(
    "dag",
    start_from_index=5,
)

For short validation runs, stop after a specific file or index:

runner.run_folder(
    "dag",
    stop_after_file="6.transform.spark.sql",
)

To run only a chosen subset, provide explicit file names. The files are still executed in the DAG folder's alphanumeric order:

runner.run_folder(
    "dag",
    files=[
        "6.transform.spark.sql",
        "7.export.spark_to_iceberg.json",
    ],
)

Optional Spark and Trino dependencies

Install core package:

pip install trino-spark-adapter==1.2.1

Optional extras:

pip install "trino-spark-adapter[spark]==1.2.1"
pip install "trino-spark-adapter[trino]==1.2.1"
pip install "trino-spark-adapter[all]==1.2.1"

Optional AES encryption

AES helpers are a separate feature. They are not required for Trino reads, Spark writes or the DAG runner.

SparkAESHelper can register aes_encrypt and aes_decrypt on a Spark session. The key and IV are read from the environment and stored inside the UDF closure. SQL only receives the column value.

export aes_key_str="...base64..."
export aes_iv_str="...base64..."

from trino_spark_adapter import SparkAESHelper

spark = SparkAESHelper.get_spark(register_aes=True)
spark.sql("SELECT aes_encrypt('abc') AS encrypted_value").show()

See demos/07_chiffrement_aes_spark.ipynb and docs/aes_optional.rst for more detail.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.2.1

Jul 5, 2026

1.1.6

Jul 4, 2026

1.1.5

Jul 4, 2026

1.1.4

Jun 1, 2026

1.1.3

Jun 1, 2026

1.1.2

May 31, 2026

1.1.1

May 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trino_spark_adapter-1.2.1.tar.gz (65.5 kB view details)

Uploaded Jul 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

trino_spark_adapter-1.2.1-py3-none-any.whl (59.9 kB view details)

Uploaded Jul 5, 2026 Python 3

File details

Details for the file trino_spark_adapter-1.2.1.tar.gz.

File metadata

Download URL: trino_spark_adapter-1.2.1.tar.gz
Upload date: Jul 5, 2026
Size: 65.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for trino_spark_adapter-1.2.1.tar.gz
Algorithm	Hash digest
SHA256	`7f36027ee40f8f4fcc7908dfbdb869049fc1e90ab1dabd1f97b074c1dd048022`
MD5	`da322ecd3c51c2feaa0eb1d068568cfa`
BLAKE2b-256	`cb18f458857bc9a5c6077a3f28f766d2e445aa2c7a88a6d1a9e80d501252388f`

See more details on using hashes here.

File details

Details for the file trino_spark_adapter-1.2.1-py3-none-any.whl.

File metadata

Download URL: trino_spark_adapter-1.2.1-py3-none-any.whl
Upload date: Jul 5, 2026
Size: 59.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for trino_spark_adapter-1.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9f7d9ef4f8f838c81981076c3db052bdea8c7574620623f21674ec445c96818b`
MD5	`17b1267a37d35564be5c81dc4d8f38eb`
BLAKE2b-256	`90d4fc420556740547b9430df210dd85d75e850d3cf7ae40e99668e9a0c197c7`

See more details on using hashes here.

trino-spark-adapter 1.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

trino-spark-adapter

Installation

Core concepts

Minimal DAG runner

Trino configuration

Spark session

Date parameters

.trino_to_spark.json

.spark_to_trino.json

.spark_to_iceberg.json

.spark_to_hive.json

Dataiku DSS and Trino

Parallel reads with SQLExecutor2

Progressive notebook exploration

Range-based builder

Result assembly options

Write a pandas DataFrame to Trino

Logging

Documentation

Demos

Publishing

Resuming or partially executing a DAG

Optional Spark and Trino dependencies

Optional AES encryption

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`.trino_to_spark.json`

`.spark_to_trino.json`

`.spark_to_iceberg.json`

`.spark_to_hive.json`