Code generation library for creating Python UDTFs from CDF Data Models

These details have not been verified by PyPI

Project links

Project description

pygen-spark

A code generation library that extends pygen to generate Python User-Defined Table Functions (UDTFs) for CDF Data Models, enabling you to query CDF data directly from Spark SQL.

Latest Release: Version 0.2.0 includes improved error handling, direct REST API calls, and enhanced time series UDTF support.

Note: This document uses PyPI package names for references:

PyPI: cognite-pygen (repository: pygen)
PyPI: cognite-pygen-spark (repository: pygen-spark)
Import paths: cognite.pygen, cognite.pygen_spark

Overview

cognite.pygen_spark (PyPI: cognite-pygen-spark) is a generic Spark UDTF code generation library that works with any Spark cluster (standalone, YARN, Kubernetes, or local development). It generates strongly-typed Python UDTF functions from CDF Data Models using Jinja2 templates, allowing you to query CDF data directly from Spark SQL.

Package Purpose:

Generic Spark Support: Works with any Spark cluster, not limited to Databricks
Template-Based Generation: Uses Jinja2 templates to generate UDTF code for both Data Model UDTFs and Time Series UDTFs
Type Conversion Utilities: Provides TypeConverter class for converting between CDF types, PySpark DataTypes, and SQL DDL
Connection Configuration: Provides CDFConnectionConfig Pydantic model for managing CDF credentials from TOML/YAML files
Utility Functions: Helper functions for consistent UDTF naming and other generic Spark utilities

Features

UDTF Generation: Automatically generates Python UDTF functions for each View in a CDF Data Model
Time Series UDTFs: Template-generated UDTFs for querying CDF time series datapoints (single, multiple, latest) using the same template-based generation as Data Model UDTFs
Type Safety: Leverages pygen's internal representation for strongly-typed code generation
Predicate Pushdown: Generated UDTFs support filter translation from Spark SQL to CDF API filters
Configuration File Support: Uses TOML/YAML configuration files for secure credential management
Generic Spark Support: Works with any Spark cluster, not limited to Databricks
Type Conversion Utilities: TypeConverter class for converting between CDF types, PySpark DataTypes, and SQL DDL
Connection Configuration: CDFConnectionConfig Pydantic model for managing CDF credentials
Utility Functions: Helper functions for consistent UDTF naming and other utilities

Using Generic Spark Utilities

pygen-spark provides generic utilities that work with any Spark cluster:

from cognite.pygen_spark import TypeConverter, CDFConnectionConfig, to_udtf_function_name

# Type conversion utilities
from cognite.client import data_modeling as dm
from pyspark.sql.types import StringType

# Convert CDF property type to PySpark DataType
spark_type = TypeConverter.cdf_to_spark(dm.Text(), is_array=False)
# Returns: StringType()

# Convert PySpark DataType to SQL DDL
sql_ddl = TypeConverter.spark_to_sql_ddl(spark_type)
# Returns: "STRING"

# Connection configuration from TOML
config = CDFConnectionConfig.from_toml("config.toml")
client = config.create_client()

# Convert view external_id to UDTF function name
udtf_name = to_udtf_function_name("MyView")
# Returns: "my_view_udtf"

These utilities are generic and work with any Spark cluster, not just Databricks.

Installation

pip install cognite-pygen-spark

Quick Start

from pathlib import Path
from cognite.client.data_classes.data_modeling.ids import DataModelId
from cognite.pygen import load_cognite_client_from_toml
from cognite.pygen_spark import SparkUDTFGenerator

# Load client from TOML file
client = load_cognite_client_from_toml("config.toml")

# Create generator
generator = SparkUDTFGenerator(
    client=client,
    output_dir=Path("./generated_udtfs"),
    data_model=DataModelId(space="sailboat", external_id="sailboat", version="1"),
    top_level_package="cognite_udtfs",
)

# Generate UDTFs for a Data Model
result = generator.generate_udtfs()

print(f"Generated {result.total_count} UDTF(s)")
for view_id, file_path in result.generated_files.items():
    print(f"  - {view_id}: {file_path}")

# Generate time series UDTFs (template-generated, same as data model UDTFs)
ts_result = generator.generate_time_series_udtfs()
print(f"Generated {ts_result.total_count} time series UDTF(s)")
for udtf_name, file_path in ts_result.generated_files.items():
    print(f"  - {udtf_name}: {file_path}")

See the User Guide for complete documentation on generating, registering, and querying UDTFs.

Architecture

cognite.pygen_spark extends cognite.pygen's architecture:

Reuses pygen's View parsing: Leverages pygen's internal representation of CDF Data Models
Custom template engine: Uses Jinja2 templates to generate UDTF Python code and SQL Views
Extends MultiAPIGenerator: Builds on pygen's code generation infrastructure
Consistent template-based generation: Both Data Model UDTFs and Time Series UDTFs use the same Jinja2 template-based generation approach for consistent behavior, error handling, and initialization patterns

See the Technical Plan for detailed architecture documentation.

Requirements

Python 3.9+
PySpark 3.5+ (required for UDTF support)
cognite-pygen (PyPI package name; import: cognite.pygen)
cognite-sdk-python (must be installed on all Spark worker nodes)
Spark cluster (standalone, YARN, Kubernetes, or local)

Package Structure

pygen-spark/
├── cognite/
│   └── pygen_spark/
│       ├── __init__.py
│       ├── generator.py          # SparkUDTFGenerator
│       ├── udtf_generator.py    # SparkMultiAPIGenerator
│       └── templates/
│           ├── udtf_function.py.jinja
│           ├── view_sql.py.jinja
│           └── udtf_init.py.jinja
├── pyproject.toml
└── README.md

Development

Setup

git clone <repository-url>
cd pygen-spark
pip install -e ".[dev]"

Running Tests

pytest tests/

Spark Cluster Compatibility

This package generates UDTF code that works with any Spark cluster:

Code Generation: Works on all Spark versions ✅
UDTF Templates: Compatible with PySpark 3.5+ ✅
Dependency Management: Requires cognite-sdk on all Spark worker nodes ⚠️

For standalone Spark clusters, ensure cognite-sdk is installed on all worker nodes. See the Installation Guide for details.

Related Packages

pygen: Base code generation library for CDF Data Models
cognite-databricks: Helper SDK for Databricks-specific features (Unity Catalog, Secret Manager)
cognite-sdk-python: Python SDK for CDF APIs

Documentation

User Guide

Getting Started: Complete user guide for pygen-spark
Installation: Installation and setup instructions
Generation: Generate UDTF code from CDF Data Models
Registration: Register UDTFs in Spark sessions
Querying: Query UDTFs using SQL
Filtering: Filter data with WHERE clauses
Joining: Join data from different UDTFs
Troubleshooting: Common issues and solutions

Examples

Basic Generation: Generate UDTFs from a Data Model
Registration: Register and query UDTFs
Querying Data: Various querying patterns
Filtering Queries: Filter examples
Joining UDTFs: Join examples

Technical Documentation

License

[License information]

Contributing

[Contributing guidelines]

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Apr 10, 2026

0.2.4

Apr 8, 2026

0.2.3

Apr 7, 2026

0.2.2

Jan 25, 2026

0.2.1

Jan 21, 2026

This version

0.2.0

Jan 21, 2026

0.1.0

Jan 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cognite_pygen_spark-0.2.0.tar.gz (38.4 kB view details)

Uploaded Jan 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cognite_pygen_spark-0.2.0-py3-none-any.whl (53.1 kB view details)

Uploaded Jan 21, 2026 Python 3

File details

Details for the file cognite_pygen_spark-0.2.0.tar.gz.

File metadata

Download URL: cognite_pygen_spark-0.2.0.tar.gz
Upload date: Jan 21, 2026
Size: 38.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for cognite_pygen_spark-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`f781d61b0fa9243639a42ce004f9a7358665a079b2c1b5f104c0fc403d128424`
MD5	`b31c470f4bf94b59d1a18596722d1982`
BLAKE2b-256	`b4753654fcccad2baa7a46539589b07b5f5847f9092cba7b62eec38f71a51ace`

See more details on using hashes here.

File details

Details for the file cognite_pygen_spark-0.2.0-py3-none-any.whl.

File metadata

Download URL: cognite_pygen_spark-0.2.0-py3-none-any.whl
Upload date: Jan 21, 2026
Size: 53.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for cognite_pygen_spark-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`35d08e1f6b3152ee497fceee97fabcedc4a078d9afe0281be16e37daa5d2edac`
MD5	`e6221b2c49db1b812778f109be28babd`
BLAKE2b-256	`00c2138a7bb04d3aec5b9128f5cf6ac25fe247195e4e4ae924906a0ac2f6a876`

See more details on using hashes here.

cognite-pygen-spark 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pygen-spark

Overview

Features

Using Generic Spark Utilities

Installation

Quick Start

Architecture

Requirements

Package Structure

Development

Setup

Running Tests

Spark Cluster Compatibility

Related Packages

Documentation

User Guide

Examples

Technical Documentation

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes