Code generation library for creating Python UDTFs from CDF Data Models
Project description
pygen-spark
A code generation library that extends pygen to generate Python User-Defined Table Functions (UDTFs) for CDF Data Models, enabling you to query CDF data directly from Spark SQL.
Latest Release: Version 0.2.0 includes improved error handling, direct REST API calls, and enhanced time series UDTF support.
Note: This document uses PyPI package names for references:
- PyPI:
cognite-pygen(repository:pygen) - PyPI:
cognite-pygen-spark(repository:pygen-spark) - Import paths:
cognite.pygen,cognite.pygen_spark
Overview
cognite.pygen_spark (PyPI: cognite-pygen-spark) is a generic Spark UDTF code generation library that works with any Spark cluster (standalone, YARN, Kubernetes, or local development). It generates strongly-typed Python UDTF functions from CDF Data Models using Jinja2 templates, allowing you to query CDF data directly from Spark SQL.
Package Purpose:
- Generic Spark Support: Works with any Spark cluster, not limited to Databricks
- Template-Based Generation: Uses Jinja2 templates to generate UDTF code for both Data Model UDTFs and Time Series UDTFs
- Type Conversion Utilities: Provides
TypeConverterclass for converting between CDF types, PySpark DataTypes, and SQL DDL - Connection Configuration: Provides
CDFConnectionConfigPydantic model for managing CDF credentials from TOML/YAML files - Utility Functions: Helper functions for consistent UDTF naming and other generic Spark utilities
Features
- UDTF Generation: Automatically generates Python UDTF functions for each View in a CDF Data Model
- Time Series UDTFs: Template-generated UDTFs for querying CDF time series datapoints (single, multiple, latest) using the same template-based generation as Data Model UDTFs
- Type Safety: Leverages pygen's internal representation for strongly-typed code generation
- Predicate Pushdown: Generated UDTFs support filter translation from Spark SQL to CDF API filters
- Configuration File Support: Uses TOML/YAML configuration files for secure credential management
- Generic Spark Support: Works with any Spark cluster, not limited to Databricks
- Type Conversion Utilities:
TypeConverterclass for converting between CDF types, PySpark DataTypes, and SQL DDL - Connection Configuration:
CDFConnectionConfigPydantic model for managing CDF credentials - Utility Functions: Helper functions for consistent UDTF naming and other utilities
Using Generic Spark Utilities
pygen-spark provides generic utilities that work with any Spark cluster:
from cognite.pygen_spark import TypeConverter, CDFConnectionConfig, to_udtf_function_name
# Type conversion utilities
from cognite.client import data_modeling as dm
from pyspark.sql.types import StringType
# Convert CDF property type to PySpark DataType
spark_type = TypeConverter.cdf_to_spark(dm.Text(), is_array=False)
# Returns: StringType()
# Convert PySpark DataType to SQL DDL
sql_ddl = TypeConverter.spark_to_sql_ddl(spark_type)
# Returns: "STRING"
# Connection configuration from TOML
config = CDFConnectionConfig.from_toml("config.toml")
client = config.create_client()
# Convert view external_id to UDTF function name
udtf_name = to_udtf_function_name("MyView")
# Returns: "my_view_udtf"
These utilities are generic and work with any Spark cluster, not just Databricks.
Installation
pip install cognite-pygen-spark
Quick Start
from pathlib import Path
from cognite.client.data_classes.data_modeling.ids import DataModelId
from cognite.pygen import load_cognite_client_from_toml
from cognite.pygen_spark import SparkUDTFGenerator
# Load client from TOML file
client = load_cognite_client_from_toml("config.toml")
# Create generator
generator = SparkUDTFGenerator(
client=client,
output_dir=Path("./generated_udtfs"),
data_model=DataModelId(space="sailboat", external_id="sailboat", version="1"),
top_level_package="cognite_udtfs",
)
# Generate UDTFs for a Data Model
result = generator.generate_udtfs()
print(f"Generated {result.total_count} UDTF(s)")
for view_id, file_path in result.generated_files.items():
print(f" - {view_id}: {file_path}")
# Generate time series UDTFs (template-generated, same as data model UDTFs)
ts_result = generator.generate_time_series_udtfs()
print(f"Generated {ts_result.total_count} time series UDTF(s)")
for udtf_name, file_path in ts_result.generated_files.items():
print(f" - {udtf_name}: {file_path}")
See the User Guide for complete documentation on generating, registering, and querying UDTFs.
Architecture
cognite.pygen_spark extends cognite.pygen's architecture:
- Reuses pygen's View parsing: Leverages pygen's internal representation of CDF Data Models
- Custom template engine: Uses Jinja2 templates to generate UDTF Python code and SQL Views
- Extends MultiAPIGenerator: Builds on pygen's code generation infrastructure
- Consistent template-based generation: Both Data Model UDTFs and Time Series UDTFs use the same Jinja2 template-based generation approach for consistent behavior, error handling, and initialization patterns
See the Technical Plan for detailed architecture documentation.
Requirements
- Python 3.9+
- PySpark 3.5+ (required for UDTF support)
cognite-pygen(PyPI package name; import:cognite.pygen)cognite-sdk-python(must be installed on all Spark worker nodes)- Spark cluster (standalone, YARN, Kubernetes, or local)
Package Structure
pygen-spark/
├── cognite/
│ └── pygen_spark/
│ ├── __init__.py
│ ├── generator.py # SparkUDTFGenerator
│ ├── udtf_generator.py # SparkMultiAPIGenerator
│ └── templates/
│ ├── udtf_function.py.jinja
│ ├── view_sql.py.jinja
│ └── udtf_init.py.jinja
├── pyproject.toml
└── README.md
Development
Setup
git clone <repository-url>
cd pygen-spark
pip install -e ".[dev]"
Running Tests
pytest tests/
Spark Cluster Compatibility
This package generates UDTF code that works with any Spark cluster:
- Code Generation: Works on all Spark versions ✅
- UDTF Templates: Compatible with PySpark 3.5+ ✅
- Dependency Management: Requires
cognite-sdkon all Spark worker nodes ⚠️
For standalone Spark clusters, ensure cognite-sdk is installed on all worker nodes. See the Installation Guide for details.
Related Packages
- pygen: Base code generation library for CDF Data Models
- cognite-databricks: Helper SDK for Databricks-specific features (Unity Catalog, Secret Manager)
- cognite-sdk-python: Python SDK for CDF APIs
Documentation
User Guide
- Getting Started: Complete user guide for pygen-spark
- Installation: Installation and setup instructions
- Generation: Generate UDTF code from CDF Data Models
- Registration: Register UDTFs in Spark sessions
- Querying: Query UDTFs using SQL
- Filtering: Filter data with WHERE clauses
- Joining: Join data from different UDTFs
- Troubleshooting: Common issues and solutions
Examples
- Basic Generation: Generate UDTFs from a Data Model
- Registration: Register and query UDTFs
- Querying Data: Various querying patterns
- Filtering Queries: Filter examples
- Joining UDTFs: Join examples
Technical Documentation
License
[License information]
Contributing
[Contributing guidelines]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cognite_pygen_spark-0.2.0.tar.gz.
File metadata
- Download URL: cognite_pygen_spark-0.2.0.tar.gz
- Upload date:
- Size: 38.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f781d61b0fa9243639a42ce004f9a7358665a079b2c1b5f104c0fc403d128424
|
|
| MD5 |
b31c470f4bf94b59d1a18596722d1982
|
|
| BLAKE2b-256 |
b4753654fcccad2baa7a46539589b07b5f5847f9092cba7b62eec38f71a51ace
|
File details
Details for the file cognite_pygen_spark-0.2.0-py3-none-any.whl.
File metadata
- Download URL: cognite_pygen_spark-0.2.0-py3-none-any.whl
- Upload date:
- Size: 53.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35d08e1f6b3152ee497fceee97fabcedc4a078d9afe0281be16e37daa5d2edac
|
|
| MD5 |
e6221b2c49db1b812778f109be28babd
|
|
| BLAKE2b-256 |
00c2138a7bb04d3aec5b9128f5cf6ac25fe247195e4e4ae924906a0ac2f6a876
|