Zero-dependency PySpark DDL schema parser
Project description
Spark DDL Parser
A zero-dependency Python library for parsing PySpark DDL schema strings into structured Python objects.
Features
- Zero Dependencies: Only uses Python standard library
- PySpark Compatible: Parses standard PySpark DDL format
- Type Safe: Returns structured dataclasses
- Comprehensive: Supports all PySpark data types including nested structs, arrays, and maps
- Well Tested: 200+ test cases covering edge cases and performance
Installation
pip install spark-ddl-parser
Quick Start
from spark_ddl_parser import parse_ddl_schema
# Parse a simple schema
schema = parse_ddl_schema("id long, name string")
print(schema.fields[0].name) # 'id'
print(schema.fields[0].data_type.type_name) # 'long'
print(schema.fields[1].name) # 'name'
print(schema.fields[1].data_type.type_name) # 'string'
Supported Types
Simple Types
string,int,integer,long,bigintdouble,float,short,smallint,byte,tinyintboolean,bool,date,timestamp,binary
Complex Types
- Arrays:
array<string>,array<long> - Maps:
map<string,int>,map<string,array<long>> - Structs:
struct<name:string,age:int> - Decimal:
decimal(10,2)(with precision and scale)
Nested Structures
# Nested structs
schema = parse_ddl_schema("""
id long,
address struct<
street:string,
city:string,
zip:string
>,
tags array<string>,
metadata map<string,string>
""")
# Access nested fields
address_field = schema.fields[1]
print(address_field.name) # 'address'
print(address_field.data_type.type_name) # 'struct'
API Reference
parse_ddl_schema(ddl_string: str) -> StructType
Parse a DDL schema string into a structured type.
Parameters:
ddl_string(str): DDL schema string (e.g., "id long, name string")
Returns:
StructType: Structured type with fields
Raises:
ValueError: If DDL string is invalid
Example:
schema = parse_ddl_schema("id long, name string")
Type Objects
StructType
Represents a struct containing fields.
Attributes:
type_name(str): Always "struct"fields(List[StructField]): List of struct fields
StructField
Represents a field in a struct.
Attributes:
name(str): Field namedata_type(DataType): Field data typenullable(bool): Whether field is nullable (default: True)
SimpleType
Represents a simple data type.
Attributes:
type_name(str): Type name (e.g., "string", "long", "int")
ArrayType
Represents an array type.
Attributes:
type_name(str): Always "array"element_type(DataType): Type of array elements
MapType
Represents a map type.
Attributes:
type_name(str): Always "map"key_type(DataType): Type of map keysvalue_type(DataType): Type of map values
DecimalType
Represents a decimal type.
Attributes:
type_name(str): Always "decimal"precision(int): Decimal precision (default: 10)scale(int): Decimal scale (default: 0)
Examples
Basic Schema
from spark_ddl_parser import parse_ddl_schema
schema = parse_ddl_schema("id long, name string, age int")
print(len(schema.fields)) # 3
Arrays and Maps
schema = parse_ddl_schema("""
tags array<string>,
scores array<long>,
metadata map<string,string>,
counts map<string,int>
""")
Nested Structs
schema = parse_ddl_schema("""
user struct<
id:long,
name:string,
address:struct<
street:string,
city:string
>
>
""")
Decimal Types
schema = parse_ddl_schema("price decimal(10,2), rate decimal(5,4)")
Format Support
The parser supports both space and colon separators:
# Space separator
schema1 = parse_ddl_schema("id long, name string")
# Colon separator
schema2 = parse_ddl_schema("id:long, name:string")
Error Handling
The parser provides detailed error messages for invalid DDL:
try:
schema = parse_ddl_schema("id long, name") # Missing type
except ValueError as e:
print(e) # "Invalid field definition: name"
Development
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest
# Run with coverage
pytest --cov=spark_ddl_parser
License
MIT License - see LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Related Projects
- mock-spark - Uses this parser for DDL schema support
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spark_ddl_parser-0.1.0.tar.gz.
File metadata
- Download URL: spark_ddl_parser-0.1.0.tar.gz
- Upload date:
- Size: 17.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d53f7ae5d2d4cae77dde21d091ba15301344ea412077045043d68ae8ae54e3b7
|
|
| MD5 |
3ce5f61b1cd35de6eec587d763e7e6d5
|
|
| BLAKE2b-256 |
4170dfc4ecfab1de0d30a7fa8135310a7ac4dd8c502d467b3cd4db3f76ec3fc9
|
File details
Details for the file spark_ddl_parser-0.1.0-py3-none-any.whl.
File metadata
- Download URL: spark_ddl_parser-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4bf4679e72d78d7ab1c3e4dd23b20ff57757f13a721e32a0cc47f9b578a20538
|
|
| MD5 |
c3b745ef2b12b7862349113dd9971f97
|
|
| BLAKE2b-256 |
9d4a0fd9a49356fa706764631f71c7e04dc310c889c024dbce9f8ee070b73967
|