Skip to main content

Zero-dependency PySpark DDL schema parser

Project description

Spark DDL Parser

A zero-dependency Python library for parsing PySpark DDL schema strings into structured Python objects.

Features

  • Zero Dependencies: Only uses Python standard library
  • PySpark Compatible: Parses standard PySpark DDL format
  • Type Safe: Returns structured dataclasses
  • Comprehensive: Supports all PySpark data types including nested structs, arrays, and maps
  • Well Tested: 200+ test cases covering edge cases and performance

Installation

pip install spark-ddl-parser

Quick Start

from spark_ddl_parser import parse_ddl_schema

# Parse a simple schema
schema = parse_ddl_schema("id long, name string")

print(schema.fields[0].name)  # 'id'
print(schema.fields[0].data_type.type_name)  # 'long'
print(schema.fields[1].name)  # 'name'
print(schema.fields[1].data_type.type_name)  # 'string'

Supported Types

Simple Types

  • string, int, integer, long, bigint
  • double, float, short, smallint, byte, tinyint
  • boolean, bool, date, timestamp, binary

Complex Types

  • Arrays: array<string>, array<long>
  • Maps: map<string,int>, map<string,array<long>>
  • Structs: struct<name:string,age:int>
  • Decimal: decimal(10,2) (with precision and scale)

Nested Structures

# Nested structs
schema = parse_ddl_schema("""
    id long,
    address struct<
        street:string,
        city:string,
        zip:string
    >,
    tags array<string>,
    metadata map<string,string>
""")

# Access nested fields
address_field = schema.fields[1]
print(address_field.name)  # 'address'
print(address_field.data_type.type_name)  # 'struct'

API Reference

parse_ddl_schema(ddl_string: str) -> StructType

Parse a DDL schema string into a structured type.

Parameters:

  • ddl_string (str): DDL schema string (e.g., "id long, name string")

Returns:

  • StructType: Structured type with fields

Raises:

  • ValueError: If DDL string is invalid

Example:

schema = parse_ddl_schema("id long, name string")

Type Objects

StructType

Represents a struct containing fields.

Attributes:

  • type_name (str): Always "struct"
  • fields (List[StructField]): List of struct fields

StructField

Represents a field in a struct.

Attributes:

  • name (str): Field name
  • data_type (DataType): Field data type
  • nullable (bool): Whether field is nullable (default: True)

SimpleType

Represents a simple data type.

Attributes:

  • type_name (str): Type name (e.g., "string", "long", "int")

ArrayType

Represents an array type.

Attributes:

  • type_name (str): Always "array"
  • element_type (DataType): Type of array elements

MapType

Represents a map type.

Attributes:

  • type_name (str): Always "map"
  • key_type (DataType): Type of map keys
  • value_type (DataType): Type of map values

DecimalType

Represents a decimal type.

Attributes:

  • type_name (str): Always "decimal"
  • precision (int): Decimal precision (default: 10)
  • scale (int): Decimal scale (default: 0)

Examples

Basic Schema

from spark_ddl_parser import parse_ddl_schema

schema = parse_ddl_schema("id long, name string, age int")
print(len(schema.fields))  # 3

Arrays and Maps

schema = parse_ddl_schema("""
    tags array<string>,
    scores array<long>,
    metadata map<string,string>,
    counts map<string,int>
""")

Nested Structs

schema = parse_ddl_schema("""
    user struct<
        id:long,
        name:string,
        address:struct<
            street:string,
            city:string
        >
    >
""")

Decimal Types

schema = parse_ddl_schema("price decimal(10,2), rate decimal(5,4)")

Format Support

The parser supports both space and colon separators:

# Space separator
schema1 = parse_ddl_schema("id long, name string")

# Colon separator
schema2 = parse_ddl_schema("id:long, name:string")

Error Handling

The parser provides detailed error messages for invalid DDL:

try:
    schema = parse_ddl_schema("id long, name")  # Missing type
except ValueError as e:
    print(e)  # "Invalid field definition: name"

Development

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=spark_ddl_parser

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Related Projects

  • mock-spark - Uses this parser for DDL schema support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spark_ddl_parser-0.1.0.tar.gz (17.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spark_ddl_parser-0.1.0-py3-none-any.whl (8.9 kB view details)

Uploaded Python 3

File details

Details for the file spark_ddl_parser-0.1.0.tar.gz.

File metadata

  • Download URL: spark_ddl_parser-0.1.0.tar.gz
  • Upload date:
  • Size: 17.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.18

File hashes

Hashes for spark_ddl_parser-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d53f7ae5d2d4cae77dde21d091ba15301344ea412077045043d68ae8ae54e3b7
MD5 3ce5f61b1cd35de6eec587d763e7e6d5
BLAKE2b-256 4170dfc4ecfab1de0d30a7fa8135310a7ac4dd8c502d467b3cd4db3f76ec3fc9

See more details on using hashes here.

File details

Details for the file spark_ddl_parser-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for spark_ddl_parser-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4bf4679e72d78d7ab1c3e4dd23b20ff57757f13a721e32a0cc47f9b578a20538
MD5 c3b745ef2b12b7862349113dd9971f97
BLAKE2b-256 9d4a0fd9a49356fa706764631f71c7e04dc310c889c024dbce9f8ee070b73967

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page