PySpark schema generator
Project description
Your data IS your schema
This tiny library helps to overcome excessive complexity in hand-written pyspark dataframe schemas.
How?
Shape your data as NamedTuple or dataclasses - they can freely mix:
from dataclasses import dataclass from tinsel import struct, transform from typing import NamedTuple, Optional, Dict, List @struct @dataclass class UserInfo: hobby: List[str] last_seen: Optional[int] pet_ages: Dict[str, int] @struct class User(NamedTuple): login: str age: int active: bool info: Optional[UserInfo]
Transform root node (User in our case) into schema:
schema = transform(User)
Create some data, if necessary:
data = [ User( login="Ben", age=18, active=False, info=None ), User( login="Tom", age=32, active=True, info=UserInfo( hobby=["pets", "flowers"], last_seen=16, pet_ages={"Jack": 2, "Sunshine": 6} ) ) ]
And… voilà!:
from pyspark.sql import SparkSession sc = SparkSession.builder.master('local').getOrCreate() df = sc.createDataFrame(data=data, schema=schema) df.printSchema() df.show(truncate=False)
This will output:
root |-- login: string (nullable = false) |-- age: integer (nullable = false) |-- active: boolean (nullable = false) |-- info: struct (nullable = true) | |-- hobby: array (nullable = false) | | |-- element: string (containsNull = false) | |-- last_seen: integer (nullable = true) | |-- pet_ages: map (nullable = false) | | |-- key: string | | |-- value: integer (valueContainsNull = false) +-----+---+------+----------------------------------------------+ |login|age|active|info | +-----+---+------+----------------------------------------------+ |Ben |18 |false |null | |Tom |32 |true |[[pets, flowers],, [Jack -> 2, Sunshine -> 6]]| +-----+---+------+----------------------------------------------+
Features
use native python types; no extra DSL, no cryptic API — just plain Python;
small and fast;
provide type shims for some types absent in Python, like long or short;
nullable fields naturally fits into schema definition;
Credits
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.
History
0.2.0 (2018-08-28)
Added dataclasses support
0.1.0 (2018-08-28)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tinsel-0.3.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5780403a4c4e5ff9e80e2fafae36593da4e05dde918b6e91168f88ba41e3f7b1 |
|
MD5 | 1e96843baf15dd94aa5a15ba5e1c2429 |
|
BLAKE2b-256 | ae4be847eb317ef1f86d66723a7862051bb351c6f4bb0c5d9da2de103bdb9d71 |