PySpark schema generator
Project description
Your data IS your schema
This tiny library helps to overcome excessive complexity in hand-written pyspark dataframe schemas.
How?
Shape your data as NamedTuple or dataclasses - they can freely mix:
from dataclasses import dataclass from tinsel import struct, transform from typing import NamedTuple, Optional, Dict, List @struct @dataclass class UserInfo: hobby: List[str] last_seen: Optional[int] pet_ages: Dict[str, int] @struct class User(NamedTuple): login: str age: int active: bool info: Optional[UserInfo]
Transform root node (User in our case) into schema:
schema = transform(User)
Create some data, if necessary:
data = [ User( login="Ben", age=18, active=False, info=None ), User( login="Tom", age=32, active=True, info=UserInfo( hobby=["pets", "flowers"], last_seen=16, pet_ages={"Jack": 2, "Sunshine": 6} ) ) ]
And… voilà!:
from pyspark.sql import SparkSession sc = SparkSession.builder.master('local').getOrCreate() df = sc.createDataFrame(data=data, schema=schema) df.printSchema() df.show(truncate=False)
This will output:
root |-- login: string (nullable = false) |-- age: integer (nullable = false) |-- active: boolean (nullable = false) |-- info: struct (nullable = true) | |-- hobby: array (nullable = false) | | |-- element: string (containsNull = false) | |-- last_seen: integer (nullable = true) | |-- pet_ages: map (nullable = false) | | |-- key: string | | |-- value: integer (valueContainsNull = false) +-----+---+------+----------------------------------------------+ |login|age|active|info | +-----+---+------+----------------------------------------------+ |Ben |18 |false |null | |Tom |32 |true |[[pets, flowers],, [Jack -> 2, Sunshine -> 6]]| +-----+---+------+----------------------------------------------+
Features
use native python types; no extra DSL, no cryptic API — just plain Python;
small and fast;
provide type shims for some types absent in Python, like long or short;
nullable fields naturally fits into schema definition;
Credits
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.
History
0.2.0 (2018-08-28)
Added dataclasses support
0.1.0 (2018-08-28)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tinsel-0.3.0.tar.gz
.
File metadata
- Download URL: tinsel-0.3.0.tar.gz
- Upload date:
- Size: 9.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.1.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc58b3965c06ab4b51fabc5d1cb710718bf4a0bb9b2626116175e5290fede617 |
|
MD5 | b71b227bbec9251b3a66f4cd908dbffd |
|
BLAKE2b-256 | 730bd5a75c4674c8c0a306249ba81a9b2a1cf3dfc51b2479bfe0dbb5c7b510d0 |
File details
Details for the file tinsel-0.3.0-py2.py3-none-any.whl
.
File metadata
- Download URL: tinsel-0.3.0-py2.py3-none-any.whl
- Upload date:
- Size: 4.8 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.1.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.6.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5780403a4c4e5ff9e80e2fafae36593da4e05dde918b6e91168f88ba41e3f7b1 |
|
MD5 | 1e96843baf15dd94aa5a15ba5e1c2429 |
|
BLAKE2b-256 | ae4be847eb317ef1f86d66723a7862051bb351c6f4bb0c5d9da2de103bdb9d71 |