Skip to main content

PySpark schema generator

Project description

Your data IS your schema

https://img.shields.io/pypi/pyversions/tinsel.svg https://img.shields.io/pypi/v/tinsel.svg https://coveralls.io/repos/github/Orhideous/tinsel/badge.svg?branch=master https://img.shields.io/travis/Orhideous/tinsel.svg https://pyup.io/repos/github/Orhideous/tinsel/shield.svg

This tiny library helps to overcome excessive complexity in hand-written pyspark dataframe schemas.

How?

Shape your data as NamedTuple or dataclasses - they can freely mix:

from dataclasses import dataclass
from tinsel import struct, transform
from typing import NamedTuple, Optional, Dict, List

@struct
@dataclass
class UserInfo:
    hobby: List[str]
    last_seen: Optional[int]
    pet_ages: Dict[str, int]


@struct
class User(NamedTuple):
    login: str
    age: int
    active: bool
    info: Optional[UserInfo]

Transform root node (User in our case) into schema:

schema = transform(User)

Create some data, if necessary:

data = [
    User(
        login="Ben",
        age=18,
        active=False,
        info=None
    ),
    User(
        login="Tom",
        age=32,
        active=True,
        info=UserInfo(
            hobby=["pets", "flowers"],
            last_seen=16,
            pet_ages={"Jack": 2, "Sunshine": 6}
        )
    )
]

And… voilà!:

from pyspark.sql import SparkSession

sc = SparkSession.builder.master('local').getOrCreate()

df = sc.createDataFrame(data=data, schema=schema)
df.printSchema()
df.show(truncate=False)

This will output:

root
 |-- login: string (nullable = false)
 |-- age: integer (nullable = false)
 |-- active: boolean (nullable = false)
 |-- info: struct (nullable = true)
 |    |-- hobby: array (nullable = false)
 |    |    |-- element: string (containsNull = false)
 |    |-- last_seen: integer (nullable = true)
 |    |-- pet_ages: map (nullable = false)
 |    |    |-- key: string
 |    |    |-- value: integer (valueContainsNull = false)


+-----+---+------+----------------------------------------------+
|login|age|active|info                                          |
+-----+---+------+----------------------------------------------+
|Ben  |18 |false |null                                          |
|Tom  |32 |true  |[[pets, flowers],, [Jack -> 2, Sunshine -> 6]]|
+-----+---+------+----------------------------------------------+

Features

  • use native python types; no extra DSL, no cryptic API — just plain Python;
  • small and fast;
  • provide type shims for some types absent in Python, like long or short;
  • nullable fields naturally fits into schema definition;

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

History

0.2.0 (2018-08-28)

  • Added dataclasses support

0.1.0 (2018-08-28)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinsel-0.3.0.tar.gz (9.1 kB view hashes)

Uploaded source

Built Distribution

tinsel-0.3.0-py2.py3-none-any.whl (4.8 kB view hashes)

Uploaded py2 py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page