Skip to main content

PySpark schema generator

Project description

Your data IS your schema

https://img.shields.io/pypi/pyversions/tinsel.svg https://img.shields.io/pypi/v/tinsel.svg https://coveralls.io/repos/github/Orhideous/tinsel/badge.svg?branch=master https://img.shields.io/travis/Orhideous/tinsel.svg https://pyup.io/repos/github/Orhideous/tinsel/shield.svg

This tiny library helps to overcome excessive complexity in hand-written pyspark dataframe schemas.

How?

Shape your data as NamedTuple or dataclasses - they can freely mix:

from dataclasses import dataclass
from tinsel import struct, transform
from typing import NamedTuple, Optional, Dict, List

@struct
@dataclass
class UserInfo:
    hobby: List[str]
    last_seen: Optional[int]
    pet_ages: Dict[str, int]


@struct
class User(NamedTuple):
    login: str
    age: int
    active: bool
    info: Optional[UserInfo]

Transform root node (User in our case) into schema:

schema = transform(User)

Create some data, if necessary:

data = [
    User(
        login="Ben",
        age=18,
        active=False,
        info=None
    ),
    User(
        login="Tom",
        age=32,
        active=True,
        info=UserInfo(
            hobby=["pets", "flowers"],
            last_seen=16,
            pet_ages={"Jack": 2, "Sunshine": 6}
        )
    )
]

And… voilà!:

from pyspark.sql import SparkSession

sc = SparkSession.builder.master('local').getOrCreate()

df = sc.createDataFrame(data=data, schema=schema)
df.printSchema()
df.show(truncate=False)

This will output:

root
 |-- login: string (nullable = false)
 |-- age: integer (nullable = false)
 |-- active: boolean (nullable = false)
 |-- info: struct (nullable = true)
 |    |-- hobby: array (nullable = false)
 |    |    |-- element: string (containsNull = false)
 |    |-- last_seen: integer (nullable = true)
 |    |-- pet_ages: map (nullable = false)
 |    |    |-- key: string
 |    |    |-- value: integer (valueContainsNull = false)


+-----+---+------+----------------------------------------------+
|login|age|active|info                                          |
+-----+---+------+----------------------------------------------+
|Ben  |18 |false |null                                          |
|Tom  |32 |true  |[[pets, flowers],, [Jack -> 2, Sunshine -> 6]]|
+-----+---+------+----------------------------------------------+

Features

  • use native python types; no extra DSL, no cryptic API — just plain Python;

  • small and fast;

  • provide type shims for some types absent in Python, like long or short;

  • nullable fields naturally fits into schema definition;

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

History

0.2.0 (2018-08-28)

  • Added dataclasses support

0.1.0 (2018-08-28)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinsel-0.3.0.tar.gz (9.1 kB view hashes)

Uploaded source

Built Distribution

tinsel-0.3.0-py2.py3-none-any.whl (4.8 kB view hashes)

Uploaded py2 py3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page