Skip to main content

PySpark schema generator

Project description

Your data IS your schema

https://img.shields.io/pypi/pyversions/tinsel.svg https://img.shields.io/pypi/v/tinsel.svg https://coveralls.io/repos/github/Orhideous/tinsel/badge.svg?branch=master https://img.shields.io/travis/Orhideous/tinsel.svg https://pyup.io/repos/github/Orhideous/tinsel/shield.svg

This tiny library helps to overcome excessive complexity in hand-written pyspark dataframe schemas.

How?

Shape your data as NamedTuple or dataclasses - they can freely mix:

from dataclasses import dataclass
from tinsel import struct, transform
from typing import NamedTuple, Optional, Dict, List

@struct
@dataclass
class UserInfo:
    hobby: List[str]
    last_seen: Optional[int]
    pet_ages: Dict[str, int]


@struct
class User(NamedTuple):
    login: str
    age: int
    active: bool
    info: Optional[UserInfo]

Transform root node (User in our case) into schema:

schema = transform(User)

Create some data, if necessary:

data = [
    User(
        login="Ben",
        age=18,
        active=False,
        info=None
    ),
    User(
        login="Tom",
        age=32,
        active=True,
        info=UserInfo(
            hobby=["pets", "flowers"],
            last_seen=16,
            pet_ages={"Jack": 2, "Sunshine": 6}
        )
    )
]

And… voilà!:

from pyspark.sql import SparkSession

sc = SparkSession.builder.master('local').getOrCreate()

df = sc.createDataFrame(data=data, schema=schema)
df.printSchema()
df.show(truncate=False)

This will output:

root
 |-- login: string (nullable = false)
 |-- age: integer (nullable = false)
 |-- active: boolean (nullable = false)
 |-- info: struct (nullable = true)
 |    |-- hobby: array (nullable = false)
 |    |    |-- element: string (containsNull = false)
 |    |-- last_seen: integer (nullable = true)
 |    |-- pet_ages: map (nullable = false)
 |    |    |-- key: string
 |    |    |-- value: integer (valueContainsNull = false)


+-----+---+------+----------------------------------------------+
|login|age|active|info                                          |
+-----+---+------+----------------------------------------------+
|Ben  |18 |false |null                                          |
|Tom  |32 |true  |[[pets, flowers],, [Jack -> 2, Sunshine -> 6]]|
+-----+---+------+----------------------------------------------+

Features

  • use native python types; no extra DSL, no cryptic API — just plain Python;

  • small and fast;

  • provide type shims for some types absent in Python, like long or short;

  • nullable fields naturally fits into schema definition;

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

History

0.2.0 (2018-08-28)

  • Added dataclasses support

0.1.0 (2018-08-28)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinsel-0.3.0.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

tinsel-0.3.0-py2.py3-none-any.whl (4.8 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file tinsel-0.3.0.tar.gz.

File metadata

  • Download URL: tinsel-0.3.0.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.1.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.6.5

File hashes

Hashes for tinsel-0.3.0.tar.gz
Algorithm Hash digest
SHA256 bc58b3965c06ab4b51fabc5d1cb710718bf4a0bb9b2626116175e5290fede617
MD5 b71b227bbec9251b3a66f4cd908dbffd
BLAKE2b-256 730bd5a75c4674c8c0a306249ba81a9b2a1cf3dfc51b2479bfe0dbb5c7b510d0

See more details on using hashes here.

File details

Details for the file tinsel-0.3.0-py2.py3-none-any.whl.

File metadata

  • Download URL: tinsel-0.3.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 4.8 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/39.1.0 requests-toolbelt/0.8.0 tqdm/4.25.0 CPython/3.6.5

File hashes

Hashes for tinsel-0.3.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 5780403a4c4e5ff9e80e2fafae36593da4e05dde918b6e91168f88ba41e3f7b1
MD5 1e96843baf15dd94aa5a15ba5e1c2429
BLAKE2b-256 ae4be847eb317ef1f86d66723a7862051bb351c6f4bb0c5d9da2de103bdb9d71

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page