Skip to main content

dplyr for pyspark

Project description

PyPI version

tidypyspark

Make pyspark sing dplyr

Inspired by sparklyr, tidyverse

tidypyspark python package provides minimal, pythonic wrapper around pyspark sql dataframe API in tidyverse flavor.

  • With accessor ts, apply tidypyspark methods where both input and output are mostly pyspark dataframes.
  • Consistent 'verbs' (select, arrange, distinct, ...)

Also see tidypandas: A grammar of data manipulation for pandas inspired by tidyverse

Usage

# assumed that pyspark session is active
from tidypyspark import ts 
import pyspark.sql.functions as F
from tidypyspark.datasets import get_penguins_path

pen = spark.read.csv(get_penguins_path(), header = True, inferSchema = True)

(pen.ts.add_row_number(order_by = 'bill_depth_mm')
    .ts.mutate({'cumsum_bl': F.sum('bill_length_mm')},
               by = 'species',
               order_by = ['bill_depth_mm', 'row_number'],
               range_between = (-float('inf'), 0)
               )
    .ts.select(['species', 'bill_length_mm', 'cumsum_bl'])
    ).show(5)
    
+--------------+-------+-------------+------------------+
|bill_length_mm|species|bill_depth_mm|         cumsum_bl|
+--------------+-------+-------------+------------------+
|          32.1| Adelie|         15.5|              32.1|
|          35.2| Adelie|         15.9| 67.30000000000001|
|          37.7| Adelie|           16|105.00000000000001|
|          36.2| Adelie|         16.1|141.20000000000002|
|          33.1| Adelie|         16.1|             174.3|
+--------------+-------+-------------+------------------+

Example

  • tidypyspark code:
(pen.ts.select(['species','bill_length_mm','bill_depth_mm', 'flipper_length_mm'])
 .ts.pivot_longer('species', include = False)
 ).show(5)
 
 +-------+-----------------+-----+
|species|             name|value|
+-------+-----------------+-----+
| Adelie|   bill_length_mm| 39.1|
| Adelie|    bill_depth_mm| 18.7|
| Adelie|flipper_length_mm|  181|
| Adelie|   bill_length_mm| 39.5|
| Adelie|    bill_depth_mm| 17.4|
+-------+-----------------+-----+
  • equivalent pyspark code:
stack_expr = '''
             stack(3, 'bill_length_mm', `bill_length_mm`,
                      'bill_depth_mm', `bill_depth_mm`,
                      'flipper_length_mm', `flipper_length_mm`)
                      as (`name`, `value`)
             '''
pen.select('species', F.expr(stack_expr)).show(5)

tidypyspark relies on the amazing pyspark library and spark ecosystem.

Installation

pip install tidypyspark

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tidypyspark-0.0.1.tar.gz (40.6 kB view details)

Uploaded Source

Built Distribution

tidypyspark-0.0.1-py3-none-any.whl (41.1 kB view details)

Uploaded Python 3

File details

Details for the file tidypyspark-0.0.1.tar.gz.

File metadata

  • Download URL: tidypyspark-0.0.1.tar.gz
  • Upload date:
  • Size: 40.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.12

File hashes

Hashes for tidypyspark-0.0.1.tar.gz
Algorithm Hash digest
SHA256 9d74f920383a04a98fc21a0591fe2ad038e04556f414f3b2172dad98cb378f30
MD5 38fb7e4bcbb64ef792df7a385e935855
BLAKE2b-256 cab9d7e1926033cfe0aac09700756cc09f16b858bdffe930128bfee9b91172d2

See more details on using hashes here.

File details

Details for the file tidypyspark-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: tidypyspark-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 41.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.12

File hashes

Hashes for tidypyspark-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0ed22f74ef26ad291586f8d8b86b46c3d84d9c2762fe954b54743c31bd2b1297
MD5 661e2da59162b73e6da0e44c907b59a2
BLAKE2b-256 95ea9dceb4d12670256f0ffaa887bdaf315242ac237f8ac926ffb85c243f28cc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page