Skip to main content

The missing PySpark utils

Project description

The missing PySpark utils.

Usage

To install:

pip install pyspark-utils
# It also depends on absl-py.

helper

import pyspark_utils.helper as spark_helper

# Nicely show rdd count and 3 items.
rdd = spark_helper.cache_and_log('MyRDD', rdd, 3)

op

import pyspark_utils.op as spark_op

# RDD<key, value>  ->  RDD<new_key, value>
pair_rdd.map(spark_op.do_key(lambda key: new_key))

# RDD<key, value>  ->  RDD<result>
pair_rdd.map(spark_op.do_tuple(lambda key, value: result))

# RDD<key, value>  ->  RDD<value, key>
pair_rdd.map(spark_op.swap_kv())

# RDD<key, value>  ->  RDD<key, value> if func(key)
pair_rdd.filter(spark_op.filter_key(lambda key: true_or_false))

# RDD<key, value>  ->  RDD<key, value> if func(value)
pair_rdd.filter(spark_op.filter_value(lambda value: true_or_false))

# RDD<iteratable>  ->  RDD<tuple_or_list> with transformed values.
rdd.map(spark_op.do_tuple_elems(lambda elem: new_elem))
rdd.map(spark_op.do_list_elems(lambda elem: new_elem))

# RDD<path>  ->  RDD<path> if path matches any given fnmatch-style patterns
rdd.filter(spark_op.filter_path(['*.txt', '*.csv', 'path/a.???']))

# RDD<element>  ->  RDD<element, element>
rdd.keyBy(spark_op.identity)

# RDD<key, value>   ->   RDD<key, value> with keys in key_rdd
spark_op.filter_keys(pair_rdd, key_rdd)

# RDD<key, value>   ->   RDD<key, value> with keys in whitelist and not in blacklist
spark_op.filter_keys(pair_rdd, whitelist_key_rdd, blacklist_key_rdd)

# RDD<key, value>   ->   RDD<key, value> with keys not in key_rdd
spark_op.substract_keys(pair_rdd, key_rdd)

# RDD<element>   ->   RDD<element> where element is not None
rdd.filter(spark_op.not_none)

# RDD<key>   ->   RDD<key, value>
rdd.map(spark_op.value_by(lambda key: value))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark_utils-1.8.0.tar.gz (2.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyspark_utils-1.8.0-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

pyspark_utils-1.8.0-py2-none-any.whl (4.7 kB view details)

Uploaded Python 2

File details

Details for the file pyspark_utils-1.8.0.tar.gz.

File metadata

  • Download URL: pyspark_utils-1.8.0.tar.gz
  • Upload date:
  • Size: 2.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for pyspark_utils-1.8.0.tar.gz
Algorithm Hash digest
SHA256 f26abeb57d0f5948225949a13f1c0803067f37a50cb687bbe13e1176a42a2913
MD5 ba2909b59d6c3c9a5246ef5a3a016a5f
BLAKE2b-256 9a6137828595d99c84cfac144edd6aa7651c82a53aee16818ca62a4eed0ade3a

See more details on using hashes here.

File details

Details for the file pyspark_utils-1.8.0-py3-none-any.whl.

File metadata

  • Download URL: pyspark_utils-1.8.0-py3-none-any.whl
  • Upload date:
  • Size: 7.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for pyspark_utils-1.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 773915ff58da41cc5a473d03565405a3f385c03d7e8fc235b94efa691aa2ff01
MD5 1a0a11bcd59875d5d7e786decda863dc
BLAKE2b-256 69497a9941937bc770331633bb58e7f09bbe363b79cc0e6673d4c8b6bee893e9

See more details on using hashes here.

File details

Details for the file pyspark_utils-1.8.0-py2-none-any.whl.

File metadata

  • Download URL: pyspark_utils-1.8.0-py2-none-any.whl
  • Upload date:
  • Size: 4.7 kB
  • Tags: Python 2
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3

File hashes

Hashes for pyspark_utils-1.8.0-py2-none-any.whl
Algorithm Hash digest
SHA256 bac08380d67e17df7abd260f6178a5254bf2f324f272e5bbdb11b88ed922490d
MD5 f7869616c42d421599eaf79457ea51c2
BLAKE2b-256 f10115452c594d3498b2a53aa96d80b55cc1e1329cdc28c48617c1b85f9a448c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page