The missing PySpark utils
Project description
The missing PySpark utils.
Usage
To install:
pip install pyspark-utils # It also depends on absl-py.
helper
import pyspark_utils.helper as spark_helper
# Nicely show rdd count and 3 items.
rdd = spark_helper.cache_and_log('MyRDD', rdd, 3)
op
import pyspark_utils.op as spark_op # RDD<key, value> -> RDD<new_key, value> pair_rdd.map(spark_op.do_key(lambda key: new_key)) # RDD<key, value> -> RDD<result> pair_rdd.map(spark_op.do_tuple(lambda key, value: result)) # RDD<key, value> -> RDD<value, key> pair_rdd.map(spark_op.swap_kv()) # RDD<key, value> -> RDD<key, value> if func(key) pair_rdd.filter(spark_op.filter_key(lambda key: true_or_false)) # RDD<key, value> -> RDD<key, value> if func(value) pair_rdd.filter(spark_op.filter_value(lambda value: true_or_false)) # RDD<iteratable> -> RDD<tuple_or_list> with transformed values. rdd.map(spark_op.do_tuple_elems(lambda elem: new_elem)) rdd.map(spark_op.do_list_elems(lambda elem: new_elem)) # RDD<path> -> RDD<path> if path matches any given fnmatch-style patterns rdd.filter(spark_op.filter_path(['*.txt', '*.csv', 'path/a.???'])) # RDD<element> -> RDD<element, element> rdd.keyBy(spark_op.identity) # RDD<key, value> -> RDD<key, value> with keys in key_rdd spark_op.filter_keys(pair_rdd, key_rdd) # RDD<key, value> -> RDD<key, value> with keys in whitelist and not in blacklist spark_op.filter_keys(pair_rdd, whitelist_key_rdd, blacklist_key_rdd) # RDD<key, value> -> RDD<key, value> with keys not in key_rdd spark_op.substract_keys(pair_rdd, key_rdd) # RDD<element> -> RDD<element> where element is not None rdd.filter(spark_op.not_none) # RDD<key> -> RDD<key, value> rdd.map(spark_op.value_by(lambda key: value))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pyspark_utils-1.8.0.tar.gz
(2.8 kB
view details)
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyspark_utils-1.8.0.tar.gz.
File metadata
- Download URL: pyspark_utils-1.8.0.tar.gz
- Upload date:
- Size: 2.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f26abeb57d0f5948225949a13f1c0803067f37a50cb687bbe13e1176a42a2913
|
|
| MD5 |
ba2909b59d6c3c9a5246ef5a3a016a5f
|
|
| BLAKE2b-256 |
9a6137828595d99c84cfac144edd6aa7651c82a53aee16818ca62a4eed0ade3a
|
File details
Details for the file pyspark_utils-1.8.0-py3-none-any.whl.
File metadata
- Download URL: pyspark_utils-1.8.0-py3-none-any.whl
- Upload date:
- Size: 7.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
773915ff58da41cc5a473d03565405a3f385c03d7e8fc235b94efa691aa2ff01
|
|
| MD5 |
1a0a11bcd59875d5d7e786decda863dc
|
|
| BLAKE2b-256 |
69497a9941937bc770331633bb58e7f09bbe363b79cc0e6673d4c8b6bee893e9
|
File details
Details for the file pyspark_utils-1.8.0-py2-none-any.whl.
File metadata
- Download URL: pyspark_utils-1.8.0-py2-none-any.whl
- Upload date:
- Size: 4.7 kB
- Tags: Python 2
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.14.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bac08380d67e17df7abd260f6178a5254bf2f324f272e5bbdb11b88ed922490d
|
|
| MD5 |
f7869616c42d421599eaf79457ea51c2
|
|
| BLAKE2b-256 |
f10115452c594d3498b2a53aa96d80b55cc1e1329cdc28c48617c1b85f9a448c
|