The missing PySpark utils
Project description
PySpark Utils
The missing PySpark utils.
Usage
To install:
pip install pyspark-utils
helper
import pyspark_utils.helper as spark_helper
# A SparkContext named as "TestPipeline".
spark_context = spark_helper.get_context('TestPipeline')
op
import pyspark_utils.op as spark_op
# RDD<key, value> -> RDD<new_key, value>
pair_rdd.map(spark_op.do_key(lambda key: new_key))
# RDD<key, value> -> RDD<result>
pair_rdd.map(spark_op.do_tuple(lambda key, value: result))
# RDD<key, value> -> RDD<value, key>
pair_rdd.map(spark_op.swap_kv())
# RDD<key, value> -> RDD<key, value> if func(key)
pair_rdd.filter(spark_op.filter_key(lambda key: true_or_false))
# RDD<key, value> -> RDD<key, value> if func(value)
pair_rdd.filter(spark_op.filter_value(lambda value: true_or_false))
# RDD<element> -> RDD<element, element>
rdd.keyBy(spark_op.identity)
# RDD<key, value> -> RDD<key, value> with keys in key_rdd
subset_pair_rdd = spark_op.filter_keys(pair_rdd, key_rdd)
# RDD<key, value> -> RDD<key, value> with keys not in key_rdd
subset_pair_rdd = spark_op.substract_keys(pair_rdd, key_rdd)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pyspark_utils-1.2.1.tar.gz
(2.1 kB
view hashes)
Built Distributions
Close
Hashes for pyspark_utils-1.2.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a2098f92bfd33202c81953f49e17a080192fa000d5cf5c97e14c6cfa0299ccc |
|
MD5 | c37e94c68df69b91b56b9abf17bae7b1 |
|
BLAKE2b-256 | ba313128bb258d6452d8b49de7eff560f8ed5b586fdd5925c8371763e9165bca |
Close
Hashes for pyspark_utils-1.2.1-py2-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c5b076f0076cdf7dcdced3c781c94a6e61c6757e75241649962d7a508a71ce55 |
|
MD5 | a37e03d51130da0cc162e9f0ea0aac6c |
|
BLAKE2b-256 | 3a504b54f329eefbd94b24dccad8065e3b8d93c818440ed1de1b8e3cb9ec6ef8 |