Easily group pyspark data into buckets and map them to different values.
Project description
pyspark-bucketmap
pyspark-bucketmap
is a tiny module for pyspark which allows you to bucketize DataFrame rows and map their values easily.
Install
pip install pyspark-bucketmap
Usage
from pyspark.sql import Row
people = spark.createDataFrame(
[
Row(age=12, name="Damian"),
Row(age=15, name="Jake"),
Row(age=18, name="Dominic"),
Row(age=20, name="John"),
Row(age=27, name="Jerry"),
Row(age=101, name="Jerry's Grandpa"),
]
)
people
Now, what we would like to do, is map each person's age to an age category.
age range | life phase |
---|---|
0 to 12 | Child |
12 to 18 | Teenager |
18 to 25 | Young adulthood |
25 to 70 | Adult |
70 and beyond | Elderly |
We can use pyspark-bucketmap
for this. First, define the splits and mappings:
from typing import List
splits: List[float] = [-float("inf"), 0, 12, 18, 25, 70, float("inf")]
mapping: Dict[int, Column] = {
0: lit("Not yet born"),
1: lit("Child"),
2: lit("Teenager"),
3: lit("Young adulthood"),
4: lit("Adult"),
5: lit("Elderly"),
}
Then, apply BucketMap.transform(df)
:
from pyspark_bucketmap import BucketMap
from typing import List, Dict
bucket_mapper = BucketMap(
splits=splits, mapping=mapping, inputCol="age", outputCol="phase"
)
phases_actual: DataFrame = bucket_mapper.transform(people).select("name", "phase")
phases_actual.show()
name | phase |
---|---|
Damian | Teenager |
Jake | Teenager |
Dominic | Young adulthood |
John | Young adulthood |
Jerry | Adult |
Jerry's Grandpa | Elderly |
Success!
✨
API
Module pyspark_bucketmap
:
from pyspark.ml.feature import Bucketizer
from pyspark.sql import DataFrame
from pyspark.sql.column import Column
class BucketMap(Bucketizer):
mapping: Dict[int, Column]
def __init__(self, mapping: Dict[int, Column], *args, **kwargs):
...
def transform(self, dataset: DataFrame, params: Optional[Any] = None) -> DataFrame:
...
Contributing
Under the hood, uses a combination of pyspark's Bucketizer
and pyspark.sql.functions.create_map
. The code is 42 lines and exists 1 in file: pyspark_bucketmap.py
. To contribute, follow your preferred setup option below.
Option A: using a Devcontainer (VSCode only)
If you happen to use VSCode as your editor, you can open fseval
in a Devcontainer. Devcontainers allow you to develop inside a Docker container - which means all dependencies and packages are automatically set up for you. First, make sure you have the Remote Development extension installed.
Then, you can do two things.
-
Click the following button:
-
Or, clone and open up the repo in VSCode:
git clone https://github.com/dunnkers/pyspark-bucketmap.git code pyspark-bucketmap
(for this to work, make sure you activated VSCode's
code
CLI)Then, you should see the following notification:
Now you should have a fully working dev environment working 🙌🏻. You can run tests, debug code, etcetera. All dependencies are automatically installed for you.
🙌🏻
Option B: installing the dependencies manually
Clone the repo and install the deps:
git clone https://github.com/dunnkers/pyspark-bucketmap.git
cd pyspark-bucketmap
pip install -r .devcontainer/requirements.txt
pip install -r .devcontainer/requirements-dev.txt
pip install .
Make sure you also have the following installed:
- Python 3.9
- OpenJDK version 11
Now, you should be able to run tests 🧪:
pytest .
🙌🏻
About
Created by Jeroen Overschie © 2022.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pyspark-bucketmap-0.0.5.tar.gz
.
File metadata
- Download URL: pyspark-bucketmap-0.0.5.tar.gz
- Upload date:
- Size: 4.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 368facb190b2b74a58b3d9ff9859f42898e19466131d2f22cca0a631a14fc351 |
|
MD5 | e397f6c6393ad74af19c2ca695d53512 |
|
BLAKE2b-256 | ab524140cd1a38466398e98e77c39fedf1429688b054dc2285fe81e11333a9cc |
File details
Details for the file pyspark_bucketmap-0.0.5-py3-none-any.whl
.
File metadata
- Download URL: pyspark_bucketmap-0.0.5-py3-none-any.whl
- Upload date:
- Size: 5.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 11527071cefc5386f74bcff08dd06561df99cae722de39ac7ab9c33d8f37116e |
|
MD5 | 52d515ebc499a65fbd1026a1f662da7f |
|
BLAKE2b-256 | b7325f052677284ebc620656f3b75e03663696e3be06933daf3acb5b34f6f2ca |