Skip to main content

Easily group pyspark data into buckets and map them to different values.

Project description

pyspark-bucketmap

build status pypi badge Black Open in Remote - Containers

pyspark-bucketmap is a tiny module for pyspark which allows you to bucketize DataFrame rows and map their values easily.

Install

pip install pyspark-bucketmap

Usage

from pyspark.sql import Row

people = spark.createDataFrame(
    [
        Row(age=12, name="Damian"),
        Row(age=15, name="Jake"),
        Row(age=18, name="Dominic"),
        Row(age=20, name="John"),
        Row(age=27, name="Jerry"),
        Row(age=101, name="Jerry's Grandpa"),
    ]
)
people

Now, what we would like to do, is map each person's age to an age category.

age range life phase
0 to 12 Child
12 to 18 Teenager
18 to 25 Young adulthood
25 to 70 Adult
70 and beyond Elderly

We can use pyspark-bucketmap for this. First, define the splits and mappings:

from typing import List

splits: List[float] = [-float("inf"), 0, 12, 18, 25, 70, float("inf")]
mapping: Dict[int, Column] = {
    0: lit("Not yet born"),
    1: lit("Child"),
    2: lit("Teenager"),
    3: lit("Young adulthood"),
    4: lit("Adult"),
    5: lit("Elderly"),
}

Then, apply BucketMap.transform(df):

from pyspark_bucketmap import BucketMap
from typing import List, Dict

bucket_mapper = BucketMap(
    splits=splits, mapping=mapping, inputCol="age", outputCol="phase"
)
phases_actual: DataFrame = bucket_mapper.transform(people).select("name", "phase")
phases_actual.show()
name phase
Damian Teenager
Jake Teenager
Dominic Young adulthood
John Young adulthood
Jerry Adult
Jerry's Grandpa Elderly

Success!

API

Module pyspark_bucketmap:

from pyspark.ml.feature import Bucketizer
from pyspark.sql import DataFrame
from pyspark.sql.column import Column

class BucketMap(Bucketizer):
    mapping: Dict[int, Column]

    def __init__(self, mapping: Dict[int, Column], *args, **kwargs):
        ...

    def transform(self, dataset: DataFrame, params: Optional[Any] = None) -> DataFrame:
        ...

Contributing

Under the hood, uses a combination of pyspark's Bucketizer and pyspark.sql.functions.create_map. The code is 42 lines and exists 1 in file: pyspark_bucketmap.py. To contribute, follow your preferred setup option below.

Option A: using a Devcontainer (VSCode only)

If you happen to use VSCode as your editor, you can open fseval in a Devcontainer. Devcontainers allow you to develop inside a Docker container - which means all dependencies and packages are automatically set up for you. First, make sure you have the Remote Development extension installed.

Then, you can do two things.

  1. Click the following button:

    Open in Remote - Containers

  2. Or, clone and open up the repo in VSCode:

    git clone https://github.com/dunnkers/pyspark-bucketmap.git
    code pyspark-bucketmap
    

    (for this to work, make sure you activated VSCode's code CLI)

    Then, you should see the following notification:

    reopen in devcontainer

Now you should have a fully working dev environment working 🙌🏻. You can run tests, debug code, etcetera. All dependencies are automatically installed for you.

🙌🏻

Option B: installing the dependencies manually

Clone the repo and install the deps:

git clone https://github.com/dunnkers/pyspark-bucketmap.git
cd pyspark-bucketmap
pip install -r .devcontainer/requirements.txt
pip install -r .devcontainer/requirements-dev.txt
pip install .

Make sure you also have the following installed:

  • Python 3.9
  • OpenJDK version 11

Now, you should be able to run tests 🧪:

pytest .

🙌🏻

About

Created by Jeroen Overschie © 2022.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyspark-bucketmap-0.0.5.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

pyspark_bucketmap-0.0.5-py3-none-any.whl (5.9 kB view details)

Uploaded Python 3

File details

Details for the file pyspark-bucketmap-0.0.5.tar.gz.

File metadata

  • Download URL: pyspark-bucketmap-0.0.5.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.8

File hashes

Hashes for pyspark-bucketmap-0.0.5.tar.gz
Algorithm Hash digest
SHA256 368facb190b2b74a58b3d9ff9859f42898e19466131d2f22cca0a631a14fc351
MD5 e397f6c6393ad74af19c2ca695d53512
BLAKE2b-256 ab524140cd1a38466398e98e77c39fedf1429688b054dc2285fe81e11333a9cc

See more details on using hashes here.

File details

Details for the file pyspark_bucketmap-0.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for pyspark_bucketmap-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 11527071cefc5386f74bcff08dd06561df99cae722de39ac7ab9c33d8f37116e
MD5 52d515ebc499a65fbd1026a1f662da7f
BLAKE2b-256 b7325f052677284ebc620656f3b75e03663696e3be06933daf3acb5b34f6f2ca

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page