Skip to main content

No project description provided

Project description

jstark

PyPI - Version PyPI - Python Version build coverage pylint

Built with uv


A PySpark library for generating time-based features for machine learning. All features are calculated relative to an as at date, enabling point-in-time feature engineering over configurable time periods.

Feature period mnemonics

Feature names end with a mnemonic describing the time window. The format is {start}{unit}{end} where the unit is one of d (days), w (weeks), m (months), q (quarters) or y (years).

For example, BasketCount_3m1 is the distinct count of baskets from 3 months before to 1 month before the as at date.

Multiple periods can be calculated in a single Spark job:

from datetime import date
from jstark.grocery import GroceryFeatures

gf = GroceryFeatures(as_at=date(2022, 1, 1), feature_periods=["3m1", "6m4"])
output_df = input_df.groupBy("Store").agg(*gf.features)

This produces BasketCount_3m1, BasketCount_6m4, and every other feature for both periods. See the Features reference for a list of all available features.

Quick start

Prerequisites: Java runtime required for PySpark. On macOS: brew install openjdk@11.

pip install jstark[faker]

The faker extra installs Faker, which is needed for the sample data generator used below. If you don't need sample data, pip install jstark is sufficient.

from datetime import date
from jstark.sample.transactions import FakeGroceryTransactions
from jstark.grocery import GroceryFeatures

input_df = FakeGroceryTransactions().df
gf = GroceryFeatures(date(2022, 1, 1), ["4q4", "3q3", "2q2", "1q1"])
output_df = input_df.groupBy("Store").agg(*gf.features)
output_df.select(
    "Store", "BasketCount_4q4", "BasketCount_3q3", "BasketCount_2q2", "BasketCount_1q1"
).show()
+-----------+---------------+---------------+---------------+---------------+
|      Store|BasketCount_4q4|BasketCount_3q3|BasketCount_2q2|BasketCount_1q1|
+-----------+---------------+---------------+---------------+---------------+
|    Staines|             47|             46|             48|             51|
| Twickenham|             55|             57|             48|             49|
|     Ealing|             52|             51|             50|             54|
|Hammersmith|             47|             40|             43|             51|
|   Richmond|             54|             40|             64|             53|
+-----------+---------------+---------------+---------------+---------------+

Feature descriptions and references

Every feature carries a description in its column metadata:

from pprint import pprint
pprint([(c.name, c.metadata["description"]) for c in output_df.schema if c.name.endswith("1q1")])
[('BasketCount_1q1',
  'Distinct count of Baskets between 2021-10-01 and 2021-12-31'),
 ...]

You can also inspect what input columns each feature requires:

gf.references["BasketCount_1q1"]                   # ['Basket', 'Timestamp']
gf.references["CustomerCount_1q1"]                 # ['Customer', 'Timestamp']
gf.references["AvgGrossSpendPerBasket_1q1"]        # ['Basket', 'GrossSpend', 'Timestamp']

All features require a Timestamp column (TimestampType). Most require additional columns depending on what they measure.

Features reference

Grocery features

A list of all Grocery features available if one were to call:

GroceryFeatures(date(2022, 1, 1), ["3m1"])
Feature Description
ApproxBasketCount_3m1 Approximate distinct count of Baskets between 2021-10-01 and 2021-12-31
ApproxCustomerCount_3m1 Approximate distinct count of Customers between 2021-10-01 and 2021-12-31
AverageBasketsPerMonth_3m1 Average number of baskets per month between 2021-10-01 and 2021-12-31
AvgDiscountPerBasket_3m1 Average Discount per Basket between 2021-10-01 and 2021-12-31
AvgGrossSpendPerBasket_3m1 Average GrossSpend per Basket between 2021-10-01 and 2021-12-31
AvgPurchaseCycle_3m1 Average purchase cycle between 2021-10-01 and 2021-12-31
AvgQuantityPerBasket_3m1 Average Quantity per Basket between 2021-10-01 and 2021-12-31
BasketCount_3m1 Distinct count of Baskets between 2021-10-01 and 2021-12-31
BasketMonths_3m1 Number of months in which at least one basket was purchased between 2021-10-01 and 2021-12-31
ChannelCount_3m1 Distinct count of Channels between 2021-10-01 and 2021-12-31
Count_3m1 Count of rows between 2021-10-01 and 2021-12-31
CustomerCount_3m1 Distinct count of Customers between 2021-10-01 and 2021-12-31
CyclesSinceLastPurchase_3m1 Cycles since last purchase between 2021-10-01 and 2021-12-31
Discount_3m1 Sum of Discount between 2021-10-01 and 2021-12-31
EarliestPurchaseDate_3m1 Earliest purchase date between 2021-10-01 and 2021-12-31
GrossSpend_3m1 Sum of GrossSpend between 2021-10-01 and 2021-12-31
MaxGrossPrice_3m1 Maximum of (GrossSpend / Quantity) between 2021-10-01 and 2021-12-31
MaxGrossSpend_3m1 Maximum GrossSpend value between 2021-10-01 and 2021-12-31
MaxNetPrice_3m1 Maximum of (NetSpend / Quantity) between 2021-10-01 and 2021-12-31
MaxNetSpend_3m1 Maximum of NetSpend value between 2021-10-01 and 2021-12-31
MinGrossPrice_3m1 Minimum of (GrossSpend / Quantity) between 2021-10-01 and 2021-12-31
MinGrossSpend_3m1 Minimum GrossSpend value between 2021-10-01 and 2021-12-31
MinNetPrice_3m1 Minimum of (NetSpend / Quantity) between 2021-10-01 and 2021-12-31
MinNetSpend_3m1 Minimum of NetSpend value between 2021-10-01 and 2021-12-31
MostRecentPurchaseDate_3m1 Most recent purchase date between 2021-10-01 and 2021-12-31
NetSpend_3m1 Sum of NetSpend between 2021-10-01 and 2021-12-31
ProductCount_3m1 Distinct count of Products between 2021-10-01 and 2021-12-31
Quantity_3m1 Sum of Quantity between 2021-10-01 and 2021-12-31
RecencyDays_3m1 Minimum number of days since occurrence between 2021-10-01 and 2021-12-31
RecencyWeightedApproxBasketMonths90_3m1 Exponentially weighted moving average, with smoothing factor of 0.9, of the approximate number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedApproxBasketMonths95_3m1 Exponentially weighted moving average, with smoothing factor of 0.95, of the approximate number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedApproxBasketMonths99_3m1 Exponentially weighted moving average, with smoothing factor of 0.99, of the approximate number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedBasketMonths90_3m1 Exponentially weighted moving average, with smoothing factor of 0.9, of the number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedBasketMonths95_3m1 Exponentially weighted moving average, with smoothing factor of 0.95, of the number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedBasketMonths99_3m1 Exponentially weighted moving average, with smoothing factor of 0.99, of the number of baskets per month between 2021-10-01 and 2021-12-31
StoreCount_3m1 Distinct count of Stores between 2021-10-01 and 2021-12-31

License

jstark is distributed under the terms of the MIT license.

Why "jstark"?

The name is phonetically similar to PySpark, is a homage to comic book character Jon Stark, and contains the initials of the original contributor (j, k & t).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jstark-0.1.1.tar.gz (93.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jstark-0.1.1-py3-none-any.whl (34.6 kB view details)

Uploaded Python 3

File details

Details for the file jstark-0.1.1.tar.gz.

File metadata

  • Download URL: jstark-0.1.1.tar.gz
  • Upload date:
  • Size: 93.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for jstark-0.1.1.tar.gz
Algorithm Hash digest
SHA256 87b03498643b0f6b4f800d556d31dfca6b8890d84dc299ce5b727a79345c7d94
MD5 9ea7268eef0e759c5d3d2c2c94a91a11
BLAKE2b-256 7f8e62da52ac8bc4b55fe6c2a38c4f921b6c73c2e58faabe247a6475310cef21

See more details on using hashes here.

File details

Details for the file jstark-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: jstark-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 34.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for jstark-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4553ffbe166917dbea3f1739934f98fb6822d3480446970ccf05608859d6aa10
MD5 0028dd8b9bf6379a000bbc70b4a2762a
BLAKE2b-256 a0f920d1fbf5c9c927ba73c30664ee435a8be256810488f8c710b7b81d20bdb0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page