Skip to main content

No project description provided

Project description

jstark

PyPI - Version PyPI - Python Version build coverage pylint

Built with uv


A PySpark library for generating time-based features for machine learning. All features are calculated relative to an as at date, enabling point-in-time feature engineering over configurable time periods.

Feature period mnemonics

Feature names end with a mnemonic describing the time window. The format is {start}{unit}{end} where the unit is one of d (days), w (weeks), m (months), q (quarters) or y (years).

For example, BasketCount_3m1 is the distinct count of baskets from 3 months before to 1 month before the as at date.

Multiple periods can be calculated in a single Spark job:

from datetime import date
from jstark.grocery import GroceryFeatures

gf = GroceryFeatures(as_at=date(2022, 1, 1), feature_periods=["3m1", "6m4"])
output_df = input_df.groupBy("Store").agg(*gf.features)

This produces BasketCount_3m1, BasketCount_6m4, and every other feature for both periods. See the Features reference for a list of all available features.

Quick start

Prerequisites: Java runtime required for PySpark. On macOS: brew install openjdk@11.

pip install jstark[faker]

The faker extra installs Faker, which is needed for the sample data generator used below. If you don't need sample data, pip install jstark is sufficient.

from datetime import date
from jstark.sample.transactions import FakeGroceryTransactions
from jstark.grocery import GroceryFeatures

input_df = FakeGroceryTransactions().df
gf = GroceryFeatures(date(2022, 1, 1), ["4q4", "3q3", "2q2", "1q1"])
output_df = input_df.groupBy("Store").agg(*gf.features)
output_df.select(
    "Store", "BasketCount_4q4", "BasketCount_3q3", "BasketCount_2q2", "BasketCount_1q1"
).show()
+-----------+---------------+---------------+---------------+---------------+
|      Store|BasketCount_4q4|BasketCount_3q3|BasketCount_2q2|BasketCount_1q1|
+-----------+---------------+---------------+---------------+---------------+
|    Staines|             47|             46|             48|             51|
| Twickenham|             55|             57|             48|             49|
|     Ealing|             52|             51|             50|             54|
|Hammersmith|             47|             40|             43|             51|
|   Richmond|             54|             40|             64|             53|
+-----------+---------------+---------------+---------------+---------------+

Feature descriptions and references

Every feature carries a description in its column metadata:

from pprint import pprint
pprint([(c.name, c.metadata["description"]) for c in output_df.schema if c.name.endswith("1q1")])
[('BasketCount_1q1',
  'Distinct count of Baskets between 2021-10-01 and 2021-12-31'),
 ...]

You can also inspect what input columns each feature requires:

gf.references["BasketCount_1q1"]                   # ['Basket', 'Timestamp']
gf.references["CustomerCount_1q1"]                 # ['Customer', 'Timestamp']
gf.references["AvgGrossSpendPerBasket_1q1"]        # ['Basket', 'GrossSpend', 'Timestamp']

All features require a Timestamp column (TimestampType). Most require additional columns depending on what they measure.

Features reference

Grocery features

A list of all Grocery features available if one were to call:

GroceryFeatures(date(2022, 1, 1), ["3m1"])
Feature Description
ApproxBasketCount_3m1 Approximate distinct count of Baskets between 2021-10-01 and 2021-12-31
ApproxCustomerCount_3m1 Approximate distinct count of Customers between 2021-10-01 and 2021-12-31
AverageBasketsPerMonth_3m1 Average number of baskets per month between 2021-10-01 and 2021-12-31
AvgDiscountPerBasket_3m1 Average Discount per Basket between 2021-10-01 and 2021-12-31
AvgGrossSpendPerBasket_3m1 Average GrossSpend per Basket between 2021-10-01 and 2021-12-31
AvgPurchaseCycle_3m1 Average purchase cycle between 2021-10-01 and 2021-12-31
AvgQuantityPerBasket_3m1 Average Quantity per Basket between 2021-10-01 and 2021-12-31
BasketCount_3m1 Distinct count of Baskets between 2021-10-01 and 2021-12-31
BasketMonths_3m1 Number of months in which at least one basket was purchased between 2021-10-01 and 2021-12-31
ChannelCount_3m1 Distinct count of Channels between 2021-10-01 and 2021-12-31
Count_3m1 Count of rows between 2021-10-01 and 2021-12-31
CustomerCount_3m1 Distinct count of Customers between 2021-10-01 and 2021-12-31
CyclesSinceLastPurchase_3m1 Cycles since last purchase between 2021-10-01 and 2021-12-31
Discount_3m1 Sum of Discount between 2021-10-01 and 2021-12-31
EarliestPurchaseDate_3m1 Earliest purchase date between 2021-10-01 and 2021-12-31
GrossSpend_3m1 Sum of GrossSpend between 2021-10-01 and 2021-12-31
MaxGrossPrice_3m1 Maximum of (GrossSpend / Quantity) between 2021-10-01 and 2021-12-31
MaxGrossSpend_3m1 Maximum GrossSpend value between 2021-10-01 and 2021-12-31
MaxNetPrice_3m1 Maximum of (NetSpend / Quantity) between 2021-10-01 and 2021-12-31
MaxNetSpend_3m1 Maximum of NetSpend value between 2021-10-01 and 2021-12-31
MinGrossPrice_3m1 Minimum of (GrossSpend / Quantity) between 2021-10-01 and 2021-12-31
MinGrossSpend_3m1 Minimum GrossSpend value between 2021-10-01 and 2021-12-31
MinNetPrice_3m1 Minimum of (NetSpend / Quantity) between 2021-10-01 and 2021-12-31
MinNetSpend_3m1 Minimum of NetSpend value between 2021-10-01 and 2021-12-31
MostRecentPurchaseDate_3m1 Most recent purchase date between 2021-10-01 and 2021-12-31
NetSpend_3m1 Sum of NetSpend between 2021-10-01 and 2021-12-31
ProductCount_3m1 Distinct count of Products between 2021-10-01 and 2021-12-31
Quantity_3m1 Sum of Quantity between 2021-10-01 and 2021-12-31
RecencyDays_3m1 Minimum number of days since occurrence between 2021-10-01 and 2021-12-31
RecencyWeightedApproxBasketMonths90_3m1 Exponentially weighted moving average, with smoothing factor of 0.9, of the approximate number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedApproxBasketMonths95_3m1 Exponentially weighted moving average, with smoothing factor of 0.95, of the approximate number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedApproxBasketMonths99_3m1 Exponentially weighted moving average, with smoothing factor of 0.99, of the approximate number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedBasketMonths90_3m1 Exponentially weighted moving average, with smoothing factor of 0.9, of the number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedBasketMonths95_3m1 Exponentially weighted moving average, with smoothing factor of 0.95, of the number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedBasketMonths99_3m1 Exponentially weighted moving average, with smoothing factor of 0.99, of the number of baskets per month between 2021-10-01 and 2021-12-31
StoreCount_3m1 Distinct count of Stores between 2021-10-01 and 2021-12-31

License

jstark is distributed under the terms of the MIT license.

Why "jstark"?

The name is phonetically similar to PySpark, is a homage to comic book character Jon Stark, and contains the initials of the original contributor (j, k & t).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jstark-0.1.3.tar.gz (93.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jstark-0.1.3-py3-none-any.whl (34.6 kB view details)

Uploaded Python 3

File details

Details for the file jstark-0.1.3.tar.gz.

File metadata

  • Download URL: jstark-0.1.3.tar.gz
  • Upload date:
  • Size: 93.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for jstark-0.1.3.tar.gz
Algorithm Hash digest
SHA256 3cdd5c79288c4c8896e24da759ca053b8fcb73deb4c8553b324923f72b7165b7
MD5 ff953a186272393485e17d7d7c21c35c
BLAKE2b-256 e5eaaaba1b18436278148635b3f737e6e6248f61637f0dc6720bc825d0cf2d9e

See more details on using hashes here.

File details

Details for the file jstark-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: jstark-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 34.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for jstark-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 f720fc4d9bc3f6e1c761e92742103e8646db00f2d01c57efe06ca4a6105fca71
MD5 091272fcb0a8dec2c209c9f96cec5249
BLAKE2b-256 adf4f99188b60b6b6625e95792541b91a500180708fd9d6750644e4ba9a7f053

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page