Skip to main content

No project description provided

Project description

jstark

PyPI - Version PyPI - Python Version build coverage pylint

Built with uv


A PySpark library for generating time-based features for machine learning. All features are calculated relative to an as at date, enabling point-in-time feature engineering over configurable time periods.

Feature period mnemonics

Feature names end with a mnemonic describing the time window. The format is {start}{unit}{end} where the unit is one of d (days), w (weeks), m (months), q (quarters) or y (years).

For example, BasketCount_3m1 is the distinct count of baskets from 3 months before to 1 month before the as at date.

Multiple periods can be calculated in a single Spark job:

from datetime import date
from jstark.grocery import GroceryFeatures

gf = GroceryFeatures(as_at=date(2022, 1, 1), feature_periods=["3m1", "6m4"])
output_df = input_df.groupBy("Store").agg(*gf.features)

This produces BasketCount_3m1, BasketCount_6m4, and every other feature for both periods. See the Features reference for a list of all available features.

Quick start

Prerequisites: Java runtime required for PySpark. On macOS: brew install openjdk@11.

pip install jstark[faker]

The faker extra installs Faker, which is needed for the sample data generator used below. If you don't need sample data, pip install jstark is sufficient.

from datetime import date
from jstark.sample.transactions import FakeGroceryTransactions
from jstark.grocery import GroceryFeatures

input_df = FakeGroceryTransactions().df
gf = GroceryFeatures(date(2022, 1, 1), ["4q4", "3q3", "2q2", "1q1"])
output_df = input_df.groupBy("Store").agg(*gf.features)
output_df.select(
    "Store", "BasketCount_4q4", "BasketCount_3q3", "BasketCount_2q2", "BasketCount_1q1"
).show()
+-----------+---------------+---------------+---------------+---------------+
|      Store|BasketCount_4q4|BasketCount_3q3|BasketCount_2q2|BasketCount_1q1|
+-----------+---------------+---------------+---------------+---------------+
|    Staines|             47|             46|             48|             51|
| Twickenham|             55|             57|             48|             49|
|     Ealing|             52|             51|             50|             54|
|Hammersmith|             47|             40|             43|             51|
|   Richmond|             54|             40|             64|             53|
+-----------+---------------+---------------+---------------+---------------+

Feature descriptions and references

Every feature carries a description in its column metadata:

from pprint import pprint
pprint([(c.name, c.metadata["description"]) for c in output_df.schema if c.name.endswith("1q1")])
[('BasketCount_1q1',
  'Distinct count of Baskets between 2021-10-01 and 2021-12-31'),
 ...]

You can also inspect what input columns each feature requires:

gf.references["BasketCount_1q1"]                   # ['Basket', 'Timestamp']
gf.references["CustomerCount_1q1"]                 # ['Customer', 'Timestamp']
gf.references["AvgGrossSpendPerBasket_1q1"]        # ['Basket', 'GrossSpend', 'Timestamp']

All features require a Timestamp column (TimestampType). Most require additional columns depending on what they measure.

Features reference

Grocery features

A list of all Grocery features available if one were to call:

GroceryFeatures(date(2022, 1, 1), ["3m1"])
Feature Description
ApproxBasketCount_3m1 Approximate distinct count of Baskets between 2021-10-01 and 2021-12-31
ApproxCustomerCount_3m1 Approximate distinct count of Customers between 2021-10-01 and 2021-12-31
AverageBasketsPerMonth_3m1 Average number of baskets per month between 2021-10-01 and 2021-12-31
AvgDiscountPerBasket_3m1 Average Discount per Basket between 2021-10-01 and 2021-12-31
AvgGrossSpendPerBasket_3m1 Average GrossSpend per Basket between 2021-10-01 and 2021-12-31
AvgPurchaseCycle_3m1 Average purchase cycle between 2021-10-01 and 2021-12-31
AvgQuantityPerBasket_3m1 Average Quantity per Basket between 2021-10-01 and 2021-12-31
BasketCount_3m1 Distinct count of Baskets between 2021-10-01 and 2021-12-31
BasketMonths_3m1 Number of months in which at least one basket was purchased between 2021-10-01 and 2021-12-31
ChannelCount_3m1 Distinct count of Channels between 2021-10-01 and 2021-12-31
Count_3m1 Count of rows between 2021-10-01 and 2021-12-31
CustomerCount_3m1 Distinct count of Customers between 2021-10-01 and 2021-12-31
CyclesSinceLastPurchase_3m1 Cycles since last purchase between 2021-10-01 and 2021-12-31
Discount_3m1 Sum of Discount between 2021-10-01 and 2021-12-31
EarliestPurchaseDate_3m1 Earliest purchase date between 2021-10-01 and 2021-12-31
GrossSpend_3m1 Sum of GrossSpend between 2021-10-01 and 2021-12-31
MaxGrossPrice_3m1 Maximum of (GrossSpend / Quantity) between 2021-10-01 and 2021-12-31
MaxGrossSpend_3m1 Maximum GrossSpend value between 2021-10-01 and 2021-12-31
MaxNetPrice_3m1 Maximum of (NetSpend / Quantity) between 2021-10-01 and 2021-12-31
MaxNetSpend_3m1 Maximum of NetSpend value between 2021-10-01 and 2021-12-31
MinGrossPrice_3m1 Minimum of (GrossSpend / Quantity) between 2021-10-01 and 2021-12-31
MinGrossSpend_3m1 Minimum GrossSpend value between 2021-10-01 and 2021-12-31
MinNetPrice_3m1 Minimum of (NetSpend / Quantity) between 2021-10-01 and 2021-12-31
MinNetSpend_3m1 Minimum of NetSpend value between 2021-10-01 and 2021-12-31
MostRecentPurchaseDate_3m1 Most recent purchase date between 2021-10-01 and 2021-12-31
NetSpend_3m1 Sum of NetSpend between 2021-10-01 and 2021-12-31
ProductCount_3m1 Distinct count of Products between 2021-10-01 and 2021-12-31
Quantity_3m1 Sum of Quantity between 2021-10-01 and 2021-12-31
RecencyDays_3m1 Minimum number of days since occurrence between 2021-10-01 and 2021-12-31
RecencyWeightedApproxBasketMonths90_3m1 Exponentially weighted moving average, with smoothing factor of 0.9, of the approximate number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedApproxBasketMonths95_3m1 Exponentially weighted moving average, with smoothing factor of 0.95, of the approximate number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedApproxBasketMonths99_3m1 Exponentially weighted moving average, with smoothing factor of 0.99, of the approximate number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedBasketMonths90_3m1 Exponentially weighted moving average, with smoothing factor of 0.9, of the number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedBasketMonths95_3m1 Exponentially weighted moving average, with smoothing factor of 0.95, of the number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedBasketMonths99_3m1 Exponentially weighted moving average, with smoothing factor of 0.99, of the number of baskets per month between 2021-10-01 and 2021-12-31
StoreCount_3m1 Distinct count of Stores between 2021-10-01 and 2021-12-31

License

jstark is distributed under the terms of the MIT license.

Why "jstark"?

The name is phonetically similar to PySpark, is a homage to comic book character Jon Stark, and contains the initials of the original contributor (j, k & t).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jstark-0.4.0.tar.gz (100.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jstark-0.4.0-py3-none-any.whl (48.4 kB view details)

Uploaded Python 3

File details

Details for the file jstark-0.4.0.tar.gz.

File metadata

  • Download URL: jstark-0.4.0.tar.gz
  • Upload date:
  • Size: 100.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for jstark-0.4.0.tar.gz
Algorithm Hash digest
SHA256 e03f80cb152ba6b0b7740426de9208780d28e4ef9831debc6cf3c4f005623b75
MD5 4210e9b67d71dd8f0a237ab34d991afe
BLAKE2b-256 cd4f68c96f18724dd7243861b7b8cd69d3fe6c6141d397608f2227321423ae06

See more details on using hashes here.

File details

Details for the file jstark-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: jstark-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 48.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.10 {"installer":{"name":"uv","version":"0.10.10","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for jstark-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1fbdcbadab58970bafaa788c106c54b024a9a2f0f8ee2402bccfdf2cc3b355be
MD5 fb60605c0b38fa20fae1b476cc72eee8
BLAKE2b-256 8ab5f02c8697650460821871e5ca665d6d5a85399b9452c321155bad565e9848

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page