Skip to main content

No project description provided

Project description

jstark

PyPI - Version PyPI - Python Version build coverage pylint

Built with uv


A PySpark library for generating time-based features for machine learning. All features are calculated relative to an as at date, enabling point-in-time feature engineering over configurable time periods.

Feature period mnemonics

Feature names end with a mnemonic describing the time window. The format is {start}{unit}{end} where the unit is one of d (days), w (weeks), m (months), q (quarters) or y (years).

For example, BasketCount_3m1 is the distinct count of baskets from 3 months before to 1 month before the as at date.

Multiple periods can be calculated in a single Spark job:

from datetime import date
from jstark.grocery import GroceryFeatures

gf = GroceryFeatures(as_at=date(2022, 1, 1), feature_periods=["3m1", "6m4"])
output_df = input_df.groupBy("Store").agg(*gf.features)

This produces BasketCount_3m1, BasketCount_6m4, and every other feature for both periods. See the Features reference for a list of all available features.

Quick start

Prerequisites: Java runtime required for PySpark. On macOS: brew install openjdk@11.

pip install jstark[faker]

The faker extra installs Faker, which is needed for the sample data generator used below. If you don't need sample data, pip install jstark is sufficient.

from datetime import date
from jstark.sample.transactions import FakeGroceryTransactions
from jstark.grocery import GroceryFeatures

input_df = FakeGroceryTransactions().df
gf = GroceryFeatures(date(2022, 1, 1), ["4q4", "3q3", "2q2", "1q1"])
output_df = input_df.groupBy("Store").agg(*gf.features)
output_df.select(
    "Store", "BasketCount_4q4", "BasketCount_3q3", "BasketCount_2q2", "BasketCount_1q1"
).show()
+-----------+---------------+---------------+---------------+---------------+
|      Store|BasketCount_4q4|BasketCount_3q3|BasketCount_2q2|BasketCount_1q1|
+-----------+---------------+---------------+---------------+---------------+
|    Staines|             47|             46|             48|             51|
| Twickenham|             55|             57|             48|             49|
|     Ealing|             52|             51|             50|             54|
|Hammersmith|             47|             40|             43|             51|
|   Richmond|             54|             40|             64|             53|
+-----------+---------------+---------------+---------------+---------------+

Feature descriptions and references

Every feature carries a description in its column metadata:

from pprint import pprint
pprint([(c.name, c.metadata["description"]) for c in output_df.schema if c.name.endswith("1q1")])
[('BasketCount_1q1',
  'Distinct count of Baskets between 2021-10-01 and 2021-12-31'),
 ...]

You can also inspect what input columns each feature requires:

gf.references["BasketCount_1q1"]                   # ['Basket', 'Timestamp']
gf.references["CustomerCount_1q1"]                 # ['Customer', 'Timestamp']
gf.references["AvgGrossSpendPerBasket_1q1"]        # ['Basket', 'GrossSpend', 'Timestamp']

All features require a Timestamp column (TimestampType). Most require additional columns depending on what they measure.

Features reference

Grocery features

A list of all Grocery features available if one were to call:

GroceryFeatures(date(2022, 1, 1), ["3m1"])
Feature Description
ApproxBasketCount_3m1 Approximate distinct count of Baskets between 2021-10-01 and 2021-12-31
ApproxCustomerCount_3m1 Approximate distinct count of Customers between 2021-10-01 and 2021-12-31
AverageBasketsPerMonth_3m1 Average number of baskets per month between 2021-10-01 and 2021-12-31
AvgDiscountPerBasket_3m1 Average Discount per Basket between 2021-10-01 and 2021-12-31
AvgGrossSpendPerBasket_3m1 Average GrossSpend per Basket between 2021-10-01 and 2021-12-31
AvgPurchaseCycle_3m1 Average purchase cycle between 2021-10-01 and 2021-12-31
AvgQuantityPerBasket_3m1 Average Quantity per Basket between 2021-10-01 and 2021-12-31
BasketCount_3m1 Distinct count of Baskets between 2021-10-01 and 2021-12-31
BasketMonths_3m1 Number of months in which at least one basket was purchased between 2021-10-01 and 2021-12-31
ChannelCount_3m1 Distinct count of Channels between 2021-10-01 and 2021-12-31
Count_3m1 Count of rows between 2021-10-01 and 2021-12-31
CustomerCount_3m1 Distinct count of Customers between 2021-10-01 and 2021-12-31
CyclesSinceLastPurchase_3m1 Cycles since last purchase between 2021-10-01 and 2021-12-31
Discount_3m1 Sum of Discount between 2021-10-01 and 2021-12-31
EarliestPurchaseDate_3m1 Earliest purchase date between 2021-10-01 and 2021-12-31
GrossSpend_3m1 Sum of GrossSpend between 2021-10-01 and 2021-12-31
MaxGrossPrice_3m1 Maximum of (GrossSpend / Quantity) between 2021-10-01 and 2021-12-31
MaxGrossSpend_3m1 Maximum GrossSpend value between 2021-10-01 and 2021-12-31
MaxNetPrice_3m1 Maximum of (NetSpend / Quantity) between 2021-10-01 and 2021-12-31
MaxNetSpend_3m1 Maximum of NetSpend value between 2021-10-01 and 2021-12-31
MinGrossPrice_3m1 Minimum of (GrossSpend / Quantity) between 2021-10-01 and 2021-12-31
MinGrossSpend_3m1 Minimum GrossSpend value between 2021-10-01 and 2021-12-31
MinNetPrice_3m1 Minimum of (NetSpend / Quantity) between 2021-10-01 and 2021-12-31
MinNetSpend_3m1 Minimum of NetSpend value between 2021-10-01 and 2021-12-31
MostRecentPurchaseDate_3m1 Most recent purchase date between 2021-10-01 and 2021-12-31
NetSpend_3m1 Sum of NetSpend between 2021-10-01 and 2021-12-31
ProductCount_3m1 Distinct count of Products between 2021-10-01 and 2021-12-31
Quantity_3m1 Sum of Quantity between 2021-10-01 and 2021-12-31
RecencyDays_3m1 Minimum number of days since occurrence between 2021-10-01 and 2021-12-31
RecencyWeightedApproxBasketMonths90_3m1 Exponentially weighted moving average, with smoothing factor of 0.9, of the approximate number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedApproxBasketMonths95_3m1 Exponentially weighted moving average, with smoothing factor of 0.95, of the approximate number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedApproxBasketMonths99_3m1 Exponentially weighted moving average, with smoothing factor of 0.99, of the approximate number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedBasketMonths90_3m1 Exponentially weighted moving average, with smoothing factor of 0.9, of the number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedBasketMonths95_3m1 Exponentially weighted moving average, with smoothing factor of 0.95, of the number of baskets per month between 2021-10-01 and 2021-12-31
RecencyWeightedBasketMonths99_3m1 Exponentially weighted moving average, with smoothing factor of 0.99, of the number of baskets per month between 2021-10-01 and 2021-12-31
StoreCount_3m1 Distinct count of Stores between 2021-10-01 and 2021-12-31

License

jstark is distributed under the terms of the MIT license.

Why "jstark"?

The name is phonetically similar to PySpark, is a homage to comic book character Jon Stark, and contains the initials of the original contributor (j, k & t).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jstark-0.2.0.tar.gz (97.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jstark-0.2.0-py3-none-any.whl (45.3 kB view details)

Uploaded Python 3

File details

Details for the file jstark-0.2.0.tar.gz.

File metadata

  • Download URL: jstark-0.2.0.tar.gz
  • Upload date:
  • Size: 97.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for jstark-0.2.0.tar.gz
Algorithm Hash digest
SHA256 9421f16b477eadc0c7613c3aa6de66514944cadd967d68bb5ad5f85fa9f938fd
MD5 0598c5cf61b5a2b1f44e4a21a4601280
BLAKE2b-256 e84a3608660ab6592f5d01d2346def845978fb79c81f716d4abdf717e4a22aba

See more details on using hashes here.

File details

Details for the file jstark-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: jstark-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 45.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for jstark-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0677545210a68d95995d01b418841b6c3996a658f10ce26b42ae639c4225d0fc
MD5 b00e6a6d6278fb3721061b53fc22161c
BLAKE2b-256 a30cef164322b0fd8d7b4497cc01eb410a196f5c15d7ddb48477e3cf5fda638a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page