No project description provided
Project description
jstark
A PySpark library for generating time-based features for machine learning. All features are calculated relative to an as at date, enabling point-in-time feature engineering over configurable time periods.
Feature period mnemonics
Feature names end with a mnemonic describing the time window. The format is {start}{unit}{end} where the unit is one
of d (days), w (weeks), m (months), q (quarters) or y (years).
For example, BasketCount_3m1 is the distinct count of baskets from 3 months before to 1 month before the as at date.
Multiple periods can be calculated in a single Spark job:
from datetime import date
from jstark.grocery import GroceryFeatures
gf = GroceryFeatures(as_at=date(2022, 1, 1), feature_periods=["3m1", "6m4"])
output_df = input_df.groupBy("Store").agg(*gf.features)
This produces BasketCount_3m1, BasketCount_6m4, and every other feature for both periods. See the
Features reference for a list of all available features.
Quick start
Prerequisites: Java runtime required for PySpark. On macOS: brew install openjdk@11.
pip install jstark[faker]
The faker extra installs Faker, which is needed for the sample data generator used
below. If you don't need sample data, pip install jstark is sufficient.
from datetime import date
from jstark.sample.transactions import FakeGroceryTransactions
from jstark.grocery import GroceryFeatures
input_df = FakeGroceryTransactions().df
gf = GroceryFeatures(date(2022, 1, 1), ["4q4", "3q3", "2q2", "1q1"])
output_df = input_df.groupBy("Store").agg(*gf.features)
output_df.select(
"Store", "BasketCount_4q4", "BasketCount_3q3", "BasketCount_2q2", "BasketCount_1q1"
).show()
+-----------+---------------+---------------+---------------+---------------+
| Store|BasketCount_4q4|BasketCount_3q3|BasketCount_2q2|BasketCount_1q1|
+-----------+---------------+---------------+---------------+---------------+
| Staines| 47| 46| 48| 51|
| Twickenham| 55| 57| 48| 49|
| Ealing| 52| 51| 50| 54|
|Hammersmith| 47| 40| 43| 51|
| Richmond| 54| 40| 64| 53|
+-----------+---------------+---------------+---------------+---------------+
Feature descriptions and references
Every feature carries a description in its column metadata:
from pprint import pprint
pprint([(c.name, c.metadata["description"]) for c in output_df.schema if c.name.endswith("1q1")])
[('BasketCount_1q1',
'Distinct count of Baskets between 2021-10-01 and 2021-12-31'),
...]
You can also inspect what input columns each feature requires:
gf.references["BasketCount_1q1"] # ['Basket', 'Timestamp']
gf.references["CustomerCount_1q1"] # ['Customer', 'Timestamp']
gf.references["AvgGrossSpendPerBasket_1q1"] # ['Basket', 'GrossSpend', 'Timestamp']
All features require a Timestamp column (TimestampType). Most require additional columns depending on what they measure.
Features reference
Grocery features
A list of all Grocery features available if one were to call:
GroceryFeatures(date(2022, 1, 1), ["3m1"])
| Feature | Description |
|---|---|
| ApproxBasketCount_3m1 | Approximate distinct count of Baskets between 2021-10-01 and 2021-12-31 |
| ApproxCustomerCount_3m1 | Approximate distinct count of Customers between 2021-10-01 and 2021-12-31 |
| AverageBasketsPerMonth_3m1 | Average number of baskets per month between 2021-10-01 and 2021-12-31 |
| AvgDiscountPerBasket_3m1 | Average Discount per Basket between 2021-10-01 and 2021-12-31 |
| AvgGrossSpendPerBasket_3m1 | Average GrossSpend per Basket between 2021-10-01 and 2021-12-31 |
| AvgPurchaseCycle_3m1 | Average purchase cycle between 2021-10-01 and 2021-12-31 |
| AvgQuantityPerBasket_3m1 | Average Quantity per Basket between 2021-10-01 and 2021-12-31 |
| BasketCount_3m1 | Distinct count of Baskets between 2021-10-01 and 2021-12-31 |
| BasketMonths_3m1 | Number of months in which at least one basket was purchased between 2021-10-01 and 2021-12-31 |
| ChannelCount_3m1 | Distinct count of Channels between 2021-10-01 and 2021-12-31 |
| Count_3m1 | Count of rows between 2021-10-01 and 2021-12-31 |
| CustomerCount_3m1 | Distinct count of Customers between 2021-10-01 and 2021-12-31 |
| CyclesSinceLastPurchase_3m1 | Cycles since last purchase between 2021-10-01 and 2021-12-31 |
| Discount_3m1 | Sum of Discount between 2021-10-01 and 2021-12-31 |
| EarliestPurchaseDate_3m1 | Earliest purchase date between 2021-10-01 and 2021-12-31 |
| GrossSpend_3m1 | Sum of GrossSpend between 2021-10-01 and 2021-12-31 |
| MaxGrossPrice_3m1 | Maximum of (GrossSpend / Quantity) between 2021-10-01 and 2021-12-31 |
| MaxGrossSpend_3m1 | Maximum GrossSpend value between 2021-10-01 and 2021-12-31 |
| MaxNetPrice_3m1 | Maximum of (NetSpend / Quantity) between 2021-10-01 and 2021-12-31 |
| MaxNetSpend_3m1 | Maximum of NetSpend value between 2021-10-01 and 2021-12-31 |
| MinGrossPrice_3m1 | Minimum of (GrossSpend / Quantity) between 2021-10-01 and 2021-12-31 |
| MinGrossSpend_3m1 | Minimum GrossSpend value between 2021-10-01 and 2021-12-31 |
| MinNetPrice_3m1 | Minimum of (NetSpend / Quantity) between 2021-10-01 and 2021-12-31 |
| MinNetSpend_3m1 | Minimum of NetSpend value between 2021-10-01 and 2021-12-31 |
| MostRecentPurchaseDate_3m1 | Most recent purchase date between 2021-10-01 and 2021-12-31 |
| NetSpend_3m1 | Sum of NetSpend between 2021-10-01 and 2021-12-31 |
| ProductCount_3m1 | Distinct count of Products between 2021-10-01 and 2021-12-31 |
| Quantity_3m1 | Sum of Quantity between 2021-10-01 and 2021-12-31 |
| RecencyDays_3m1 | Minimum number of days since occurrence between 2021-10-01 and 2021-12-31 |
| RecencyWeightedApproxBasketMonths90_3m1 | Exponentially weighted moving average, with smoothing factor of 0.9, of the approximate number of baskets per month between 2021-10-01 and 2021-12-31 |
| RecencyWeightedApproxBasketMonths95_3m1 | Exponentially weighted moving average, with smoothing factor of 0.95, of the approximate number of baskets per month between 2021-10-01 and 2021-12-31 |
| RecencyWeightedApproxBasketMonths99_3m1 | Exponentially weighted moving average, with smoothing factor of 0.99, of the approximate number of baskets per month between 2021-10-01 and 2021-12-31 |
| RecencyWeightedBasketMonths90_3m1 | Exponentially weighted moving average, with smoothing factor of 0.9, of the number of baskets per month between 2021-10-01 and 2021-12-31 |
| RecencyWeightedBasketMonths95_3m1 | Exponentially weighted moving average, with smoothing factor of 0.95, of the number of baskets per month between 2021-10-01 and 2021-12-31 |
| RecencyWeightedBasketMonths99_3m1 | Exponentially weighted moving average, with smoothing factor of 0.99, of the number of baskets per month between 2021-10-01 and 2021-12-31 |
| StoreCount_3m1 | Distinct count of Stores between 2021-10-01 and 2021-12-31 |
License
jstark is distributed under the terms of the MIT license.
Why "jstark"?
The name is phonetically similar to PySpark, is a homage to comic book character Jon Stark, and contains the initials of the original contributor (j, k & t).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jstark-0.1.6.tar.gz.
File metadata
- Download URL: jstark-0.1.6.tar.gz
- Upload date:
- Size: 96.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
486366c95b165d2f78313b494575c52f2b53878afd04c1924ebc8be82bb33078
|
|
| MD5 |
4e967bbcfb0e37576665d3c38036ecd4
|
|
| BLAKE2b-256 |
4d0282d3bdeb7edaf3ded523106a51e8fd2eea347c6079fa50a1c8188d26c0c9
|
File details
Details for the file jstark-0.1.6-py3-none-any.whl.
File metadata
- Download URL: jstark-0.1.6-py3-none-any.whl
- Upload date:
- Size: 43.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.10.9 {"installer":{"name":"uv","version":"0.10.9","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9816dca8923b24bc96cc4a4c163a2b8424b31cd0041a752d8c4609ab7378cc42
|
|
| MD5 |
f1403d7cba2fbace0e055878cc03c76f
|
|
| BLAKE2b-256 |
f36a9bcd29d9462aebd7dd620ee67b0fc99d23c80bc81e5770c22749a20ff2d3
|