No project description provided

# Quickstart

If you haven't already, install the package via pip install anatomy, preferably in a new Anaconda environment with Python 3.9.

The anatomy package uses a simple workflow. An Anatomy object is initially estimated on your forecasting setup (using your data and your models), is then stored to disk, and can then be loaded at any future time without requiring re-estimation.

After initial estimation, an Anatomy can anatomize:

• forecasts produced by any combination of your models
• any loss or gain function applied to your forecasts
• an arbitrary subset of your forecasts

all of which requires no additional computational time.

### General structure

You may already have trained your models before you create the Anatomy, and the aggregate of all your models at all periods may be too large to fit into your RAM. During estimation, the Anatomy will therefore ask you for the specific model and dataset it needs at a given iteration by calling your mapping function:

from anatomy import *

def my_map(key: AnatomyModelProvider.PeriodKey) -> \
AnatomyModelProvider.PeriodValue:

train, test, model = ...  # load from somewhere or generate here

return AnatomyModelProvider.PeriodValue(train, test, model)


You wrap the mapping function in an AnatomyModelProvider alongside information about the forecasting application:

my_provider = AnatomyModelProvider(
n_periods=..., n_features=..., model_names=[...],
y_name=..., provider_fn=my_map
)


and finally create the Anatomy:

my_anatomy = Anatomy(provider=my_provider, n_iterations=...).precompute(
n_jobs=16, save_path="my_anatomy.bin"
)


After running the above, the Anatomy is estimated and stored in your working directory as my_anatomy.bin.

## Example:

To get started, we need a forecasting application. We use a linear DGP to generate our dataset consisting of 500 observations of the three predictors x_{0,1,2} and our target y:

# set random seed for reproducibility:
np.random.seed(1338)

xy = pd.DataFrame(np.random.normal(0, 1, (500, 3)), columns=["x_0", "x_1", "x_2"])
xy["y"] = xy.sum(axis=1) + np.random.normal(0, 1, 500)

# set a unique and monotonically increasing index (default index would suffice):
xy.index = pd.date_range("2021-04-19", "2022-08-31").map(lambda x: x.date())


For convenience, the AnatomySubsets includes a generator that splits your dataset into training and test sets according to your forecasting scheme. Here, we include 100 periods in our first training set, forecast the target of the next period with no gap between the training set and the forecast, extend our training set by one period, and repeat until we reach the end of our data:

subsets = AnatomySubsets.generate(
index=xy.index,
initial_window=100,
estimation_type=AnatomySubsets.EstimationType.EXPANDING,
periods=1,
gap=0
)


In this example, we have not yet trained our models. We thus do so directly in our mapping function:

def mapper(key: AnatomyModelProvider.PeriodKey) -> \
AnatomyModelProvider.PeriodValue:

train = xy.iloc[subsets.get_train_subset(key.period)]
test = xy.iloc[subsets.get_test_subset(key.period)]

if key.model_name == "ols":
model = train_ols(train.drop("y", axis=1), train["y"])
elif key.model_name == "rf":
model = train_rf(train.drop("y", axis=1), train["y"])

return AnatomyModelProvider.PeriodValue(train, test, model)


using train_ols and train_rf, which train a model and yield its prediction function wrapped in an AnatomyModel:

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

def train_ols(x_train: pd.DataFrame, y_train: pd.Series) -> AnatomyModel:
ols_model = LinearRegression().fit(x_train, y_train)

def pred_fn_ols(xs: np.ndarray) -> np.ndarray:
xs_df = pd.DataFrame(xs, columns=x_train.columns)
return np.array(ols_model.predict(xs_df)).flatten()

return AnatomyModel(pred_fn_ols)

def train_rf(x_train: pd.DataFrame, y_train: pd.Series) -> AnatomyModel:
rf_model = RandomForestRegressor(random_state=1338).fit(x_train, y_train)

def pred_fn_rf(xs: np.ndarray) -> np.ndarray:
xs_df = pd.DataFrame(xs, columns=x_train.columns)
return np.array(rf_model.predict(xs_df)).flatten()

return AnatomyModel(pred_fn_rf)


We now have all we need to train the models and estimate the Anatomy:

provider = AnatomyModelProvider(
n_periods=subsets.n_periods,
n_features=xy.shape[1]-1,
model_names=["ols", "rf"],
y_name="y",
provider_fn=mapper
)

anatomy = Anatomy(provider=provider, n_iterations=10).precompute(
n_jobs=16, save_path="anatomy.bin"
)


At this point, the Anatomy is stored as anatomy.bin in our working directory. We can load it at any later point using anatomy = Anatomy.load("anatomy.bin").

## Anatomizing

We can now use our estimated Anatomy to dismantle our forecasts. In this example, we are interested in the two models, rf and ols, as well as the equal-weighted combination of the two:

groups = {
"rf": ["rf"],
"ols": ["ols"],
"ols+rf": ["ols", "rf"]
}


### Anatomize the out-of-sample RÂ² of the forecasts:

To decompose the out-of-sample RÂ² of our forecasts produced by the two models and the combination, we compute the unconditional forecasts and provide a function transforming forecasts into out-of-sample RÂ² to the Anatomy:

prevailing_mean = np.array([
xy.iloc[subsets.get_train_subset(period=i)]["y"].mean()
for i in range(subsets.n_periods)
])

def transform(y_hat, y):
return 1 - np.sum((y - y_hat) ** 2) / np.sum((y - prevailing_mean) ** 2)

df = anatomy.explain(
model_sets=AnatomyModelCombination(groups=groups),
transformer=AnatomyModelOutputTransformer(transform=transform)
)


This yields the change in out-of-sample RÂ² attributable to each predictor:

>>> df
base_contribution       x_0       x_1       x_2
rf     2021-07-28 -> 2022-08-31           0.000759  0.257240  0.238458  0.215062
ols    2021-07-28 -> 2022-08-31           0.000000  0.285329  0.267940  0.249870
ols+rf 2021-07-28 -> 2022-08-31           0.000383  0.279073  0.259201  0.240548


### ... the RMSE of the forecasts:

def transform(y_hat, y):
return np.sqrt(np.mean((y - y_hat) ** 2))

df = anatomy.explain(
model_sets=AnatomyModelCombination(groups=groups),
transformer=AnatomyModelOutputTransformer(transform=transform)
)


which yields the change in root mean squared error attributable to each predictor:

>>> df
base_contribution       x_0       x_1       x_2
rf     2021-07-28 -> 2022-08-31           2.105513 -0.351430 -0.326065 -0.296708
ols    2021-07-28 -> 2022-08-31           2.106313 -0.414488 -0.386125 -0.371149
ols+rf 2021-07-28 -> 2022-08-31           2.105910 -0.397903 -0.368193 -0.350080


### ... the MAE:

def transform(y_hat, y):
return np.mean(np.abs(y - y_hat))

df = anatomy.explain(
model_sets=AnatomyModelCombination(groups=groups),
transformer=AnatomyModelOutputTransformer(transform=transform)
)


which yields the change in mean absolute error attributable to each predictor:

>>> df
base_contribution       x_0       x_1       x_2
rf     2021-07-28 -> 2022-08-31           1.679359 -0.299382 -0.249591 -0.221651
ols    2021-07-28 -> 2022-08-31           1.679946 -0.345583 -0.300613 -0.288960
ols+rf 2021-07-28 -> 2022-08-31           1.679652 -0.330303 -0.283129 -0.262270


### ... the SE:

Note that the transform function in this case does not aggregate and returns a vector instead of a scalar. The Anatomy thus yields one decomposition per forecast, which is also known as a local (as opposed to global) decomposition.

def transform(y_hat, y):
return (y - y_hat) ** 2

df = anatomy.explain(
model_sets=AnatomyModelCombination(groups=groups),
transformer=AnatomyModelOutputTransformer(transform=transform)
)


which yields the change in squared error attributable to each predictor for each forecast:

>>> df
base_contribution       x_0       x_1       x_2
rf     2021-07-28           0.026485 -0.063311 -0.115985  0.163743
2021-07-29           0.021582  0.238569 -0.370448  0.694773
2021-07-30           2.451742 -2.702365  1.660915 -1.407192
...                              ...       ...       ...       ...


### ... or just the raw forecasts:

def transform(y_hat):
return y_hat

df = anatomy.explain(
model_sets=AnatomyModelCombination(groups=groups),
transformer=AnatomyModelOutputTransformer(transform=transform)
)


which yields the change in the forecast attributable to each predictor:

>>> df
base_contribution       x_0       x_1       x_2
rf     2021-07-28           0.070861  0.222932  0.360182 -0.315820
2021-07-29           0.070354  0.250389 -0.617779  0.984992
2021-07-30           0.071163 -1.514547  0.839562 -0.835136
...                              ...       ...       ...       ...


### The Efficiency property

During estimation the Anatomy checks that the individual attributions of the predictors to the forecasts sum up exactly to the forecasts produced by the models. The estimation would be aborted if efficiency does not hold.

Due to the efficiency property of Shapley values, summing the individual attributions yields the decomposed value(s) exactly. We can check that the Anatomy is consistet, for instance, by computing the RMSE from the decomposed raw forecasts, and by comparing that to the RMSE that we can compute from the decomposed RMSE.

Let's begin by decomposing the RMSE directly:

def transform(y_hat, y):
return np.sqrt(np.mean((y - y_hat) ** 2))

df = anatomy.explain(
model_sets=AnatomyModelCombination(groups=groups),
transformer=AnatomyModelOutputTransformer(transform=transform)
)


The RMSE is the sum of the individual attributions, rmse_a = df.sum(axis=1):

>>> rmse_a
rf      2021-07-28 -> 2022-08-31    1.131310
ols     2021-07-28 -> 2022-08-31    0.934551
ols+rf  2021-07-28 -> 2022-08-31    0.989733


Let's next decompose the raw forecasts and compute the RMSE for these:

def transform(y_hat):
return y_hat

df = anatomy.explain(
model_sets=AnatomyModelCombination(groups=groups),
transformer=AnatomyModelOutputTransformer(transform=transform)
)


The forecasts are recovered as the sum of the individual attributions, y_hat = df.sum(axis=1):

>>> y_hat
rf      2021-07-28    0.338154
2021-07-29    0.687956
2021-07-30   -1.438958
...


From the forecasts, we can compute the RMSE:

y_true = np.hstack([
xy.iloc[subsets.get_test_subset(period=i)]["y"]
for i in range(subsets.n_periods)
])

rmse_b = pd.Series({
key: np.sqrt(np.mean((y_true - y_hat.xs(key)) ** 2))
for key in groups.keys()
})


This yields the same RMSE as the sum of the attributions from the RMSE decomposition:

>>> rmse_b
rf        1.131310
ols       0.934551
ols+rf    0.989733


## Project details

Uploaded source