Python replication of Stata's shapley2: Shapley-Owen decomposition for regression fit statistics
Project description
pyshapley2
Python replication of Stata's shapley2 command (Chavez Juarez, 2013).
Computes the Shapley-Owen decomposition of any regression fit statistic (R², adjusted R², log-likelihood, AIC, …) across independent variables or user-defined variable groups, with optional parallel computation support.
Installation
# Core (serial only)
pip install pyshapley2
# With parallel support (recommended)
pip install "pyshapley2[parallel]"
# With all optional features
pip install "pyshapley2[all]"
Optional extras:
| Extra | Installs | Needed for |
|---|---|---|
parallel |
joblib |
n_jobs != 1 |
plot |
matplotlib |
.plot() |
progress |
tqdm |
verbose=1 |
all |
all of above | everything |
dev |
above + pytest, ruff | development |
Quick Start
import pandas as pd
from pyshapley2 import shapley2
# Sample data
df = pd.read_csv("your_data.csv")
# Basic R² decomposition
result = shapley2(df, depvar="wage", indepvars=["edu", "exp", "tenure"])
result.summary()
Output (1:1 replica of Stata's table format):
Shapley-Owen decomposition | depvar: wage | stat: r2 | command: ols
Observations: 500 | Subsets: 8 | K=3
Factor │ Shapley value │ Per cent │Shapley value │ Per cent
│ (estimate) │(estimate) │ (normalized) │(normalized)
───────────┼───────────────┼───────────┼──────────────┼─────────────
edu │ 0.35420 │ 51.23 % │ 0.31876 │ 46.12 %
exp │ 0.27816 │ 40.25 % │ 0.25034 │ 36.21 %
tenure │ 0.05918 │ 8.56 % │ 0.05326 │ 7.70 %
───────────┼───────────────┼───────────┼──────────────┼─────────────
Residual │ -0.00204 │ -0.04 % │ │
───────────┼───────────────┼───────────┼──────────────┼─────────────
TOTAL │ 0.68954 │ 100.00 % │ 0.68954 │ 100.00 %
───────────┼───────────────┼───────────┼──────────────┼─────────────
Features
All stat options
stat= |
Meaning | Stata equivalent |
|---|---|---|
"r2" |
R² | e(r2) |
"r2_a" |
Adjusted R² | e(r2_a) |
"ll" |
Log-likelihood | e(ll) |
"aic" |
AIC | computed |
"bic" |
BIC | computed |
"rmse" |
Root MSE | computed |
Custom extractor via stat_func:
result = shapley2(df, "y", ["x1", "x2", "x3"], stat_func=lambda r: r.rsquared)
All command options
command= |
Model | Stata equivalent |
|---|---|---|
"ols" / "reg" |
OLS | regress |
"logit" |
Logit | logit |
"probit" |
Probit | probit |
"poisson" |
Poisson | poisson |
"glm" |
GLM | glm |
| callable | Custom | any e() command |
Group decomposition (Stata group() option)
result = shapley2(
df, "wage", ["edu", "exp", "tenure", "age"],
stat="r2",
groups={
"Human Capital": ["edu", "exp"],
"Job Tenure": ["tenure"],
"Demographics": ["age"],
},
)
result.summary()
Parallel computation
# Use all available CPU cores
result = shapley2(
df, "wage", ["x1", "x2", "x3", "x4", "x5"],
stat="r2",
n_jobs=-1, # -1 = all cores; N = exactly N processes
backend="loky", # "loky" (default) | "threading" | "multiprocessing"
verbose=1, # show progress bar (requires tqdm)
)
When to use parallel?
Parallel is beneficial when K ≥ 10 (≥ 1,024 regressions).
For small K (≤ 8), the process-spawning overhead outweighs the benefit.
Visualization
fig, ax = result.plot(
kind="norm_pct", # "pct" | "norm_pct" | "shapley" | "norm"
figsize=(8, 5),
)
fig.savefig("shapley_decomp.pdf", dpi=300)
Stata → Python mapping
| Stata | Python |
|---|---|
shapley2, stat(r2) |
shapley2(df, "y", ["x1","x2"], stat="r2") |
shapley2, stat(r2) command(logit) |
shapley2(..., stat="ll", command="logit") |
shapley2, stat(r2) group(x1 x2, x3) |
shapley2(..., groups={"G1":["x1","x2"],"G2":["x3"]}) |
shapley2, stat(r2) force |
shapley2(..., force=True) |
| (not available in Stata) | shapley2(..., n_jobs=-1) |
Result object attributes
result.table # pd.DataFrame: shapley, shapley_pct, shapley_norm, shapley_norm_pct
result.full_stat # float: full-model stat (e.g. R²)
result.residual # float: full_stat − sum(shapley)
result.K # int: number of variables/groups
result.runs # int: number of regressions run (2^K)
result.n_obs # int: number of observations used
result.summary() # prints Stata-style table, returns str
result.plot() # matplotlib bar chart
result.to_dict() # serializable dict
Algorithm
Shapley2 implements the Shapley-Owen regression decomposition (also known as the LMG method):
- Enumerate all 2^K subsets of K variables/groups.
- Regress the outcome on each subset; record the fit statistic.
- OLS (with intercept): regress the vector of fit statistics on the binary inclusion indicators; slope coefficients are the Shapley values.
- Normalize: compute four output forms (raw, relative %, normalized, normalized %).
This is a 1:1 algorithmic replication of Stata's shapley2 v1.1.
Validation against Stata
Results are verified to match Stata's shapley2 (v1.1) output to ≥ 5 decimal places on two public benchmark datasets.
Test 1 — mtcars (individual variables)
Data: Motor Trend Cars Road Tests (1974), N = 32
Model: regress mpg hp wt disp
Stata: reg mpg hp wt disp → shapley2, stat(r2)
| Variable | Shapley (est.) | % (est.) | Shapley (norm.) | % (norm.) |
|---|---|---|---|---|
| hp | 0.18805 | 22.74% | 0.22511 | 27.23% |
| wt | 0.27959 | 33.81% | 0.33469 | 40.48% |
| disp | 0.22307 | 26.98% | 0.26704 | 32.30% |
| Residual | 0.13612 | 16.46% | — | — |
| TOTAL | 0.82684 | 100% | 0.82684 | 100% |
import pandas as pd
from pyshapley2 import shapley2
df = pd.read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv")
result = shapley2(df, "mpg", ["hp", "wt", "disp"], stat="r2")
result.summary()
Test 2 — Boston Housing (grouped variables)
Data: Boston Housing (Harrison & Rubinfeld, 1978), N = 506
Model: regress medv lstat rm dis ptratio
Stata: reg medv lstat rm dis ptratio → shapley2, stat(r2) group(lstat,rm,dis ptratio)
| Group | Variables | Shapley (est.) | % (est.) | Shapley (norm.) | % (norm.) |
|---|---|---|---|---|---|
| Group 1 | lstat | 0.29427 | 42.63% | 0.31257 | 45.28% |
| Group 2 | rm | 0.23205 | 33.61% | 0.24648 | 35.71% |
| Group 3 | dis, ptratio | 0.12358 | 17.90% | 0.13126 | 19.01% |
| Residual | — | 0.04041 | 5.85% | — | — |
| TOTAL | — | 0.69031 | 100% | 0.69031 | 100% |
import pandas as pd
from pyshapley2 import shapley2
df = pd.read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/MASS/Boston.csv")
result = shapley2(
df, "medv", ["lstat", "rm", "dis", "ptratio"],
stat="r2",
groups={
"lstat": ["lstat"],
"rm": ["rm"],
"dis_ptratio": ["dis", "ptratio"],
},
)
result.summary()
References
- Chavez Juarez, F. (2013). shapley2: Stata module to compute Shapley values from regressions. Statistical Software Components S457543, Boston College.
- Shapley, L. S. (1953). A value for n-person games. Contributions to the Theory of Games, 2, 307–317.
- Owen, G. (1977). Values of games with a priori unions. Essays in Mathematical Economics and Game Theory, 76–88.
- Kruskal, W. (1987). Relative importance by averaging over orderings. American Statistician, 41(1), 6–10.
License
MIT © 2026 luzhiyu-econ
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyshapley2-0.1.0.tar.gz.
File metadata
- Download URL: pyshapley2-0.1.0.tar.gz
- Upload date:
- Size: 15.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
137576556a24c62db449efdd0f2836c5871fb38b7ab64c057f71b2efa11e666e
|
|
| MD5 |
8b5529daac877290a700de18a9b55444
|
|
| BLAKE2b-256 |
54fe5a45c6a21478540f7d8862f878156045f6e2c5a68a51f22b4e1db672bb9d
|
Provenance
The following attestation bundles were made for pyshapley2-0.1.0.tar.gz:
Publisher:
publish.yml on luzhiyu-econ/pyshapley2
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyshapley2-0.1.0.tar.gz -
Subject digest:
137576556a24c62db449efdd0f2836c5871fb38b7ab64c057f71b2efa11e666e - Sigstore transparency entry: 1337940879
- Sigstore integration time:
-
Permalink:
luzhiyu-econ/pyshapley2@090861f76dddc63e4507b1be15214c77b5a5c5d1 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/luzhiyu-econ
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@090861f76dddc63e4507b1be15214c77b5a5c5d1 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pyshapley2-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pyshapley2-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bb0560cb69e619350a4f0e3de10a64a55bff72f5ffa6582a745101ac29dbf97e
|
|
| MD5 |
605813c4682b63435b7c77b733cc4506
|
|
| BLAKE2b-256 |
aac57a51f5684825ee3b773784e817d6fc7351d9fbb02dc9baa1e1051045423b
|
Provenance
The following attestation bundles were made for pyshapley2-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on luzhiyu-econ/pyshapley2
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pyshapley2-0.1.0-py3-none-any.whl -
Subject digest:
bb0560cb69e619350a4f0e3de10a64a55bff72f5ffa6582a745101ac29dbf97e - Sigstore transparency entry: 1337940961
- Sigstore integration time:
-
Permalink:
luzhiyu-econ/pyshapley2@090861f76dddc63e4507b1be15214c77b5a5c5d1 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/luzhiyu-econ
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@090861f76dddc63e4507b1be15214c77b5a5c5d1 -
Trigger Event:
push
-
Statement type: