Sparse factor model estimation with sign-constrained LASSO, prior-centered regularisation, and hierarchical group LASSO (HCGL)
Project description
🚀 Sparse Factor Model Estimation: factorlasso
factorlassopackage implements sign-constrained LASSO, prior-centered regularisation, and hierarchical group LASSO (HCGL) for sparse multi-output factor model estimation with integrated factor covariance assembly
| 📊 Metric | 🔢 Value |
|---|---|
| PyPI Version | |
| Python Versions | |
| License |
📈 Package Statistics
| 📊 Metric | 🔢 Value |
|---|---|
| Total Downloads | |
| CI Status | |
| Coverage | |
| GitHub Stars | |
| GitHub Forks |
The Problem
In many applications — portfolio construction, genomics, macro-econometrics — you need to estimate a factor model
$$Y_t = \alpha + \beta X_t + \varepsilon_t$$
where $Y_t \in \mathbb{R}^{N}$ are response variables (asset returns, gene expressions), $X_t \in \mathbb{R}^{M}$ are factors, $\beta \in \mathbb{R}^{N \times M}$ are sparse factor loadings, and $\alpha \in \mathbb{R}^{N}$ is the intercept.
In practice, you face several challenges that standard LASSO packages don't handle:
- Domain knowledge constrains coefficient signs — equity assets should have non-negative equity beta; government bonds should not load on commodity factors. Standard LASSO ignores this.
- You have prior estimates and want to shrink toward them, not toward zero — the penalty should be $|\beta - \beta_0|$ not $|\beta|$.
- Variables have different history lengths — some assets start trading later than others. Dropping rows with any NaN discards valid data for all other variables.
- You need a consistent covariance matrix — the factor covariance $\Sigma_y = \beta \Sigma_x \beta^\top + D$ must use the same $\beta$ from estimation, not a separate estimate.
- Data is non-stationary — recent observations should carry more weight (EWMA weighting).
factorlasso solves all five in a single fit() call. The implementation follows scikit-learn conventions (fit / predict / score / coef_ / intercept_).
The methodology is based on the Hierarchical Clustering Group LASSO (HCGL) framework introduced in:
Sepp A., Ossa I., Kastenholz M. (2026), "Robust Optimization of Strategic and Tactical Asset Allocation for Multi-Asset Portfolios", The Journal of Portfolio Management, 52(4), 86–120. Paper link
and the Capital Market Assumptions framework in the companion paper:
Sepp A., Hansen E., Kastenholz M. (2026), "Capital Market Assumptions and Strategic Asset Allocation Using Multi-Asset Tradable Factors", Under revision at the Journal of Portfolio Management.
Installation
Install using
pip install factorlasso
Upgrade using
pip install --upgrade factorlasso
Clone using
git clone https://github.com/ArturSepp/factorlasso.git
Core dependencies:
numpy, pandas, scipy, cvxpy, openpyxl
Table of Contents
- Quick Start
- Convention: Paper vs Code
- Sign Constraints
- Prior-Centered Regularisation
- Hierarchical Clustering Group LASSO (HCGL)
- NaN-Aware Estimation
- Factor Covariance Assembly
- API Summary
- Estimation Methods
- Applications
- Related Packages
- References
- Citation
Quick Start
import numpy as np, pandas as pd
from factorlasso import LassoModel, LassoModelType
# Simulate Y_t = β X_t + noise (code uses row-major: Y = X @ β' + noise)
np.random.seed(42)
T, M, N = 200, 3, 5
X = pd.DataFrame(np.random.randn(T, M), columns=['f0', 'f1', 'f2'])
beta_true = np.array([[1, 0, .5], [0, 1, 0], [.3, 0, 0], [0, .8, .2], [1, .5, 0]])
Y = pd.DataFrame(X.values @ beta_true.T + .1*np.random.randn(T, N),
columns=[f'y{i}' for i in range(N)])
# Fit sparse factor model
model = LassoModel(model_type=LassoModelType.LASSO, reg_lambda=1e-4)
model.fit(x=X, y=Y)
print(model.coef_.round(2)) # β (N × M)
print(model.intercept_.round(4)) # α (N,)
# Predict and score (scikit-learn compatible)
y_hat = model.predict(X) # Ŷ_t = α + β X_t (code: X @ β' + α)
r2 = model.score(X, Y) # mean R² across response variables
Convention: Paper vs Code
The factor model in the paper uses column vectors:
$$Y_t = \alpha + \beta, X_t + \varepsilon_t, \qquad \beta \in \mathbb{R}^{N \times M}$$
where $Y_t \in \mathbb{R}^{N \times 1}$ and $X_t \in \mathbb{R}^{M \times 1}$.
In Python, pandas DataFrames store observations as rows. The code works with the row-major equivalent:
| Symbol | Paper (column-vector) | Code (row-major, pandas) |
|---|---|---|
| $Y$ | $(N \times T)$ | y: DataFrame $(T \times N)$ |
| $X$ | $(M \times T)$ | x: DataFrame $(T \times M)$ |
| $\beta$ | $(N \times M)$ | coef_: DataFrame $(N \times M)$ — same as paper |
| $\alpha$ | $(N \times 1)$ | intercept_: Series $(N,)$ |
The coefficient matrix coef_ is stored in the paper convention $(N \times M)$.
The prediction Y = X @ β' + α in code is the row-major form of the paper's Y_t = α + β X_t.
Sign Constraints
Enforce domain knowledge on coefficient signs using a constraint matrix where
1 = non-negative, -1 = non-positive, 0 = constrained to zero, NaN = free:
signs = pd.DataFrame([[1, np.nan, 1], [np.nan, 1, 0], [1, 0, np.nan],
[np.nan, 1, 1], [1, 1, np.nan]],
index=Y.columns, columns=X.columns)
model = LassoModel(reg_lambda=1e-4, factors_beta_loading_signs=signs)
model.fit(x=X, y=Y)
# All constrained coefficients satisfy their sign requirements by construction
Prior-Centred Regularisation
Shrink toward a non-zero prior instead of zero. When you have prior estimates $\beta_0$ (e.g., from a previous estimation period or theoretical model), the penalty becomes $|\beta - \beta_0|$ instead of $|\beta|$:
beta_prior = pd.DataFrame(beta_true, index=Y.columns, columns=X.columns)
model = LassoModel(reg_lambda=1e-2, factors_beta_prior=beta_prior)
model.fit(x=X, y=Y) # shrinks toward beta_prior instead of zero
Hierarchical Clustering Group LASSO (HCGL)
Automatically discover group structure among response variables via hierarchical clustering on their correlation matrix (Ward's method), then apply Group LASSO with group-adaptive penalties:
model = LassoModel(
model_type=LassoModelType.GROUP_LASSO_CLUSTERS,
reg_lambda=1e-5, span=52,
)
model.fit(x=X, y=Y)
print(model.clusters) # auto-discovered groups
NaN-Aware Estimation
Variables with different history lengths are handled naturally.
Instead of dropping any row containing a NaN (which discards valid observations
for all other variables), factorlasso applies a binary validity mask that
zeros out the contribution of missing observations per variable while
preserving all available data:
Y_with_gaps = Y.copy()
Y_with_gaps.iloc[:50, 3] = np.nan # variable y3 starts 50 periods later
Y_with_gaps.iloc[:100, 4] = np.nan # variable y4 starts 100 periods later
model = LassoModel(reg_lambda=1e-4)
model.fit(x=X, y=Y_with_gaps)
# All 5 variables estimated using their full available history
# No data discarded for y0, y1, y2 despite gaps in y3, y4
Factor Covariance Assembly
After estimation, assemble the consistent factor covariance decomposition $\Sigma_y = \beta \Sigma_x \beta^\top + D$ where $\beta$ is the same matrix from the LASSO estimation — guaranteed consistency:
from factorlasso import CurrentFactorCovarData, VarianceColumns
sigma_y = CurrentFactorCovarData(
x_covar=factor_covariance, # Σ_x (M × M)
y_betas=model.coef_, # β (N × M) from estimation
y_variances=diagnostics_df, # residual variances D
).get_y_covar()
# sigma_y is (N × N) positive semi-definite by construction
API Summary
The API follows scikit-learn conventions: fit / predict / score.
| Method | Description |
|---|---|
model.fit(x, y) |
Estimate α, β — returns self |
model.predict(x) |
Return Ŷ_t = α + β X_t (row-major: X @ β' + α) |
model.score(x, y) |
Return mean R² |
| Fitted attribute | Shape | Description |
|---|---|---|
coef_ |
(N, M) | Factor loadings β |
intercept_ |
(N,) | Intercept α |
estimated_betas |
(N, M) | Alias for coef_ (backward compat) |
clusters_ |
(N,) | HCGL cluster labels |
estimation_result_ |
— | Full diagnostics (r2, ss_res, ss_total) |
| Parameter | Type | Default | Description |
|---|---|---|---|
model_type |
LassoModelType |
LASSO |
Estimation method |
reg_lambda |
float |
1e-5 |
Regularisation strength |
span |
int |
None |
EWMA span for observation weighting |
factors_beta_loading_signs |
DataFrame |
None |
Sign constraint matrix (N × M) |
factors_beta_prior |
DataFrame |
None |
Prior β₀ matrix (N × M) |
group_data |
Series |
None |
Group labels (required for GROUP_LASSO) |
demean |
bool |
True |
Subtract (rolling) mean before estimation |
solver |
str |
'CLARABEL' |
CVXPY solver name |
warmup_period |
int |
12 |
Min observations before including a variable |
Estimation Methods
| Method | LassoModelType |
Penalty |
|---|---|---|
| LASSO | LASSO |
$\lambda|\beta - \beta_0|_1$ |
| Group LASSO | GROUP_LASSO |
$\sum_g \lambda\sqrt{ |
| HCGL | GROUP_LASSO_CLUSTERS |
Same as Group LASSO with auto-clustering |
All methods support sign constraints, prior-centered shrinkage, EWMA weighting, and NaN-aware estimation.
Applications
The methodology is domain-agnostic. Examples are provided for:
examples/finance_factor_model.py— Multi-asset factor models with sign-constrained betas and consistent covariance estimationexamples/genomics_factor_model.py— Gene expression driven by pathway activity factors with biological sign priors
The same estimation problem (sparse factor loadings with sign priors and consistent covariance) appears in macro-econometrics, signal processing, and multi-task learning.
Illustration: multi-asset factor model with HCGL
from factorlasso import LassoModel, LassoModelType
model = LassoModel(
model_type=LassoModelType.GROUP_LASSO_CLUSTERS,
reg_lambda=1e-5,
span=52, # 1-year EWMA half-life (weekly data)
factors_beta_loading_signs=sign_matrix, # domain-knowledge constraints
factors_beta_prior=prior_betas, # shrink toward prior, not zero
)
model.fit(x=factor_returns, y=asset_returns)
# Inspect results
print(model.coef_) # sparse factor loadings (N × M)
print(model.intercept_) # intercept α (N,)
print(model.clusters_) # auto-discovered asset groups
print(model.score(factor_returns, asset_returns)) # mean R²
Related Packages
| Package | Key Difference |
|---|---|
scikit-learn Lasso |
No sign constraints, no multi-output Group LASSO |
| skglm | No sign constraints, no prior-centered shrinkage |
| abess | Best-subset selection (L0), not L1/Group L2 |
| group-lasso | No sign constraints, no EWMA, no prior-centered |
factorlasso is the only package that combines sign-constrained penalised regression, prior-centered shrinkage, HCGL clustering, NaN-aware estimation, and integrated factor covariance assembly.
References
-
Sepp A., Ossa I., Kastenholz M. (2026), "Robust Optimization of Strategic and Tactical Asset Allocation for Multi-Asset Portfolios", The Journal of Portfolio Management, 52(4), 86–120. Paper link
-
Sepp A., Hansen E., Kastenholz M. (2026), "Capital Market Assumptions and Strategic Asset Allocation Using Multi-Asset Tradable Factors", Under revision at the Journal of Portfolio Management.
Citation
If you use factorlasso in your research, please cite the software and the underlying papers:
@software{sepp2026factorlasso,
author = {Sepp, Artur},
title = {factorlasso: Sparse Factor Model Estimation with Constrained LASSO in Python},
year = {2026},
url = {https://github.com/ArturSepp/factorlasso}
}
@article{seppossa2026,
author = {Sepp, Artur and Ossa, Ivan and Kastenholz, Mika},
title = {Robust Optimization of Strategic and Tactical Asset Allocation for Multi-Asset Portfolios},
journal = {The Journal of Portfolio Management},
volume = {52},
number = {4},
pages = {86--120},
year = {2026}
}
@article{sepphansen2026,
author = {Sepp, Artur and Hansen, Emilie and Kastenholz, Mika},
title = {Capital Market Assumptions and Strategic Asset Allocation Using Multi-Asset Tradable Factors},
journal = {Under revision at the Journal of Portfolio Management},
year = {2026}
}
Disclaimer
factorlasso package is distributed FREE & WITHOUT ANY WARRANTY under the MIT License.
See LICENSE for details.
Please report any bugs or suggestions by opening an issue.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file factorlasso-0.1.9.tar.gz.
File metadata
- Download URL: factorlasso-0.1.9.tar.gz
- Upload date:
- Size: 46.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e20d3c17a48d9a4a77b48784b2141b20929e1f70872679a8b5aa22e60701f3d
|
|
| MD5 |
99d45d5fd1c4e75c5a262b2904ddcc33
|
|
| BLAKE2b-256 |
eadcdedb47d0e87e021828cf3a08b57f9874f2856f23174b416173cf32974727
|
File details
Details for the file factorlasso-0.1.9-py3-none-any.whl.
File metadata
- Download URL: factorlasso-0.1.9-py3-none-any.whl
- Upload date:
- Size: 26.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6b519a88a7273f46007403f45915c885f2ff3b96c7c136b8a2b0ae433481372a
|
|
| MD5 |
cfc37bbd22aeef8775f03e180eb515f2
|
|
| BLAKE2b-256 |
69106252253e3e851be436c4b0ab4f10012a67258a6544cdb6b59cde67f07a05
|