Skip to main content

Regress Out Covariates

Project description

regressout

CI

regressout removes the linear effect of observed covariates from a feature matrix. It provides RegressOutCovariates, a scikit-learn-style estimator that residualizes each feature column against a covariate matrix.

Why it exists

Some modeling workflows need features with variation explained by known covariates removed first. For example, a feature matrix may need to be adjusted for observed variables such as age, sex, ethnicity, batch, site, or other metadata before downstream analysis. This package fits those adjustments and returns the residual feature matrix.

How it works

RegressOutCovariates uses scikit-learn naming, but with domain-specific meaning:

  • X is the covariate or observation matrix: the variables to regress out.
  • y is the feature matrix to residualize.

On fit(X=covariates, y=features), it fits one sklearn.linear_model.LinearRegression model per feature column:

feature_j ~ covariates

On predict(X=covariates, y=features), it predicts the covariate contribution for each feature and returns:

feature_j - predicted_feature_j

If y is a pandas DataFrame, the returned residuals are also a DataFrame with the same index and columns. Otherwise, residuals are returned as a NumPy array.

Installation

pip install regressout

For local development from this repository:

pip install -r requirements_dev.txt
pip install -e .

The runtime dependencies declared by the package are numpy, pandas, and scikit-learn; Python 3.8 or newer is required.

Usage

import pandas as pd
from regressout import RegressOutCovariates

covariates = pd.DataFrame(
    {
        "age": [25, 49, 60, 50],
        "sex_M": [1, 0, 1, 0],
    },
    index=["sample1", "sample2", "sample3", "sample4"],
)

features = pd.DataFrame(
    {
        "feat1": [1.2, 2.5, 2.9, 3.1],
        "feat2": [0.4, 0.7, 1.4, 1.6],
    },
    index=covariates.index,
)

residualizer = RegressOutCovariates()
residualizer.fit(X=covariates, y=features)

residualized_features = residualizer.predict(X=covariates, y=features)

When covariates need preprocessing, put the preprocessing steps before RegressOutCovariates in a scikit-learn pipeline. The tests show this pattern with categorical encoding, column matching, scaling, and then residualization.

Important behavior and limitations

  • Covariates must already be numeric when they reach RegressOutCovariates. Encode categorical variables, impute missing values, or scale covariates in earlier pipeline steps as needed.
  • The estimator performs independent linear regression for each feature column; it does not model nonlinear effects unless you add nonlinear covariate features before fitting.
  • When fitted with pandas DataFrames, it validates row indexes and column order on later predictions where that metadata is available.
  • The number of rows in X and y must match. The number and order of covariate and feature columns must match what was seen during fit.
  • Unlike a standard scikit-learn estimator, both fit and predict take two arguments (predict(X=covariates, y=features)); a single-argument predict(X) call will not work, and the class is a predictor rather than a transform-style transformer.

Development

make test
make lint
make docs

The package is MIT licensed.

Changelog

0.0.1

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

regressout-0.0.2.tar.gz (12.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

regressout-0.0.2-py2.py3-none-any.whl (6.0 kB view details)

Uploaded Python 2Python 3

File details

Details for the file regressout-0.0.2.tar.gz.

File metadata

  • Download URL: regressout-0.0.2.tar.gz
  • Upload date:
  • Size: 12.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for regressout-0.0.2.tar.gz
Algorithm Hash digest
SHA256 4a65d7a835b55d2be54e469d1fd48fb7d561a4197d28f1bda6aafd06f305822f
MD5 42692477f42c60c57e8cc41f43bef947
BLAKE2b-256 fe4962c041370dd60088760655bd7962cdacba71aa61377ee88f9decd76e363b

See more details on using hashes here.

File details

Details for the file regressout-0.0.2-py2.py3-none-any.whl.

File metadata

  • Download URL: regressout-0.0.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 6.0 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for regressout-0.0.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 df2ff8235f0125f7368903249dd44e86cb2f986eb1cc9e0e5735cd252a9e78ce
MD5 3dd6277cabe80268985ac05a5ac1d1e4
BLAKE2b-256 ad857f16882b49ed5a89e09432876f58d8899efbc0a38960ad34cc9251fb7a6d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page