Random Forest-inspired correlation/dependence methods
Project description
rfcorr - Random Forest-based "Correlation" measures
This library records an open research agenda related to alternative conceptions of correlation based on tree-based ensembles, i.e., "random forests."
Author: Michael Bommarito
Project Homepage: GitHub
Original Announcement
PyPI
INSTALL
$ pip install rfcorr
USE
import rfcorr.random_forest
# df = pandas.DataFrame of data with features/variables in columns
rfcorr.random_forest.get_pairwise_corr(df.values,
num_trees=100, # number of trees in forest - bigger => tighter estimates
lag=0, # lag feature-target variable => allows for asymmetric R(x,y) != R(y,x)
method="regression", # estimate using regression or classification task
use_permutation=True # permutation- or impurity-based importance estimates
)
WHY?
Countless tasks rely on conceptions and formalizations of "correlation." But in two decades of working in areas that utilize correlation as key to their everyday operation, e.g., finance, I have found that few are measuring what their words reveal is their intuition.
While others have offered alternative measures of correlation or dependence generally, this is my contribution - an approach based on tree-based ensembles that natively supports lagged correlation.
Tree-based ensembles have a number of highly-favorable properties:
- Support for categorical features or mixed feature sets
- Intrinsic uncertainty estimation and intervals through ensemble construction
- Support for overfitting protection
- High degree of flexibility
- Estimation produces inferential models that can be used to make predictions
Additionally, this library supports lagging inputs against targets in the supervised training process, enabling asymmetric correlation estimates from the same data.
There are, however, downsides:
- Slower estimation than other correlation methods
- Stochastic estimates (though this library supports fixing RNG)
- Question around interpretation of signedness or directionality
- More complex interpretation than linear correlation measures
- Scaling permutation-based estimates
- Estimating covariance in asymmetric contexts
FUNCTIONALITY
- Random Forest (
rfcorr.random_forest
)get_corr_classification
: Correlation from classification taskget_corr_regression
: Correlation from regression taskget_corr
: Convenience handler including lag support for (x, y)get_pairwise_corr
: Convenience handler including lag support for full matrix X- Support for impurity-based or permutation-based importances (
use_permutation=True
)
- Extra Trees (
rfcorr.extra_trees
)get_corr_classification
: Correlation from classification taskget_corr_regression
: Correlation from regression taskget_corr
: Convenience handler including lag support for (x, y)get_pairwise_corr
: Convenience handler including lag support for full matrix X- Support for impurity-based or permutation-based importances (
use_permutation=True
)
- CatBoost (
rfcorr.cat
) (WIP)get_corr_classification
: Correlation from classification taskget_corr_regression
: Correlation from regression taskget_corr
: Convenience handler including lag support for (x, y)get_pairwise_corr
: Convenience handler including lag support for full matrix X- Support for GPU training and limited subset of catboost training parameters
- xgboost (
rfcorr.xgboost
) (WIP)get_corr_classification
: Correlation from classification taskget_corr_regression
: Correlation from regression taskget_corr
: Convenience handler including lag support for (x, y)get_pairwise_corr
: Convenience handler including lag support for full matrix X
- TODO: Histogram-based Gradient Boosting Trees
- TODO: Gradient-Boosting Trees
- TODO: Support exposing intervals (std, range) from permutation-based estimates
EXAMPLE USE
There are sample notebooks in the notebooks/
directory, including:
notebooks/test_sector_etf.ipynb
: "Correlation" and eigenvalue/spectral representations for SPDR Sector ETFs and SPYnotebooks/test_sector_etf_ts.ipynb
: Rolling "Correlation" time series for SPDR Sector ETFs and SPYnotebooks/test_periodic_pathological.ipynb
: Test of periodic (sin(x)
) data with pathological results for Pearson/Spearman
Sample usage looks like this:
import numpy
import pandas
import rfcorr.random_forest
# create sample data
x = numpy.arange(0, 8*numpy.pi, 0.1)
y1 = numpy.sqrt(x)
y2 = numpy.sin(x)
# fix random state/RNG
rs = numpy.random.RandomState(42)
pandas.DataFrame(rfcorr.random_forest.get_pairwise_corr(df.values,
num_trees=1000,
lag=0,
method="regression",
use_permutation=True,
random_state=rs),
columns=["x", "y1", "y2"],
index=["x", "y1", "y2"])
"""
x y1 y2
x 1.000000 1.919737 0.001276
y1 1.965436 1.000000 0.003697
y2 0.649579 0.628396 1.000000
"""
#NB: ~0 correlation for x~y2 and y1~y2
# compare with pearson
df = pandas.DataFrame(zip(x, y1, y2), columns=["x", "y1", "y2"])
df.corr(method="pearson")
"""
x y1 y2
x 1.000000 0.978639 -0.194091
y1 0.978639 1.000000 -0.206973
y2 -0.194091 -0.206973 1.000000
"""
# compare with spearman
df.corr(method="spearman")
"""
x y1 y2
x 1.000000 1.000000 -0.186751
y1 1.000000 1.000000 -0.186751
y2 -0.186751 -0.186751 1.000000
"""
HISTORY
- 0.1.0, 2022-02-22: Initial PyPI release
- 0.1.1, 2022-05-02: catboost support (available on GH as of 2022-03-28)
- 0.1.2, 2022-05-02: xgboost support
LICENSE
Apache 2.0
COLLABORATION
I'm currently working on a brief research note that should be on arxiv by March 2022. I'd love to collaborate with anyone interested on the topic, especially to bring broader perspective to backtesting, portfolio construction, and regime detection/timing applications.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file rfcorr-0.1.2.tar.gz
.
File metadata
- Download URL: rfcorr-0.1.2.tar.gz
- Upload date:
- Size: 15.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.0b1 CPython/3.8.10 Linux/5.13.0-40-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 70859ec5f802a7cd6cc6f2359d274f594fd653f402bc4094f3b87cce47ac07a0 |
|
MD5 | d8a67ae101ec16363c57d96f7e7a920c |
|
BLAKE2b-256 | a791ca8f1bbe2b0e3992fb15fded375b0a20c96ef27ffbb5ff60b9f7717f8175 |
File details
Details for the file rfcorr-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: rfcorr-0.1.2-py3-none-any.whl
- Upload date:
- Size: 15.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.0b1 CPython/3.8.10 Linux/5.13.0-40-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8dcbbe5484bc70bfcb77cba8d412a8a8e1185a3469dbf9207d8a3e1905149eed |
|
MD5 | 4953ea7a0e9528e132b2df3e118cde02 |
|
BLAKE2b-256 | 50fd49da9631ccb51904bdf3667ae43e3fa6b55d002a583fc3faab7f9c432e51 |