Skip to main content

Random Forest-inspired correlation/dependence methods

Project description

Contributors Forks Stargazers Issues MIT License

rfcorr - Random Forest-based "Correlation" measures

This library records an open research agenda related to alternative conceptions of correlation based on tree-based ensembles, i.e., "random forests."

Author: Michael Bommarito
Project Homepage: GitHub
Original Announcement
PyPI

INSTALL

$ pip install rfcorr

USE

import rfcorr.random_forest

# df = pandas.DataFrame of data with features/variables in columns
rfcorr.random_forest.get_pairwise_corr(df.values,
                                       num_trees=100, # number of trees in forest - bigger => tighter estimates
                                       lag=0, # lag feature-target variable => allows for asymmetric R(x,y) != R(y,x)
                                       method="regression", # estimate using regression or classification task
                                       use_permutation=True # permutation- or impurity-based importance estimates
)

WHY?

Countless tasks rely on conceptions and formalizations of "correlation." But in two decades of working in areas that utilize correlation as key to their everyday operation, e.g., finance, I have found that few are measuring what their words reveal is their intuition.

While others have offered alternative measures of correlation or dependence generally, this is my contribution - an approach based on tree-based ensembles that natively supports lagged correlation.
Tree-based ensembles have a number of highly-favorable properties:

  • Support for categorical features or mixed feature sets
  • Intrinsic uncertainty estimation and intervals through ensemble construction
  • Support for overfitting protection
  • High degree of flexibility
  • Estimation produces inferential models that can be used to make predictions

Additionally, this library supports lagging inputs against targets in the supervised training process, enabling asymmetric correlation estimates from the same data.

There are, however, downsides:

  • Slower estimation than other correlation methods
  • Stochastic estimates (though this library supports fixing RNG)
  • Question around interpretation of signedness or directionality
  • More complex interpretation than linear correlation measures
  • Scaling permutation-based estimates
  • Estimating covariance in asymmetric contexts

FUNCTIONALITY

  • Random Forest (rfcorr.random_forest)
    • get_corr_classification: Correlation from classification task
    • get_corr_regression: Correlation from regression task
    • get_corr: Convenience handler including lag support for (x, y)
    • get_pairwise_corr: Convenience handler including lag support for full matrix X
    • Support for impurity-based or permutation-based importances (use_permutation=True)
  • Extra Trees (rfcorr.extra_trees)
    • get_corr_classification: Correlation from classification task
    • get_corr_regression: Correlation from regression task
    • get_corr: Convenience handler including lag support for (x, y)
    • get_pairwise_corr: Convenience handler including lag support for full matrix X
    • Support for impurity-based or permutation-based importances (use_permutation=True)
  • CatBoost (rfcorr.cat) (WIP)
    • get_corr_classification: Correlation from classification task
    • get_corr_regression: Correlation from regression task
    • get_corr: Convenience handler including lag support for (x, y)
    • get_pairwise_corr: Convenience handler including lag support for full matrix X
    • Support for GPU training and limited subset of catboost training parameters
  • xgboost (rfcorr.xgboost) (WIP)
    • get_corr_classification: Correlation from classification task
    • get_corr_regression: Correlation from regression task
    • get_corr: Convenience handler including lag support for (x, y)
    • get_pairwise_corr: Convenience handler including lag support for full matrix X
  • TODO: Histogram-based Gradient Boosting Trees
  • TODO: Gradient-Boosting Trees
  • TODO: Support exposing intervals (std, range) from permutation-based estimates

EXAMPLE USE

There are sample notebooks in the notebooks/ directory, including:

  • notebooks/test_sector_etf.ipynb: "Correlation" and eigenvalue/spectral representations for SPDR Sector ETFs and SPY
  • notebooks/test_sector_etf_ts.ipynb: Rolling "Correlation" time series for SPDR Sector ETFs and SPY
  • notebooks/test_periodic_pathological.ipynb: Test of periodic (sin(x)) data with pathological results for Pearson/Spearman

Sample usage looks like this:

import numpy
import pandas
import rfcorr.random_forest

# create sample data
x = numpy.arange(0, 8*numpy.pi, 0.1)
y1 = numpy.sqrt(x)
y2 = numpy.sin(x)

# fix random state/RNG
rs = numpy.random.RandomState(42)
pandas.DataFrame(rfcorr.random_forest.get_pairwise_corr(df.values, 
                                                      num_trees=1000,
                                                      lag=0,
                                                      method="regression", 
                                                      use_permutation=True,
                                                      random_state=rs),
                 columns=["x", "y1", "y2"],
                 index=["x", "y1", "y2"])
"""
x	y1	y2
x	1.000000	1.919737	0.001276
y1	1.965436	1.000000	0.003697
y2	0.649579	0.628396	1.000000
"""
#NB: ~0 correlation for x~y2 and y1~y2

# compare with pearson
df = pandas.DataFrame(zip(x, y1, y2), columns=["x", "y1", "y2"])
df.corr(method="pearson")

"""
	x	y1	y2
x	1.000000	0.978639	-0.194091
y1	0.978639	1.000000	-0.206973
y2	-0.194091	-0.206973	1.000000
"""

# compare with spearman
df.corr(method="spearman")
"""
x	y1	y2
x	1.000000	1.000000	-0.186751
y1	1.000000	1.000000	-0.186751
y2	-0.186751	-0.186751	1.000000
"""

HISTORY

  • 0.1.0, 2022-02-22: Initial PyPI release
  • 0.1.1, 2022-05-02: catboost support (available on GH as of 2022-03-28)
  • 0.1.2, 2022-05-02: xgboost support

LICENSE

Apache 2.0

COLLABORATION

I'm currently working on a brief research note that should be on arxiv by March 2022. I'd love to collaborate with anyone interested on the topic, especially to bring broader perspective to backtesting, portfolio construction, and regime detection/timing applications.

(back to top)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rfcorr-0.1.2.tar.gz (15.1 kB view details)

Uploaded Source

Built Distribution

rfcorr-0.1.2-py3-none-any.whl (15.9 kB view details)

Uploaded Python 3

File details

Details for the file rfcorr-0.1.2.tar.gz.

File metadata

  • Download URL: rfcorr-0.1.2.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.0b1 CPython/3.8.10 Linux/5.13.0-40-generic

File hashes

Hashes for rfcorr-0.1.2.tar.gz
Algorithm Hash digest
SHA256 70859ec5f802a7cd6cc6f2359d274f594fd653f402bc4094f3b87cce47ac07a0
MD5 d8a67ae101ec16363c57d96f7e7a920c
BLAKE2b-256 a791ca8f1bbe2b0e3992fb15fded375b0a20c96ef27ffbb5ff60b9f7717f8175

See more details on using hashes here.

File details

Details for the file rfcorr-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: rfcorr-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 15.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.0b1 CPython/3.8.10 Linux/5.13.0-40-generic

File hashes

Hashes for rfcorr-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8dcbbe5484bc70bfcb77cba8d412a8a8e1185a3469dbf9207d8a3e1905149eed
MD5 4953ea7a0e9528e132b2df3e118cde02
BLAKE2b-256 50fd49da9631ccb51904bdf3667ae43e3fa6b55d002a583fc3faab7f9c432e51

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page