Skip to main content

Linear regression utility with inference tests, residual analysis, outlier visualization, multicollinearity test, and other features

Project description

mlr (pip install mlr)

top

A lightweight, easy-to-use Python package that combines the scikit-learn-like simple API with the power of statistical inference tests, visual residual analysis, outlier visualization, multicollinearity test, found in packages like statsmodels and R language.

Authored and maintained by Dr. Tirthajyoti Sarkar (Website, LinkedIn profile)

Useful regression metrics,

  • MSE, SSE, SST
  • R^2, Adjusted R^2
  • AIC (Akaike Information Criterion), and BIC (Bayesian Information Criterion)

Inferential statistics,

  • Standard errors
  • Confidence intervals
  • p-values
  • t-test values
  • F-statistic

Visual residual analysis,

  • Plots of fitted vs. features,
  • Plot of fitted vs. residuals,
  • Histogram of standardized residuals
  • Q-Q plot of standardized residuals

Outlier detection

  • Influence plot
  • Cook's distance plot

Multicollinearity

  • Pairplot
  • Variance infletion factors (VIF)
  • Covariance matrix
  • Correlation matrix
  • Correlation matrix heatmap

Requirements

  • numpy (pip install numpy)
  • pandas (pip install pandas)
  • matplotlib (pip install matplotlib)
  • seaborn (pip install seaborn)
  • scipy (pip install scipy)
  • statsmodels (pip install statsmodels)

Install

(On Linux and Windows) You can use pip

pip install mlr

(On Mac OS), first install pip,

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py

Then proceed as above.


Quick Start

Import the MyLinearRegression class,

from MLR import MyLinearRegression as mlr
import numpy as np

Generate some random data

num_samples=40
num_dim = 5
X = 10*np.random.random(size=(num_samples,num_dim))
coeff = np.array([2,-3.5,1.2,4.1,-2.5])
y = np.dot(coeff,X.T)+10*np.random.randn(num_samples)

Make a model instance,

model = mlr()

Ingest the data

model.ingest_data(X,y)

Fit,

model.fit()

Directly read from a Pandas DataFrame

You can read directly from a Pandas DataFrame. Just give the features/predictors' column names as a list and the target column name as a string to the fit_dataframe method.

At this point, only numerical features/targets are supported but in future releases we will support categorical variables too.

<... obtain a Pandas DataFrame by some processing>
df = pd.DataFrame(...)
feature_cols = ['X1','X2','X3']
target_col = 'output'

model = mlr()
model.fit_dataframe(X=feature_cols,y = target_col,dataframe=df)

Metrics

So far, it looks similar to the linear regression estimator of Scikit-Learn, doesn't it?
Here comes the difference,

Print all kinds of regression model metrics, one by one,

print ("R-squared: ",model.r_squared())
print ("Adjusted R-squared: ",model.adj_r_squared())
print("MSE: ",model.mse())

>> R-squared:  0.8344327025902752
   Adjusted R-squared:  0.8100845706182569
   MSE:  72.2107655649954

Or, print all the metrics at once!

model.print_metrics()

>> sse:     2888.4306
   sst:     17445.6591
   mse:     72.2108
   r^2:     0.8344
   adj_r^2: 0.8101
   AIC:     296.6986
   BIC:     306.8319

Correlation matrix, heatmap, covariance

We can build the correlation matrix right after ingesting the data. This matrix gives us an indication how much multicollinearity is present among the features/predictors.

Correlation matrix

model.ingest_data(X,y)
model.corrcoef()

>> array([[ 1.        ,  0.18424447, -0.00207883,  0.144186  ,  0.08678109],
       [ 0.18424447,  1.        , -0.08098705, -0.05782733,  0.19119872],
       [-0.00207883, -0.08098705,  1.        ,  0.03602977, -0.17560097],
       [ 0.144186  , -0.05782733,  0.03602977,  1.        ,  0.05216212],
       [ 0.08678109,  0.19119872, -0.17560097,  0.05216212,  1.        ]])

Covariance

model.covar()

>> array([[10.28752086,  1.51237819, -0.01770701,  1.47414685,  0.79121778],
       [ 1.51237819,  6.54969628, -0.5504233 , -0.47174359,  1.39094876],
       [-0.01770701, -0.5504233 ,  7.05247111,  0.30499622, -1.32560195],
       [ 1.47414685, -0.47174359,  0.30499622, 10.16072256,  0.47264283],
       [ 0.79121778,  1.39094876, -1.32560195,  0.47264283,  8.08036806]])

Correlation heatmap

model.corrplot(cmap='inferno',annot=True)

corrplot

Statistical inference

Perform the F-test of overall significance

It retunrs the F-statistic and the p-value of the test.

If the p-value is a small number you can reject the Null hypothesis that all the regression coefficient is zero. That means a small p-value (generally < 0.01) indicates that the overall regression is statistically significant.

model.ftest()

>> (34.270912591948814, 2.3986657277649282e-12)

How about p-values, t-test statistics, and standard errors of the coefficients?

Standard errors and corresponding t-tests give us the p-values for each regression coefficient, which tells us whether that particular coefficient is statistically significant or not (based on the given data).

print("P-values:",model.pvalues())
print("t-test values:",model.tvalues())
print("Standard errors:",model.std_err())

>> P-values: [8.33674608e-01 3.27039586e-03 3.80572234e-05 2.59322037e-01 9.95094748e-11 2.82226752e-06]
   t-test values: [ 0.21161008  3.1641696  -4.73263963  1.14716519  9.18010412 -5.60342256]
   Standard errors: [5.69360847 0.47462621 0.59980706 0.56580141 0.47081187 0.5381103 ]

Confidence intervals

model.conf_int()

>> array([[-10.36597959,  12.77562953],
       [  0.53724132,   2.46635435],
       [ -4.05762528,  -1.61971606],
       [ -0.50077913,   1.79891449],
       [  3.36529718,   5.27890687],
       [ -4.10883113,  -1.92168771]])

Visual analysis of the residuals

Residual analysis is crucial to check the assumptions of a linear regression model. mlr helps you check those assumption easily by providing straight-forward visual analytis methods for the residuals.

Fitted vs. residuals plot

Check the assumption of constant variance and uncorrelated features (independence) with this plot

model.fitted_vs_residual()

fit_vs_resid

Fitted vs features plot

Check the assumption of linearity with this plot

model.fitted_vs_features()

fit_vs_features

Histogram and Q-Q plot of standardized residuals

Check the normality assumption of the error terms using these plots,

model.histogram_resid()

hist_resid

model.qqplot_resid()

Do more

Do more fun stuff with your regression model. More features will be added in the future releases!

  • Outlier detection and plots
  • Multicollinearity checks

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlr-0.1.0.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

mlr-0.1.0-py3-none-any.whl (24.2 kB view details)

Uploaded Python 3

File details

Details for the file mlr-0.1.0.tar.gz.

File metadata

  • Download URL: mlr-0.1.0.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.0

File hashes

Hashes for mlr-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6d84592c3090efa37e762c1938e05faa102c992d5121b9ad92b77e06d72a8732
MD5 1857650eb14267992a0fb7ef8ef14ad0
BLAKE2b-256 06e623e5bc9d461e0eacb37fa63644bc5f0345ba2b7c4f76467477c2edbabcf7

See more details on using hashes here.

File details

Details for the file mlr-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mlr-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 24.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.0

File hashes

Hashes for mlr-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 53bc44caa5f68582949654b42d52557eb06ea4d677bc9a954ea8a9f0d8b04a65
MD5 e7c592f0c31ef79d2a5ceb19f58d0f3f
BLAKE2b-256 64aa5877ade58c2d0b531e848ceae9d4bfa677b9df91932b7d3ef14e127ffa9e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page