Skip to main content

A Python package for sklearn to produce linear regression summary and diagnostic plots similar to those made in R with summary.lm and plot.lm

Project description

pyplotlm - R style linear regression summary and diagnostic plots for sklearn

This package is a reproduction of the summary.lm and plot.lm function in R but for a python environment and is meant to support the sklearn library by adding model summary and diagnostic plots for linear regression. In the R environment, we can fit a linear model and generate a model summary and diagnostic plots by doing the following:

> fit = lm(y ~ ., data=data)

> summary(fit)


Call:
lm(formula = y ~ ., data = data)

Residuals:
     Min       1Q   Median       3Q      Max
-155.829  -38.534   -0.227   37.806  151.355

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  152.133      2.576  59.061  < 2e-16 ***
X0           -10.012     59.749  -0.168 0.867000    
X1          -239.819     61.222  -3.917 0.000104 ***
X2           519.840     66.534   7.813 4.30e-14 ***
X3           324.390     65.422   4.958 1.02e-06 ***
X4          -792.184    416.684  -1.901 0.057947 .  
X5           476.746    339.035   1.406 0.160389    
X6           101.045    212.533   0.475 0.634721    
X7           177.064    161.476   1.097 0.273456    
X8           751.279    171.902   4.370 1.56e-05 ***
X9            67.625     65.984   1.025 0.305998    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 54.15 on 431 degrees of freedom
Multiple R-squared:  0.5177,	Adjusted R-squared:  0.5066
F-statistic: 46.27 on 10 and 431 DF,  p-value: < 2.2e-16


> par(mfrow=c(2,2))
> plot(fit)


The goal of this package is to make this process as simple as it is in R for a sklearn LinearRegression object.

Install

pip install pyplotlm

Introduction

There are two core functionalities:

A. generate a R style regression model summary (R summary.lm)

B. plot six available diagnostic plots (R plot.lm):
1. Residuals vs Fitted
2. Normal Q-Q
3. Scale-Location
4. Cook's Distance
5. Residuals vs Leverage
6. Cook's Distance vs Leverage / (1-Leverage)

Usage

Below is how you would produce the summary and diagnostic plots in Python:

>>> from sklearn.datasets import load_diabetes
>>> from sklearn.linear_model import LinearRegression
>>> import matplotlib.pyplot as plt
>>> from pyplotlm import *

>>> X, y = load_diabetes(return_X_y=True)

>>> reg = LinearRegression().fit(X, y)

>>> obj = PyPlotLm(reg, X, y, intercept=False)
>>> obj.summary() # or summary(obj)
Residuals:
       Min        1Q   Median       3Q       Max
 -155.8290  -38.5339  -0.2269  37.8061  151.3550

Coefficients:
              Estimate Std. Error  t value Pr(>|t|)     
(Intercept)   152.1335     2.5759  59.0614   0.0000  ***
X0            -10.0122    59.7492  -0.1676   0.8670     
X1           -239.8191    61.2223  -3.9172   0.0001  ***
X2            519.8398    66.5336   7.8132   0.0000  ***
X3            324.3904    65.4219   4.9584   0.0000  ***
X4           -792.1842   416.6839  -1.9012   0.0579  .  
X5            476.7458   339.0345   1.4062   0.1604     
X6            101.0446   212.5326   0.4754   0.6347     
X7            177.0642   161.4756   1.0965   0.2735     
X8            751.2793   171.9020   4.3704   0.0000  ***
X9             67.6254    65.9842   1.0249   0.3060     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 54.154 on 431 degrees of freedom
Multiple R-squared: 0.5177,     Adjusted R-squared: 0.5066
F-statistic: 46.27 on 10 and 431 DF,  p-value: 1.11e-16

>>> obj.plot() or plot(obj)
>>> plt.show()

This will produce the same set of diagnostic plots:

References:

  1. Regression Deletion Diagnostics (R)
    https://stat.ethz.ch/R-manual/R-devel/library/stats/html/influence.measures.html
    https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/lm
    https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/plot.lm

  2. Residuals and Influence in Regression
    https://conservancy.umn.edu/handle/11299/37076
    https://en.wikipedia.org/wiki/Leverage_(statistics)
    https://en.wikipedia.org/wiki/Studentized_residual

  3. Cook's Distance
    https://en.wikipedia.org/wiki/Cook%27s_distance

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyplotlm-0.1.4.tar.gz (9.0 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page