Skip to main content

A better generalized linear regression implementation!

Project description

lr_cd

A better implementation of linear regression in Python!

CI/CD codecov Documentation Status License: MIT version Python 3.9.0 release

Project Summary

We implement linear regression using the coordinate descent (CD) algorithm in this Python package. For additional details about the coordinate descent (CD) algorithm, please refer to the link.

Our package consists of three major components:

  1. Simulated data generation
  2. Coordinate descent algorithm
  3. Visualization of data and the fitted linear regression line

Functions

There are three major functions in this package:

  • generate_data_lr(n, n_features, theta, noise=0.2, random_seed=123): this function generates many random data points based on the theta coefficients, which will later be used for model fitting.
  • coordinate_descent(X, y, ϵ=1e-6, max_iterations=1000): this function performs coordinate descent to minimize the mean squared error of linear regression and therefore outputs the optimized intercept and coefficients vector.
  • plot_lr(X, y, intercept, coef): this function returns a scatter plot of the observed data points overlayed with a regression with optimized intercept and coefficients vector.

Common Parameters

  • n (integer): Number of data points.
  • n_features (integer): Number of features to generate, excluding the intercept.
  • theta (ndarray): True scalar intercept and coefficient weights vector. The first element should always be the intercept.
  • noise (float): Standard deviation of a normal distribution added to the generated target y array as noise.
  • random_seed (integer): Random seed to ensure reproducibility.
  • X (ndarray): Feature data matrix, the independent variable.
  • y (ndarray): Response data vector, the dependent variable. Both X and y should have the same number of observations.
  • ϵ (float, optional): Stop the algorithm if the change in weights is smaller than the value (default is 1e-6).
  • max_iterations (int, optional): Maximum number of iterations (default is 1000).
  • intercept (float): Optimized intercept. It will be used to calculate the estimated values using observed data X.
  • coef (ndarray): Optimized coefficient weights vector. It will be used to calculate the estimated values using observed data X.

Python Ecosystem Context

lr_cd establishes itself as a valuable enhancement to the Python ecosystem. The LinearRegression in the Python package scikit-learn has similar functionality, but our implementation uses a different algorithm, which we believe is superior. sklearn.linear_model.LinearRegression contains a few optimization functions: scipy.linalg.lstsq, scipy.sparse.linalg.lsqr, and scipy.optimize.nnls, which rely on the singular value decomposition of feature matrix X.

  • Beginner-Friendly : lr_cd is easy to use for beginners in Python and statistics. The well-written functions allow users to play with various simulated data and generate beautiful plots with little effort.

  • Reliable-Alternative : The coordinate descent algorithm is known for fast convergence in various convex optimization problems, making this Python package a reliable alternative to existed packages. The package can be easily extended to a list of statistical models such as Ridge Regression and Lasso Regression.

Installation

Prerequisites

Make sure Miniconda or Anaconda is installed on your system

Step 1: Clone the Repository

git clone git@github.com:UBC-MDS/lr_cd.git
cd lr_cd  # Navigate to the cloned repository directory

Step 2: Create and Activate the Conda Environment

# Method 1: create Conda Environment from the environment.yml file
conda env create -f environment.yml  # Create Conda environment
conda activate lr_cd  # Activate the Conda environment

# Method 2: create Conda Environment 
conda create --name lr_cd python=3.9 -y
conda activate lr_cd

Step 3: Install the Package Using Poetry

Ensure the Conda environment is activated (you should see (lr_cd) in the terminal prompt)

poetry install  # Install the package using Poetry

Step 4: Get the coverage

# Check line coverage
pytest --cov=lr_cd

# Check branch coverage
pytest --cov-branch --cov=lr_cd
poetry run pytest --cov-branch --cov=src
poetry run pytest --cov-branch --cov=lr_cd --cov-report html

Troubleshooting

  1. Environment Creation Issues: Ensure environment.yml is in the correct directory and you have the correct Conda version

  2. Poetry Installation Issues: Verify Poetry is correctly installed in the Conda environment and your pyproject.toml file is properly configured

Usage

Use this package to find the optimized intercept and coefficients vector of linear regression. In the following example, we generate a simulated data set with a feature matrix and response first. By the coordinate descent algorithm, we obtain the optimized intercept and coefficients. Finally, we visualize both the simulated data and fitted line in one figure.

Example usage:

>>> from lr_cd.lr_data_generation import generate_data_lr
>>> import numpy as np
>>> theta = np.array([4, 3])
>>> X, y = generate_data_lr(n=10, n_features=1, theta=theta)
>>> print('Generated X:')
>>> print(X)
>>> print('Generated y:')
>>> print(y)
Generated X:
[[0.69646919]
 [0.28613933]
 [0.22685145]
 [0.55131477]
 [0.71946897]
 [0.42310646]
 [0.9807642 ]
 [0.68482974]
 [0.4809319 ]
 [0.39211752]]
Generated y:
[[6.34259481]
 [4.68506992]
 [4.54477713]
 [5.63500251]
 [6.45668483]
 [5.14153898]
 [6.8534962 ]
 [5.96761896]
 [5.88398172]
 [5.61370977]]
>>> from lr_cd.lr_cd import coordinate_descent
>>> intercept, coef, _ = coordinate_descent(X, y)
>>> print(f"lr_cd Intercept for example: {intercept}")
>>> print(f"lr_cd Coefficients for example: {coef}")
lr_cd Intercept for example: 4.0240072117306145
lr_cd Coefficients for example: [[3.10261496]]
>>> from lr_cd.lr_plotting import plot_lr
>>> plot_lr(X, y, intercept, coef)

Documentations

Online documentation is available readthedocs.

Publishing on TestPyPi and PyPi.

Contributors

Andy Zhang for algorithm, Sam Fo for data generation, and Jing Wen for visualization.

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

License

lr_cd was created by Sam Fo, Jing Wen, Andy Zhang. It is licensed under the terms of the MIT license.

Credits

lr_cd was created with cookiecutter and the py-pkgs-cookiecutter template.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lr_cd-0.3.6.tar.gz (7.1 kB view hashes)

Uploaded Source

Built Distribution

lr_cd-0.3.6-py3-none-any.whl (8.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page