Skip to main content

Statistical computations and models for use with SciPy

Project description

What it is

Statsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics and estimation and inference for statistical models.

Documentation for the 0.4 version is currently at

Main Features

  • linear regression models: Generalized least squares (including weighted least squares and least squares with autoregressive errors), ordinary least squares.

  • glm: Generalized linear models with support for all of the one-parameter exponential family distributions.

  • discrete: regression with discrete dependent variables, including Logit, Probit, MNLogit, Poisson, based on maximum likelihood estimators

  • rlm: Robust linear models with support for several M-estimators.

  • tsa: models for time series analysis - univariate time series analysis: AR, ARIMA - vector autoregressive models, VAR and structural VAR - descriptive statistics and process models for time series analysis

  • nonparametric : (Univariate) kernel density estimators

  • datasets: Datasets to be distributed and used for examples and in testing.

  • stats: a wide range of statistical tests - diagnostics and specification tests - goodness-of-fit and normality tests - functions for multiple testing - various additional statistical tests

  • iolib - Tools for reading Stata .dta files into numpy arrays. - printing table output to ascii, latex, and html

  • miscellaneous models

  • sandbox: statsmodels contains a sandbox folder with code in various stages of developement and testing which is not considered “production ready”. This covers among others Mixed (repeated measures) Models, GARCH models, general method of moments (GMM) estimators, kernel regression, various extensions to scipy.stats.distributions, panel data models, generalized additive models and information theoretic measures.

Where to get it

The master branch on GitHub is the most up to date code

Source download of release tags are available on GitHub

Binaries and source distributions are available from PyPi

Installation from sources

See INSTALL.txt for requirements or see the documentation


Modified BSD (3-clause)


The official documentation is hosted on SourceForge

Windows Help

The source distribution for Windows includes a htmlhelp file (statsmodels.chm). This can be opened from the python interpreter

>>> import statsmodels.api as sm
>>> sm.open_help()

Discussion and Development

Discussions take place on our mailing list.

We are very interested in feedback about usability and suggestions for improvements.

Bug Reports

Bug reports can be submitted to the issue tracker at

Release History


Main Changes and Additions

  • Added pandas dependency.

  • Cython source is built automatically if cython and compiler are present

  • Support use of dates in timeseries models

  • Improved plots - Violin plots - Bean Plots - QQ Plots

  • Added lowess function

  • Support for pandas Series and DataFrame objects. Results instances return pandas objects if the models are fit using pandas objects.

  • Full Python 3 compatibility

  • Fix bugs in genfromdta. Convert Stata .dta format to structured array preserving all types. Conversion is much faster now.

  • Improved documentation

  • Models and results are pickleable via save/load, optionally saving the model data.

  • Kernel Density Estimation now uses Cython and is considerably faster.

  • Diagnostics for outlier and influence statistics in OLS

  • Added El Nino Sea Surface Temperatures dataset

  • Numerous bug fixes

  • Internal code refactoring

  • Improved documentation including examples as part of HTML

Changes that break backwards compatibility

  • Deprecated scikits namespace. The recommended import is now:

    from statsmodels.api import sm
  • model.predict methods signature is now (params, exog, …) where before it assumed that the model had been fit and omitted the params argument.

  • For consistency with other multi-equation models, the parameters of MNLogit are now transposed.

  • -> distributions.ECDF

  • -> distributions.monotone_fn_inverter

  • -> distributions.StepFunction


  • Removed academic-only WFS dataset.

  • Fix easy_install issue on Windows.


Changes that break backwards compatibility

Added for importing. So the new convention for importing is:

import statsmodels.api as sm

Importing from modules directly now avoids unnecessary imports and increases the import speed if a library or user only needs specific functions.

  • sandbox/ -> iolib/

  • lib/ -> iolib/ (Now contains Stata .dta format reader)

  • family -> families

  • families.links.inverse -> families.links.inverse_power

  • Datasets’ Load class is now load function.

  • -> regression/

  • -> discrete/

  • -> robust/

  • -> genmod/

  • -> base/

  • t() method -> tvalues attribute (t() still exists but raises a warning)

Main changes and additions

  • Numerous bugfixes.

  • Time Series Analysis model (tsa)

    • Vector Autoregression Models VAR (tsa.VAR)

    • Autogressive Models AR (tsa.AR)

    • Autoregressive Moving Average Models ARMA (tsa.ARMA) optionally uses Cython for Kalman Filtering use install with option –with-cython

    • Baxter-King band-pass filter (tsa.filters.bkfilter)

    • Hodrick-Prescott filter (tsa.filters.hpfilter)

    • Christiano-Fitzgerald filter (tsa.filters.cffilter)

  • Improved maximum likelihood framework uses all available scipy.optimize solvers

  • Refactor of the datasets sub-package.

  • Added more datasets for examples.

  • Removed RPy dependency for running the test suite.

  • Refactored the test suite.

  • Refactored codebase/directory structure.

  • Support for offset and exposure in GLM.

  • Removed data_weights argument to for Binomial models.

  • New statistical tests, especially diagnostic and specification tests

  • Multiple test correction

  • General Method of Moment framework in sandbox

  • Improved documentation

  • and other additions


Main changes

  • renames for more consistency RLM.fitted_values -> RLM.fittedvalues GLMResults.resid_dev -> GLMResults.resid_deviance

  • GLMResults, RegressionResults: lazy calculations, convert attributes to properties with _cache

  • fix tests to run without rpy

  • expanded examples in examples directory

  • add PyDTA to – functions for reading Stata .dta binary files and converting them to numpy arrays

  • made tools.categorical much more robust

  • add_constant now takes a prepend argument

  • fix GLS to work with only a one column design


  • add four new datasets

    • A dataset from the American National Election Studies (1996)

    • Grunfeld (1950) investment data

    • Spector and Mazzeo (1980) program effectiveness data

    • A US macroeconomic dataset

  • add four new Maximum Likelihood Estimators for models with a discrete dependent variables with examples

    • Logit

    • Probit

    • MNLogit (multinomial logit)

    • Poisson


  • add qqplot in

  • add sandbox.tsa (time series analysis) and sandbox.regression (anova)

  • add principal component analysis in

  • add Seemingly Unrelated Regression (SUR) and Two-Stage Least Squares for systems of equations in sandbox.sysreg.Sem2SLS

  • add restricted least squares (RLS)


  • initial release

Project details

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page