prince

Statistical factor analysis in Python

These details have not been verified by PyPI

Project links

Homepage

Project description

Introduction

Prince is a library for doing factor analysis. This includes a variety of methods including principal component analysis (PCA) and correspondence analysis (CA). The goal is to provide an efficient implementation for each algorithm along with a nice API.

Installation

:warning: Prince is only compatible with Python 3.

:snake: Although it isn't a requirement, using Anaconda is highly recommended.

Via PyPI

>>> pip install prince  # doctest: +SKIP

Via GitHub for the latest development version

>>> pip install git+https://github.com/MaxHalford/Prince  # doctest: +SKIP

Prince doesn't have any extra dependencies apart from the usual suspects (sklearn, pandas, matplotlib) which are included with Anaconda.

Usage

Guidelines

Each estimator provided by prince extends scikit-learn's TransformerMixin. This means that each estimator implements a fit and a transform method which makes them usable in a transformation pipeline. The fit method is actually an alias for the row_principal_components method which returns the row principal components. However you also can also access the column principal components with the column_principal_components.

Under the hood Prince uses a randomised version of SVD. This is much faster than using the more commonly full approach. However the results may have a small inherent randomness. For most applications this doesn't matter and you shouldn't have to worry about it. However if you want reproducible results then you should set the random_state parameter.

The randomised version of SVD is an iterative method. Because each of Prince's algorithms use SVD, they all possess a n_iter parameter which controls the number of iterations used for computing the SVD. On the one hand the higher n_iter is the more precise the results will be. On the other hand increasing n_iter increases the computation time. In general the algorithm converges very quickly so using a low n_iter (which is the default behaviour) is recommended.

The following papers give a good overview of the field of factor analysis if you want to go deeper:

Principal component analysis (PCA)

If you're using PCA it is assumed you have a dataframe consisting of numerical continuous variables. In this example we're going to be using the Iris flower dataset.

>>> import pandas as pd
>>> import prince
>>> from sklearn import datasets

>>> X, y = datasets.load_iris(return_X_y=True)
>>> X = pd.DataFrame(data=X, columns=['Sepal length', 'Sepal width', 'Petal length', 'Sepal length'])
>>> y = pd.Series(y).map({0: 'Setosa', 1: 'Versicolor', 2: 'Virginica'})
>>> X.head()
   Sepal length  Sepal width  Petal length  Sepal length
0           5.1          3.5           1.4           0.2
1           4.9          3.0           1.4           0.2
2           4.7          3.2           1.3           0.2
3           4.6          3.1           1.5           0.2
4           5.0          3.6           1.4           0.2

The PCA class implements scikit-learn's fit/transform API. It's parameters have to passed at initialisation before calling the fit method.

>>> pca = prince.PCA(
...     n_components=2,
...     n_iter=3,
...     rescale_with_mean=True,
...     rescale_with_std=True,
...     copy=True,
...     engine='auto',
...     random_state=42
... )
>>> pca = pca.fit(X)

The available parameters are:

n_components: the number of components that are computed. You only need two if your intention is to make a chart.
n_iter: the number of iterations used for computing the SVD
rescale_with_mean: whether to substract each column's mean
rescale_with_std: whether to divide each column by it's standard deviation
copy: if False then the computations will be done inplace which can have possible side-effects on the input data
engine: what SVD engine to use (should be one of ['auto', 'fbpca', 'sklearn'])
random_state: controls the randomness of the SVD results.

Once the PCA has been fitted, it can be used to extract the row principal coordinates as so:

>>> pca.transform(X).head()  # Same as pca.row_principal_coordinates(X).head()
          0         1
0 -2.264542  0.505704
1 -2.086426 -0.655405
2 -2.367950 -0.318477
3 -2.304197 -0.575368
4 -2.388777  0.674767

Each column stands for a principal component whilst each row stands a row in the original dataset. You can display these projections with the plot_row_principal_coordinates method:

>>> ax = pca.plot_row_principal_coordinates(
...     X,
...     ax=None,
...     figsize=(6, 6),
...     x_component=0,
...     y_component=1,
...     labels=None,
...     group_labels=y,
...     ellipse_outline=False,
...     ellipse_fill=True,
...     show_points=True
... )
>>> ax.get_figure().savefig('images/pca_row_principal_coordinates.png')

Each principal component explains part of the underlying of the distribution. You can see by how much by using the accessing the explained_inertia_ property:

>>> pca.explained_inertia_  # doctest: +ELLIPSIS
[0.727704..., 0.230305...]

The explained inertia represents the percentage of the inertia each principal component contributes. It sums up to 1 if the n_components property is equal to the number of columns in the original dataset. you The explained inertia is obtained by dividing the eigenvalues obtained with the SVD by the total inertia, both of which are also accessible.

>>> pca.eigenvalues_  # doctest: +ELLIPSIS
[436.622712..., 138.183139...]

>>> pca.total_inertia_
600.0

>>> pca.explained_inertia_
[0.727704..., 0.230305...]

You can also obtain the correlations between the original variables and the principal components.

>>> pca.column_correlations(X)
                     0         1
Sepal length  0.891224  0.357352
Sepal width  -0.449313  0.888351
Petal length  0.991684  0.020247
Sepal length  0.964996  0.062786

You may also want to know how much each observation contributes to each principal component. This can be done with the row_component_contributions method.

>>> pca.row_component_contributions(X).head()
          0         1
0  0.011745  0.001851
1  0.009970  0.003109
2  0.012842  0.000734
3  0.012160  0.002396
4  0.013069  0.003295

Correspondence analysis (CA)

You should be using correspondence analysis when you want to analyse a contingency table. In other words you want to analyse the dependencies between two categorical variables. The following example comes from section 17.2.3 of this textbook. It shows the number of occurrences between different hair and eye colors.

import pandas as pd

>>> pd.set_option('display.float_format', lambda x: '{:.6f}'.format(x))
>>> X = pd.DataFrame(
...    data=[
...        [326, 38, 241, 110, 3],
...        [688, 116, 584, 188, 4],
...        [343, 84, 909, 412, 26],
...        [98, 48, 403, 681, 85]
...    ],
...    columns=pd.Series(['Fair', 'Red', 'Medium', 'Dark', 'Black']),
...    index=pd.Series(['Blue', 'Light', 'Medium', 'Dark'])
... )
>>> X
        Fair  Red  Medium  Dark  Black
Blue     326   38     241   110      3
Light    688  116     584   188      4
Medium   343   84     909   412     26
Dark      98   48     403   681     85

Unlike the PCA class, the CA only exposes scikit-learn's fit method.

>>> import prince
>>> ca = prince.CA(
...     n_components=2,
...     n_iter=3,
...     copy=True,
...     engine='auto',
...     random_state=42
... )
>>> X.columns.rename('Hair color', inplace=True)
>>> X.index.rename('Eye color', inplace=True)
>>> ca = ca.fit(X)

The parameters and methods overlap with those proposed by the PCA class.

>>> ca.row_principal_coordinates(X)
               0         1
Blue   -0.400300 -0.165411
Light  -0.440708 -0.088463
Medium  0.033614  0.245002
Dark    0.702739 -0.133914

>>> ca.column_principal_coordinates(X)
               0         1
Fair   -0.543995 -0.173844
Red    -0.233261 -0.048279
Medium -0.042024  0.208304
Dark    0.588709 -0.103950
Black   1.094388 -0.286437

You can plot both sets of principal coordinates with the plot_principal_coordinates method.

>>> ax = ca.plot_principal_coordinates(
...     X=X,
...     ax=None,
...     figsize=(6, 6),
...     x_component=0,
...     y_component=1,
...     show_row_labels=True,
...     show_col_labels=True
... )
>>> ax.get_figure().savefig('images/ca_principal_coordinates.png')

Like for the PCA you can access the inertia contribution of each principal component as well as the eigenvalues and the total inertia.

>>> ca.eigenvalues_  # doctest: +ELLIPSIS
[0.199244..., 0.030086...]

>>> ca.total_inertia_  # doctest: +ELLIPSIS
0.230191...

>>> ca.explained_inertia_  # doctest: +ELLIPSIS
[0.865562..., 0.130703...]

Multiple correspondence analysis (MCA)

Multiple correspondence analysis (MCA) is an extension of correspondence analysis (CA). It should be used when you have more than two categorical variables. The idea is simply to compute the one-hot encoded version of a dataset and apply CA on it. As an example we're going to use the ballons dataset taken from the UCI datasets website.

import pandas as pd

>>> X = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/balloons/adult+stretch.data')
>>> X.columns = ['Color', 'Size', 'Action', 'Age', 'Inflated']
>>> X.head()
    Color   Size   Action    Age Inflated
0  YELLOW  SMALL  STRETCH  ADULT        T
1  YELLOW  SMALL  STRETCH  CHILD        F
2  YELLOW  SMALL      DIP  ADULT        F
3  YELLOW  SMALL      DIP  CHILD        F
4  YELLOW  LARGE  STRETCH  ADULT        T

The MCA also implements the fit and transform methods.

>>> import prince
>>> mca = prince.MCA(
...     n_components=2,
...     n_iter=3,
...     copy=True,
...     engine='auto',
...     random_state=42
... )
>>> mca = mca.fit(X)

As usual you can retrieve the row and column principal components via their respective methods.

>>> mca.row_principal_coordinates(X).head()
          0         1
0  0.705387  0.000000
1 -0.386586  0.000000
2 -0.386586  0.000000
3 -0.852014  0.000000
4  0.783539 -0.633333

>>> mca.column_principal_coordinates(X).head()
                     0         1
Color_PURPLE  0.117308  0.689202
Color_YELLOW -0.130342 -0.765780
Size_LARGE    0.117308 -0.689202
Size_SMALL   -0.130342  0.765780
Action_DIP   -0.853864 -0.000000

Like the CA class, the MCA class also has plot_principal_coordinates method.

>>> ax = mca.plot_principal_coordinates(
...     X=X,
...     ax=None,
...     figsize=(6, 6),
...     show_row_points=True,
...     row_points_size=10,
...     show_row_labels=False,
...     show_column_points=True,
...     column_points_size=30,
...     show_column_labels=False,
...     legend_n_cols=1
... )
>>> ax.get_figure().savefig('images/mca_principal_coordinates.png')

The eigenvalues and inertia values are also accessible.

>>> mca.eigenvalues_  # doctest: +ELLIPSIS
[0.401656..., 0.211111...]

>>> mca.total_inertia_
1.0

>>> mca.explained_inertia_  # doctest: +ELLIPSIS
[0.401656..., 0.211111...]

Going faster

By default prince uses sklearn's randomized SVD implementation (the one used under the hood for TruncatedSVD). One of the goals of Prince is to make it possible to use a different SVD backend. For the while the only other supported backend is Facebook's randomized SVD implementation called fbpca. You can use it by setting the engine parameter to 'fbpca':

>>> import prince
>>> pca = prince.PCA(engine='fbpca')

If you are using Anaconda then you should be able to install fbpca without any pain by running pip install fbpca.

Incoming features

I've got a lot on my hands aside from prince, so feel free to give me a hand!

License

The MIT License (MIT). Please see the license file for more information.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.16.6

Feb 27, 2026

0.16.5

Jan 9, 2026

0.16.3

Dec 21, 2025

0.16.2

Nov 1, 2025

0.16.1

Aug 4, 2025

0.16.0

Mar 9, 2025

0.15.0

Jan 4, 2025

0.14.0

Nov 17, 2024

0.13.1

Sep 7, 2024

0.13.0

Oct 11, 2023

0.12.1

Sep 12, 2023

0.12.0

Aug 8, 2023

0.11.0

Jul 29, 2023

0.10.8

Jun 27, 2023

0.10.7

Jun 14, 2023

0.10.6

Jun 10, 2023

0.10.5

May 31, 2023

0.10.4

May 2, 2023

0.10.3

Apr 18, 2023

0.10.2

Apr 18, 2023

0.10.1

Apr 17, 2023

0.10.0

Apr 7, 2023

0.9.0

Mar 18, 2023

0.8.3

Mar 11, 2023

0.8.2

Mar 10, 2023

0.8.1

Mar 1, 2023

0.8.0

Feb 27, 2023

0.7.1

Oct 6, 2020

0.7.0

Mar 31, 2020

0.6.3

Jul 2, 2019

0.6.2

Mar 14, 2019

0.6.1

Feb 14, 2019

0.6.0

Feb 2, 2019

0.5.2

Dec 9, 2018

0.4.10

Nov 6, 2018

0.4.9

Oct 23, 2018

0.4.8

Oct 11, 2018

0.4.7

Sep 24, 2018

0.4.6

Aug 9, 2018

0.4.5

Aug 9, 2018

0.4.4

Aug 7, 2018

0.4.3

Aug 6, 2018

0.4.2

Aug 6, 2018

0.4.1

May 22, 2018

0.4.0

May 19, 2018

This version

0.3.8

May 17, 2018

0.3.7

May 15, 2018

0.3.6

May 8, 2018

0.3.5

May 3, 2018

0.3.4

May 3, 2018

0.3.3

May 3, 2018

0.3.2

May 1, 2018

0.3.1

Apr 27, 2018

0.3.0

Apr 25, 2018

0.2.6

Mar 20, 2017

0.2.5

Mar 5, 2017

0.2.4

Jan 17, 2017

0.2.3

Jan 10, 2017

0.2.2

Jan 10, 2017

0.2.1

Nov 23, 2016

0.2.0

Nov 22, 2016

0.1.3

Nov 14, 2016

0.1.2

Nov 11, 2016

0.1.1

Nov 11, 2016

0.1.0

Nov 11, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prince-0.3.8.tar.gz (21.2 kB view details)

Uploaded May 17, 2018 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

prince-0.3.8-py2.py3-none-any.whl (17.1 kB view details)

Uploaded May 17, 2018 Python 2Python 3

File details

Details for the file prince-0.3.8.tar.gz.

File metadata

Download URL: prince-0.3.8.tar.gz
Upload date: May 17, 2018
Size: 21.2 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for prince-0.3.8.tar.gz
Algorithm	Hash digest
SHA256	`83ce905ec2f1f11f80a8583c3b8ce0ae0ffdc1fa3790bfe7702ce6b1257e03df`
MD5	`a678ceed87b0ffdc8c71fe58fa83bcc7`
BLAKE2b-256	`67a45a36d31c81f5fb7634ef9b1c6c8797e21e7644add66bc6073897f0072bb6`

See more details on using hashes here.

File details

Details for the file prince-0.3.8-py2.py3-none-any.whl.

File metadata

Download URL: prince-0.3.8-py2.py3-none-any.whl
Upload date: May 17, 2018
Size: 17.1 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No

File hashes

Hashes for prince-0.3.8-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`b274cb8a12af9e3a57ce6248bb10177f6791ff836aa3cfc8198b230f681008ab`
MD5	`5a10cade36560cf532acd13022099dc0`
BLAKE2b-256	`c1f9ceb23eb34b748e30570a08359b3d38958789071d2160c41c0129096ee357`

See more details on using hashes here.

prince 0.3.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Introduction

Installation

Usage

Guidelines

Principal component analysis (PCA)

Correspondence analysis (CA)

Multiple correspondence analysis (MCA)

Going faster

Incoming features

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes