Statistical factor analysis in Python
Project description
Prince is a library for doing factor analysis. This includes a variety of methods including principal component analysis (PCA) and correspondence analysis (CA). The goal is to provide an efficient implementation for each algorithm along with a scikitlearn API.
Table of contents
Installation
:warning: Prince is only compatible with Python 3.
:snake: Although it isn't a requirement, using Anaconda is highly recommended.
Via PyPI
>>> pip install prince # doctest: +SKIP
Via GitHub for the latest development version
>>> pip install git+https://github.com/MaxHalford/Prince # doctest: +SKIP
Prince doesn't have any extra dependencies apart from the usual suspects (sklearn
, pandas
, matplotlib
) which are included with Anaconda.
Usage
import numpy as np; np.random.set_state(42) # This is for doctests reproducibility
Guidelines
Each estimator provided by prince
extends scikitlearn's TransformerMixin
. This means that each estimator implements a fit
and a transform
method which makes them usable in a transformation pipeline. The fit
method is actually an alias for the row_principal_components
method which returns the row principal components. However you can also access the column principal components with the column_principal_components
.
Under the hood Prince uses a randomised version of SVD. This is much faster than using the more commonly full approach. However the results may have a small inherent randomness. For most applications this doesn't matter and you shouldn't have to worry about it. However if you want reproducible results then you should set the random_state
parameter.
The randomised version of SVD is an iterative method. Because each of Prince's algorithms use SVD, they all possess a n_iter
parameter which controls the number of iterations used for computing the SVD. On the one hand the higher n_iter
is the more precise the results will be. On the other hand increasing n_iter
increases the computation time. In general the algorithm converges very quickly so using a low n_iter
(which is the default behaviour) is recommended.
You are supposed to use each method depending on your situation:
 All your variables are numeric: use principal component analysis (
prince.PCA
)  You have a contingency table: use correspondence analysis (
prince.CA
)  You have more than 2 variables and they are all categorical: use multiple correspondence analysis (
prince.MCA
)  You have groups of categorical or numerical variables: use multiple factor analysis (
prince.MFA
)  You have both categorical and numerical variables: use factor analysis of mixed data (
prince.FAMD
)
The next subsections give an overview of each method along with usage information. The following papers give a good overview of the field of factor analysis if you want to go deeper:
 A Tutorial on Principal Component Analysis
 Theory of Correspondence Analysis
 Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions
 Computation of Multiple Correspondence Analysis, with code in R
 Singular Value Decomposition Tutorial
 Multiple Factor Analysis
Principal component analysis (PCA)
If you're using PCA it is assumed you have a dataframe consisting of numerical continuous variables. In this example we're going to be using the Iris flower dataset.
>>> import pandas as pd
>>> import prince
>>> from sklearn import datasets
>>> X, y = datasets.load_iris(return_X_y=True)
>>> X = pd.DataFrame(data=X, columns=['Sepal length', 'Sepal width', 'Petal length', 'Petal width'])
>>> y = pd.Series(y).map({0: 'Setosa', 1: 'Versicolor', 2: 'Virginica'})
>>> X.head()
Sepal length Sepal width Petal length Petal width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
The PCA
class implements scikitlearn's fit
/transform
API. It's parameters have to passed at initialisation before calling the fit
method.
>>> pca = prince.PCA(
... n_components=2,
... n_iter=3,
... rescale_with_mean=True,
... rescale_with_std=True,
... copy=True,
... check_input=True,
... engine='auto',
... random_state=42
... )
>>> pca = pca.fit(X)
The available parameters are:
n_components
: the number of components that are computed. You only need two if your intention is to make a chart.n_iter
: the number of iterations used for computing the SVDrescale_with_mean
: whether to substract each column's meanrescale_with_std
: whether to divide each column by it's standard deviationcopy
: ifFalse
then the computations will be done inplace which can have possible sideeffects on the input dataengine
: what SVD engine to use (should be one of['auto', 'fbpca', 'sklearn']
)random_state
: controls the randomness of the SVD results.
Once the PCA
has been fitted, it can be used to extract the row principal coordinates as so:
>>> pca.transform(X).head() # Same as pca.row_coordinates(X).head()
0 1
0 2.264703 0.480027
1 2.080961 0.674134
2 2.364229 0.341908
3 2.299384 0.597395
4 2.389842 0.646835
Each column stands for a principal component whilst each row stands a row in the original dataset. You can display these projections with the plot_row_coordinates
method:
>>> ax = pca.plot_row_coordinates(
... X,
... ax=None,
... figsize=(6, 6),
... x_component=0,
... y_component=1,
... labels=None,
... color_labels=y,
... ellipse_outline=False,
... ellipse_fill=True,
... show_points=True
... )
>>> ax.get_figure().savefig('images/pca_row_coordinates.png')
Each principal component explains part of the underlying of the distribution. You can see by how much by using the accessing the explained_inertia_
property:
>>> pca.explained_inertia_ # doctest: +ELLIPSIS
[0.729624..., 0.228507...]
The explained inertia represents the percentage of the inertia each principal component contributes. It sums up to 1 if the n_components
property is equal to the number of columns in the original dataset. you The explained inertia is obtained by dividing the eigenvalues obtained with the SVD by the total inertia, both of which are also accessible.
>>> pca.eigenvalues_ # doctest: +ELLIPSIS
[437.774672..., 137.104570...]
>>> pca.total_inertia_ # doctest: +ELLIPSIS
600.0...
>>> pca.explained_inertia_
[0.729624..., 0.228507...]
You can also obtain the correlations between the original variables and the principal components.
>>> pca.column_correlations(X)
0 1
Petal length 0.991555 0.023415
Petal width 0.964979 0.064000
Sepal length 0.890169 0.360830
Sepal width 0.460143 0.882716
You may also want to know how much each observation contributes to each principal component. This can be done with the row_contributions
method.
>>> pca.row_contributions(X).head()
0 1
0 0.011716 0.001681
1 0.009892 0.003315
2 0.012768 0.000853
3 0.012077 0.002603
4 0.013046 0.003052
Correspondence analysis (CA)
You should be using correspondence analysis when you want to analyse a contingency table. In other words you want to analyse the dependencies between two categorical variables. The following example comes from section 17.2.3 of this textbook. It shows the number of occurrences between different hair and eye colors.
>>> import pandas as pd
>>> pd.set_option('display.float_format', lambda x: '{:.6f}'.format(x))
>>> X = pd.DataFrame(
... data=[
... [326, 38, 241, 110, 3],
... [688, 116, 584, 188, 4],
... [343, 84, 909, 412, 26],
... [98, 48, 403, 681, 85]
... ],
... columns=pd.Series(['Fair', 'Red', 'Medium', 'Dark', 'Black']),
... index=pd.Series(['Blue', 'Light', 'Medium', 'Dark'])
... )
>>> X
Fair Red Medium Dark Black
Blue 326 38 241 110 3
Light 688 116 584 188 4
Medium 343 84 909 412 26
Dark 98 48 403 681 85
Unlike the PCA
class, the CA
only exposes scikitlearn's fit
method.
>>> import prince
>>> ca = prince.CA(
... n_components=2,
... n_iter=3,
... copy=True,
... check_input=True,
... engine='auto',
... random_state=42
... )
>>> X.columns.rename('Hair color', inplace=True)
>>> X.index.rename('Eye color', inplace=True)
>>> ca = ca.fit(X)
The parameters and methods overlap with those proposed by the PCA
class.
>>> ca.row_coordinates(X)
0 1
Blue 0.400300 0.165411
Light 0.440708 0.088463
Medium 0.033614 0.245002
Dark 0.702739 0.133914
>>> ca.column_coordinates(X)
0 1
Fair 0.543995 0.173844
Red 0.233261 0.048279
Medium 0.042024 0.208304
Dark 0.588709 0.103950
Black 1.094388 0.286437
You can plot both sets of principal coordinates with the plot_coordinates
method.
>>> ax = ca.plot_coordinates(
... X=X,
... ax=None,
... figsize=(6, 6),
... x_component=0,
... y_component=1,
... show_row_labels=True,
... show_col_labels=True
... )
>>> ax.get_figure().savefig('images/ca_coordinates.png')
Like for the PCA
you can access the inertia contribution of each principal component as well as the eigenvalues and the total inertia.
>>> ca.eigenvalues_ # doctest: +ELLIPSIS
[0.199244..., 0.030086...]
>>> ca.total_inertia_ # doctest: +ELLIPSIS
0.230191...
>>> ca.explained_inertia_ # doctest: +ELLIPSIS
[0.865562..., 0.130703...]
Multiple correspondence analysis (MCA)
Multiple correspondence analysis (MCA) is an extension of correspondence analysis (CA). It should be used when you have more than two categorical variables. The idea is simply to compute the onehot encoded version of a dataset and apply CA on it. As an example we're going to use the balloons dataset taken from the UCI datasets website.
>>> import pandas as pd
>>> X = pd.read_csv('https://archive.ics.uci.edu/ml/machinelearningdatabases/balloons/adult+stretch.data')
>>> X.columns = ['Color', 'Size', 'Action', 'Age', 'Inflated']
>>> X.head()
Color Size Action Age Inflated
0 YELLOW SMALL STRETCH ADULT T
1 YELLOW SMALL STRETCH CHILD F
2 YELLOW SMALL DIP ADULT F
3 YELLOW SMALL DIP CHILD F
4 YELLOW LARGE STRETCH ADULT T
The MCA
also implements the fit
and transform
methods.
>>> import prince
>>> mca = prince.MCA(
... n_components=2,
... n_iter=3,
... copy=True,
... check_input=True,
... engine='auto',
... random_state=42
... )
>>> mca = mca.fit(X)
Like the CA
class, the MCA
class also has plot_coordinates
method.
>>> ax = mca.plot_coordinates(
... X=X,
... ax=None,
... figsize=(6, 6),
... show_row_points=True,
... row_points_size=10,
... show_row_labels=False,
... show_column_points=True,
... column_points_size=30,
... show_column_labels=False,
... legend_n_cols=1
... )
>>> ax.get_figure().savefig('images/mca_coordinates.png')
The eigenvalues and inertia values are also accessible.
>>> mca.eigenvalues_ # doctest: +ELLIPSIS
[0.401656..., 0.211111...]
>>> mca.total_inertia_
1.0
>>> mca.explained_inertia_ # doctest: +ELLIPSIS
[0.401656..., 0.211111...]
Multiple factor analysis (MFA)
Multiple factor analysis (MFA) is meant to be used when you have groups of variables. In practice it builds a PCA on each group  or an MCA, depending on the types of the group's variables. It then constructs a global PCA on the results of the socalled partial PCAs  or MCAs. The dataset used in the following examples come from this paper. In the dataset, three experts give their opinion on six different wines. Each opinion for each wine is recorded as a variable. We thus want to consider the separate opinions of each expert whilst also having a global overview of each wine. MFA is the perfect fit for this kind of situation.
First of all let's copy the data used in the paper.
>>> import pandas as pd
>>> X = pd.DataFrame(
... data=[
... [1, 6, 7, 2, 5, 7, 6, 3, 6, 7],
... [5, 3, 2, 4, 4, 4, 2, 4, 4, 3],
... [6, 1, 1, 5, 2, 1, 1, 7, 1, 1],
... [7, 1, 2, 7, 2, 1, 2, 2, 2, 2],
... [2, 5, 4, 3, 5, 6, 5, 2, 6, 6],
... [3, 4, 4, 3, 5, 4, 5, 1, 7, 5]
... ],
... columns=['E1 fruity', 'E1 woody', 'E1 coffee',
... 'E2 red fruit', 'E2 roasted', 'E2 vanillin', 'E2 woody',
... 'E3 fruity', 'E3 butter', 'E3 woody'],
... index=['Wine {}'.format(i+1) for i in range(6)]
... )
>>> X['Oak type'] = [1, 2, 2, 2, 1, 1]
The groups are passed as a dictionary to the MFA
class.
>>> groups = {
... 'Expert #{}'.format(no+1): [c for c in X.columns if c.startswith('E{}'.format(no+1))]
... for no in range(3)
... }
>>> import pprint
>>> pprint.PrettyPrinter().pprint(groups)
{'Expert #1': ['E1 fruity', 'E1 woody', 'E1 coffee'],
'Expert #2': ['E2 red fruit', 'E2 roasted', 'E2 vanillin', 'E2 woody'],
'Expert #3': ['E3 fruity', 'E3 butter', 'E3 woody']}
Now we can fit an MFA
.
>>> import prince
>>> mfa = prince.MFA(
... groups=groups,
... n_components=2,
... n_iter=3,
... copy=True,
... check_input=True,
... engine='auto',
... random_state=42
... )
>>> mfa = mfa.fit(X)
The MFA
inherits from the PCA
class, which entails that you have access to all it's methods and properties. The row_coordinates
method will return the global coordinates of each wine.
>>> mfa.row_coordinates(X)
0 1
Wine 1 2.172155 0.508596
Wine 2 0.557017 0.197408
Wine 3 2.317663 0.830259
Wine 4 1.832557 0.905046
Wine 5 1.403787 0.054977
Wine 6 1.131296 0.576241
Just like for the PCA
you can plot the row coordinates with the plot_row_coordinates
method.
>>> ax = mfa.plot_row_coordinates(
... X,
... ax=None,
... figsize=(6, 6),
... x_component=0,
... y_component=1,
... labels=X.index,
... color_labels=['Oak type {}'.format(t) for t in X['Oak type']],
... ellipse_outline=False,
... ellipse_fill=True,
... show_points=True
... )
>>> ax.get_figure().savefig('images/mfa_row_coordinates.png')
You can also obtain the row coordinates inside each group. The partial_row_coordinates
method returns a pandas.DataFrame
where the set of columns is a pandas.MultiIndex
. The first level of indexing corresponds to each specified group whilst the nested level indicates the coordinates inside each group.
>>> mfa.partial_row_coordinates(X) # doctest: +NORMALIZE_WHITESPACE
Expert #1 Expert #2 Expert #3
0 1 0 1 0 1
Wine 1 2.764432 1.104812 2.213928 0.863519 1.538106 0.442545
Wine 2 0.773034 0.298919 0.284247 0.132135 0.613771 0.759009
Wine 3 1.991398 0.805893 2.111508 0.499718 2.850084 3.796390
Wine 4 1.981456 0.927187 2.393009 1.227146 1.123206 0.560803
Wine 5 1.292834 0.620661 1.492114 0.488088 1.426414 1.273679
Wine 6 0.688623 0.306527 1.082723 0.243122 1.622541 2.278372
Likewhise you can visualize the partial row coordinates with the plot_partial_row_coordinates
method.
>>> ax = mfa.plot_partial_row_coordinates(
... X,
... ax=None,
... figsize=(6, 6),
... x_component=0,
... y_component=1,
... color_labels=['Oak type {}'.format(t) for t in X['Oak type']]
... )
>>> ax.get_figure().savefig('images/mfa_partial_row_coordinates.png')
As usual you have access to inertia information.
>>> mfa.eigenvalues_ # doctest: +ELLIPSIS
[2.834800..., 0.356859...]
>>> mfa.total_inertia_
3.353004...
>>> mfa.explained_inertia_ # doctest: +ELLIPSIS
[0.845450..., 0.106429...]
You can also access information concerning each partial factor analysis via the partial_factor_analysis_
attribute.
>>> for name, fa in sorted(mfa.partial_factor_analysis_.items()): # doctest: +ELLIPSIS
... print('{} eigenvalues: {}'.format(name, fa.eigenvalues_))
Expert #1 eigenvalues: [2.862595..., 0.119836...]
Expert #2 eigenvalues: [3.651083..., 0.194159...]
Expert #3 eigenvalues: [2.480488..., 0.441195...]
Factor analysis of mixed data (FAMD)
A description is on it's way. This section is empty because I have to refactor the documentation a bit.
>>> import pandas as pd
>>> X = pd.DataFrame(
... data=[
... ['A', 'A', 'A', 2, 5, 7, 6, 3, 6, 7],
... ['A', 'A', 'A', 4, 4, 4, 2, 4, 4, 3],
... ['B', 'A', 'B', 5, 2, 1, 1, 7, 1, 1],
... ['B', 'A', 'B', 7, 2, 1, 2, 2, 2, 2],
... ['B', 'B', 'B', 3, 5, 6, 5, 2, 6, 6],
... ['B', 'B', 'A', 3, 5, 4, 5, 1, 7, 5]
... ],
... columns=['E1 fruity', 'E1 woody', 'E1 coffee',
... 'E2 red fruit', 'E2 roasted', 'E2 vanillin', 'E2 woody',
... 'E3 fruity', 'E3 butter', 'E3 woody'],
... index=['Wine {}'.format(i+1) for i in range(6)]
... )
>>> X['Oak type'] = [1, 2, 2, 2, 1, 1]
Now we can fit an FAMD
.
>>> import prince
>>> famd = prince.FAMD(
... n_components=2,
... n_iter=3,
... copy=True,
... check_input=True,
... engine='auto',
... random_state=42
... )
>>> famd = famd.fit(X.drop('Oak type', axis='columns')) # No need for 'Oak type'
The FAMD
inherits from the MFA
class, which entails that you have access to all it's methods and properties. The row_coordinates
method will return the global coordinates of each wine.
>>> famd.row_coordinates(X)
0 1
Wine 1 3.351475 4.278852
Wine 2 3.396873 4.135743
Wine 3 4.777638 1.643254
Wine 4 4.769714 1.665251
Wine 5 3.779385 3.053543
Wine 6 3.465413 0.304409
Just like for the MFA
you can plot the row coordinates with the plot_row_coordinates
method.
>>> ax = famd.plot_row_coordinates(
... X,
... ax=None,
... figsize=(6, 6),
... x_component=0,
... y_component=1,
... labels=X.index,
... color_labels=['Oak type {}'.format(t) for t in X['Oak type']],
... ellipse_outline=False,
... ellipse_fill=True,
... show_points=True
... )
>>> ax.get_figure().savefig('images/famd_row_coordinates.png')
Going faster
By default prince
uses sklearn
's randomized SVD implementation (the one used under the hood for TruncatedSVD
). One of the goals of Prince is to make it possible to use a different SVD backend. For the while the only other supported backend is Facebook's randomized SVD implementation called fbpca. You can use it by setting the engine
parameter to 'fbpca'
:
>>> import prince
>>> pca = prince.PCA(engine='fbpca')
If you are using Anaconda then you should be able to install fbpca
without any pain by running pip install fbpca
.
License
The MIT License (MIT). Please see the license file for more information.
Project details
Release history Release notifications  RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for prince0.6.2py2.py3noneany.whl
Algorithm  Hash digest  

SHA256  894e14273514d03ca6c3ba2755cdd3882306b9569f960172975153c2340e5b62 

MD5  bf9a3840d0d082f6bf3150f3980c5f07 

BLAKE2b256  04f881256326b8c8ef77b824a64a9ef7ff463be8e936c00cd8eee1a6c9b85abc 