Light Factor Analysis of Mixed Data
Project description
Light_FAMD
Light_FAMD
is a library for prcessing factor analysis of mixed data. This includes a variety of methods including principal component analysis (PCA) and multiply correspondence analysis (MCA). The goal is to provide an efficient and light implementation for each algorithm along with a scikit-learn API.
Table of contents
Light_FAMD
doesn't have any extra dependencies apart from the usual suspects (sklearn
, pandas
, numpy
) which are included with Anaconda.
Guidelines
Each base estimator(CA,PCA) provided by Light_FAMD
extends scikit-learn's (TransformerMixin,BaseEstimator)
.which means we could use directly fit_transform
,and (set_params,get_params)
methods.
Under the hood Light_FAMD
uses a randomised version of SVD. This algorithm finds a (usually very good) approximate truncated singular value decomposition using randomization to speed up the computations. It is particularly fast on large matrices on which you wish to extract only a small number of components. In order to obtain further speed up, n_iter can be set <=2 (at the cost of loss of precision). However if you want reproducible results then you should set the random_state
parameter.
The randomised version of SVD is an iterative method. Because each of light_famd's algorithms use SVD, they all possess a n_iter
parameter which controls the number of iterations used for computing the SVD. On the one hand the higher n_iter
is the more precise the results will be. On the other hand increasing n_iter
increases the computation time. In general the algorithm converges very quickly so using a low n_iter
(which is the default behaviour) is recommended.
In this package,inheritance relationship as shown below(A->B:A is superclass of B):
- PCA -> MFA -> FAMD
- CA ->MCA
You are supposed to use each method depending on your situation:
- All your variables are numeric: use principal component analysis (
PCA
) - You have a contingency table: use correspondence analysis (
CA
) - You have more than 2 variables and they are all categorical: use multiple correspondence analysis (
MCA
) - You have groups of categorical or numerical variables: use multiple factor analysis (
MFA
) - You have both categorical and numerical variables: use factor analysis of mixed data (
FAMD
)
The next subsections give an overview of each method along with usage information. The following papers give a good overview of the field of factor analysis if you want to go deeper:
- A Tutorial on Principal Component Analysis
- Theory of Correspondence Analysis
- Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions
- Computation of Multiple Correspondence Analysis, with code in R
- Singular Value Decomposition Tutorial
- Multiple Factor Analysis
Notice that Light_FAMD
does't support the sparse input,see Truncated_FAMD for an alternative of sparse and big data.
Principal-Component-Analysis: PCA
PCA(rescale_with_mean=True, rescale_with_std=True, n_components=2, n_iter=3, copy=True, check_input=True, random_state=None, engine='auto'):
Args:
rescale_with_mean
(bool): Whether to substract each column's mean or not.rescale_with_std
(bool): Whether to divide each column by it's standard deviation or not.n_components
(int): The number of principal components to compute.n_iter
(int): The number of iterations used for computing the SVD.copy
(bool): Whether to perform the computations inplace or not.check_input
(bool): Whether to check the consistency of the inputs or not.engine
(string):"auto":randomized_svd,"fbpca":Facebook's randomized SVD implementationrandom_state
(int, RandomState instance or None, optional (default=None):The seed of the -pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Return ndarray (M,k),M:Number of samples,K:Number of components.
Examples:
>>>import numpy as np
>>> np.random.seed(42) # This is for doctests reproducibility
>>>from light_famd import PCA
>>>X = pd.DataFrame(np.random.randint(0,10,size=(10,3)),columns=list('ABC'))
>>>pca = PCA(n_components=2)
>>>pca.fit(X)
PCA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=3,
random_state=None, rescale_with_mean=True, rescale_with_std=True)
>>>print(pca.explained_variance_)
[20.20385109 8.48246239]
>>>print(pca.explained_variance_ratio_)
[0.6734617029875277, 0.28274874633810754]
>>>print(pca.column_correlation(X)) # pearson correlation between component and original column,while p-value >=0.05 this similarity is `Nan`.
0 1
A -0.953482 NaN
B 0.907314 NaN
C NaN 0.84211
>>>print(pca.transform(X))
[[-0.82262005 0.11730656]
[ 0.05359079 1.62298683]
[ 1.03052849 0.79973099]
[-0.24313366 0.25651395]
[-0.94630387 -1.04943025]
[-0.70591749 -0.01282583]
[-0.39948373 -1.52612436]
[ 2.70164194 0.38048482]
[-2.49373351 0.53655273]
[ 1.8254311 -1.12519545]]
>>>print(pca.fit_transform(X))
[[-0.82262005 0.11730656]
[ 0.05359079 1.62298683]
[ 1.03052849 0.79973099]
[-0.24313366 0.25651395]
[-0.94630387 -1.04943025]
[-0.70591749 -0.01282583]
[-0.39948373 -1.52612436]
[ 2.70164194 0.38048482]
[-2.49373351 0.53655273]
[ 1.8254311 -1.12519545]]
Correspondence-Analysis: CA
CA(n_components=2, n_iter=10, copy=True, check_input=True, random_state=None, engine='auto'):
Args:
n_components
(int): The number of principal components to compute.copy
(bool): Whether to perform the computations inplace or not.check_input
(bool): Whether to check the consistency of the inputs or not.engine
(string):"auto":randomized_svd,"fbpca":Facebook's randomized SVD implementationrandom_state
(int, RandomState instance or None, optional (default=None):The seed of the -pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
Return ndarray (M,k),M:Number of samples,K:Number of components.
Examples:
>>>import numpy as np
>>>from light_famd import CA
>>>X = pd.DataFrame(data=np.random.randint(0,100,size=(10,4)),columns=list('ABCD'))
>>>ca=CA(n_components=2,n_iter=2)
>>>ca.fit(X)
CA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=2,
random_state=None)
>>> print(ca.explained_variance_)
[0.16892141 0.0746376 ]
>>>print(ca.explained_variance_ratio_)
[0.5650580210934917, 0.2496697790527281]
>>>print(ca.transform(X))
[[ 0.23150854 -0.39167802]
[ 0.36006095 0.00301414]
[-0.48192602 -0.13002647]
[-0.06333533 -0.21475652]
[-0.16438708 -0.10418312]
[-0.38129126 -0.16515196]
[ 0.2721296 0.46923757]
[ 0.82953753 0.20638333]
[-0.500007 0.36897935]
[ 0.57932474 -0.1023383 ]]
>>>print(ca.fit_transform(X))
[[ 0.23150854 -0.39167802]
[ 0.36006095 0.00301414]
[-0.48192602 -0.13002647]
[-0.06333533 -0.21475652]
[-0.16438708 -0.10418312]
[-0.38129126 -0.16515196]
[ 0.2721296 0.46923757]
[ 0.82953753 0.20638333]
[-0.500007 0.36897935]
[ 0.57932474 -0.1023383 ]]
Multiple-Correspondence-Analysis: MCA
MCA class inherits from CA class.
>>>import pandas as pd
>>>from light_famd import MCA
>>>X=pd.DataFrame(np.random.choice(list('abcde'),size=(10,4),replace=True),columns =list('ABCD'))
>>>print(X)
A B C D
0 d e a d
1 e d b b
2 e d a e
3 b b e d
4 b d b b
5 c b a e
6 e d b a
7 d c d d
8 b c d a
9 a e c c
>>>mca=MCA(n_components=2)
>>>mca.fit(X)
MCA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=10,
random_state=None)
>>>print(mca.explained_variance_)
[0.90150495 0.76979456]
>>>print(mca.explained_variance_ratio_)
[0.24040131974598467, 0.20527854948955893]
>>>print(mca.transform(X))
[[ 0.55603013 0.7016272 ]
[-0.73558629 -1.17559462]
[-0.44972794 -0.4973024 ]
[-0.16248444 0.95706908]
[-0.66969377 -0.79951057]
[-0.21267777 0.39953562]
[-0.67921667 -0.8707747 ]
[ 0.05058625 1.34573057]
[-0.31952341 0.77285922]
[ 2.62229391 -0.83363941]]
>>>print(mca.fit_transform(X))
[[ 0.55603013 0.7016272 ]
[-0.73558629 -1.17559462]
[-0.44972794 -0.4973024 ]
[-0.16248444 0.95706908]
[-0.66969377 -0.79951057]
[-0.21267777 0.39953562]
[-0.67921667 -0.8707747 ]
[ 0.05058625 1.34573057]
[-0.31952341 0.77285922]
[ 2.62229391 -0.83363941]]
Multiple-Factor-Analysis: MFA
MFA class inherits from PCA class.
Since FAMD class inherits from MFA and the only thing to do for FAMD is to determine groups
parameter compare to its superclass MFA
.therefore we skip this chapiter and go directly to FAMD
.
Factor-Analysis-of-Mixed-Data: FAMD
The FAMD
inherits from the MFA
class, which entails that you have access to all it's methods and properties of MFA
class.
>>>import pandas as pd
>>>from light_famd import FAMD
>>>X_n = pd.DataFrame(data=np.random.randint(0,100,size=(10,2)),columns=list('AB'))
>>>X_c =pd.DataFrame(np.random.choice(list('abcde'),size=(10,4),replace=True),columns =list('CDEF'))
>>>X=pd.concat([X_n,X_c],axis=1)
>>>print(X)
A B C D E F
0 96 19 b d b e
1 11 46 b d a e
2 0 89 a a a c
3 13 63 c a e d
4 37 36 d b e c
5 10 99 a b d c
6 76 2 c a d e
7 32 5 c a e d
8 49 9 c e e e
9 4 22 c c b d
>>>famd = FAMD(n_components=2)
>>>famd.fit(X)
MCA PROCESS MCA PROCESS ELIMINATED 0 COLUMNS SINCE THEIR MISS_RATES >= 99%
Out:
FAMD(check_input=True, copy=False, engine='auto', n_components=2, n_iter=2,
random_state=None)
>>>print(famd.explained_variance_)
[17.40871219 9.73440949]
>>>print(famd.explained_variance_ratio_)
[0.32596621039327284, 0.1822701494502082]
>>> print(famd.column_correlation(X))
0 1
A NaN NaN
B NaN NaN
C_a NaN NaN
C_b NaN 0.824458
C_c 0.922220 NaN
C_d NaN NaN
D_a NaN NaN
D_b NaN NaN
D_c NaN NaN
D_d NaN 0.824458
D_e NaN NaN
E_a NaN NaN
E_b NaN NaN
E_d NaN NaN
E_e NaN NaN
F_c NaN -0.714447
F_d 0.673375 NaN
F_e NaN 0.839324
>>>print(famd.transform(X))
[[ 2.23848136 5.75809647]
[ 2.0845175 4.78930072]
[ 2.6682068 -2.78991262]
[ 6.2962962 -1.57451325]
[ 2.52140085 -3.28279729]
[ 1.58256681 -3.73135011]
[ 5.19476759 1.18333717]
[ 6.35288446 -1.33186723]
[ 5.02971134 1.6216402 ]
[ 4.05754963 0.69620997]]
>>>print(famd.fit_transform(X))
MCA PROCESS HAVE ELIMINATE 0 COLUMNS SINCE ITS MISSING RATE >= 99%
[[ 2.23848136 5.75809647]
[ 2.0845175 4.78930072]
[ 2.6682068 -2.78991262]
[ 6.2962962 -1.57451325]
[ 2.52140085 -3.28279729]
[ 1.58256681 -3.73135011]
[ 5.19476759 1.18333717]
[ 6.35288446 -1.33186723]
[ 5.02971134 1.6216402 ]
[ 4.05754963 0.69620997]]
Going faster
By default light_famd
uses sklearn
's randomized SVD implementation. One of the goals of Light_FAMD
is to make it possible to use a different SVD backend. For the while the only other supported backend is Facebook's randomized SVD implementation called fbpca. You can use it by setting the engine
parameter to 'fbpca'
or see Truncated_FAMD for an alternative of automatic selection of svd_solver depends on the structure of input:
>>> import Light_FAMD
>>> pca = Light_FAMD.PCA(engine='fbpca')
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for light_famd-0.0.3-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 63abe8762ca98f32736b239a94a4b849dd082ad40de5a36be84118136470686a |
|
MD5 | 976d1d7a40d36335123229be8919fbce |
|
BLAKE2b-256 | 6c406678217385426fe2d7791df6a56c866e8676f9d75afa26e188d9a2a291a3 |