Light Factor Analysis of Mixed Data

These details have not been verified by PyPI

Project links

Homepage

Project description

Light_FAMD

Light_FAMD is a library for prcessing factor analysis of mixed data. This includes a variety of methods including principal component analysis (PCA) and multiply correspondence analysis (MCA). The goal is to provide an efficient and light implementation for each algorithm along with a scikit-learn API.

Usage
Going faster

Light_FAMD doesn't have any extra dependencies apart from the usual suspects (sklearn, pandas, numpy) which are included with Anaconda.

Guidelines

Each base estimator(CA,PCA) provided by Light_FAMD extends scikit-learn's (TransformerMixin,BaseEstimator).which means we could use directly fit_transform,and (set_params,get_params) methods.

Under the hood Light_FAMD uses a randomised version of SVD. This algorithm finds a (usually very good) approximate truncated singular value decomposition using randomization to speed up the computations. It is particularly fast on large matrices on which you wish to extract only a small number of components. In order to obtain further speed up, n_iter can be set <=2 (at the cost of loss of precision). However if you want reproducible results then you should set the random_state parameter.

The randomised version of SVD is an iterative method. Because each of light_famd's algorithms use SVD, they all possess a n_iter parameter which controls the number of iterations used for computing the SVD. On the one hand the higher n_iter is the more precise the results will be. On the other hand increasing n_iter increases the computation time. In general the algorithm converges very quickly so using a low n_iter (which is the default behaviour) is recommended.

In this package,inheritance relationship as shown below(A->B:A is superclass of B):

PCA -> MFA -> FAMD
CA ->MCA

You are supposed to use each method depending on your situation:

All your variables are numeric: use principal component analysis (PCA)
You have a contingency table: use correspondence analysis (CA)
You have more than 2 variables and they are all categorical: use multiple correspondence analysis (MCA)
You have groups of categorical or numerical variables: use multiple factor analysis (MFA)
You have both categorical and numerical variables: use factor analysis of mixed data (FAMD)

The next subsections give an overview of each method along with usage information. The following papers give a good overview of the field of factor analysis if you want to go deeper:

Notice that Light_FAMD does't support the sparse input,see Truncated_FAMD for an alternative of sparse and big data.

Principal-Component-Analysis: PCA

PCA(rescale_with_mean=True, rescale_with_std=True, n_components=2, n_iter=3, copy=True, check_input=True, random_state=None, engine='auto'):

Args:

rescale_with_mean (bool): Whether to substract each column's mean or not.
rescale_with_std (bool): Whether to divide each column by it's standard deviation or not.
n_components (int): The number of principal components to compute.
n_iter (int): The number of iterations used for computing the SVD.
copy (bool): Whether to perform the computations inplace or not.
check_input (bool): Whether to check the consistency of the inputs or not.
engine(string):"auto":randomized_svd,"fbpca":Facebook's randomized SVD implementation
random_state(int, RandomState instance or None, optional (default=None):The seed of the -pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Return ndarray (M,k),M:Number of samples,K:Number of components.

Examples:

>>>import numpy as np
>>> np.random.seed(42)  # This is for doctests reproducibility

>>>from light_famd  import PCA
>>>X = pd.DataFrame(np.random.randint(0,10,size=(10,3)),columns=list('ABC'))
>>>pca = PCA(n_components=2)
>>>pca.fit(X)
PCA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=3,
  random_state=None, rescale_with_mean=True, rescale_with_std=True)

>>>print(pca.explained_variance_)
[20.20385109  8.48246239]

>>>print(pca.explained_variance_ratio_)
[0.6734617029875277, 0.28274874633810754]
>>>print(pca.column_correlation(X))  # pearson correlation between component and  original column,while p-value >=0.05 this similarity is `Nan`.
          0        1
A -0.953482      NaN
B  0.907314      NaN
C       NaN  0.84211

>>>print(pca.transform(X))
[[-0.82262005  0.11730656]
 [ 0.05359079  1.62298683]
 [ 1.03052849  0.79973099]
 [-0.24313366  0.25651395]
 [-0.94630387 -1.04943025]
 [-0.70591749 -0.01282583]
 [-0.39948373 -1.52612436]
 [ 2.70164194  0.38048482]
 [-2.49373351  0.53655273]
 [ 1.8254311  -1.12519545]]
>>>print(pca.fit_transform(X))
[[-0.82262005  0.11730656]
 [ 0.05359079  1.62298683]
 [ 1.03052849  0.79973099]
 [-0.24313366  0.25651395]
 [-0.94630387 -1.04943025]
 [-0.70591749 -0.01282583]
 [-0.39948373 -1.52612436]
 [ 2.70164194  0.38048482]
 [-2.49373351  0.53655273]
 [ 1.8254311  -1.12519545]]

Correspondence-Analysis: CA

CA(n_components=2, n_iter=10, copy=True, check_input=True, random_state=None, engine='auto'):

Args:

n_components (int): The number of principal components to compute.
copy (bool): Whether to perform the computations inplace or not.
check_input (bool): Whether to check the consistency of the inputs or not.
engine(string):"auto":randomized_svd,"fbpca":Facebook's randomized SVD implementation
random_state(int, RandomState instance or None, optional (default=None):The seed of the -pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Return ndarray (M,k),M:Number of samples,K:Number of components.

Examples:

>>>import numpy as np
>>>from light_famd import CA
>>>X  = pd.DataFrame(data=np.random.randint(0,100,size=(10,4)),columns=list('ABCD'))
>>>ca=CA(n_components=2,n_iter=2)
>>>ca.fit(X)
CA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=2,
  random_state=None)

>>> print(ca.explained_variance_)
[0.16892141 0.0746376 ]

>>>print(ca.explained_variance_ratio_)
[0.5650580210934917, 0.2496697790527281]

>>>print(ca.transform(X))
[[ 0.23150854 -0.39167802]
 [ 0.36006095  0.00301414]
 [-0.48192602 -0.13002647]
 [-0.06333533 -0.21475652]
 [-0.16438708 -0.10418312]
 [-0.38129126 -0.16515196]
 [ 0.2721296   0.46923757]
 [ 0.82953753  0.20638333]
 [-0.500007    0.36897935]
 [ 0.57932474 -0.1023383 ]]

>>>print(ca.fit_transform(X))
[[ 0.23150854 -0.39167802]
 [ 0.36006095  0.00301414]
 [-0.48192602 -0.13002647]
 [-0.06333533 -0.21475652]
 [-0.16438708 -0.10418312]
 [-0.38129126 -0.16515196]
 [ 0.2721296   0.46923757]
 [ 0.82953753  0.20638333]
 [-0.500007    0.36897935]
 [ 0.57932474 -0.1023383 ]]

Multiple-Correspondence-Analysis: MCA

MCA class inherits from CA class.

>>>import pandas as pd
>>>from light_famd import MCA
>>>X=pd.DataFrame(np.random.choice(list('abcde'),size=(10,4),replace=True),columns =list('ABCD'))
>>>print(X)
      A  B  C  D
0  d  e  a  d
1  e  d  b  b
2  e  d  a  e
3  b  b  e  d
4  b  d  b  b
5  c  b  a  e
6  e  d  b  a
7  d  c  d  d
8  b  c  d  a
9  a  e  c  c
>>>mca=MCA(n_components=2)
>>>mca.fit(X)
MCA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=10,
  random_state=None)

>>>print(mca.explained_variance_)
[0.90150495 0.76979456]

>>>print(mca.explained_variance_ratio_)
[0.24040131974598467, 0.20527854948955893]

>>>print(mca.transform(X)) 
[[ 0.55603013  0.7016272 ]
 [-0.73558629 -1.17559462]
 [-0.44972794 -0.4973024 ]
 [-0.16248444  0.95706908]
 [-0.66969377 -0.79951057]
 [-0.21267777  0.39953562]
 [-0.67921667 -0.8707747 ]
 [ 0.05058625  1.34573057]
 [-0.31952341  0.77285922]
 [ 2.62229391 -0.83363941]]

>>>print(mca.fit_transform(X)) 
[[ 0.55603013  0.7016272 ]
 [-0.73558629 -1.17559462]
 [-0.44972794 -0.4973024 ]
 [-0.16248444  0.95706908]
 [-0.66969377 -0.79951057]
 [-0.21267777  0.39953562]
 [-0.67921667 -0.8707747 ]
 [ 0.05058625  1.34573057]
 [-0.31952341  0.77285922]
 [ 2.62229391 -0.83363941]]

Multiple-Factor-Analysis: MFA

MFA class inherits from PCA class. Since FAMD class inherits from MFA and the only thing to do for FAMD is to determine groups parameter compare to its superclass MFA.therefore we skip this chapiter and go directly to FAMD.

Factor-Analysis-of-Mixed-Data: FAMD

The FAMD inherits from the MFA class, which entails that you have access to all it's methods and properties of MFA class.

>>>import pandas as pd
>>>from light_famd import FAMD
>>>X_n = pd.DataFrame(data=np.random.randint(0,100,size=(10,2)),columns=list('AB'))
>>>X_c =pd.DataFrame(np.random.choice(list('abcde'),size=(10,4),replace=True),columns =list('CDEF'))
>>>X=pd.concat([X_n,X_c],axis=1)
>>>print(X)
        A   B  C  D  E  F
0  96  19  b  d  b  e
1  11  46  b  d  a  e
2   0  89  a  a  a  c
3  13  63  c  a  e  d
4  37  36  d  b  e  c
5  10  99  a  b  d  c
6  76   2  c  a  d  e
7  32   5  c  a  e  d
8  49   9  c  e  e  e
9   4  22  c  c  b  d

>>>famd = FAMD(n_components=2)
>>>famd.fit(X)
MCA PROCESS MCA PROCESS ELIMINATED 0  COLUMNS SINCE THEIR MISS_RATES >= 99%
Out:
FAMD(check_input=True, copy=False, engine='auto', n_components=2, n_iter=2,
     random_state=None)

>>>print(famd.explained_variance_)
[17.40871219  9.73440949]

>>>print(famd.explained_variance_ratio_)
[0.32596621039327284, 0.1822701494502082]

>>> print(famd.column_correlation(X))
             0         1
A         NaN       NaN
B         NaN       NaN
C_a       NaN       NaN
C_b       NaN  0.824458
C_c  0.922220       NaN
C_d       NaN       NaN
D_a       NaN       NaN
D_b       NaN       NaN
D_c       NaN       NaN
D_d       NaN  0.824458
D_e       NaN       NaN
E_a       NaN       NaN
E_b       NaN       NaN
E_d       NaN       NaN
E_e       NaN       NaN
F_c       NaN -0.714447
F_d  0.673375       NaN
F_e       NaN  0.839324



>>>print(famd.transform(X)) 
[[ 2.23848136  5.75809647]
 [ 2.0845175   4.78930072]
 [ 2.6682068  -2.78991262]
 [ 6.2962962  -1.57451325]
 [ 2.52140085 -3.28279729]
 [ 1.58256681 -3.73135011]
 [ 5.19476759  1.18333717]
 [ 6.35288446 -1.33186723]
 [ 5.02971134  1.6216402 ]
 [ 4.05754963  0.69620997]]

>>>print(famd.fit_transform(X))
MCA PROCESS HAVE ELIMINATE 0  COLUMNS SINCE ITS MISSING RATE >= 99%
[[ 2.23848136  5.75809647]
 [ 2.0845175   4.78930072]
 [ 2.6682068  -2.78991262]
 [ 6.2962962  -1.57451325]
 [ 2.52140085 -3.28279729]
 [ 1.58256681 -3.73135011]
 [ 5.19476759  1.18333717]
 [ 6.35288446 -1.33186723]
 [ 5.02971134  1.6216402 ]
 [ 4.05754963  0.69620997]]

Going faster

By default light_famd uses sklearn's randomized SVD implementation. One of the goals of Light_FAMD is to make it possible to use a different SVD backend. For the while the only other supported backend is Facebook's randomized SVD implementation called fbpca. You can use it by setting the engine parameter to 'fbpca' or see Truncated_FAMD for an alternative of automatic selection of svd_solver depends on the structure of input:

>>> import Light_FAMD
>>> pca = Light_FAMD.PCA(engine='fbpca')

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.3

Sep 29, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

light_famd-0.0.3.tar.gz (15.9 kB view details)

Uploaded Sep 29, 2019 Source

Built Distribution

light_famd-0.0.3-py2.py3-none-any.whl (14.1 kB view details)

Uploaded Sep 29, 2019 Python 2Python 3

File details

Details for the file light_famd-0.0.3.tar.gz.

File metadata

Download URL: light_famd-0.0.3.tar.gz
Upload date: Sep 29, 2019
Size: 15.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for light_famd-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`0d640659de578a572ec513f3741e6c2f5eeaf841884579d15e8c3eb834853b81`
MD5	`bc6e6e28443acc65cd49a6f6778d1018`
BLAKE2b-256	`9ef060e56c2e3c00e33cfeab5d54dfdb917fa960fd8d178fb57be1320af7010b`

See more details on using hashes here.

File details

Details for the file light_famd-0.0.3-py2.py3-none-any.whl.

File metadata

Download URL: light_famd-0.0.3-py2.py3-none-any.whl
Upload date: Sep 29, 2019
Size: 14.1 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for light_famd-0.0.3-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`63abe8762ca98f32736b239a94a4b849dd082ad40de5a36be84118136470686a`
MD5	`976d1d7a40d36335123229be8919fbce`
BLAKE2b-256	`6c406678217385426fe2d7791df6a56c866e8676f9d75afa26e188d9a2a291a3`

See more details on using hashes here.

light-famd 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Light_FAMD

Table of contents

Guidelines

Principal-Component-Analysis: PCA

Correspondence-Analysis: CA

Multiple-Correspondence-Analysis: MCA

Multiple-Factor-Analysis: MFA

Factor-Analysis-of-Mixed-Data: FAMD

Going faster

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes