Supervised, nonlinear feature selection method for highdimensional datasets.
Project description
pyHSICLasso is a package of the Hilbert Schmidt Independence Criterion Lasso (HSIC Lasso), which is a nonlinear feature selection method considering the nonlinear input and output relationship.
Advantage of HSIC Lasso
Can find nonlinearly related features efficiently.
Can obtain a globally optimal solution.
Can deal with both regression and classification problems through kernels.
Feature Selection
The goal of supervised feature selection is to find a subset of input features that are responsible for predicting output values. By using this, you can supplement the dependence of nonlinear input and output and you can calculate the optimal solution efficiently for high dimensional problem. The effectiveness are demonstrated through feature selection experiments for classification and regression with thousands of features. Finding a subset of features in highdimensional supervised learning is an important problem with many real world applications such as gene selection from microarray data, document categorization, and prosthesis control.
Install
$ pip install r requirements.txt
$ python setup.py install
or
$ pip install pyHSICLasso
Usage
First, pyHSICLasso provides the single entry point as class HSICLasso()
This class has the following methods.
input
regression
classification
dump
plot_path
plot_dendrogram
plot_heatmap
get_features
get_features_neighbors
get_index
get_index_score
get_index_neighbors
get_index_neighbors_score
The input format corresponds to the following formats.
MATLAB file (.mat)
.csv
.tsv
python’s list
numpy’s ndarray
Input file
When using .mat, .csv, .tsv, we support pandas dataframe. The rows of the dataframe are sample number. The output variable should have class tag. If you wish to use your own tag, you need to specify the output variables by list (output_list=['tag']) The remaining columns are values of each features. The following is a sample data (csv format).
class,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10 1,2,0,0,0,2,0,2,0,2,0 1,2,2,0,0,2,0,0,0,2,0 ...
When using python’s list or numpy’s ndarray, Let each index be sample number, let values of each features for X[ind] and classification value for Y[ind].
Example
>>> from pyHSICLasso import HSICLasso
>>> hsic_lasso = HSICLasso()
>>> hsic_lasso.input("data.mat")
>>> hsic_lasso.input("data.csv")
>>> hsic_lasso.input("data.tsv")
>>> hsic_lasso.input([[1, 1, 1], [2, 2, 2]], [0, 1])
>>> hsic_lasso.input(np.array([[1, 1, 1], [2, 2, 2]]), np.array([0, 1]))
You can specify the number of subset of feature selections with arguments regression andclassification.
>>> hsic_lasso.regression(5)
>>> hsic_lasso.classification(10)
About output method, it is possible to select plots on the graph, details of the analysis result, output of the feature index.
>>> hsic_lasso.plot()
# plot the graph
>>> hsic_lasso.dump()
============================================== HSICLasso : Result ==================================================
 Order  Feature  Score  Top5 Related Feature (Relatedness Score) 
 1  1100  1.000  100 (0.979), 385 (0.104), 1762 (0.098), 762 (0.098), 1385 (0.097)
 2  100  0.537  1100 (0.979), 385 (0.100), 1762 (0.095), 762 (0.094), 1385 (0.092)
 3  200  0.336  1200 (0.979), 264 (0.094), 1482 (0.094), 1264 (0.093), 482 (0.091)
 4  1300  0.140  300 (0.984), 1041 (0.107), 1450 (0.104), 1869 (0.102), 41 (0.101)
 5  300  0.033  1300 (0.984), 1041 (0.110), 41 (0.106), 1450 (0.100), 1869 (0.099)
>>> hsic_lasso.get_index()
[1099, 99, 199, 1299, 299]
>>> hsic_lasso.get_index_score()
array([0.09723658, 0.05218047, 0.03264885, 0.01360242, 0.00319763])
>>> hsic_lasso.get_features()
['1100', '100', '200', '1300', '300']
>>> hsic_lasso.get_index_neighbors(feat_index=0,num_neighbors=5)
[99, 384, 1761, 761, 1384]
>>> hsic_lasso.get_features_neighbors(feat_index=0,num_neighbors=5)
['100', '385', '1762', '762', '1385']
>>> hsic_lasso.get_index_neighbors_score(feat_index=0,num_neighbors=5)
array([0.9789888 , 0.10350618, 0.09757666, 0.09751763, 0.09678892])
Contributors
Developers
Name : Makoto Yamada, Héctor ClimenteGonzález
Email : makoto.yamada@riken.jp
Distributor
Name : Hirotaka Suetake
Email : hirotaka.suetake@riken.jp
Project details
Release history Release notifications  RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for pyHSICLasso1.4.2py2.py3noneany.whl
Algorithm  Hash digest  

SHA256  6a27b7a10fd602925fbcb9cc82a69041b51c9a3c003aa119a82daef85980cb72 

MD5  3adecfd96e2b66bec83366437016a3fa 

BLAKE2b256  f0dd54ea99918b8940e49c0fc7763df3d4ddd6b68bb26197475661c96de64f72 