Use fast FFT-based mutual information and accelerated gradient method to filter out and optimize nonconvex sparse learning problems on large CSV files or large genetic bed/bim/fam files. Multiprocessing is now available.
Project description
fastHDMI -- fast High-Dimensional Mutual Information estimation
Kai Yang
kai.yang2@mail.mcgill.ca
This packages uses FFT-based mutual information screening and accelerated gradient method for important variables from (potentially very) high-dimensional large datasets. A share_memory option is added for multiprocess computing. As a feature, it can be applied on large .csv data in parallel in a memory-efficient manner and use FFT for KDE to estimate the mutual information extremely fast. A tqdm progress bar is now added to be more useful on cloud computing platforms. verbose option can take values of 0,1,2, with 2 being most verbal; 1 being only show progress bar, and 0 being not verbal at all. The corresponding paper by Yang et al. is coming soon...
The available functions are:
-
continuous_screening_plinkcaculates the mutual information between a continuous outcome and a bialletic SNP using FFT. Missing data in the input variables is acceptable and will be removed per bivariate calculation. The arguments are:bed_file,bim_file,fam_fileare the location of the plink1 files;outcome,outcome_iidare the outcome values and the iids for the outcome. For genetic data, it is usual that the order of SNP iid and the outcome iid don't match. While SNP iid can be obtained from the plink1 files, outcome iid here is to be declared separately.outcome_iidshould be a list of strings or a one-dimensional numpy string array.N=500is the default values for grid size for FFT.
-
binary_screening_plinkworks similarly. -
continuous_screening_plink_parallelandbinary_screening_plink_parallelare the multiprocessing version of the above two functions, withcore_numcan be used to declare the number of cores to be used for multiprocessing. -
MI_continuous_continuousandMI_binary_continuousare to calculate mutual information between two continuous variables and binary and continuous variables, respectively.MI_binary_012andMI_012_012arejitcomplied functions -- the later can be used for clumping for very large genetic datasets. -
binary_screening_csv,continuous_screening_csv,binary_screening_csv_parallel, andcontinuous_screening_csv_parallelare to work on large CSV files directly in a memory efficient manner. Note that it is assumed the left first column should be the outcome; if not, use_usecolsto set the first element to be the outcome column label._usecolsis a list of column labels to be used, the first element should be the outcome. Returned mutual information calculation results match_usecols.Pearson_screening_csv_parallelcalculate Pearson's correlation between only the outcome and the covariates in similiar manner -- sincepandas.DataFrame.corrcalculate pairwise Pearson's correlation for the entire dataframe.csv_enginecan usedaskfor low memory situations, orpandas'sread_csvengines, orfastparquetengine for a createdparquetfile for faster speed. Iffastparquetis chosen, declareparquet_fileas the filepath to the parquet file; ifdaskis chosen to read very large CSV, it might need to specify a largersample.
-
continuous_skMIscreening_csv_paralleluses the MI calculation fromsklearn.feature_selection.mutual_info_regressionto carry out the screening process instead. -
clump_plink_parallelandclump_continuous_csv_parallelcarry out mutual information based clumping in parallel at a very fast speed. -
UAG_LM_SCAD_MCP,UAG_logistic_SCAD_MCP: these functions find a local minizer for the SCAD/MCP penalized linear models/logistic models. The arguments are:design_matrix: the design matrix input, should be a two-dimensional numpy array;outcome: the outcome, should be one dimensional numpy array, continuous for linear model, binary for logistic model;beta_0: starting value; optional, if not declared, it will be calculated based on the Gauss-Markov theory estimators of $\beta$;tol: tolerance parameter; the tolerance parameter is set to be the uniform norm of two iterations;maxit: maximum number of iteratios allowed;_lambda: _lambda value;penalty: could be"SCAD"or"MCP";a=3.7,gamma=2:afor SCAD andgammafor MCP; it is recommended forato be set as $3.7$;L_convex: the L-smoothness constant for the convex component, if not declared, it will be calculated by itselfadd_intercept_column: boolean, should the fucntion add an intercept column?
-
solution_path_LM,solution_path_logistic: calculate the solution path for linear/logistic models; the only difference from above is thatlambda_is now a one-dimensional numpy array for the values of $\lambda$ to be used. -
UAG_LM_SCAD_MCP_strongrule,UAG_logistic_SCAD_MCP_strongrulework just likeUAG_LM_SCAD_MCP,UAG_logistic_SCAD_MCP-- except they use strong rule to screening out many covariates before carrying out the optimization step. Same forsolution_path_LM_strongruleandsolution_path_logistic_strongrule. Strong rule increases the computational speed dramatically. -
SNP_UAG_LM_SCAD_MCPandSNP_UAG_logistic_SCAD_MCPwork similar toUAG_LM_SCAD_MCPandUAG_logistic_SCAD_MCP; andSNP_solution_path_LMandSNP_solution_path_logisticwork similar tosolution_path_LM,solution_path_logistic-- except that it takes plink1 files so it will be more memory-efficient. Since PCA adjustment is usually used to adjust for population structure, PCA can be given forpcaas a 2-d array -- each column should be one principal component. The pca version isSNP_UAG_LM_SCAD_MCP_PCAandSNP_UAG_logistic_SCAD_MCP_PCA.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fastHDMI-1.18.10.tar.gz.
File metadata
- Download URL: fastHDMI-1.18.10.tar.gz
- Upload date:
- Size: 61.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ca6a2dc19ef38b1d84c86154014a4fa4211096c36c8c513d5dbcacc3bc47cff
|
|
| MD5 |
7d636b36bd4ba608fd1ece76c8cd1c0d
|
|
| BLAKE2b-256 |
65deef6da17f49bce9725e00bb7eacd1676e3cc045c147227b5986a1ace4103b
|
File details
Details for the file fastHDMI-1.18.10-py3-none-any.whl.
File metadata
- Download URL: fastHDMI-1.18.10-py3-none-any.whl
- Upload date:
- Size: 45.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b2691ad84c2133519976af237a45632a8de307adaf6b18b49ff6d0ed504160d7
|
|
| MD5 |
2240eba5caa9b5a7036eafe18b00b0bb
|
|
| BLAKE2b-256 |
4eae5a84700f9735cbafc11c1fc7f09e79bda8bd7dc60bf4ade65edccd6442ad
|