Use fast FFT-based mutual information screening for large datasets. Works well on MRI brain imaging data. Developed by Kai Yang, [GPG Public key Fingerprint: CC02CF153594774CF956691492B2600D18170329](https://keys.openpgp.org/vks/v1/by-fingerprint/CC02CF153594774CF956691492B2600D18170329)
Project description
fastHDMI -- fast High-Dimensional Mutual Information estimation
Kai Yang
<kai.yang2 "at" mail.mcgill.ca>
License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)
GPG Public key Fingerprint: CC02CF153594774CF956691492B2600D18170329
This packages uses FFT-based mutual information screening and accelerated gradient method for important variables from (potentially very) high-dimensional large datasets. version 1.23.23 is a version with only the README file updated to illustrate the functions more clearly
Consider the sizes of the datafiles, the most commonly-used functions are the functions run in parallel -- all functions running in parallel will has _parallel suffix; and they all have arguments:
core_num: number of CPU cores used for multiprocessing; the default option is to use all the cores available, considering the job is most likely running on a server instead of a PCmultp: job multiplier, the job to be run in parallel will be first divided intocore_num * multpsub-jobs -- as equal as possible, then at each time, one core will take one subjob.verbose: how verbal the function will be, with0being least verbal and increases wrt. the number decalred here
The function implementing our propsoed FFT-based mutual information estimation will have the following arguments:
N: the grid size for 1-D FFT; withN=500as the default valuea_N,a_N: similar to above, the grid size for 2-D FFT; with300as the default valueskernelandbwspecify the kernel and bandwidth used for KDEnormis the norm used for KDE -- this option only takes effects for 2-D KDE
The screening functions and their arguments:
- For
plinkfiles:
- arguments
bed_file,bim_file,fam_fileare the location of the plink files; - arguments
outcome,outcome_iidare the outcome values and the iids for the outcome. For genetic data, it is usual that the order of SNP iid and the outcome iid don't match. While SNP iid can be obtained from the plink1 files, outcome iid here is to be declared separately.outcome_iidshould be a list of strings or a one-dimensional numpy string array. continuous_screening_plink,continuous_screening_plink_parallelfor screening on continuous outcomes with continuous covariatesbinary_screening_plink,binary_screening_plink_parallelfor screening on binary outcomes with continuous covariatesclump_plink_parallelfor clumping -- starting from the first covariate (i.e., the first column on the left of the datafile), clumping will remove all subsequent covariates with a mutual information higher than what theclumping_thresholddeclares with the one it looks at
- For
csvfiles:
- argument
_usecolsis a list of column labels to be used, the first element should be the outcome. Returned mutual information calculation results match_usecols. - Note that it is assumed the left first column should be the outcome; if not, use
_usecolsto set the first element to be the outcome column label. csv_enginecan usedaskfor low memory situations, orpandas'sread_csvengines, orfastparquetengine for a createdparquetfile for faster speed. Iffastparquetis chosen, declareparquet_fileas the filepath to the parquet file; ifdaskis chosen to read very large CSV, it might need to specify a largersample.binary_screening_csv,binary_screening_csv_parallelfor screening on binary outcomes with continuous covariatesbinary_skMI_screening_csv_parallel,continuous_skMI_screening_csv_parallelfor screening using mutual information estimation provided byskLearn, i.e.,sklearn.metrics.mutual_info_score,sklearn.feature_selection.mutual_info_classifPearson_screening_csv_parallelfor screening using Pearson correlationcontinuous_screening_csv,continuous_screening_csv_parallelfor screening on continuous outcomes with continuous covariatesclump_continuous_csv_parallelsimilar to above
A share_memory option is added for multiprocess computing. As a feature, it can be applied on large .csv data in parallel in a memory-efficient manner and use FFT for KDE to estimate the mutual information extremely fast. A tqdm progress bar is now added to be more useful on cloud computing platforms. verbose option can take values of 0,1,2, with 2 being most verbal; 1 being only show progress bar, and 0 being not verbal at all.
- For DataFrame files:
binary_screening_dataframe,binary_screening_dataframe_parallelfor screening on binary outcomes with continuous covariatesbinary_skMI_screening_dataframe_parallel,continuous_skMI_screening_dataframe_parallelfor screening using mutual information estimation provided byskLearn, i.e.,sklearn.metrics.mutual_info_score,sklearn.feature_selection.mutual_info_classifPearson_screening_dataframe_parallelfor screening using Pearson correlationcontinuous_screening_dataframe,continuous_screening_dataframe_parallelfor screening on continuous outcomes with continuous covariatesclump_continuous_dataframe_parallelsimilar to above
- For
numpyarrays:
binary_screening_array,binary_screening_array_parallelfor screening on binary outcomes with continuous covariatescontinuous_screening_array,continuous_screening_array_parallelfor screening on continuous outcomes with continuous covariatesbinary_skMI_array_parallel,continuous_skMI_array_parallelfor screening using mutual information estimation provided byskLearn, i.e.,sklearn.metrics.mutual_info_score,sklearn.feature_selection.mutual_info_classifcontinuous_Pearson_array_parallelfor screening using Pearson correlation
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fastHDMI-1.23.28.tar.gz.
File metadata
- Download URL: fastHDMI-1.23.28.tar.gz
- Upload date:
- Size: 210.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
58ed56893b62c104fd06882bad577ffd95a773238aab3023d901f54e508b534f
|
|
| MD5 |
1601cd47c1d9c64fb2f32b7caa6a904c
|
|
| BLAKE2b-256 |
dde5e46be8364b8d2bfab7e8d7983ea3d525b7d90cf43b5d91063840177752af
|
File details
Details for the file fastHDMI-1.23.28-cp39-cp39-macosx_11_0_arm64.whl.
File metadata
- Download URL: fastHDMI-1.23.28-cp39-cp39-macosx_11_0_arm64.whl
- Upload date:
- Size: 275.2 kB
- Tags: CPython 3.9, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b7cffa180f8bbbc2e66de635bc9f7362933d0aa6719c74bfe929cec32d564e05
|
|
| MD5 |
904eda85c60317c0aaf00aadeaff55b3
|
|
| BLAKE2b-256 |
f67d03b6f5148e4c0e687aa5be7c9efb246100def62511d3f1f2bea56e13a4f2
|