Use fast FFT-based mutual information screening for large datasets. Works well on MRI brain imaging data. Developed by Kai Yang, [GPG Public key Fingerprint: CC02CF153594774CF956691492B2600D18170329](https://keys.openpgp.org/vks/v1/by-fingerprint/CC02CF153594774CF956691492B2600D18170329)

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

fastHDMI -- fast High-Dimensional Mutual Information estimation

Kai Yang

<kai.yang2 "at" mail.mcgill.ca>

License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)

GPG Public key Fingerprint: CC02CF153594774CF956691492B2600D18170329

This packages uses FFT-based mutual information screening and accelerated gradient method for important variables from (potentially very) high-dimensional large datasets. version 1.23.23 is a version with only the README file updated to illustrate the functions more clearly

Consider the sizes of the datafiles, the most commonly-used functions are the functions run in parallel -- all functions running in parallel will has _parallel suffix; and they all have arguments:

core_num: number of CPU cores used for multiprocessing; the default option is to use all the cores available, considering the job is most likely running on a server instead of a PC
multp: job multiplier, the job to be run in parallel will be first divided into core_num * multp sub-jobs -- as equal as possible, then at each time, one core will take one subjob.
verbose: how verbal the function will be, with 0 being least verbal and increases wrt. the number decalred here

The function implementing our propsoed FFT-based mutual information estimation will have the following arguments:

N: the grid size for 1-D FFT; with N=500 as the default value
a_N, a_N: similar to above, the grid size for 2-D FFT; with 300 as the default values
kernel and bw specify the kernel and bandwidth used for KDE
norm is the norm used for KDE -- this option only takes effects for 2-D KDE

The screening functions and their arguments:

For plink files:

arguments bed_file, bim_file, fam_file are the location of the plink files;
arguments outcome, outcome_iid are the outcome values and the iids for the outcome. For genetic data, it is usual that the order of SNP iid and the outcome iid don't match. While SNP iid can be obtained from the plink1 files, outcome iid here is to be declared separately. outcome_iid should be a list of strings or a one-dimensional numpy string array.
continuous_screening_plink, continuous_screening_plink_parallel for screening on continuous outcomes with continuous covariates
binary_screening_plink, binary_screening_plink_parallel for screening on binary outcomes with continuous covariates
clump_plink_parallel for clumping -- starting from the first covariate (i.e., the first column on the left of the datafile), clumping will remove all subsequent covariates with a mutual information higher than what the clumping_threshold declares with the one it looks at

For csv files:

argument _usecols is a list of column labels to be used, the first element should be the outcome. Returned mutual information calculation results match _usecols.
Note that it is assumed the left first column should be the outcome; if not, use _usecols to set the first element to be the outcome column label.
csv_engine can use dask for low memory situations, or pandas's read_csv engines, or fastparquet engine for a created parquet file for faster speed. If fastparquet is chosen, declare parquet_file as the filepath to the parquet file; if dask is chosen to read very large CSV, it might need to specify a larger sample.
binary_screening_csv, binary_screening_csv_parallel for screening on binary outcomes with continuous covariates
binary_skMI_screening_csv_parallel, continuous_skMI_screening_csv_parallel for screening using mutual information estimation provided by skLearn, i.e., sklearn.metrics.mutual_info_score, sklearn.feature_selection.mutual_info_classif
Pearson_screening_csv_parallel for screening using Pearson correlation
continuous_screening_csv, continuous_screening_csv_parallel for screening on continuous outcomes with continuous covariates
clump_continuous_csv_parallel similar to above

A share_memory option is added for multiprocess computing. As a feature, it can be applied on large .csv data in parallel in a memory-efficient manner and use FFT for KDE to estimate the mutual information extremely fast. A tqdm progress bar is now added to be more useful on cloud computing platforms. verbose option can take values of 0,1,2, with 2 being most verbal; 1 being only show progress bar, and 0 being not verbal at all.

For DataFrame files:

binary_screening_dataframe, binary_screening_dataframe_parallel for screening on binary outcomes with continuous covariates
binary_skMI_screening_dataframe_parallel, continuous_skMI_screening_dataframe_parallel for screening using mutual information estimation provided by skLearn, i.e., sklearn.metrics.mutual_info_score, sklearn.feature_selection.mutual_info_classif
Pearson_screening_dataframe_parallel for screening using Pearson correlation
continuous_screening_dataframe, continuous_screening_dataframe_parallel for screening on continuous outcomes with continuous covariates
clump_continuous_dataframe_parallel similar to above

For numpy arrays:

binary_screening_array, binary_screening_array_parallel for screening on binary outcomes with continuous covariates
continuous_screening_array, continuous_screening_array_parallel for screening on continuous outcomes with continuous covariates
binary_skMI_array_parallel, continuous_skMI_array_parallel for screening using mutual information estimation provided by skLearn, i.e., sklearn.metrics.mutual_info_score, sklearn.feature_selection.mutual_info_classif
continuous_Pearson_array_parallel for screening using Pearson correlation

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.25.16

Mar 5, 2024

1.25.6

Dec 23, 2023

1.25.1

Dec 20, 2023

1.25.0

Oct 9, 2023

This version

1.23.28

Aug 29, 2023

1.23.26

Aug 27, 2023

1.23.23

Aug 25, 2023

1.23.18

May 26, 2023

1.23.10

Apr 26, 2023

1.23.6

Apr 20, 2023

1.23.3

Apr 20, 2023

1.23.2

Apr 19, 2023

1.23.1

Apr 18, 2023

1.18.28

Apr 17, 2023

1.18.26

Apr 16, 2023

1.18.25

Apr 16, 2023

1.18.24

Apr 16, 2023

1.18.23

Apr 16, 2023

1.18.20

Feb 19, 2023

1.18.15

Feb 18, 2023

1.18.10

Feb 18, 2023

1.18.5

Feb 10, 2023

1.18.3

Feb 7, 2023

1.17.0

Feb 6, 2023

1.16.32

Feb 6, 2023

1.16.28

Feb 6, 2023

1.16.26

Feb 5, 2023

1.16.25

Jan 31, 2023

1.16.22

Jan 30, 2023

1.16.20

Jan 30, 2023

1.16.16

Jan 29, 2023

1.16.15

Jan 23, 2023

1.16.10

Jan 23, 2023

1.16.6

Jan 23, 2023

1.16.5

Jan 16, 2023

1.16.3

Jan 16, 2023

1.16.2

Jan 16, 2023

1.16.1

Jan 16, 2023

1.16.0

Jan 15, 2023

1.15.0

Jan 15, 2023

1.12.0

Jan 14, 2023

1.10.0

Jan 11, 2023

1.2.9

Jan 11, 2023

1.2.8

Jan 11, 2023

1.2.6

Jan 10, 2023

1.2.2

Jan 9, 2023

1.1.8

Jan 9, 2023

1.1.6

Jan 8, 2023

1.1.2

Jan 8, 2023

1.0.28

Jan 8, 2023

1.0.6

Jan 5, 2023

1.0.0

Jan 4, 2023

0.9.5.9

Jan 4, 2023

0.9.5

Nov 10, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastHDMI-1.23.28.tar.gz (210.7 kB view hashes)

Uploaded Aug 29, 2023 Source

Built Distribution

fastHDMI-1.23.28-cp39-cp39-macosx_11_0_arm64.whl (275.2 kB view hashes)

Uploaded Aug 29, 2023 CPython 3.9 macOS 11.0+ ARM64

Hashes for fastHDMI-1.23.28.tar.gz

Hashes for fastHDMI-1.23.28.tar.gz
Algorithm	Hash digest
SHA256	`58ed56893b62c104fd06882bad577ffd95a773238aab3023d901f54e508b534f`
MD5	`1601cd47c1d9c64fb2f32b7caa6a904c`
BLAKE2b-256	`dde5e46be8364b8d2bfab7e8d7983ea3d525b7d90cf43b5d91063840177752af`

Hashes for fastHDMI-1.23.28-cp39-cp39-macosx_11_0_arm64.whl

Hashes for fastHDMI-1.23.28-cp39-cp39-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`b7cffa180f8bbbc2e66de635bc9f7362933d0aa6719c74bfe929cec32d564e05`
MD5	`904eda85c60317c0aaf00aadeaff55b3`
BLAKE2b-256	`f67d03b6f5148e4c0e687aa5be7c9efb246100def62511d3f1f2bea56e13a4f2`