Skip to main content

Use fast FFT-based mutual information and accelerated gradient method to filter out and optimize nonconvex sparse learning problems on large CSV files or large genetic bed/bim/fam files. Multiprocessing is now available.

Project description

fastHDMI -- fast High-Dimensional Mutual Information estimation

Kai Yang

kai.yang2@mail.mcgill.ca

This packages uses mutual information and accelerated gradient method to screen for important variables from (potentially very) high-dimensional large datasets. As a feature, it can be applied on large .csv data in parallel in a memory-efficient manner and use FFT for KDE to estimate the mutual information extremely fast. A tqdm progress bar is now added to be more useful on cloud computing platforms. verbose option can take values of 0,1,2, with 2 being most verbal; 1 being only show progress bar, and 0 being not verbal at all. The corresponding paper by Yang et al. is coming soon...

The available functions are:

  • continuous_screening_plink caculates the mutual information between a continuous outcome and a bialletic SNP using FFT. Missing data in the input variables is acceptable and will be removed per bivariate calculation. The arguments are:

    • bed_file, bim_file, fam_file are the location of the plink1 files;
    • outcome, outcome_iid are the outcome values and the iids for the outcome. For genetic data, it is usual that the order of SNP iid and the outcome iid don't match. While SNP iid can be obtained from the plink1 files, outcome iid here is to be declared separately. outcome_iid should be a list of strings or a one-dimensional numpy string array.
    • N=500 is the default values for grid size for FFT.
  • binary_screening_plink works similarly.

  • continuous_screening_plink_parallel and binary_screening_plink_parallel are the multiprocessing version of the above two functions, with core_num can be used to declare the number of cores to be used for multiprocessing.

  • MI_continuous_continuous and MI_binary_continuous are to calculate mutual information between two continuous variables and binary and continuous variables, respectively. MI_binary_012 and MI_012_012 are jit complied functions -- the later can be used for clumping for very large genetic datasets.

  • binary_screening_csv, continuous_screening_csv, binary_screening_csv_parallel, and continuous_screening_csv_parallel are to work on large CSV files directly in a memory efficient manner. Note that it is assumed the left first column should be the outcome; if not, use _usecols to set the first element to be the outcome column label.

    • _usecols is a list of column labels to be used, the first element should be the outcome. Returned mutual information calculation results match _usecols.
    • Pearson_screening_csv_parallel calculate Pearson's correlation between only the outcome and the covariates in similiar manner -- since pandas.DataFrame.corr calculate pairwise Pearson's correlation for the entire dataframe.
    • csv_engine can use dask for low memory situations, or pandas's read_csv engines, or fastparquet engine for a created parquet file for faster speed. If fastparquet is chosen, declare parquet_file as the filepath to the parquet file; if dask is chosen to read very large CSV, it might need to specify a larger sample.
  • continuous_skMIscreening_csv_parallel uses the MI calculation from sklearn.feature_selection.mutual_info_regression to carry out the screening process instead.

  • clump_plink_parallel and clump_continuous_csv_parallel carry out mutual information based clumping in parallel at a very fast speed.

  • UAG_LM_SCAD_MCP, UAG_logistic_SCAD_MCP: these functions find a local minizer for the SCAD/MCP penalized linear models/logistic models. The arguments are:

    • design_matrix: the design matrix input, should be a two-dimensional numpy array;
    • outcome: the outcome, should be one dimensional numpy array, continuous for linear model, binary for logistic model;
    • beta_0: starting value; optional, if not declared, it will be calculated based on the Gauss-Markov theory estimators of $\beta$;
    • tol: tolerance parameter; the tolerance parameter is set to be the uniform norm of two iterations;
    • maxit: maximum number of iteratios allowed;
    • _lambda: _lambda value;
    • penalty: could be "SCAD" or "MCP";
    • a=3.7, gamma=2: a for SCAD and gamma for MCP; it is recommended for a to be set as $3.7$;
    • L_convex: the L-smoothness constant for the convex component, if not declared, it will be calculated by itself
    • add_intercept_column: boolean, should the fucntion add an intercept column?
  • solution_path_LM, solution_path_logistic: calculate the solution path for linear/logistic models; the only difference from above is that lambda_ is now a one-dimensional numpy array for the values of $\lambda$ to be used.

  • UAG_LM_SCAD_MCP_strongrule, UAG_logistic_SCAD_MCP_strongrule work just like UAG_LM_SCAD_MCP, UAG_logistic_SCAD_MCP -- except they use strong rule to screening out many covariates before carrying out the optimization step. Same for solution_path_LM_strongrule and solution_path_logistic_strongrule. Strong rule increases the computational speed dramatically.

  • SNP_UAG_LM_SCAD_MCP and SNP_UAG_logistic_SCAD_MCP work similar to UAG_LM_SCAD_MCP and UAG_logistic_SCAD_MCP; and SNP_solution_path_LM and SNP_solution_path_logistic work similar to solution_path_LM, solution_path_logistic -- except that it takes plink1 files so it will be more memory-efficient. Since PCA adjustment is usually used to adjust for population structure, PCA can be given for pca as a 2-d array -- each column should be one principal component. The pca version is SNP_UAG_LM_SCAD_MCP_PCA and SNP_UAG_logistic_SCAD_MCP_PCA.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastHDMI-1.16.32.tar.gz (59.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fastHDMI-1.16.32-py3-none-any.whl (45.2 kB view details)

Uploaded Python 3

File details

Details for the file fastHDMI-1.16.32.tar.gz.

File metadata

  • Download URL: fastHDMI-1.16.32.tar.gz
  • Upload date:
  • Size: 59.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for fastHDMI-1.16.32.tar.gz
Algorithm Hash digest
SHA256 4203e4d5dede8275d01cb23f5efedb8ee54064208b2f145badb5dbfd745626dd
MD5 bed1a16df1298c3dc6cbe18897011b40
BLAKE2b-256 faf67bf12c9090582d711d5d6db2e42ef086df406b65fd179a475d7b8db424d5

See more details on using hashes here.

File details

Details for the file fastHDMI-1.16.32-py3-none-any.whl.

File metadata

  • Download URL: fastHDMI-1.16.32-py3-none-any.whl
  • Upload date:
  • Size: 45.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for fastHDMI-1.16.32-py3-none-any.whl
Algorithm Hash digest
SHA256 c6f4dade860d0f02fbf56413582bc5302e1006ac42308b6f8395598b1041ec12
MD5 24ee3577b8efc191ff8c2698cac9fa63
BLAKE2b-256 b84991c9e8c0650636d93bc71cec81f293938d8e6bc89bed9b7fced4aa4d280b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page