Skip to main content

Attention based BiLSTM model for Olfactory Analysis

Project description

This package contains code to train an attention based deep learning model to predict activation status for given ligands and receptors.

Detailed description of functions:

split_data:

This function takes a pandas dataframe as input and provides preprocessed model ready data. The function parameters include val_ratio to split the data into train and validation, a test flag which will enable the test_ratio to split the data into train, validation and test. This function also performs the oversampling of minority class (class with less samples) in the training data to avoid bias in the training process. Please note that the ligands column should be named SMILES, sequences column should be named Final_Sequence and activation status column should be named Activation_Status.

  • Basic code required for split_ratio function -

      X_smile_onehot_train, X_seq_onehot_train, Y_train, X_smile_onehot_val, X_seq_onehot_val, Y_val = split_data(dataframe)
    
  • Provide a validation split ratio explicitally -

      X_smile_onehot_train, X_seq_onehot_train, Y_train, X_smile_onehot_val, X_seq_onehot_val, Y_val, X_smile_onehot_test, X_seq_onehot_test, Y_test = split_data(dataframe, val_ratio=0.2)
    
  • Perform test and validation splitting with given ratio -

      X_smile_onehot_train, X_seq_onehot_train, Y_train, X_smile_onehot_val, X_seq_onehot_val, Y_val, X_smile_onehot_test, X_seq_onehot_test, Y_test = split_data(dataframe, val_ratio=0.2, test=True, test_ratio=0.1)
    

train:

This function will train a model based on training and validation data. You can modify the model parameters of LSTM along with basic hyperparameters like learning rate, dropout, batch size and number of training epochs (these parameters are not necessary and the code runs with default values as well). We use Adam optimier for training the entire model. This function also saves the loss plot for your trained model in the current directory. Custom filename can be provided to the saved plot by parameter filename.

  • Basic code required for train function -

      model = train(X_smile_onehot_train, X_seq_onehot_train, Y_train, X_smile_onehot_val, X_seq_onehot_val)
    
  • Updating hyperparameters like learning rate, dropout, batch size and epochs

      model = train(X_smile_onehot_train, X_seq_onehot_train, Y_train, X_smile_onehot_val, X_seq_onehot_val,  learning_rate=0.0002, dropout=0.5, batch_size=16, epochs=50, filename = "train_val_loss")
    
  • Update the model architecture by changing number of recurrent layers, hidden states (update carefully to avoid memory overflow)

      model = train(X_smile_onehot_train, X_seq_onehot_train, Y_train, X_smile_onehot_val, X_seq_onehot_val,  smile_h = 100, smile_l = 1, seq_h = 100, seq_l = 1)
    

test:

This function is used to find standard metrics like accuracy, precision, recall, auc, kappa etc. on any data for a trained model. This function also makes a ROC curve and saves it the current directory. Custom filename can be provided to the saved plot by parameter filename.

  • The below code snippet generates a ROC curve for the model

      test(model, X_smile_onehot_test, X_seq_onehot_test, Y_test, filename = "test")
    
  • The below snippet generates a classwise ROC curve

      test(model, X_smile_onehot_test, X_seq_onehot_test, Y_test, "test", flag=1)
    

grid_search:

This method helps to find optimal model hyperparameters by training a group of models and deciding the best model based on highest validation accuracy after same number of epochs. Please be careful while using this function to avoid exhuasting CUDA memory of your system.

  • Basic code required for grid_search function with default options

      grid_search(X_smile_onehot_train, X_seq_onehot_train, Y_train, X_smile_onehot_val, X_seq_onehot_val, Y_val)
    
  • Try various combinations of learning_rate, batch_size and dropouts

      grid_search(X_smile_onehot_train, X_seq_onehot_train, Y_train, X_smile_onehot_val, X_seq_onehot_val, Y_val, , batch_size = [16, 32], learning_rate = [1e-4, 1.2e-4], dropout= [0.2, 0.7], epochs = 10)                
    

generate_result_matrix:

This method gives the prediction and probability matrix for given sequences and ligands.

pred_matrix, prob_matrix = generate_result_matrix(model, smiles, seqs)

interpretebility:

This method saves the plots for ligands and receptors interpretebility along with molecular structure interpretebilty in the provided path. It takes a trained model with save path along with a single smile and sequence to find interpretebility.

interpretebilty(model, user_smile, user_seq, path = "./")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

AttentionOdorify-3.0.1-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file AttentionOdorify-3.0.1-py3-none-any.whl.

File metadata

  • Download URL: AttentionOdorify-3.0.1-py3-none-any.whl
  • Upload date:
  • Size: 10.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.6.5

File hashes

Hashes for AttentionOdorify-3.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6cf30b8e0a104bc07a1325e045eb0dc6249657c2279952058ae78a9c246caccf
MD5 4343d1ee1f0a2c7cfed68cb1d9f464ab
BLAKE2b-256 8f6ce910e10e7a3f28647f520a8fdd4668b904055c5de922077306a012bbe260

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page