This is a python package for genomics study with a GCN framework.
Project description
a GP-GCN framework for genomics
This is a python package for genomics study with a GP-GCN (Gapped Pattern Graph Convolutional Networks) framework.
Getting started
Prerequisite
- cython
- numpy
- Biopython
- editdistance
- pytorch 1.7.1
- pytorch_geometric 1.7.0
Install
pip install GCNFrame
Or
git clone https://github.com/deepomicslab/GCNFrame.git
cd GCNFrame/GCNFrame
python setup.py build_ext --inplace
cd ../
Examples
The framework makes it easy to train your customized models with a few lines of codes. The example data can be downloaded from Google Drive.
# This is an example to train a two-classes model.
from GCNFrame import Biodata, GCNmodel
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
data = Biodata(fasta_file="example_data/nature_2017.fasta",
label_file="example_data/lifestyle_label.txt",
feature_file="example_data/CDD_protein_feature.txt")
dataset = data.encode(thread=20)
model = GCNmodel.model(label_num=2, other_feature_dim=206).to(device)
GCNmodel.train(dataset, model, weighted_sampling=True)
GCNmodel.test(model_name="GCN_model.pt", fasta_file="example_data/nature_2017.fasta", feature_file="example_data/CDD_protein_feature.txt")
The output is shown bellow:
Encoding sequences...
Epoch 0| Loss: 0.6335| Train accuracy: 0.7480| Validation accuracy: 0.8839
Epoch 1| Loss: 0.5605| Train accuracy: 0.8165| Validation accuracy: 0.7032
Epoch 2| Loss: 0.5042| Train accuracy: 0.8469| Validation accuracy: 0.8065
Epoch 3| Loss: 0.4873| Train accuracy: 0.8344| Validation accuracy: 0.7677
Epoch 4| Loss: 0.4559| Train accuracy: 0.8703| Validation accuracy: 0.8194
Epoch 5| Loss: 0.4533| Train accuracy: 0.8763| Validation accuracy: 0.7806
Epoch 6| Loss: 0.4372| Train accuracy: 0.8931| Validation accuracy: 0.8387
Epoch 7| Loss: 0.4409| Train accuracy: 0.8842| Validation accuracy: 0.8581
Epoch 8| Loss: 0.4357| Train accuracy: 0.8858| Validation accuracy: 0.8516
Epoch 9| Loss: 0.4314| Train accuracy: 0.8987| Validation accuracy: 0.8387
Epoch 10| Loss: 0.4246| Train accuracy: 0.8992| Validation accuracy: 0.8581
Epoch 11| Loss: 0.4085| Train accuracy: 0.9180| Validation accuracy: 0.8839
Epoch 12| Loss: 0.4071| Train accuracy: 0.9290| Validation accuracy: 0.8903
Epoch 13| Loss: 0.4095| Train accuracy: 0.9170| Validation accuracy: 0.8839
Epoch 14| Loss: 0.4019| Train accuracy: 0.9241| Validation accuracy: 0.8839
Epoch 15| Loss: 0.3960| Train accuracy: 0.9342| Validation accuracy: 0.9161
The model with best validation accuracy will be saved as GCN_model.pt
Also, the package provides users with functions to mine gapped patterns or motifs of more significant influence in prediction tasks.
# the pattern_contribution_score function returns a score list to record the contribution scores for the 4,096 gapped patterns.
score_list = pattern_contribution_score(fasta_file="example_data/nature_2017.fasta",
label_file="example_data/lifestyle_label.txt",
feature_file="example_data/CDD_protein_feature.txt")
The scores for the gapped-patterns will also be saved in a txt file.
# the pattern_group_contribution_score function groups similar gapped patterns and analyzes the occurrence & scores for each group.
pattern_group_contribution_score(fasta_file="example_data/nature_2017.fasta", label_file="example_data/lifestyle_label.txt", score_list=score_list)
The results are saved as figures.
# the motif_contribution_score calculate the contribution score for a given motif.
score = motif_contribution_score(fasta_file="example_data/nature_2017.fasta", label_file="example_data/lifestyle_label.txt", motif="AAAAAATTCG", feature_file="example_data/CDD_protein_feature.txt")
print("The contribution score for AAAAAATTCG is %s."%score)
Parameters
class Biodata.Biodata
fasta_file: The DNA sequences used for training and evaluation in fasta format.label_file: The labels for the DNA sequences for training and evaluation (should have the same order as fasta_file).feature_file: Other features (like gene density) for the DNA sequences for training and evaluation (should have the same order as fasta_file) (default=None).K: The length of K-mer for encoding (default=3).d: The number of spaced distance used for encoding (default=3).thread: The number of thread used for encoding (default=10).
class GCNmodel.model
label_num: The number of labels.other_feature_dim: The dimension for other features, 0 if not available.K: The length of K-mer for encoding (default=3).d: The number of spaced distance used for encoding (default=3).node_hidden_dim: The size for kmer nodes after transformation(default=3).gcn_dim: The size of output of SAGEConv (default=128).gcn_layer_num: The number of SAGEConv layers (default=4).cnn_dim: The size of output of convolutional layers (default=64).cnn_layer_num: The number of convolutional layers (default=3).cnn_kernel_size: The kernel size of convolutional layers (default=8).fc_dim: The number of neurons for the fully connected layers (default=100).dropout_rate: The dropout rate (default=0.2).pnode_nn: Whether transform primary nodes (default=True).fnode_nn: Whether transform target nodes (default=True).
GCNmodel.train
learning_rate: The learning rate for training (default=1e-4).batch_size: The batch_size for training (default=64).epoch_n: The number of training epoches (default=20).random_seed: The random seed for train-validation split (default=111).val_split: The validation size (default=0.1).weighted_sampling: Whether use weighted sampling for training (default=True).model_name: The saved model name (default="GCN_model.pt").
GCNmodel.test
fasta_file: The DNA sequences used for test in fasta format.model_name: The saved model name (default="GCN_model.pt").feature_file: Other features (like gene density) for the DNA sequences for test (should have the same order as fasta_file) (default=None).output_file: The output file name (default="test_output.txt").thread: The number of thread used for encoding (default=10).K: The length of K-mer for encoding (default=3).d: The number of spaced distance used for encoding (default=3).
pattern_contribution_score
fasta_file: The DNA sequences used for training and evaluation in fasta format.label_file: The labels for the DNA sequences for training and evaluation (should have the same order as fasta_file).target_label: The label of the class being analyzed (default=1).model_name: The saved model name (default="GCN_model.pt").feature_file: Other features (like gene density) for the DNA sequences for training and evaluation (should have the same order as fasta_file) (default=None).output_file: The output file name (default="pattern_contribution_score.txt").thread: The number of thread used for encoding (default=10).K: The length of K-mer for encoding (default=3).d: The number of spaced distance used for encoding (default=3).
motif_contribution_score
fasta_file: The DNA sequences used for training and evaluation in fasta format.label_file: The labels for the DNA sequences for training and evaluation (should have the same order as fasta_file).motif: The motif to be analyzed.target_label: The label of the class being analyzed (default=1).model_name: The saved model name (default="GCN_model.pt").feature_file: Other features (like gene density) for the DNA sequences for training and evaluation (should have the same order as fasta_file) (default=None).thread: The number of thread used for encoding (default=10).K: The length of K-mer for encoding (default=3).d: The number of spaced distance used for encoding (default=3).
pattern_group_contribution_score
fasta_file: The DNA sequences used for training and evaluation in fasta format.label_file: The labels for the DNA sequences for training and evaluation (should have the same order as fasta_file).score_list: The contribution scores of the 4,096 gapped patterns.target_label: The label of the class being analyzed (default=1).d: The number of spaced distance used for encoding (default=3).
Version history
v0.1.1: Add contribution score functions.v0.0.1: Initial version.
Maintainer
WANG Ruohan ruohawang2-c@my.cityu.edu.hk
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file GCNFrame-0.1.2.tar.gz.
File metadata
- Download URL: GCNFrame-0.1.2.tar.gz
- Upload date:
- Size: 295.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3464572c6e9f12180738d489e5964d98dc5eeac488979675d38ebafbb22bb7fa
|
|
| MD5 |
99dc9c7fb42ffbed3dc85113332772b7
|
|
| BLAKE2b-256 |
90ba01669edc730fc5e5a000d72dca7d98ccd584575d667bdf668eb540ac0492
|