Skip to main content

This is a python package for genomics study with a GCN framework.

Project description

a GP-GCN framework for genomics

This is a python package for genomics study with a GP-GCN (Gapped Pattern Graph Convolutional Networks) framework.

image

Getting started

Prerequisite

  • cython
  • numpy
  • Biopython
  • editdistance
  • pytorch 1.7.1
  • pytorch_geometric 1.7.0

Install

pip install GCNFrame

Or

git clone https://github.com/deepomicslab/GCNFrame.git
cd GCNFrame/GCNFrame
python setup.py build_ext --inplace
cd ../

Examples

The framework makes it easy to train your customized models with a few lines of codes. The example data can be downloaded from Google Drive.

# This is an example to train a two-classes model.
from GCNFrame import Biodata, GCNmodel
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

data = Biodata(fasta_file="example_data/nature_2017.fasta", 
        label_file="example_data/lifestyle_label.txt",
        feature_file="example_data/CDD_protein_feature.txt")
dataset = data.encode(thread=20)
model = GCNmodel.model(label_num=2, other_feature_dim=206).to(device)
GCNmodel.train(dataset, model, weighted_sampling=True)
GCNmodel.test(model_name="GCN_model.pt", fasta_file="example_data/nature_2017.fasta", feature_file="example_data/CDD_protein_feature.txt")

The output is shown bellow:

Encoding sequences...
Epoch 0| Loss: 0.6335| Train accuracy: 0.7480| Validation accuracy: 0.8839
Epoch 1| Loss: 0.5605| Train accuracy: 0.8165| Validation accuracy: 0.7032
Epoch 2| Loss: 0.5042| Train accuracy: 0.8469| Validation accuracy: 0.8065
Epoch 3| Loss: 0.4873| Train accuracy: 0.8344| Validation accuracy: 0.7677
Epoch 4| Loss: 0.4559| Train accuracy: 0.8703| Validation accuracy: 0.8194
Epoch 5| Loss: 0.4533| Train accuracy: 0.8763| Validation accuracy: 0.7806
Epoch 6| Loss: 0.4372| Train accuracy: 0.8931| Validation accuracy: 0.8387
Epoch 7| Loss: 0.4409| Train accuracy: 0.8842| Validation accuracy: 0.8581
Epoch 8| Loss: 0.4357| Train accuracy: 0.8858| Validation accuracy: 0.8516
Epoch 9| Loss: 0.4314| Train accuracy: 0.8987| Validation accuracy: 0.8387
Epoch 10| Loss: 0.4246| Train accuracy: 0.8992| Validation accuracy: 0.8581
Epoch 11| Loss: 0.4085| Train accuracy: 0.9180| Validation accuracy: 0.8839
Epoch 12| Loss: 0.4071| Train accuracy: 0.9290| Validation accuracy: 0.8903
Epoch 13| Loss: 0.4095| Train accuracy: 0.9170| Validation accuracy: 0.8839
Epoch 14| Loss: 0.4019| Train accuracy: 0.9241| Validation accuracy: 0.8839
Epoch 15| Loss: 0.3960| Train accuracy: 0.9342| Validation accuracy: 0.9161

The model with best validation accuracy will be saved as GCN_model.pt

Also, the package provides users with functions to mine gapped patterns or motifs of more significant influence in prediction tasks.

# the pattern_contribution_score function returns a score list to record the contribution scores for the 4,096 gapped patterns. 
score_list = pattern_contribution_score(fasta_file="example_data/nature_2017.fasta",
        label_file="example_data/lifestyle_label.txt",
        feature_file="example_data/CDD_protein_feature.txt")

The scores for the gapped-patterns will also be saved in a txt file.

# the pattern_group_contribution_score function groups similar gapped patterns and analyzes the occurrence & scores for each group.
pattern_group_contribution_score(fasta_file="example_data/nature_2017.fasta", label_file="example_data/lifestyle_label.txt", score_list=score_list)

The results are saved as figures. image image

# the motif_contribution_score calculate the contribution score for a given motif.
score = motif_contribution_score(fasta_file="example_data/nature_2017.fasta", label_file="example_data/lifestyle_label.txt", motif="AAAAAATTCG", feature_file="example_data/CDD_protein_feature.txt")
print("The contribution score for AAAAAATTCG is %s."%score)

Parameters

class Biodata.Biodata

  • fasta_file: The DNA sequences used for training and evaluation in fasta format.
  • label_file: The labels for the DNA sequences for training and evaluation (should have the same order as fasta_file).
  • feature_file: Other features (like gene density) for the DNA sequences for training and evaluation (should have the same order as fasta_file) (default=None).
  • K: The length of K-mer for encoding (default=3).
  • d: The number of spaced distance used for encoding (default=3).
  • thread: The number of thread used for encoding (default=10).

class GCNmodel.model

  • label_num: The number of labels.
  • other_feature_dim: The dimension for other features, 0 if not available.
  • K: The length of K-mer for encoding (default=3).
  • d: The number of spaced distance used for encoding (default=3).
  • node_hidden_dim: The size for kmer nodes after transformation(default=3).
  • gcn_dim: The size of output of SAGEConv (default=128).
  • gcn_layer_num: The number of SAGEConv layers (default=4).
  • cnn_dim: The size of output of convolutional layers (default=64).
  • cnn_layer_num: The number of convolutional layers (default=3).
  • cnn_kernel_size: The kernel size of convolutional layers (default=8).
  • fc_dim: The number of neurons for the fully connected layers (default=100).
  • dropout_rate: The dropout rate (default=0.2).
  • pnode_nn: Whether transform primary nodes (default=True).
  • fnode_nn: Whether transform target nodes (default=True).

GCNmodel.train

  • learning_rate: The learning rate for training (default=1e-4).
  • batch_size: The batch_size for training (default=64).
  • epoch_n: The number of training epoches (default=20).
  • random_seed: The random seed for train-validation split (default=111).
  • val_split: The validation size (default=0.1).
  • weighted_sampling: Whether use weighted sampling for training (default=True).
  • model_name: The saved model name (default="GCN_model.pt").

GCNmodel.test

  • fasta_file: The DNA sequences used for test in fasta format.
  • model_name: The saved model name (default="GCN_model.pt").
  • feature_file: Other features (like gene density) for the DNA sequences for test (should have the same order as fasta_file) (default=None).
  • output_file: The output file name (default="test_output.txt").
  • thread: The number of thread used for encoding (default=10).
  • K: The length of K-mer for encoding (default=3).
  • d: The number of spaced distance used for encoding (default=3).

pattern_contribution_score

  • fasta_file: The DNA sequences used for training and evaluation in fasta format.
  • label_file: The labels for the DNA sequences for training and evaluation (should have the same order as fasta_file).
  • target_label: The label of the class being analyzed (default=1).
  • model_name: The saved model name (default="GCN_model.pt").
  • feature_file: Other features (like gene density) for the DNA sequences for training and evaluation (should have the same order as fasta_file) (default=None).
  • output_file: The output file name (default="pattern_contribution_score.txt").
  • thread: The number of thread used for encoding (default=10).
  • K: The length of K-mer for encoding (default=3).
  • d: The number of spaced distance used for encoding (default=3).

motif_contribution_score

  • fasta_file: The DNA sequences used for training and evaluation in fasta format.
  • label_file: The labels for the DNA sequences for training and evaluation (should have the same order as fasta_file).
  • motif: The motif to be analyzed.
  • target_label: The label of the class being analyzed (default=1).
  • model_name: The saved model name (default="GCN_model.pt").
  • feature_file: Other features (like gene density) for the DNA sequences for training and evaluation (should have the same order as fasta_file) (default=None).
  • thread: The number of thread used for encoding (default=10).
  • K: The length of K-mer for encoding (default=3).
  • d: The number of spaced distance used for encoding (default=3).

pattern_group_contribution_score

  • fasta_file: The DNA sequences used for training and evaluation in fasta format.
  • label_file: The labels for the DNA sequences for training and evaluation (should have the same order as fasta_file).
  • score_list: The contribution scores of the 4,096 gapped patterns.
  • target_label: The label of the class being analyzed (default=1).
  • d: The number of spaced distance used for encoding (default=3).

Version history

  • v0.1.1: Add contribution score functions.
  • v0.0.1: Initial version.

Maintainer

WANG Ruohan ruohawang2-c@my.cityu.edu.hk

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GCNFrame-0.1.2.tar.gz (295.8 kB view details)

Uploaded Source

File details

Details for the file GCNFrame-0.1.2.tar.gz.

File metadata

  • Download URL: GCNFrame-0.1.2.tar.gz
  • Upload date:
  • Size: 295.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.3

File hashes

Hashes for GCNFrame-0.1.2.tar.gz
Algorithm Hash digest
SHA256 3464572c6e9f12180738d489e5964d98dc5eeac488979675d38ebafbb22bb7fa
MD5 99dc9c7fb42ffbed3dc85113332772b7
BLAKE2b-256 90ba01669edc730fc5e5a000d72dca7d98ccd584575d667bdf668eb540ac0492

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page