This is a python package for genomics study with a GCN framework.

These details have not been verified by PyPI

Project links

Homepage

Project description

a GP-GCN framework for genomics

This is a python package for genomics study with a GP-GCN (Gapped Pattern Graph Convolutional Networks) framework.

Getting started

Prerequisite

cython
numpy
Biopython
editdistance
pytorch 1.7.1
pytorch_geometric 1.7.0

Install

pip install GCNFrame

git clone https://github.com/deepomicslab/GCNFrame.git
cd GCNFrame/GCNFrame
python setup.py build_ext --inplace
cd ../

Examples

The framework makes it easy to train your customized models with a few lines of codes. The example data can be downloaded from Google Drive.

# This is an example to train a two-classes model.
from GCNFrame import Biodata, GCNmodel
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

data = Biodata(fasta_file="example_data/nature_2017.fasta", 
        label_file="example_data/lifestyle_label.txt",
        feature_file="example_data/CDD_protein_feature.txt")
dataset = data.encode(thread=20)
model = GCNmodel.model(label_num=2, other_feature_dim=206).to(device)
GCNmodel.train(dataset, model, weighted_sampling=True)
GCNmodel.test(model_name="GCN_model.pt", fasta_file="example_data/nature_2017.fasta", feature_file="example_data/CDD_protein_feature.txt")

The output is shown bellow:

Encoding sequences...
Epoch 0| Loss: 0.6335| Train accuracy: 0.7480| Validation accuracy: 0.8839
Epoch 1| Loss: 0.5605| Train accuracy: 0.8165| Validation accuracy: 0.7032
Epoch 2| Loss: 0.5042| Train accuracy: 0.8469| Validation accuracy: 0.8065
Epoch 3| Loss: 0.4873| Train accuracy: 0.8344| Validation accuracy: 0.7677
Epoch 4| Loss: 0.4559| Train accuracy: 0.8703| Validation accuracy: 0.8194
Epoch 5| Loss: 0.4533| Train accuracy: 0.8763| Validation accuracy: 0.7806
Epoch 6| Loss: 0.4372| Train accuracy: 0.8931| Validation accuracy: 0.8387
Epoch 7| Loss: 0.4409| Train accuracy: 0.8842| Validation accuracy: 0.8581
Epoch 8| Loss: 0.4357| Train accuracy: 0.8858| Validation accuracy: 0.8516
Epoch 9| Loss: 0.4314| Train accuracy: 0.8987| Validation accuracy: 0.8387
Epoch 10| Loss: 0.4246| Train accuracy: 0.8992| Validation accuracy: 0.8581
Epoch 11| Loss: 0.4085| Train accuracy: 0.9180| Validation accuracy: 0.8839
Epoch 12| Loss: 0.4071| Train accuracy: 0.9290| Validation accuracy: 0.8903
Epoch 13| Loss: 0.4095| Train accuracy: 0.9170| Validation accuracy: 0.8839
Epoch 14| Loss: 0.4019| Train accuracy: 0.9241| Validation accuracy: 0.8839
Epoch 15| Loss: 0.3960| Train accuracy: 0.9342| Validation accuracy: 0.9161

The model with best validation accuracy will be saved as GCN_model.pt

Also, the package provides users with functions to mine gapped patterns or motifs of more significant influence in prediction tasks.

# the pattern_contribution_score function returns a score list to record the contribution scores for the 4,096 gapped patterns. 
score_list = pattern_contribution_score(fasta_file="example_data/nature_2017.fasta",
        label_file="example_data/lifestyle_label.txt",
        feature_file="example_data/CDD_protein_feature.txt")

The scores for the gapped-patterns will also be saved in a txt file.

# the pattern_group_contribution_score function groups similar gapped patterns and analyzes the occurrence & scores for each group.
pattern_group_contribution_score(fasta_file="example_data/nature_2017.fasta", label_file="example_data/lifestyle_label.txt", score_list=score_list)

The results are saved as figures.

# the motif_contribution_score calculate the contribution score for a given motif.
score = motif_contribution_score(fasta_file="example_data/nature_2017.fasta", label_file="example_data/lifestyle_label.txt", motif="AAAAAATTCG", feature_file="example_data/CDD_protein_feature.txt")
print("The contribution score for AAAAAATTCG is %s."%score)

Parameters

`class Biodata.Biodata`

fasta_file: The DNA sequences used for training and evaluation in fasta format.

label_file: The labels for the DNA sequences for training and evaluation (should have the same order as fasta_file).

feature_file: Other features (like gene density) for the DNA sequences for training and evaluation (should have the same order as fasta_file) (default=None).

K: The length of K-mer for encoding (default=3).

d: The number of spaced distance used for encoding (default=3).

thread: The number of thread used for encoding (default=10).

`class GCNmodel.model`

label_num: The number of labels.

other_feature_dim: The dimension for other features, 0 if not available.

K: The length of K-mer for encoding (default=3).

d: The number of spaced distance used for encoding (default=3).

node_hidden_dim: The size for kmer nodes after transformation(default=3).

gcn_dim: The size of output of SAGEConv (default=128).

gcn_layer_num: The number of SAGEConv layers (default=4).

cnn_dim: The size of output of convolutional layers (default=64).

cnn_layer_num: The number of convolutional layers (default=3).

cnn_kernel_size: The kernel size of convolutional layers (default=8).

fc_dim: The number of neurons for the fully connected layers (default=100).

dropout_rate: The dropout rate (default=0.2).

pnode_nn: Whether transform primary nodes (default=True).

fnode_nn: Whether transform target nodes (default=True).

`GCNmodel.train`

learning_rate: The learning rate for training (default=1e-4).

batch_size: The batch_size for training (default=64).

epoch_n: The number of training epoches (default=20).

random_seed: The random seed for train-validation split (default=111).

val_split: The validation size (default=0.1).

weighted_sampling: Whether use weighted sampling for training (default=True).

model_name: The saved model name (default="GCN_model.pt").

`GCNmodel.test`

fasta_file: The DNA sequences used for test in fasta format.

model_name: The saved model name (default="GCN_model.pt").

feature_file: Other features (like gene density) for the DNA sequences for test (should have the same order as fasta_file) (default=None).

output_file: The output file name (default="test_output.txt").

thread: The number of thread used for encoding (default=10).

K: The length of K-mer for encoding (default=3).

d: The number of spaced distance used for encoding (default=3).

`pattern_contribution_score`

fasta_file: The DNA sequences used for training and evaluation in fasta format.

label_file: The labels for the DNA sequences for training and evaluation (should have the same order as fasta_file).

target_label: The label of the class being analyzed (default=1).

model_name: The saved model name (default="GCN_model.pt").

feature_file: Other features (like gene density) for the DNA sequences for training and evaluation (should have the same order as fasta_file) (default=None).

output_file: The output file name (default="pattern_contribution_score.txt").

thread: The number of thread used for encoding (default=10).

K: The length of K-mer for encoding (default=3).

d: The number of spaced distance used for encoding (default=3).

`motif_contribution_score`

fasta_file: The DNA sequences used for training and evaluation in fasta format.

label_file: The labels for the DNA sequences for training and evaluation (should have the same order as fasta_file).

motif: The motif to be analyzed.

target_label: The label of the class being analyzed (default=1).

model_name: The saved model name (default="GCN_model.pt").

feature_file: Other features (like gene density) for the DNA sequences for training and evaluation (should have the same order as fasta_file) (default=None).

thread: The number of thread used for encoding (default=10).

K: The length of K-mer for encoding (default=3).

d: The number of spaced distance used for encoding (default=3).

`pattern_group_contribution_score`

fasta_file: The DNA sequences used for training and evaluation in fasta format.

label_file: The labels for the DNA sequences for training and evaluation (should have the same order as fasta_file).

score_list: The contribution scores of the 4,096 gapped patterns.

target_label: The label of the class being analyzed (default=1).

d: The number of spaced distance used for encoding (default=3).

Version history

v0.1.1: Add contribution score functions.
v0.0.1: Initial version.

Maintainer

WANG Ruohan ruohawang2-c@my.cityu.edu.hk

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.2

May 17, 2023

0.1.1

Dec 29, 2022

0.0.4

Oct 11, 2022

0.0.3

Oct 11, 2022

0.0.2

Oct 10, 2022

0.0.1

Apr 9, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GCNFrame-0.1.2.tar.gz (295.8 kB view details)

Uploaded May 17, 2023 Source

File details

Details for the file GCNFrame-0.1.2.tar.gz.

File metadata

Download URL: GCNFrame-0.1.2.tar.gz
Upload date: May 17, 2023
Size: 295.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.8.3

File hashes

Hashes for GCNFrame-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`3464572c6e9f12180738d489e5964d98dc5eeac488979675d38ebafbb22bb7fa`
MD5	`99dc9c7fb42ffbed3dc85113332772b7`
BLAKE2b-256	`90ba01669edc730fc5e5a000d72dca7d98ccd584575d667bdf668eb540ac0492`

See more details on using hashes here.

GCNFrame 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

a GP-GCN framework for genomics

Getting started

Prerequisite

Install

Examples

Parameters

`class Biodata.Biodata`

`class GCNmodel.model`

`GCNmodel.train`

`GCNmodel.test`

`pattern_contribution_score`

`motif_contribution_score`

`pattern_group_contribution_score`

Version history

Maintainer

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes