Skip to main content

Prediction tool to identify the class of unknown sequences assembled from metagenomes

Project description

#################################################################### #################################################################### ## ## ## _____ _ ____ ___ __ ## ## |_ |__| | _ __ __ | _ \ _ __ ___ __| \ / / ## ## | |/ _ \ | '/ | |_) | '__/ _ \/ _ |\ / ## ## | | __/ || | | (| | /| | | / (| |/ \ ## ## ||_|_|| _,|| || _|_,/_/_\ ## ## ## ## ## #################################################################### ####################################################################

TetraPredX

Microbial sequence predictor using short DNA features.

TetraPredX can be used to predict the origin of unknown sequences assembled from metagenomic or metatranscriptomics datasets. It can also be used in combination with UnXplore framework.

Dependencies:

This tool requires the following Python modules are installed.

  • Python 3.6 or higher
  • BioPython
  • joblib
  • sklearn
  • seaborn
  • pathos

Note: PyPI will take care of this automatically.

OR

Use the tetrapredx.yml file to create a new environment with all dependencies.

Installation and Usage:

# clone the data repository
git clone https://github.com/sejmodha/TetraPredX.git

# create conda environment
conda env create -f tetrapredx.yml

# activate new environment
conda activate tetrapredx

cd TetraPredX

# run predictions
python predict.py -i test_zetavirus.fa -o test_out

TetraPredX also supports training new models and using them for predictions. A pseudo example shows major steps required to train and save new models.

Note: This process may require substantial computing power and could take a long time depending on the input data size.

Step 1

Extract features and their frequencies using FeatureExtractor.py script.

# generate feature output output output file
python FeatureExtractor.py -i mysequences.fasta -o output_prefix

Output file generated from Step 1 can be used as input for the next step.

Step 2

Train and save the models using train.py.

# train new models
python train.py -i input_csv_with_features_and_label -o output_prefix

FeatureExtractor.py and TrainModels.py contain a range of functions that can be used by importing them as standard python modules. e.g.,

import FeatureExtractor as ft

# generate a feature table
df = ft.get_feature_table(....)

Further details on functions:

Help on module FeatureExtractor:

NAME
    FeatureExtractor - Created on Thu 30 Apr 11:48:35 BST 2020

DESCRIPTION
    @author: sejmodha

FUNCTIONS
    batch_iterator(iterator, batch_size)
        Return lists of length batch_size.

        This can be used on any iterator, for example to batch up
        SeqRecord objects from Bio.SeqIO.parse(...), or to batch
        Alignment objects from Bio.AlignIO.parse(...), or simply
        lines from a file handle.

        This is a generator function, and it returns lists of the
        entries from the supplied iterator.  Each list will have
        batch_size entries, although the final list may be shorter.

        Taken from: https://biopython.org/wiki/Split_large_file

    extract_feat(infasta, tax_label, kmer, cpu, chunk)
        Extract k-mer features from a given FASTA file.

        Returns a dataframe with indexes, features and
        sequences labels (when known).

    generate_list_for_record(record, k)
        Generate a list of seq and revcomp seq kmers.

    generate_primer_ngrams(k, n)
        Generate n-grams of words.

    generate_primers(length)
        Generate primers.

    get_feature_table(infasta, out, tax_label, kmer, cpu, chunk)
        Convert feature table to a .csv file.

    get_kmers(dna, k)
        Extract k-mers of defined size k. Returns a list  of kmers.

    is_fasta(filename)
        Check the validity of FASTA file.

    main()
        Run the module as a script.

    set_vars()
        Set variables for the module.

Help on module TrainModels:

NAME
    TrainModels - Created on Wed 24 Mar 15:37:43 GMT 2021.

DESCRIPTION
    @author: sejmodha

FUNCTIONS
    get_best_model(X, y, cpu)
        Run GriSearchCV to identity the best model parameters.

    get_train_test(input_df, label_col, test_size, k, n_features)
        Generate train/test set for each class.

    main()
        Run the module as a script.

    set_vars()
        Set var_list required for the module.

    train_models_rfc(data_dict, out, path, cpu, cv)
        Run the Random forest classifier and saves models.

Project details


Release history Release notifications | RSS feed

This version

1.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

TetraPredX-1.1.tar.gz (3.5 kB view details)

Uploaded Source

File details

Details for the file TetraPredX-1.1.tar.gz.

File metadata

  • Download URL: TetraPredX-1.1.tar.gz
  • Upload date:
  • Size: 3.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for TetraPredX-1.1.tar.gz
Algorithm Hash digest
SHA256 cdc71e3baf07875fba74f2b570d85d1a83cdb8b2ca75347734283d8fc0a6d60c
MD5 fc50be75da2685d7e21aa9c3d67ab7a3
BLAKE2b-256 57247fc6a1600375ac940aad22633c19291d672790edf2bb05d2d8c7b2a3e66b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page