Prediction tool to identify the class of unknown sequences assembled from metagenomes

These details have not been verified by PyPI

Project links

Intended Audience
- Science/Research
Programming Language
- Python :: 3.6
Topic
- Scientific/Engineering :: Bio-Informatics

Project description

#################################################################### #################################################################### ## ## ## _____ _ ____ ___ __ ## ## |_ |__| | _ __ __ | _ \ _ __ ___ __| \ / / ## ## | |/ _ \ | '/ | |_) | '__/ _ \/ _ |\ / ## ## | | __/ || | | (| | /| | | / (| |/ \ ## ## ||_|_|| _,|| || _|_,/_/_\ ## ## ## ## ## #################################################################### ####################################################################

TetraPredX

Microbial sequence predictor using short DNA features.

TetraPredX can be used to predict the origin of unknown sequences assembled from metagenomic or metatranscriptomics datasets. It can also be used in combination with UnXplore framework.

Dependencies:

This tool requires the following Python modules are installed.

Python 3.6 or higher
BioPython
joblib
sklearn
seaborn
pathos

Note: PyPI will take care of this automatically.

Use the tetrapredx.yml file to create a new environment with all dependencies.

Installation and Usage:

# clone the data repository
git clone https://github.com/sejmodha/TetraPredX.git

# create conda environment
conda env create -f tetrapredx.yml

# activate new environment
conda activate tetrapredx

cd TetraPredX

# run predictions
python predict.py -i test_zetavirus.fa -o test_out

TetraPredX also supports training new models and using them for predictions. A pseudo example shows major steps required to train and save new models.

Note: This process may require substantial computing power and could take a long time depending on the input data size.

Step 1

Extract features and their frequencies using FeatureExtractor.py script.

# generate feature output output output file
python FeatureExtractor.py -i mysequences.fasta -o output_prefix

Output file generated from Step 1 can be used as input for the next step.

Step 2

Train and save the models using train.py.

# train new models
python train.py -i input_csv_with_features_and_label -o output_prefix

FeatureExtractor.py and TrainModels.py contain a range of functions that can be used by importing them as standard python modules. e.g.,

import FeatureExtractor as ft

# generate a feature table
df = ft.get_feature_table(....)

Further details on functions:

Help on module FeatureExtractor:

NAME
    FeatureExtractor - Created on Thu 30 Apr 11:48:35 BST 2020

DESCRIPTION
    @author: sejmodha

FUNCTIONS
    batch_iterator(iterator, batch_size)
        Return lists of length batch_size.

        This can be used on any iterator, for example to batch up
        SeqRecord objects from Bio.SeqIO.parse(...), or to batch
        Alignment objects from Bio.AlignIO.parse(...), or simply
        lines from a file handle.

        This is a generator function, and it returns lists of the
        entries from the supplied iterator.  Each list will have
        batch_size entries, although the final list may be shorter.

        Taken from: https://biopython.org/wiki/Split_large_file

    extract_feat(infasta, tax_label, kmer, cpu, chunk)
        Extract k-mer features from a given FASTA file.

        Returns a dataframe with indexes, features and
        sequences labels (when known).

    generate_list_for_record(record, k)
        Generate a list of seq and revcomp seq kmers.

    generate_primer_ngrams(k, n)
        Generate n-grams of words.

    generate_primers(length)
        Generate primers.

    get_feature_table(infasta, out, tax_label, kmer, cpu, chunk)
        Convert feature table to a .csv file.

    get_kmers(dna, k)
        Extract k-mers of defined size k. Returns a list  of kmers.

    is_fasta(filename)
        Check the validity of FASTA file.

    main()
        Run the module as a script.

    set_vars()
        Set variables for the module.

Help on module TrainModels:

NAME
    TrainModels - Created on Wed 24 Mar 15:37:43 GMT 2021.

DESCRIPTION
    @author: sejmodha

FUNCTIONS
    get_best_model(X, y, cpu)
        Run GriSearchCV to identity the best model parameters.

    get_train_test(input_df, label_col, test_size, k, n_features)
        Generate train/test set for each class.

    main()
        Run the module as a script.

    set_vars()
        Set var_list required for the module.

    train_models_rfc(data_dict, out, path, cpu, cv)
        Run the Random forest classifier and saves models.

Project details

These details have not been verified by PyPI

Project links

Intended Audience
- Science/Research
Programming Language
- Python :: 3.6
Topic
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

This version

1.1

Aug 17, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

TetraPredX-1.1.tar.gz (3.5 kB view details)

Uploaded Aug 17, 2021 Source

File details

Details for the file TetraPredX-1.1.tar.gz.

File metadata

Download URL: TetraPredX-1.1.tar.gz
Upload date: Aug 17, 2021
Size: 3.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.22.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.6

File hashes

Hashes for TetraPredX-1.1.tar.gz
Algorithm	Hash digest
SHA256	`cdc71e3baf07875fba74f2b570d85d1a83cdb8b2ca75347734283d8fc0a6d60c`
MD5	`fc50be75da2685d7e21aa9c3d67ab7a3`
BLAKE2b-256	`57247fc6a1600375ac940aad22633c19291d672790edf2bb05d2d8c7b2a3e66b`

See more details on using hashes here.

TetraPredX 1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TetraPredX

Dependencies:

Installation and Usage:

Step 1

Step 2

Further details on functions:

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes