Skip to main content

A light-weight pipeline for searching and assembling homologous genes utilizing the NCBI SRA database.

Project description

PyPI pyver travis Codecov docker

icon
# HomologFishing

HomologFishing, or HF, is a Python-based light-weight pipeline to find and assemble target homologous genes in species with poor genome assemblies or even without genome assemblies. Very common personal computers without high-end specs nor tons of sequencing data stored can run it adequately.

Implementation

PyPI

The HF pipeline was written in Python, and it can be installed by one command as follows:

pip install HF

Besides, two third-party software packages including NCBI SRA Toolkit and Trinity are required while running. N.B. the Trinity software package doesn’t have pre-compiled binary files for Windows users.

Docker (recommended)

We also built a docker image with all required software packages installed and configured, which can be installed by one command as follows:

docker pull yangwu91/hf:latest

This is recommended as it is compatible with most common operating systems including Linux, Windows and macOS.

Usage

After installation, detailed usage will be printed by the command:

HF --help

Or:

docker run -it --dns 8.8.8.8 -v /dir/to/your/folder:/opt/data yangwu91/hf:latest --help

In the command, the option -v /dir/to/your/folder:/opt/data will mount your folder /dir/to/your/folder onto the docker.

Detailed usage:

Optional arguments:
  -h, --help            show this help message and exit
  -V, --version         Print the version.
  -v, --verbose         Print detailed log.
  -r [INT], --retry [INT]
                        Number of times to retry, default: 5. Enabling it
                        without any numbers will force it to keep retrying.
  -o DIR, --outdir DIR  Specify an output directory.

NCBI options:
  -s SRA, --sra SRA     Choose SRA accessions (comma-separated without blank
                        space), usually whose prefix is "SRX" (e.g.
                        SRX4977164).
  -q SEQUENCE, --query SEQUENCE
                        Submit either a FASTA file or nucleotide sequences.
  -p BLAST, --program BLAST
                        Specify a blast program: blastn, tblastn, or tblastx,
                        default: blastn.
  -m INT, --max_num_seq INT
                        Maximum number of aligned sequences to retrieve (the
                        actual number of alignments may be greater than this),
                        default: 1000.
  -e FLOAT, --evalue FLOAT
                        Expected number of chance matches in a random model,
                        default: 1e-3.
  -c FRAGMENT,OVERLAP, --cut FRAGMENT,OVERLAP
                        Cut sequences and query them respectively to prevent
                        weaker matches from being ignored.

Trinity options:
  -t INT, --CPU INT     Number of CPU threads to use, default: 36.
  --max_memory RAM      Suggest max Gb of memory to use by Trinity, default: "
                        --max_memory 5G"
  --min_contig_length INT
                        Minimum assembled contig length to report, default:
                        150.
  -k INT, --KMER_SIZE INT
                        K-mer size for Trinity, maximum: 32, default: 25.
  --full_cleanup        Only retain the assembled contig file in FASTA format.
  --trim [Trimmomatic paramters]
                        Run Trimmomatic to qualify and trim reads, default:
                        disabled. Using this option without any parameters
                        will trigger preset settings in Trinity for
                        Trimmomatic. See Trinity for more help.
  --stage {no_trinity,jellyfish,inchworm,chrysalis,butterfly}
                        Stop Trinity after the stage you chose, default:
                        butterfly (the final stage)

An example: finding "inexistent" S6K gene in a mosquito species

We applied the HF pipeline to search the gene S6K (AAEL018120 from Aedes aegypti) in Aedes albopictus SRA experiment SRX885420 (https://www.ncbi.nlm.nih.gov/sra/SRX885420) using the engine blastn. Detailed workflow is described as follows:

Picking a "lure"

Download nucleotide/protein sequences of Aedes aegypti S6K from VectorBase, Ensembl, NCBI or other online databases, and let’s say it was saved as the file /opt/data/AAEL018120-RE.S6K.fasta.

lure

Selecting a "fishing spot"

Select a proper SRA experiment for Aedes albopictus (e.g. SRX885420). Some genes only express in specific tissues or at specific time. Make sure the gene you are interested in indeed expresses in the SRA experiment(s) you selected.

fishing spot

"Casting" and "Fishing"

Run the HF pipeline. Here, we chopped the query (/opt/data/AAEL018120.fa) into 80-base fragments overlapping 50 bases. The command line is as follows:

HF -o /dir/to/your/S6K_q-aae_s-SRX885420_c-80.50_p-blastn -s SRX885420 -q /dir/to/your/AAEL018120-RE.S6K.fasta --cut 80,50 -p blastn

Or:

docker run -it --dns 8.8.8.8 -v /dir/to/your/folder:/opt/data yangwu91/hf:latest -o /opt/data/S6K_q-aae_s-SRX885420_c-80.50_p-blastn -s SRX885420 -q /dir/to/your/folder/AAEL018120-RE.S6K.fasta --cut 80,50 -p blastn

"Harvesting"

The sequence file in FASTA format of the predicted Aedes albopictus S6K is in the folder /dir/to/your/folder/S6K_q-aae_s-SRX885420_c-80.50_p-blastn/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for HF, version 1.1.0
Filename, size File type Python version Upload date Hashes
Filename, size HF-1.1.0-py3-none-any.whl (25.6 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size HF-1.1.0.tar.gz (522.6 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page