Skip to main content

A small python package to trace orthology neighborhood across feature files

Project description

Build Status codecov PyPI version Requirements Status Documentation Status Code style:black

Vicinator

What is Vicinator for?

Vicinator visualizes the microsynteny of grouped proteins (e.g. orthologs) across a large collection of genomes. As input, it requires a mapping of the genomes' proteins to the respective protein groups and a directory containing the genomes' feature files, i.e. files of the format *.gff or *_feature_table.txt.

image

What is Vicinator not for?

As stated above, Vicinator relies on a pre-computed grouping of proteins across genomes. It can not find these groups of genes for you.

Installation

Vicinator is written for Python 3.6+

It is recommended to install Vicinator inside a virtual environment, e.g. with venv:

python3 -m venv myenv

This activates the new environment called myenv. While activated, you can install the latest version via pip. The following command installs the latest version and all unmet requirements automatically.

pip install --upgrades vicinator

Requirements:

  • ansi2html>=1.5.2
  • colorama>=0.4.4
  • ete3>=3.1.2
  • pandas>=1.1.3
  • importlib-metadata>=3.1.1
  • setuptools-scm>=5.0.1

Options

python3 vicinator/vicinator.py --help

usage: Vicinator [-h] --tabular-ortholog-groups <orthology_table>
                 --feat-tables-dir <dir_path> --reference <file_path>
                 --centerprotein-accession <str> --extension-size <int>
                 [--tree <newick_tree_file_path>] [--outdir <dir_path>]
                 [--prefix <str>] [--outputlabel-map <file_path>]
                 [--nprocs <int>] [--force] [--version]

Track Microsynteny of target proteins and its orthologs across genomes.

required arguments:
  --tabular-ortholog-groups <orthology_table>
                        path to mapping file with format
                        ortholog_group_id<tab>genome_id<tab>protein_seq_id
  --feat-tables-dir <dir_path>
                        path to directory of *.feature_tables.txt or *.gff3
                        files that shall be screen

required arguments (neighborhood):
  --reference <file_path>
                        path to a ncbi style feature table file that acts as a
                        reference
  --centerprotein-accession <str>
                        unique identifier of the central gene of the window
  --extension-size <int>
                        defines the #features that are co-checked to the left
                        and right of the centerprotein

optional arguments (output):
  --tree <newick_tree_file_path>
                        path to newick tree that includes all taxa to be
                        screened
  --outdir <dir_path>   path to desired output directory
  --prefix <str>        if option is set, shows intergenic distances of genes
                        surrounding the center gene
  --outputlabel-map <file_path>
                        Attempts to replace genome accessions in the outputs
                        with a replacement string. Requires a two-column map
                        file formatted like so: 'genome file accession' <tab>
                        'replacement string'

optional arguments (run):
  --nprocs <int>        Number of CPUs for parallel processing of genomes.
                        Default: Number of CPUs-1
  --force               if option is set, existing ortholog databases in the
                        output dir are ignored and will be overwritten

Input: Required Arguments


--tabular-ortholog-groups <orthology_table>

Vicinator requires a tab-separated three-column mapping of orthologs that is formatted like so:

group_id    \tab   genome_id    \tab   protein_id example mapping file


--feat-tables-dir <dir_path>

Vicinator expects the path to a directory containing .gff format or _feature_table.txt files of all the genomes you want to trace the microsynteny in.

A recommended source for these files is NCBI RefSeq. In order for the mapping to work, the filenames should correspond to the genome_ids specified in the mapping file:

E.g. line 7: OG_2    genomeB    protein_X011
triggers a search in a feature file named genomeB.gff or genomeB_genomic.gff or genomeB_feature_table.txt in the directory specified with --feat-tables-dir. Effectively, it tries to locate the protein_X011 in this feature file.


--reference <file_path>

the path to a reference genome feature file where the center-protein accession must be found


--centerprotein-accession & --extension-size <int>

Identifies the window of vicinity around a center-protein which is traced based on the findings in the reference genome.
Vicinator Window in Reference Genome


Example Basic Usage

vicinator --tabular-ortholog-groups orthogenome_map.tsv --feat-tables-dir ./gff_dir --outdir ./results --reference gff_dir/MUSMU@10090@1.gff --centerprotein XP_006539605.1 --extension-size 3

Example Advanced Usage

When vicinator receives a phylogenetic tree (with genome_ids as leaf labels) it will trace the microsynteny in order of increasing phylogentic distance to the reference genome specified.

vicinator --tabular-ortholog-groups orthogenome_map.tsv --feat-tables-dir ./gff_dir --outdir ./results --reference gff_dir/MUSMU@10090@1.gff --centerprotein XP_006539605.1 --extension-size 3 --tree phylogeny.nwk

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Vicinator-0.0.31.tar.gz (5.7 MB view hashes)

Uploaded Source

Built Distribution

Vicinator-0.0.31-py3-none-any.whl (27.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page