Python library for plotting synteny diagram for phage and bacterial sequences.

These details have not been verified by PyPI

Project links

Project description

synphage

Pipeline to create phage genome synteny graphics from genbank files

This library provides an intuitive tool to create synteny graphics highlighting the conserved genes between multiple genome sequences.
This tool is primarily designed to work with phage genomes or other short sequences of interest, although it works with bacterial genomes as well.

Despite numerous synteny tools available on the market, this tool has been conceived because none of the available tools allows to visualise gene conservation in multiple sequences at one glance (as typically cross-links are drawn only between two consecutive sequences for a better readability).

As a result synphage was born.

In addition to show conserved genes across multiple sequences, the originality of this library stands in the fact that when working on the same set of genomes the initial blast and computation need to be run only once. Multiple graphics can then be generated from these data, comparing all the genomes or only a set of genomes from the analysed dataset. Moreover, the generated data is also available to the user as a table, where individual genes or groups of genes can easily be checked by name for conservation or uniqueness in the other genomes.

Stats

Install

synphage is available via pip install or as docker image.

Via pip

pip install synphage

Via docker

docker pull vestalisvirginis/synphage:latest

Additional dependencies

synphage relies on one non-python dependency that need to be manually installed when synphage is installed with pip:

Blast+ >= 2.12.0

apt update
apt install -y ncbi-blast+

Usage

Setup

synphage requires:

to specify a folder path where the genbank folder will be present and where generated data will be stored;
a genbank folder populated with genbank files (.gb and .gbk extension are accepted);
a sequences.csv file containing the file name and orientation of the sequences to plot.

Warning: Genbank file names should not contain spaces.

Path setup

export DATA_DIR=<path_to_data_folder>

Note: For docker users, this path is defaulted to /data.

CSV file

genome_1.gb,0
genome_2.gb,1
genome_3.gb,0

Running Synphage

synphage uses Dagster. In order to run synphage jobs, you need to start dagster first.

Starting Dagster

Set up the environment variable DAGSTER_HOME in order to keep a trace of your previous run. For more information, see Dagster documentation.

export DAGSTER_HOME=<dagster_home_directory>

dagster dev -h 0.0.0.0 -p 3000 -m synphage

Running the jobs

The current software is structured in four different jobs.

blasting_job : create the blastn of each sequences against each sequences (results -> gene_identity folder)
transform : create three tables from the blastn results and genbank files (results -> tables)
synteny_job : create the synteny graph (results -> synteny)

Note: Different synteny plots can be generated from the same set of genomes. In this case the two first jobs only need to be run once and the third job (synteny_job) can be triggered separately for each graphs.

[2024-01-11] ✨ New feature! to simplify the addition of new sequences into the genbank folder

ncbi_download_job : download genomes to be analysed from the NCBI database

Output

synphage's output consists of three main parquet files and the synteny graph. However all the data generated by the synphage pipeline are made available in your workng directory.

Generated data architecture

.
├── <path_to_data_folder>/
│   ├── download/
│   ├── genbank/
│   ├── fs/
│   ├── gene_identity/
│   │   ├── fasta/
│   │   ├── blastn_database/
│   │   └── blastn/
│   ├── tables/
│   │   ├── blastn.parquet
│   │   ├── locus_and_gene.parquet
│   │   └── uniqueness.parquet
│   └── synteny/
│      ├── colour_table.parquet
│      └── synteny_graph.svg
└── ...

Tables

The tables folder contains the three main parquet files generated by the transform job of synphage.

blastn.parquet contains the collection of the best match for each locus tag/gene against each genomes. The percentage of identity between two genes/loci are then used for calculating the plot cross-links between the sequences.
locus_and_gene.parquet contains the full list of locus tag and corresponding gene names when available for all the genomes in the genbank folder. If the genbank file only contains CDS, the locus tag and gene value are replaced by the protein identifyer protein_id.
uniqueness.parquet combined both previous data tables in one, allowing the user to quickly know how many matches their gene(s) of interest has/have retrieved. These data are then used to compute the colour code used for the synteny plot. The result of the computation is recorded in the colour_table.parquet. This file is over-written between each synteny_job run.

Synteny plot

The synteny plot is generated as .svg file and .png file, and contains the sequences indicated in the sequences.csv file. The genes are colour-coded according to their abundance (percentage) among the plotted sequences. The cross-links between each consecutive sequence indicates the percentage or similarities between those two sequences.

Plotting config options

Field Name	Description	Default Value
`title`	Generated plot file title	synteny_plot
`colours`	Gene identity colour bar	["#fde725", "#90d743", "#35b779", "#21918c", "#31688e", "#443983", "#440154"]
`gradient`	Nucleotide identity colour bar	#B22222
`graph_shape`	Linear or circular representation	linear
`graph_pagesize`	Output document format	A4
`graph_fragments`	Number of fragments	1
`graph_start`	Sequence start	1
`graph_end`	Sequence end	length of the longest genome

Genbank file download

The ncbi_download_job allow to download sequences of interest into the genbank folder to be subsequently processed by the software.

Requirement

Connection to the NCBI databases requires user's email and api_key.

export EMAIL=user.email@email.com
export API_KEY=UserApiKey

Query config options

Field Name	Description	Default Value
`search_key`	Keyword(s) for NCBI query	Myoalterovirus
`database`	Database identifier	nuccore

Roadmap

~~[x] create config options for the plot at run time~~
~~[x] integrate the NCBI search~~
create possibility to add ref sequence with special colour coding
create interactive plot
Help us in a discussion?

Contributing

We accept different types of contributions, including some that don't require you to write a single line of code. For detailed instructions on how to get started with our project, see CONTRIBUTING file.

Authors

vestalisvirginis / Virginie Grosboillot / 🇫🇷

License

Apache License 2.0 Free for commercial use, modification, distribution, patent use, private use. Just preserve the copyright and license.

Made with ❤️ in Ljubljana 🇸🇮

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.7

Jul 29, 2024

0.2.6

Jul 22, 2024

0.2.5

Jul 21, 2024

0.2.4

Jul 20, 2024

0.2.3

Jul 20, 2024

0.2.2

Jul 19, 2024

0.2.1

Jul 19, 2024

0.2.0

Jul 19, 2024

This version

0.1.1

Jan 13, 2024

0.1.0

Jan 13, 2024

0.0.7

Nov 20, 2023

0.0.6

Nov 3, 2023

0.0.5

Nov 3, 2023

0.0.4

Nov 3, 2023

0.0.3

Nov 3, 2023

0.0.2

Nov 2, 2023

0.0.1

Oct 25, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

synphage-0.1.1.tar.gz (27.9 kB view hashes)

Uploaded Jan 13, 2024 Source

Built Distribution

synphage-0.1.1-py3-none-any.whl (29.3 kB view hashes)

Uploaded Jan 13, 2024 Python 3

Hashes for synphage-0.1.1.tar.gz

Hashes for synphage-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`fc88eea88394f4e4c0b3f1a0238aa43deca36d81ce4fe0495eef46c920d16e95`
MD5	`c4701404ea11100f6da9080a9594a344`
BLAKE2b-256	`3414c1dbbf5b51875fb97b8f2d251868d711fdeed94168dae258f50558b78779`

Hashes for synphage-0.1.1-py3-none-any.whl

Hashes for synphage-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`837a97c7bae3a7bd5bc0f6112115af17cb3de81eb16beca7e32fa0060aa6dca3`
MD5	`a62a4b98ed645c95015001f0a16a83a0`
BLAKE2b-256	`db5fc713e9d64109724309257605ae50668e9a43afa8ee2cfc28053fe9155c06`

synphage 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

synphage

Stats

Install

Via pip

Via docker

Additional dependencies

Usage

Setup

Path setup

CSV file

Running Synphage

Starting Dagster

Running the jobs

Output

Generated data architecture

Tables

Synteny plot

Plotting config options

Genbank file download

Requirement

Query config options

Roadmap

Contributing

Authors

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution