Skip to main content

A command-line tool for identifying toxin-antitoxin (TA) systems in genomes and metagenomes.

Project description

TAtouScan

TAtouScan is a command-line tool designed to identify toxin-antitoxin (TA) systems in genomes and metagenomes.

Installation

Option 1: Install with pip

  1. Clone the repository:
git clone https://github.com/JeanMainguy/TAtouScan.git
cd TAtouScan
  1. Create and activate a virtual environment:
# Create a virtual environment
python -m venv venv

# Activate the virtual environment
# On Linux/macOS:
source venv/bin/activate
  1. Install TAtouScan:
pip install .

Option 2: Install using conda

If you prefer using conda, you can create a dedicated environment as follows:

# Create a new conda environment with Python
conda create -n tatouscan python=3.12

# Activate the environment
conda activate tatouscan

# Clone the repository
git clone https://github.com/JeanMainguy/TAtouScan.git
cd TAtouScan

# Install TAtouScan
pip install -e .

[!NOTE] TAtouScan is not yet available via bioconda. The above method combines conda for environment management and pip for installation.

Download the TAtouScan Database

TAtouScan requires a database directory containing HMM profiles and reference statistics.

Download the database and extract it with:

wget https://zenodo.org/records/20059258/files/tatouscan_db.tar.gz
tar -xzf tatouscan_db.tar.gz

The database directory must contain the following four files:

tatouscan_db/
  ta.hmm                 # HMM profiles (HMMER3 format)
  hmm_info.tsv           # profile metadata (name, type, source)
  family_statistics.tsv  # per-family reference statistics for scoring
  known_pairs.tsv        # known toxin–antitoxin family co-occurrences

Usage

After installation and downloading the database, run TAtouScan with:

  • a GFF file with gene annotations
  • a FAA file with the corresponding protein sequences
  • the database directory downloaded above
tatouscan --gff <genes.gff> --faa <proteins.faa> --db tatouscan_db/

By default, results are written to a directory called tatouscan_results/. Use --outdir to specify a different location:

tatouscan --gff <genes.gff> --faa <proteins.faa> --db tatouscan_db/ --outdir my_results/

Two TSV files are produced inside the output directory:

File Description
tatouscan_results.tsv One row per predicted toxin or antitoxin gene
tatouscan_results_pairs.tsv One row per predicted TA pair (two-gene systems only)

HMM Database Composition

The HMM database used by TAtouScan is composed of profiles collected from multiple sources, including curated databases and literature. The file hmm_info.tsv provides metadata for each profile, indicating its origin and whether it corresponds to a toxin or an antitoxin.

Breakdown of the database:

  • 682 profiles were obtained from the TASmania project:

    Akarsu H, Bordes P, Mansour M, Bigot D-J, Genevaux P, Falquet L (2019). TASmania: A bacterial Toxin-Antitoxin Systems database. PLoS Comput Biol 15(4): e1006946.
    https://doi.org/10.1371/journal.pcbi.1006946

  • 3,168 profiles were generated from sequences in the TADB 3.0 database:
    These sequences were first clustered, and each cluster was then aligned using multiple sequence alignment. HMM profiles were built from the resulting alignments.

    Guan J, Chen Y, Goh YX, Wang M, Tai C, Deng Z, Song J, Ou HY (2024).
    TADB 3.0: an updated database of bacterial toxin-antitoxin loci and associated mobile genetic elements.
    Nucleic Acids Research, 52(D1): D784–D790.
    https://doi.org/10.1093/nar/gkad962

  • Additional HMM profiles were manually collected from other sources in the literature.

Output

TAtouScan writes two TSV files into the output directory.

By default, only the most informative columns are written. Add --detailed to include per-source HMM breakdowns and raw Z-score columns.

tatouscan_results.tsv — per-gene results

One row per predicted toxin or antitoxin gene.

Column Description
contig_name Contig where the gene is located
gene_id Gene identifier (from the input GFF)
start / end Genomic coordinates
strand + or -
length_aa Protein length in amino acids
product Predicted gene product (if available)
ta_system_id ID shared by both genes of a pair (None for single-gene predictions)
is_single_gene True if no paired partner was found
gene_type Toxin or Antitoxin
hmm_name / hmm_score / hmm_evalue Best HMM hit across all database sources
hmm_source Database the best hit comes from (TADB3, TASmania, or other)
hmm_description Profile description
pair_is_known 1 if this toxin–antitoxin family combination is known in TADB3, 0 if not, None if family could not be identified
score Unified match score in (0, 1] (see Scoring)

Scoring columns are None for single-gene predictions.

tatouscan_results_pairs.tsv — per-pair results

One row per predicted toxin–antitoxin pair. For systems with more than one toxin or antitoxin, all valid combinations are written as separate rows.

Column Description
ta_system_id Shared system ID (matches the per-gene file)
contig_name Contig where the pair is located
toxin_gene_id Toxin gene identifier
toxin_strand + or -
toxin_product Predicted gene product
toxin_length_aa Toxin protein length in amino acids
toxin_hmm_name / _score / _evalue / _source / _description Best HMM hit for the toxin
antitoxin_gene_id Antitoxin gene identifier
antitoxin_strand + or -
antitoxin_product Predicted gene product
antitoxin_length_aa Antitoxin protein length in amino acids
antitoxin_hmm_name / _score / _evalue / _source / _description Best HMM hit for the antitoxin
intergenic_distance Distance in nucleotides between the two genes (negative = overlap)
pair_is_known 1 / 0 / None (see above)
score Unified match score in (0, 1]

Detailed output

With --detailed, the following additional columns are written to both files:

  • Per-source HMM hits: TASmania_hmm_name/score/evalue/description, TADB3_hmm_name/score/evalue/description, Other_hmm_name/score/evalue/description (prefixed with toxin_ / antitoxin_ in the pairs file)
  • Raw Z-scores: toxin_size_z, at_size_z, intergenic_distance_z, matched_family, n_reference_pairs

The pairs file also adds toxin_start/end and antitoxin_start/end in detailed mode.


Scoring

Every predicted TA pair is compared against reference statistics derived from known TADB3 type-II systems. The score measures how closely the predicted pair resembles a genuine TA system of its family.

What is compared

Three structural features are measured for each predicted pair and compared against the reference distribution for the matched family:

Feature Definition
toxin_size Toxin protein length (amino acids)
at_size Antitoxin protein length (amino acids)
intergenic_distance Distance in nucleotides between the two genes (negative = overlap)

The toxin family is determined from its best TADB3 HMM hit. If no TADB3 hit exists or the family has fewer than 20 reference pairs, global statistics computed across all families are used as a fallback.

Robust Z-scores

For each feature, a Z-score measures how far the predicted value deviates from the family reference:

$$z = \frac{x - \text{median}}{\text{MAD} / 0.6745}$$

Median and MAD (median absolute deviation) are used instead of mean and standard deviation because size distributions in TA families are often skewed. This makes the scores robust to outliers.

Unified score

All Z-scores are combined into a single score in the range $(0, 1]$:

$$\text{score} = \exp!\left(-\frac{1}{n}\sum_i |z_i|\right)$$

The mean is taken over all available terms: the three structural Z-scores plus a compatibility term ($z_{\text{compat}}$) based on whether this toxin–antitoxin family combination has been observed in TADB3:

  • pair_is_known = 1 → $z_{\text{compat}} = 0$ (no penalty)
  • pair_is_known = 0 → $z_{\text{compat}} = 2$ (unknown combination lowers the score)
  • pair_is_known = None → compatibility term excluded from the mean

Score interpretation:

Score Meaning
~1.0 Features match the family reference almost exactly, known combination
~0.7 Moderate structural match, known combination
~0.4 Moderate structural match, but family combination not seen in TADB3
< 0.2 Large structural deviations or unknown combination — treat with caution

A high score supports a genuine TA pair; a low score does not exclude it, but suggests the prediction should be reviewed.

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tatouscan-0.2.0.tar.gz (18.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tatouscan-0.2.0-py3-none-any.whl (21.6 kB view details)

Uploaded Python 3

File details

Details for the file tatouscan-0.2.0.tar.gz.

File metadata

  • Download URL: tatouscan-0.2.0.tar.gz
  • Upload date:
  • Size: 18.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tatouscan-0.2.0.tar.gz
Algorithm Hash digest
SHA256 fdee30704551bfedbd442cca2ddfa854695ee86ed2bfe7aa4209b625e5f55c32
MD5 ef72e8ac98013a5a59b378dd8175b9e0
BLAKE2b-256 8396d363722f9c612bf161b78a2ef1dd2fd9fea39fb456ce0cb6b4aa1929b0be

See more details on using hashes here.

Provenance

The following attestation bundles were made for tatouscan-0.2.0.tar.gz:

Publisher: python-publish.yml on JeanMainguy/TAtouScan

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tatouscan-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: tatouscan-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 21.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tatouscan-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6fc77785e7f82b251397edec42a15ebcbc9b2f058226f109daebccd50b1fab6b
MD5 0da2a75d2cbfe1f0ebda82085b7de913
BLAKE2b-256 6028cc22c34c33216cd293e5c2561a99f033a939ac2f25007d5bc50d071bc521

See more details on using hashes here.

Provenance

The following attestation bundles were made for tatouscan-0.2.0-py3-none-any.whl:

Publisher: python-publish.yml on JeanMainguy/TAtouScan

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page