Skip to main content

Markadoros is a tool for the identification and assembly of barcode genes in raw sequencing data.

Project description

Markadoros - fast barcode assembly and identification from raw sequencing data

    ▗▖  ▗▖▗▞▀▜▌ ▄▄▄ █  ▄ ▗▞▀▜▌▐▌ ▄▄▄   ▄▄▄ ▄▄▄   ▄▄▄
    ▐▛▚▞▜▌▝▚▄▟▌█    █▄▀  ▝▚▄▟▌▐▌█   █ █   █   █ ▀▄▄
    ▐▌  ▐▌     █    █ ▀▄   ▗▞▀▜▌▀▄▄▄▀ █   ▀▄▄▄▀ ▄▄▄▀
    ▐▌  ▐▌          █  █   ▝▚▄▟▌

Introduction

Markadoros is a Python tool for the identification and assembly of barcode genes in raw sequencing data, using MMSeqs2 for searching and either SPAdes or Hifiasm for assembly. Using an MMSeqs2 database of marker gene sequences, markadoros searches a set of input reads (providable in either FASTX or CRAM format) to quickly pre-filter reads that map to the target marker gene. Reads that match any barcode sequence in the database are extracted and assembled using an appropriate assembler. The resulting assembled contigs are then searched again using the same database, using more sensitive thresholds, to identify the marker genes from the input dataset.

The name markadoros comes from the Greek word for a marker pen, μαρκαδόρος.

Requirements

System Dependencies

You will need the following external tools installed to run Markadoros:

Python Dependencies

Markadoros requires Python 3.11+ and the following packages:

  • biopython (≥1.86)
  • click (≥8.3.1)
  • jsonschema (≥4.26.0)
  • pandas (≥2.3.3)
  • pymmseqs (≥1.0.5)
  • pysam (≥0.23.3)
  • scikit-learn (≥1.8.0)

These are automatically installed when you install Markadoros.

Installation

From source

pip install -e .

Dependencies

Install the required external tools with Conda:

conda install -c bioconda mmseqs2 spades hifiasm

Quick Start

# 1. Build a database from a BOLD release
markadoros database -x bold --prefix BOLD --cluster --outdir db/ <bold_release.fasta.gz>

# 2. Search your reads
markadoros search -x illumina -n 100000 --index db/db.json --db BOLD.COI <reads.fq.gz>

# 3. Check results
less reads.COI.summary.json

Detailed Usage

Database Preparation

Use the database command to prepare marker gene sequences for searching:

markadoros database -x bold \
    --prefix BOLD \
    --outdir db/ \
    /path/to/bold/release.fasta.gz

Required options:

  • -x, --header-type <type> - Use a preset header processor: bold, unite, silva_lsu, silva_ssu, generic.
  • --prefix <name> - Prefix for output database names

Additional options:

  • --marker <name> - Only when using --header-type is generic - the name of the marker gene in the marker field of the FASTA header.
  • --min-length <N> - Minimum sequence length to retain (default: 200)
  • --deduplicate/--no-deduplicate - Deduplicate identical sequences. The first record for each identical sequence is kept. (default: false)
  • --cluster/--no-cluster - Cluster sequences using MMSeqs2's linear clustering algorithm (default: no cluster)
  • --cluster_min_seq_id - Cluster sequences at this percentage identity threshold (default: 0.99)
  • --cluster_coverage - Overlap between two sequences required for clustering (default: 0.8)
  • --create-index/--no-create-index - Create an MMSeqs2 index for each marker database. This may improve speed for larger databases, but can cause IO issues if multiple processes access the same database. (default: False)
  • --exclude-file - New line-separated list of regular expressions. If a header matches a regular expression it will be skipped.
  • -o, --outdir <path> - Output directory (default: ./markadoros.db)
  • --cleanup/--no-cleanup - Clean up temporary files (default: cleanup)
  • -t, --threads <N> - Number of threads for MMSeqs2 (default: 1)

Header types:

If your FASTA release does not conform to the above presets, set -x to generic. Your input FASTA headers must then be formatted as:

><unique_id>|<marker>|<taxon_name>|<lineage>

Output files:

  • db.json - Index of available databases and their parameters
  • <prefix>_<marker>/db* - MMSeqs2 database files
  • <prefix>_<marker>/taxon.json.gz - JSON file counting the number of available sequences per taxon.

If you build additional databases pointing to the same output directory, the existing index file will be updated to include the new entries.

Building a database from the BOLD TSV release

Alternatively, you can use the bold-coi-from-tsv to build a BOLD COI database from the BOLD TSV release. The BOLD TSV release includes information on the BIN, or cluster, of each sequence and it can thus be used to create a database with a single representative sequence per cluster instead of running your own clustering. Currently this appoach is only supported for the COI marker.

Run it as follows:

markadoros bold-coi-from-tsv --prefix BOLD --threads 16 --outdir . BOLD_Public.27-Mar-2026.tsv.gz

Options such as --exclude-file and --min-length work the same as in the standard database command.

Searching for Barcodes

Use the search command to identify barcode genes in a set of reads or pre-assembled contigs:

markadoros search -x illumina --index db/db.json reads.fq.gz

You can supply the binomial name of a taxon that you believe the data should arise from with --expected-taxon. In this case, markadoros will check the results and tell you how many possible sequences there are to match against and how many hits were found for the taxon. If the expected taxon is found, it also reports the top hit for that taxon as the best hit instead of the overall top hit.

Sometimes, there may be synonyms for the expected taxon that you may wish to check additionally. To accomodate these, you can supply either the --find-goat-synonyms flag, which will pull the names of all synonyms from GoaT, or alternatively supply known synonyms as a comma-separated list with --synonyms. If a hit is for a synonym, it will be marked as such in the output results.

Required options:

  • -x, --type <type> - Input data type (see table below)
  • -i, --index <path> - Path to database index JSON

Input type aliases:

Platform Accepted values
Illumina / short reads sr, short, illumina
Short read RNA-seq rnaseq
PacBio HiFi pb, pacbio, pacbio_hifi
Oxford Nanopore ont, nanopore, oxford_nanopore
Pre-assembled contigs contigs

Additional options:

  • --db <name> - Search a specific database only (default: search all databases in index)
  • --expected_taxon <name> - Expected taxon binomial name for validation
  • -s, --find-goat-synonyms - Get the synonyms for the provided expected taxon from GoaT
  • --synonyms - A comma separated list of synonyms. Will override --find-goat-synonyms if provided.
  • -n, --nreads <N> - Limit to first N reads
  • -m, --min_seq_id <float> - Minimum sequence identity for hits (default: 0.96)
  • -l, --min_aln_len <int> - Minimum alignment length for hits (default: 450)
  • --cleanup/--no-cleanup - Clean up temporary files (default: cleanup)
  • --db-to-tmpdir/--no-db-to-tmpdir - Temporarily copy the database to the temporary directory. Can reduce IO and improve speed if multiple processes would access the database simultaneously. Not suggested if the databases include indexes. (default: True)
  • -t, --threads <N> - Number of threads (default: 1)
  • -p, --prefix <name> - Output file prefix (default: input filename)
  • -o, --outdir <path> - Output directory (default: current directory)
  • --save-contigs - Write the assembled contigs to disk in the output directory.

Output files:

  • <prefix>.<marker>.summary.json - Results in JSON format

Output JSON file

The search subtool outputs a JSON file summarising the search. It has the following format:

{
  "input": {
      "file": <path>, // input file path
      "n_reads": <int>, // number of reads searched
      "n_aligned_reads": <int>, // number of reads aligned to database
      "marker": <string>, // marker gene in database
      "database": <path>, // path to mmseqs database
      "contig_stats": <dict>, // number, total length and n50 of assembled contigs
      "expected_taxon": <dict>, // name of asserted taxon and number of records present in database, including synonyms
  },
  "summary": {
      "n_contigs_with_hits": <int>, // number of assembled contigs with search hits
      "n_expected_taxon_hits": <int>, // number of hits for the asserted taxon
      "n_synonym_hits": <int>, // number of hits for synonyms
      "top_result": <dict>, // top result (highest bitscore) for asserted taxon if found or overall if not
      "taxon_summary": <dict>, // per-taxon summary of results - number of hits, min and max %ID and alignment lengths, and sequence of top hit
  },
  "results": <list>, // for each contig with results, a list of all hits
  "run_info": <dict> // tool information
}

Example Workflows

Building a marker database for the BOLD FASTA release:

markadoros database \
    --header-type bold \
    --outdir db/ \
    --prefix BOLD \
    --cluster \
    --threads 16 \
    BOLD_Public.20-Feb-2026.fasta.gz

Searching PacBio HiFi reads with expected taxon:

markadoros search -x pb \
    --index db/db.json \
    --db BOLD.COI \
    --expected_taxon "Halyzia sedecimguttata" \
    --threads 16 \
    --nreads 20000 \
    pacbio_reads.fasta.gz

Author

  • Jim Downie

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markadoros-1.0.0.tar.gz (329.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

markadoros-1.0.0-py3-none-any.whl (35.0 kB view details)

Uploaded Python 3

File details

Details for the file markadoros-1.0.0.tar.gz.

File metadata

  • Download URL: markadoros-1.0.0.tar.gz
  • Upload date:
  • Size: 329.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for markadoros-1.0.0.tar.gz
Algorithm Hash digest
SHA256 ffcba0f3392665c712748de53ad518e6e906f004f077371e2df4ac0f27395f64
MD5 175f1a1868221e9b9fa8e45ac10a5165
BLAKE2b-256 2e84eb72ea0b0771d92f6d7ccf35bb5a38ffa08354b251a120c95a1074a65ffb

See more details on using hashes here.

File details

Details for the file markadoros-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: markadoros-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 35.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for markadoros-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 078ab67703e91805d23adc1a055cf1f6d2849d9adb184364af60c532c399e60d
MD5 8cca6c30de42ba52ac3f1b7705626c97
BLAKE2b-256 c24932f383a1c20e7cb6b1d3032743ac421294bf7a8696eb164b5a4e82577625

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page