Markadoros is a tool for the identification and assembly of barcode genes in raw sequencing data.

These details have not been verified by PyPI

Project links

Project description

Markadoros - fast barcode assembly and identification from raw sequencing data

    ▗▖  ▗▖▗▞▀▜▌ ▄▄▄ █  ▄ ▗▞▀▜▌▐▌ ▄▄▄   ▄▄▄ ▄▄▄   ▄▄▄
    ▐▛▚▞▜▌▝▚▄▟▌█    █▄▀  ▝▚▄▟▌▐▌█   █ █   █   █ ▀▄▄
    ▐▌  ▐▌     █    █ ▀▄   ▗▞▀▜▌▀▄▄▄▀ █   ▀▄▄▄▀ ▄▄▄▀
    ▐▌  ▐▌          █  █   ▝▚▄▟▌

Introduction

Markadoros is a Python tool for the identification and assembly of barcode genes in raw sequencing data, using MMSeqs2 for searching and either SPAdes or Hifiasm for assembly. Using an MMSeqs2 database of marker gene sequences, markadoros searches a set of input reads (providable in either FASTX or CRAM format) to quickly pre-filter reads that map to the target marker gene. Reads that match any barcode sequence in the database are extracted and assembled using an appropriate assembler. The resulting assembled contigs are then searched again using the same database, using more sensitive thresholds, to identify the marker genes from the input dataset.

The name markadoros comes from the Greek word for a marker pen, μαρκαδόρος.

Requirements

System Dependencies

You will need the following external tools installed to run Markadoros:

Python Dependencies

Markadoros requires Python 3.11+ and the following packages:

biopython (≥1.86)
click (≥8.3.1)
jsonschema (≥4.26.0)
pandas (≥2.3.3)
pymmseqs (≥1.0.5)
pysam (≥0.23.3)
scikit-learn (≥1.8.0)

These are automatically installed when you install Markadoros.

Installation

From source

pip install -e .

Dependencies

Install the required external tools with Conda:

conda install -c bioconda mmseqs2 spades hifiasm

Quick Start

# 1. Build a database from a BOLD release
markadoros database -x bold --prefix BOLD --cluster --outdir db/ <bold_release.fasta.gz>

# 2. Search your reads
markadoros search -x illumina -n 100000 --index db/db.json --db BOLD.COI <reads.fq.gz>

# 3. Check results
less reads.COI.summary.json

Detailed Usage

Database Preparation

Use the database command to prepare marker gene sequences for searching:

markadoros database -x bold \
    --prefix BOLD \
    --outdir db/ \
    /path/to/bold/release.fasta.gz

Required options:

-x, --header-type <type> - Use a preset header processor: bold, unite, silva_lsu, silva_ssu, generic.
--prefix <name> - Prefix for output database names

Additional options:

--marker <name> - Only when using --header-type is generic - the name of the marker gene in the marker field of the FASTA header.
--min-length <N> - Minimum sequence length to retain (default: 200)
--deduplicate/--no-deduplicate - Deduplicate identical sequences. The first record for each identical sequence is kept. (default: false)
--cluster/--no-cluster - Cluster sequences using MMSeqs2's linear clustering algorithm (default: no cluster)
--cluster_min_seq_id - Cluster sequences at this percentage identity threshold (default: 0.99)
--cluster_coverage - Overlap between two sequences required for clustering (default: 0.8)
--create-index/--no-create-index - Create an MMSeqs2 index for each marker database. This may improve speed for larger databases, but can cause IO issues if multiple processes access the same database. (default: False)
--exclude-file - New line-separated list of regular expressions. If a header matches a regular expression it will be skipped.
-o, --outdir <path> - Output directory (default: ./markadoros.db)
--cleanup/--no-cleanup - Clean up temporary files (default: cleanup)
-t, --threads <N> - Number of threads for MMSeqs2 (default: 1)

Header types:

bold - BOLD Systems general FASTA release
unite - UNITE general FASTA release
silva_lsu - SILVA LSU Release
silva_ssu - SILVA SSU Release

If your FASTA release does not conform to the above presets, set -x to generic. Your input FASTA headers must then be formatted as:

><unique_id>|<marker>|<taxon_name>|<lineage>

Output files:

db.json - Index of available databases and their parameters
<prefix>_<marker>/db* - MMSeqs2 database files
<prefix>_<marker>/taxon.json.gz - JSON file counting the number of available sequences per taxon.

If you build additional databases pointing to the same output directory, the existing index file will be updated to include the new entries.

Building a database from the BOLD TSV release

Alternatively, you can use the bold-coi-from-tsv to build a BOLD COI database from the BOLD TSV release. The BOLD TSV release includes information on the BIN, or cluster, of each sequence and it can thus be used to create a database with a single representative sequence per cluster instead of running your own clustering. Currently this appoach is only supported for the COI marker.

Run it as follows:

markadoros bold-coi-from-tsv --prefix BOLD --threads 16 --outdir . BOLD_Public.27-Mar-2026.tsv.gz

Options such as --exclude-file and --min-length work the same as in the standard database command.

Searching for Barcodes

Use the search command to identify barcode genes in a set of reads or pre-assembled contigs:

markadoros search -x illumina --index db/db.json reads.fq.gz

You can supply the binomial name of a taxon that you believe the data should arise from with --expected-taxon. In this case, markadoros will check the results and tell you how many possible sequences there are to match against and how many hits were found for the taxon. If the expected taxon is found, it also reports the top hit for that taxon as the best hit instead of the overall top hit.

Sometimes, there may be synonyms for the expected taxon that you may wish to check additionally. To accomodate these, you can supply either the --find-goat-synonyms flag, which will pull the names of all synonyms from GoaT, or alternatively supply known synonyms as a comma-separated list with --synonyms. If a hit is for a synonym, it will be marked as such in the output results.

Required options:

-x, --type <type> - Input data type (see table below)
-i, --index <path> - Path to database index JSON

Input type aliases:

Platform	Accepted values
Illumina / short reads	`sr`, `short`, `illumina`
Short read RNA-seq	`rnaseq`
PacBio HiFi	`pb`, `pacbio`, `pacbio_hifi`
Oxford Nanopore	`ont`, `nanopore`, `oxford_nanopore`
Pre-assembled contigs	`contigs`

Additional options:

--db <name> - Search a specific database only (default: search all databases in index)
--expected_taxon <name> - Expected taxon binomial name for validation
-s, --find-goat-synonyms - Get the synonyms for the provided expected taxon from GoaT
--synonyms - A comma separated list of synonyms. Will override --find-goat-synonyms if provided.
-n, --nreads <N> - Limit to first N reads
-m, --min_seq_id <float> - Minimum sequence identity for hits (default: 0.96)
-l, --min_aln_len <int> - Minimum alignment length for hits (default: 450)
--cleanup/--no-cleanup - Clean up temporary files (default: cleanup)
--db-to-tmpdir/--no-db-to-tmpdir - Temporarily copy the database to the temporary directory. Can reduce IO and improve speed if multiple processes would access the database simultaneously. Not suggested if the databases include indexes. (default: True)
-t, --threads <N> - Number of threads (default: 1)
-p, --prefix <name> - Output file prefix (default: input filename)
-o, --outdir <path> - Output directory (default: current directory)
--save-contigs - Write the assembled contigs to disk in the output directory.

Output files:

<prefix>.<marker>.summary.json - Results in JSON format

Output JSON file

The search subtool outputs a JSON file summarising the search. It has the following format:

{
  "input": {
      "file": <path>, // input file path
      "n_reads": <int>, // number of reads searched
      "n_aligned_reads": <int>, // number of reads aligned to database
      "marker": <string>, // marker gene in database
      "database": <path>, // path to mmseqs database
      "contig_stats": <dict>, // number, total length and n50 of assembled contigs
      "expected_taxon": <dict>, // name of asserted taxon and number of records present in database, including synonyms
  },
  "summary": {
      "n_contigs_with_hits": <int>, // number of assembled contigs with search hits
      "n_expected_taxon_hits": <int>, // number of hits for the asserted taxon
      "n_synonym_hits": <int>, // number of hits for synonyms
      "top_result": <dict>, // top result (highest bitscore) for asserted taxon if found or overall if not
      "taxon_summary": <dict>, // per-taxon summary of results - number of hits, min and max %ID and alignment lengths, and sequence of top hit
  },
  "results": <list>, // for each contig with results, a list of all hits
  "run_info": <dict> // tool information
}

Example Workflows

Building a marker database for the BOLD FASTA release:

markadoros database \
    --header-type bold \
    --outdir db/ \
    --prefix BOLD \
    --cluster \
    --threads 16 \
    BOLD_Public.20-Feb-2026.fasta.gz

Searching PacBio HiFi reads with expected taxon:

markadoros search -x pb \
    --index db/db.json \
    --db BOLD.COI \
    --expected_taxon "Halyzia sedecimguttata" \
    --threads 16 \
    --nreads 20000 \
    pacbio_reads.fasta.gz

Author

Jim Downie

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.0

May 5, 2026

This version

1.0.0

Apr 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markadoros-1.0.0.tar.gz (329.3 kB view details)

Uploaded Apr 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

markadoros-1.0.0-py3-none-any.whl (35.0 kB view details)

Uploaded Apr 23, 2026 Python 3

File details

Details for the file markadoros-1.0.0.tar.gz.

File metadata

Download URL: markadoros-1.0.0.tar.gz
Upload date: Apr 23, 2026
Size: 329.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for markadoros-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`ffcba0f3392665c712748de53ad518e6e906f004f077371e2df4ac0f27395f64`
MD5	`175f1a1868221e9b9fa8e45ac10a5165`
BLAKE2b-256	`2e84eb72ea0b0771d92f6d7ccf35bb5a38ffa08354b251a120c95a1074a65ffb`

See more details on using hashes here.

File details

Details for the file markadoros-1.0.0-py3-none-any.whl.

File metadata

Download URL: markadoros-1.0.0-py3-none-any.whl
Upload date: Apr 23, 2026
Size: 35.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for markadoros-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`078ab67703e91805d23adc1a055cf1f6d2849d9adb184364af60c532c399e60d`
MD5	`8cca6c30de42ba52ac3f1b7705626c97`
BLAKE2b-256	`c24932f383a1c20e7cb6b1d3032743ac421294bf7a8696eb164b5a4e82577625`

See more details on using hashes here.

markadoros 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Markadoros - fast barcode assembly and identification from raw sequencing data

Introduction

Requirements

System Dependencies

Python Dependencies

Installation

From source

Dependencies

Quick Start

Detailed Usage

Database Preparation

Building a database from the BOLD TSV release

Searching for Barcodes

Output JSON file

Example Workflows

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes