Markadoros is a tool for the identification and assembly of barcode genes in raw sequencing data.
Project description
Markadoros - fast barcode assembly and identification from raw sequencing data
▗▖ ▗▖▗▞▀▜▌ ▄▄▄ █ ▄ ▗▞▀▜▌▐▌ ▄▄▄ ▄▄▄ ▄▄▄ ▄▄▄
▐▛▚▞▜▌▝▚▄▟▌█ █▄▀ ▝▚▄▟▌▐▌█ █ █ █ █ ▀▄▄
▐▌ ▐▌ █ █ ▀▄ ▗▞▀▜▌▀▄▄▄▀ █ ▀▄▄▄▀ ▄▄▄▀
▐▌ ▐▌ █ █ ▝▚▄▟▌
Introduction
Markadoros is a Python tool for the identification and assembly of barcode genes in raw sequencing data, using MMSeqs2 for searching and either SPAdes or Hifiasm for assembly. Using an MMSeqs2 database of marker gene sequences, markadoros searches a set of input reads (providable in either FASTX or CRAM format) to quickly pre-filter reads that map to the target marker gene. Reads that match any barcode sequence in the database are extracted and assembled using an appropriate assembler. The resulting assembled contigs are then searched again using the same database, using more sensitive thresholds, to identify the marker genes from the input dataset.
The name markadoros comes from the Greek word for a marker pen, μαρκαδόρος.
Requirements
System Dependencies
You will need the following external tools installed to run Markadoros:
Installation
From Conda
markadoros is on Bioconda. Install it as follows:
conda install -c bioconda markadoros
With uv
Install markadoros with uv. You will need to install the external dependencies independently.
uv tool install markadoros
From source
To install markadoros from source, clone the repository and run the following inside the source directory:
pip install -e .
Quick Start
# 1. Build a database from a BOLD release
markadoros database -x bold --prefix BOLD --cluster --outdir db/ <bold_release.fasta.gz>
# 2. Search your reads
markadoros search -x illumina -n 100000 --index db/db.json --db BOLD.COI <reads.fq.gz>
# 3. Check results
less reads.COI.summary.json
Detailed Usage
Database Preparation
Use the database command to prepare marker gene sequences for searching:
markadoros database -x bold \
--prefix BOLD \
--outdir db/ \
/path/to/bold/release.fasta.gz
Required options:
-x, --header-type <type>- Use a preset header processor:bold,unite,silva_lsu,silva_ssu,generic.--prefix <name>- Prefix for output database names
Additional options:
--marker <name>- Only when using--header-typeisgeneric- the name of the marker gene in themarkerfield of the FASTA header.--min-length <N>- Minimum sequence length to retain (default: 200)--deduplicate/--no-deduplicate- Deduplicate identical sequences. The first record for each identical sequence is kept. (default: false)--cluster/--no-cluster- Cluster sequences using MMSeqs2's linear clustering algorithm (default: no cluster)--cluster_min_seq_id- Cluster sequences at this percentage identity threshold (default: 0.99)--cluster_coverage- Overlap between two sequences required for clustering (default: 0.8)--create-index/--no-create-index- Create an MMSeqs2 index for each marker database. This may improve speed for larger databases, but can cause IO issues if multiple processes access the same database. (default: False)--exclude-file- New line-separated list of regular expressions. If a header matches a regular expression it will be skipped.-o, --outdir <path>- Output directory (default:./markadoros.db)--cleanup/--no-cleanup- Clean up temporary files (default: cleanup)-t, --threads <N>- Number of threads for MMSeqs2 (default: 1)
Header types:
bold- BOLD Systems general FASTA releaseunite- UNITE general FASTA releasesilva_lsu- SILVA LSU Releasesilva_ssu- SILVA SSU Release
If your FASTA release does not conform to the above presets, set -x to generic. Your input FASTA headers must then be formatted as:
><unique_id>|<marker>|<taxon_name>|<lineage>
Output files:
db.json- Index of available databases and their parameters<prefix>_<marker>/db*- MMSeqs2 database files<prefix>_<marker>/taxon.json.gz- JSON file counting the number of available sequences per taxon.
If you build additional databases pointing to the same output directory, the existing index file will be updated to include the new entries.
Building a database from the BOLD TSV release
Alternatively, you can use the bold-coi-from-tsv to build a BOLD COI database from the BOLD TSV release. The BOLD TSV release includes information on the BIN, or cluster, of each sequence and it can thus be used to create a database with a single representative sequence per cluster instead of running your own clustering. Currently this appoach is only supported for the COI marker.
Run it as follows:
markadoros bold-coi-from-tsv --prefix BOLD --threads 16 --outdir . BOLD_Public.27-Mar-2026.tsv.gz
Options such as --exclude-file and --min-length work the same as in the standard database command.
Searching for Barcodes
Use the search command to identify barcode genes in a set of reads or pre-assembled contigs:
markadoros search -x illumina --index db/db.json reads.fq.gz
You can supply the binomial name of a taxon that you believe the data should arise from with --expected-taxon. In
this case, markadoros will check the results and tell you how many possible sequences there are to match against and
how many hits were found for the taxon. If the expected taxon is found, it also reports the top hit for that taxon as the
best hit instead of the overall top hit.
Sometimes, there may be synonyms for the expected taxon that you may wish to check additionally. To accomodate these,
you can supply either the --find-goat-synonyms flag, which will pull the names of all synonyms from
GoaT, or alternatively supply known synonyms as a comma-separated list with --synonyms. If
a hit is for a synonym, it will be marked as such in the output results.
Required options:
-x, --type <type>- Input data type (see table below)-i, --index <path>- Path to database index JSON
Input type aliases:
| Platform | Accepted values |
|---|---|
| Illumina / short reads | sr, short, illumina |
| Short read RNA-seq | rnaseq |
| PacBio HiFi | pb, pacbio, pacbio_hifi |
| Oxford Nanopore | ont, nanopore, oxford_nanopore |
| Pre-assembled contigs | contigs |
Additional options:
--db <name>- Search a specific database only (default: search all databases in index)--expected_taxon <name>- Expected taxon binomial name for validation-s,--find-goat-synonyms- Get the synonyms for the provided expected taxon from GoaT--synonyms- A comma separated list of synonyms. Will override--find-goat-synonymsif provided.-n, --nreads <N>- Limit to first N reads-m, --min_seq_id <float>- Minimum sequence identity for hits (default: 0.96)-l, --min_aln_len <int>- Minimum alignment length for hits (default: 450)--cleanup/--no-cleanup- Clean up temporary files (default: cleanup)--db-to-tmpdir/--no-db-to-tmpdir- Temporarily copy the database to the temporary directory. Can reduce IO and improve speed if multiple processes would access the database simultaneously. Not suggested if the databases include indexes. (default: True)-t, --threads <N>- Number of threads (default: 1)-p, --prefix <name>- Output file prefix (default: input filename)-o, --outdir <path>- Output directory (default: current directory)--save-contigs- Write the assembled contigs to disk in the output directory.
Output files:
<prefix>.<marker>.summary.json- Results in JSON format
Output JSON file
The search subtool outputs a JSON file summarising the search. It has the following format:
{
"input": {
"file": <path>, // input file path
"n_reads": <int>, // number of reads searched
"n_aligned_reads": <int>, // number of reads aligned to database
"marker": <string>, // marker gene in database
"database": <path>, // path to mmseqs database
"contig_stats": <dict>, // number, total length and n50 of assembled contigs
"expected_taxon": <dict>, // name of asserted taxon and number of records present in database, including synonyms
},
"summary": {
"n_contigs_with_hits": <int>, // number of assembled contigs with search hits
"n_expected_taxon_hits": <int>, // number of hits for the asserted taxon
"n_synonym_hits": <int>, // number of hits for synonyms
"top_result": <dict>, // top result (highest bitscore) for asserted taxon if found or overall if not
"taxon_summary": <dict>, // per-taxon summary of results - number of hits, min and max %ID and alignment lengths, and sequence of top hit
},
"results": <list>, // for each contig with results, a list of all hits
"run_info": <dict> // tool information
}
Example Workflows
Building a marker database for the BOLD FASTA release:
markadoros database \
--header-type bold \
--outdir db/ \
--prefix BOLD \
--cluster \
--threads 16 \
BOLD_Public.20-Feb-2026.fasta.gz
Searching PacBio HiFi reads with expected taxon:
markadoros search -x pb \
--index db/db.json \
--db BOLD.COI \
--expected_taxon "Halyzia sedecimguttata" \
--threads 16 \
--nreads 20000 \
pacbio_reads.fasta.gz
Author
- Jim Downie
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file markadoros-1.1.0.tar.gz.
File metadata
- Download URL: markadoros-1.1.0.tar.gz
- Upload date:
- Size: 329.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c808e27cd154867805ce760dd3bb05126413682f8653525eec0f04e0edd75422
|
|
| MD5 |
fe2856f9445b69b9446159024dc78a54
|
|
| BLAKE2b-256 |
7cf578b5378d903e022020bf2228c3d48df16f7b4c27125f5838a4c07a5e926b
|
Provenance
The following attestation bundles were made for markadoros-1.1.0.tar.gz:
Publisher:
release.yml on sanger-tol/markadoros
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
markadoros-1.1.0.tar.gz -
Subject digest:
c808e27cd154867805ce760dd3bb05126413682f8653525eec0f04e0edd75422 - Sigstore transparency entry: 1440259860
- Sigstore integration time:
-
Permalink:
sanger-tol/markadoros@61ca1180e9cade86244f7e76f3ace615aafa45fa -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/sanger-tol
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@61ca1180e9cade86244f7e76f3ace615aafa45fa -
Trigger Event:
push
-
Statement type:
File details
Details for the file markadoros-1.1.0-py3-none-any.whl.
File metadata
- Download URL: markadoros-1.1.0-py3-none-any.whl
- Upload date:
- Size: 34.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa57a8839a82b3f09795a3874c35a528f51d95e77494ccb0a90fe080b427848f
|
|
| MD5 |
472c68f687652dc50c44219ebe05ea0d
|
|
| BLAKE2b-256 |
ee65e90994b91768e6260babb9b659259f1a414e08538d0c7077d676bfe6922f
|
Provenance
The following attestation bundles were made for markadoros-1.1.0-py3-none-any.whl:
Publisher:
release.yml on sanger-tol/markadoros
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
markadoros-1.1.0-py3-none-any.whl -
Subject digest:
aa57a8839a82b3f09795a3874c35a528f51d95e77494ccb0a90fe080b427848f - Sigstore transparency entry: 1440259870
- Sigstore integration time:
-
Permalink:
sanger-tol/markadoros@61ca1180e9cade86244f7e76f3ace615aafa45fa -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/sanger-tol
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@61ca1180e9cade86244f7e76f3ace615aafa45fa -
Trigger Event:
push
-
Statement type: