Skip to main content

Mutable registry and workflow bookkeeping for large ZipStrain/SRA profiling projects.

Project description

MetaTrawl

MetaTrawl is a mutable DuckDB project store for SRA-scale ZipStrain projects. It tracks run IDs, imports completed ZipStrain/Sylph outputs into real database tables, coordinates shared genome cache preparation, and builds ZipStrain matrix stores from selected samples.

The core idea is simple: many SRA workers can run in parallel, but one cache owner prepares genome and Prodigal outputs for shared accessions. Workers create per-sample concatenated references in scratch space, profile the sample, import the final tables into DuckDB, and then delete scratch files.

Install For Development

pip install -e ".[test]"

MetaTrawl checks for the external tools used by the full workflow: zipstrain, sylph, samtools, bowtie2, prefetch, fasterq-dump, datasets, and prodigal.

metatrawl test

Use the strict checker before long jobs. It exits non-zero if anything required is missing:

metatrawl check

Database Workflow

Initialize a project database:

metatrawl init --db metatrawl.duckdb

Add SRA run IDs:

metatrawl runs add --db metatrawl.duckdb SRR000001 SRR000002

Export only runs that are not yet complete:

metatrawl profiles remaining \
  --db metatrawl.duckdb \
  --output-file remaining_runs.csv

The CSV contains one column:

run_id
SRR000001
SRR000002

Run the high-level sync. This gets remaining runs from DuckDB, runs the SRA profiling lifecycle, finds completed profile outputs, imports them into DuckDB, deletes the imported per-sample files, and logs each step:

metatrawl sync \
  --db metatrawl.duckdb \
  --cache-dir cache \
  --scratch-dir scratch \
  --sylph-db gtdb-r220-c200-dbv1.syldb \
  --output-dir outputs \
  --threads 16

MetaTrawl runs sylph profile for each sample, saves the abundance table as SRR000001.sylph.tsv, extracts nonzero GCF_.../GCA_... accessions from it, and asks the shared cache to prepare those genomes.

--accessions-dir is still available as a manual override. It should contain one accession list per run, produced by Sylph or another genome preselection step:

accessions/SRR000001.accessions.txt
accessions/SRR000002.accessions.csv

Each file can be a plain one-accession-per-line text file or a CSV with an accession column.

sync expects per-run outputs in --output-dir using these conventional names:

SRR000001.profile.parquet
SRR000001.genome_stats.parquet
SRR000001.gene_stats.parquet      # optional
SRR000001.sylph.csv               # csv, tsv, or parquet

After a successful import, these per-sample outputs are removed because DuckDB is the durable project store. The durable cache is left intact. Use --keep-profile-outputs only when debugging a failed or suspicious run.

After profiling, import completed outputs into DuckDB tables:

metatrawl profiles import \
  --db metatrawl.duckdb \
  --run-id SRR000001 \
  --profile-file outputs/SRR000001.profile.parquet \
  --genome-stats-file outputs/SRR000001.genome_stats.parquet \
  --gene-stats-file outputs/SRR000001.gene_stats.parquet \
  --sylph-abundance-file outputs/SRR000001.sylph.csv

Or import many samples from a manifest:

metatrawl profiles add \
  --db metatrawl.duckdb \
  --manifest completed_profiles.csv

Manifest columns:

run_id,profile_file,genome_stats_file,gene_stats_file,sylph_abundance_file
SRR000001,/path/profile.parquet,/path/genome_stats.parquet,/path/gene_stats.parquet,/path/sylph.csv

gene_stats_file is optional. A run is complete after profile positions, genome stats, and Sylph abundance have been imported.

Cache Workflow

Prepare one sample reference from accessions using a shared cache:

metatrawl cache prepare \
  --cache-dir cache \
  --accessions accessions.csv \
  --output-dir scratch/SRR000001/reference

For parallel workers, start a local cache server:

metatrawl cache serve \
  --cache-dir cache \
  --host 127.0.0.1 \
  --port 8765

The cache keeps only durable per-accession files:

cache/genomes/GCF_xxx.fna
cache/genes/GCF_xxx.genes.fna

Per-sample concatenated references are scratch outputs and should be deleted after import.

SRA Worker Lifecycle

profile-sra wires the worker lifecycle around remaining runs and scratch cleanup:

metatrawl profile-sra \
  --db metatrawl.duckdb \
  --remaining-csv remaining_runs.csv \
  --cache-dir cache \
  --scratch-dir scratch \
  --threads 8

Long-running steps emit compact cluster-friendly logs:

METATRAWL sample=SRR123 step=sylph status=done genomes=12 elapsed=4.2s
METATRAWL sample=SRR123 step=cache status=done accessions=10 elapsed=28.9s
METATRAWL sample=SRR123 step=cleanup status=done removed=scratch/SRR123

Matrix Workflow

Build a ZipStrain matrix from complete DuckDB samples. Thresholds are applied before temporary profile parquets are exported:

metatrawl matrix build \
  --db metatrawl.duckdb \
  --genome GCF_000269965.1_ASM26996v1_genomic.fna \
  --bed-file reference/genomes.bed \
  --stb-file reference/genomes.stb \
  --output-file matrices/binfantis.h5 \
  --min-coverage 1 \
  --min-breadth 0.2 \
  --min-ber 0.77 \
  --min-sylph-abundance 0.001 \
  --sparse

Append newly imported complete samples to a registered matrix:

metatrawl matrix append \
  --db metatrawl.duckdb \
  --matrix-id binfantis

Compare a registered matrix:

metatrawl matrix compare \
  --db metatrawl.duckdb \
  --matrix-id binfantis \
  --output-file compares/binfantis.duckdb \
  --calculate all

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metatrawl-0.1.1.tar.gz (27.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

metatrawl-0.1.1-py3-none-any.whl (23.6 kB view details)

Uploaded Python 3

File details

Details for the file metatrawl-0.1.1.tar.gz.

File metadata

  • Download URL: metatrawl-0.1.1.tar.gz
  • Upload date:
  • Size: 27.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.4 CPython/3.14.0 Darwin/24.5.0

File hashes

Hashes for metatrawl-0.1.1.tar.gz
Algorithm Hash digest
SHA256 35485a723f4954dcdbc4c7467ccdfb21711a84f97090280cf85ce71e800dc4c8
MD5 9418b98f54919e7e1e2d7c113e0d38e2
BLAKE2b-256 3b521da4321c28f4c2d2b31c5e1745a9b05f26766727369f0fad6a61f291db29

See more details on using hashes here.

File details

Details for the file metatrawl-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: metatrawl-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 23.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.4 CPython/3.14.0 Darwin/24.5.0

File hashes

Hashes for metatrawl-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e13d3c641785de8831df6856c77ff6874aae7487f821606278e2652b978f8382
MD5 0d516254bf5e8c2fb3d0798f95c9378a
BLAKE2b-256 b09aae4064ec1494ee787781f368c7f62445d25bc8b0078238ee837295a3c9d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page