Skip to main content

Mutable registry and workflow bookkeeping for large ZipStrain/SRA profiling projects.

Project description

MetaTrawl

MetaTrawl is a mutable DuckDB project store for SRA-scale ZipStrain projects. It tracks run IDs, imports completed ZipStrain/Sylph outputs into real database tables, coordinates shared genome cache preparation, and builds ZipStrain matrix stores from selected samples.

The core idea is simple: many SRA workers can run in parallel, but one cache owner prepares genome and Prodigal outputs for shared accessions. Workers create per-sample concatenated references in scratch space, profile the sample, import the final tables into DuckDB, and then delete scratch files.

Install For Development

pip install -e ".[test]"

MetaTrawl checks for the external tools used by the full workflow: zipstrain, sylph, samtools, bowtie2, prefetch, fasterq-dump, datasets, and prodigal.

metatrawl test

Use the strict checker before long jobs. It exits non-zero if anything required is missing:

metatrawl check

Database Workflow

Initialize a project database:

metatrawl init --db metatrawl.duckdb

Add SRA run IDs:

metatrawl runs add --db metatrawl.duckdb SRR000001 SRR000002

Export only runs that are not yet complete:

metatrawl profiles remaining \
  --db metatrawl.duckdb \
  --output-file remaining_runs.csv

The CSV contains one column:

run_id
SRR000001
SRR000002

Run the high-level sync. This gets remaining runs from DuckDB, runs the SRA profiling lifecycle, finds completed profile outputs, imports them into DuckDB, deletes the imported per-sample files, and logs each step:

metatrawl sync \
  --db metatrawl.duckdb \
  --cache-dir cache \
  --scratch-dir scratch \
  --output-dir outputs \
  --threads 16

sync expects per-run outputs in --output-dir using these conventional names:

SRR000001.profile.parquet
SRR000001.genome_stats.parquet
SRR000001.gene_stats.parquet      # optional
SRR000001.sylph.csv               # csv, tsv, or parquet

After a successful import, these per-sample outputs are removed because DuckDB is the durable project store. The durable cache is left intact. Use --keep-profile-outputs only when debugging a failed or suspicious run.

After profiling, import completed outputs into DuckDB tables:

metatrawl profiles import \
  --db metatrawl.duckdb \
  --run-id SRR000001 \
  --profile-file outputs/SRR000001.profile.parquet \
  --genome-stats-file outputs/SRR000001.genome_stats.parquet \
  --gene-stats-file outputs/SRR000001.gene_stats.parquet \
  --sylph-abundance-file outputs/SRR000001.sylph.csv

Or import many samples from a manifest:

metatrawl profiles add \
  --db metatrawl.duckdb \
  --manifest completed_profiles.csv

Manifest columns:

run_id,profile_file,genome_stats_file,gene_stats_file,sylph_abundance_file
SRR000001,/path/profile.parquet,/path/genome_stats.parquet,/path/gene_stats.parquet,/path/sylph.csv

gene_stats_file is optional. A run is complete after profile positions, genome stats, and Sylph abundance have been imported.

Cache Workflow

Prepare one sample reference from accessions using a shared cache:

metatrawl cache prepare \
  --cache-dir cache \
  --accessions accessions.csv \
  --output-dir scratch/SRR000001/reference

For parallel workers, start a local cache server:

metatrawl cache serve \
  --cache-dir cache \
  --host 127.0.0.1 \
  --port 8765

The cache keeps only durable per-accession files:

cache/genomes/GCF_xxx.fna
cache/genes/GCF_xxx.genes.fna

Per-sample concatenated references are scratch outputs and should be deleted after import.

SRA Worker Lifecycle

profile-sra wires the worker lifecycle around remaining runs and scratch cleanup:

metatrawl profile-sra \
  --db metatrawl.duckdb \
  --remaining-csv remaining_runs.csv \
  --cache-dir cache \
  --scratch-dir scratch \
  --threads 8

Long-running steps emit compact cluster-friendly logs:

METATRAWL sample=SRR123 step=sylph status=done genomes=12 elapsed=4.2s
METATRAWL sample=SRR123 step=cache status=done accessions=10 elapsed=28.9s
METATRAWL sample=SRR123 step=cleanup status=done removed=scratch/SRR123

Matrix Workflow

Build a ZipStrain matrix from complete DuckDB samples. Thresholds are applied before temporary profile parquets are exported:

metatrawl matrix build \
  --db metatrawl.duckdb \
  --genome GCF_000269965.1_ASM26996v1_genomic.fna \
  --bed-file reference/genomes.bed \
  --stb-file reference/genomes.stb \
  --output-file matrices/binfantis.h5 \
  --min-coverage 1 \
  --min-breadth 0.2 \
  --min-ber 0.77 \
  --min-sylph-abundance 0.001 \
  --sparse

Append newly imported complete samples to a registered matrix:

metatrawl matrix append \
  --db metatrawl.duckdb \
  --matrix-id binfantis

Compare a registered matrix:

metatrawl matrix compare \
  --db metatrawl.duckdb \
  --matrix-id binfantis \
  --output-file compares/binfantis.duckdb \
  --calculate all

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metatrawl-0.1.0.tar.gz (25.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

metatrawl-0.1.0-py3-none-any.whl (21.8 kB view details)

Uploaded Python 3

File details

Details for the file metatrawl-0.1.0.tar.gz.

File metadata

  • Download URL: metatrawl-0.1.0.tar.gz
  • Upload date:
  • Size: 25.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.4 CPython/3.14.0 Darwin/24.5.0

File hashes

Hashes for metatrawl-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5556d27fcba3150fd1b9d7b872a961e6c8615bf4a39940bf369400ecb7235477
MD5 39251832f8aa4fb215f000aa2fa25839
BLAKE2b-256 212ac0b9ded8a181def5d14f805ee3c741df44da85c35d411caa099afaf78f66

See more details on using hashes here.

File details

Details for the file metatrawl-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: metatrawl-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.3.4 CPython/3.14.0 Darwin/24.5.0

File hashes

Hashes for metatrawl-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d80e877b455f25f16434c6154b9950f89f084cd69055d90d6fa70c90fc763b12
MD5 d4cb3ffb8e9a15f9b3082ed34afe07cc
BLAKE2b-256 41794e9dbed4b2d4fdc6d7310ee7175b98c9094ec980102e3f0fcdca0f9ba471

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page