Mutable registry and workflow bookkeeping for large ZipStrain/SRA profiling projects.
Project description
MetaTrawl
MetaTrawl is a mutable DuckDB project store for SRA-scale ZipStrain projects. It tracks run IDs, imports completed ZipStrain/Sylph outputs into real database tables, coordinates shared genome cache preparation, and builds ZipStrain matrix stores from selected samples.
The core idea is simple: many SRA workers can run in parallel, but one cache owner prepares genome and Prodigal outputs for shared accessions. Workers create per-sample concatenated references in scratch space, profile the sample, import the final tables into DuckDB, and then delete scratch files.
Install For Development
pip install -e ".[test]"
MetaTrawl checks for the external tools used by the full workflow:
zipstrain, sylph, samtools, bowtie2, prefetch, fasterq-dump,
datasets, and prodigal.
metatrawl test
Use the strict checker before long jobs. It exits non-zero if anything required is missing:
metatrawl check
Database Workflow
Initialize a project database:
metatrawl init --db metatrawl.duckdb
Add SRA run IDs:
metatrawl runs add --db metatrawl.duckdb SRR000001 SRR000002
Export only runs that are not yet complete:
metatrawl profiles remaining \
--db metatrawl.duckdb \
--output-file remaining_runs.csv
The CSV contains one column:
run_id
SRR000001
SRR000002
Run the high-level sync. This gets remaining runs from DuckDB, runs the SRA profiling lifecycle, finds completed profile outputs, imports them into DuckDB, deletes the imported per-sample files, and logs each step:
metatrawl sync \
--db metatrawl.duckdb \
--cache-dir cache \
--scratch-dir scratch \
--sylph-db /path/to/gtdb-r220-c200-dbv1.syldb \
--output-dir outputs \
--threads 16
MetaTrawl runs sylph profile for each sample, saves the abundance table as
SRR000001.sylph.tsv, extracts nonzero GCF_.../GCA_... accessions from it,
and asks the shared cache to prepare those genomes.
Use an absolute --sylph-db path when possible. MetaTrawl validates the file
before launching SRA workers.
--accessions-dir is still available as a manual override. It should contain
one accession list per run, produced by Sylph or another genome preselection
step:
accessions/SRR000001.accessions.txt
accessions/SRR000002.accessions.csv
Each file can be a plain one-accession-per-line text file or a CSV with an
accession column.
sync expects per-run outputs in --output-dir using these conventional names:
SRR000001.profile.parquet
SRR000001.genome_stats.parquet
SRR000001.gene_stats.parquet # optional
SRR000001.sylph.csv # csv, tsv, or parquet
After a successful import, these per-sample outputs are removed because DuckDB is
the durable project store. The durable cache is left intact. Use
--keep-profile-outputs only when debugging a failed or suspicious run.
After profiling, import completed outputs into DuckDB tables:
metatrawl profiles import \
--db metatrawl.duckdb \
--run-id SRR000001 \
--profile-file outputs/SRR000001.profile.parquet \
--genome-stats-file outputs/SRR000001.genome_stats.parquet \
--gene-stats-file outputs/SRR000001.gene_stats.parquet \
--sylph-abundance-file outputs/SRR000001.sylph.csv
Or import many samples from a manifest:
metatrawl profiles add \
--db metatrawl.duckdb \
--manifest completed_profiles.csv
Manifest columns:
run_id,profile_file,genome_stats_file,gene_stats_file,sylph_abundance_file
SRR000001,/path/profile.parquet,/path/genome_stats.parquet,/path/gene_stats.parquet,/path/sylph.csv
gene_stats_file is optional. A run is complete after profile positions, genome
stats, and Sylph abundance have been imported.
Cache Workflow
Prepare one sample reference from accessions using a shared cache:
metatrawl cache prepare \
--cache-dir cache \
--accessions accessions.csv \
--output-dir scratch/SRR000001/reference
For parallel workers, start a local cache server:
metatrawl cache serve \
--cache-dir cache \
--host 127.0.0.1 \
--port 8765
The cache keeps only durable per-accession files:
cache/genomes/GCF_xxx.fna
cache/genes/GCF_xxx.genes.fna
Per-sample concatenated references are scratch outputs and should be deleted after import.
SRA Worker Lifecycle
profile-sra wires the worker lifecycle around remaining runs and scratch
cleanup:
metatrawl profile-sra \
--db metatrawl.duckdb \
--remaining-csv remaining_runs.csv \
--cache-dir cache \
--scratch-dir scratch \
--threads 8
Long-running steps emit compact cluster-friendly logs:
METATRAWL sample=SRR123 step=sylph status=done genomes=12 elapsed=4.2s
METATRAWL sample=SRR123 step=cache status=done accessions=10 elapsed=28.9s
METATRAWL sample=SRR123 step=cleanup status=done removed=scratch/SRR123
Matrix Workflow
Build a ZipStrain matrix from complete DuckDB samples. Thresholds are applied before temporary profile parquets are exported:
metatrawl matrix build \
--db metatrawl.duckdb \
--genome GCF_000269965.1_ASM26996v1_genomic.fna \
--bed-file reference/genomes.bed \
--stb-file reference/genomes.stb \
--output-file matrices/binfantis.h5 \
--min-coverage 1 \
--min-breadth 0.2 \
--min-ber 0.77 \
--min-sylph-abundance 0.001 \
--sparse
Append newly imported complete samples to a registered matrix:
metatrawl matrix append \
--db metatrawl.duckdb \
--matrix-id binfantis
Compare a registered matrix:
metatrawl matrix compare \
--db metatrawl.duckdb \
--matrix-id binfantis \
--output-file compares/binfantis.duckdb \
--calculate all
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file metatrawl-0.1.2.tar.gz.
File metadata
- Download URL: metatrawl-0.1.2.tar.gz
- Upload date:
- Size: 28.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.4 CPython/3.14.0 Darwin/24.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8219a2a79009770c362d03994ae0934c12cc62ade298428c34bca95d4ae8da3e
|
|
| MD5 |
701f37cea197449d6fb0e614c07529d8
|
|
| BLAKE2b-256 |
99cc082921e64fc0ad272f5b37516374623733cf28125e1db5a01f50c188784f
|
File details
Details for the file metatrawl-0.1.2-py3-none-any.whl.
File metadata
- Download URL: metatrawl-0.1.2-py3-none-any.whl
- Upload date:
- Size: 23.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.4 CPython/3.14.0 Darwin/24.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7624c2c401dbb8549aec43f23b29f6db2a684e4f37b49ed04234bcdd2c3fcac6
|
|
| MD5 |
0c0fd4462d018a3a57112f53bff6f89c
|
|
| BLAKE2b-256 |
f2205d4d3eee117a80ede43a98c0e113aa70300788882e98379dc03d6c68d27a
|