CHIMERA (Configurable Hybrid In-silico Metagenome Emulator for Read Analysis): download genomes from NCBI or use your own, build simulated metagenome FASTAs for training classifiers.

These details have not been verified by PyPI

Project links

Project description

CHIMERA

Configurable Hybrid In-silico Metagenome Emulator for Read Analysis — build reproducible, ground-truth-labelled synthetic metagenomes from NCBI RefSeq (or your own genomes) for training and benchmarking sequence classifiers.

CHIMERA fragments reference genomes into fixed-length reads or variable-length contigs, optionally applies sequencing-error and mutation models, and writes a single FASTA or FASTQ metagenome file whose read IDs remain traceable to their source genome. It is designed around reproducibility (date-stamped accession snapshots, deterministic seeds), train/test hygiene (temporal splits by NCBI CreateDate, BLAST-based similarity filtering), and extensibility (drop-in support for your own in-house genomes).

Highlights
Installation
Quick start
How CHIMERA stays reproducible
Pre-built snapshots and viral reference
Use cases at a glance
Choosing workflows
Extended usage
Command reference
Capabilities summary
Documentation
Development
Contributing
Citation
License

Highlights

End-to-end pipeline. Download NCBI RefSeq genomes (bacteria, virus, archaea, plasmid) and produce a ready-to-train metagenome FASTA/FASTQ with a single command.
Bring your own genomes. Drop FASTA files into bacteria/, virus/, archaea/, or plasmid/ folders and CHIMERA treats them identically to NCBI downloads — mix both if you want.
Ground-truth labels. Every read ID encodes its source category and accession (virus_NC_001798.2_read_17, bacteria_NC_000913.3_contig_3, _seg{N} for multi-record FASTAs); optional _abundance.txt gives per-genome read counts and proportions.
Reproducible by default. Date-stamped accession snapshots (snapshots/accession_snapshot_YYYY-MM-DD.json) freeze NCBI's catalog; combine with --sample-seed and --seed and two runs months apart produce the same bytes.
Train/test hygiene. Temporal splits by NCBI CreateDate and BLAST-based similarity filtering (filter-test-against-train) to kill train→test leakage from near-identical strains.
Realistic reads. Fixed- or variable-length chunking, single-end or paired-end reads (--paired, --insert-size), multi-length output in one run (--multi-length 300,500,1000,3000), coverage-depth model with log-normal inter-genome variability (--coverage, --coverage-cv), Illumina + long-read error models (--error-model {illumina,nanopore,pacbio-hifi,pacbio-clr} with homopolymer-aware nanopore indels), library-prep artefacts (--chimera-rate, --pcr-duplicate-rate), uniform or exponential per-genome abundance, and per-base Phred qualities in FASTQ mode.
Gold-standard labels in headers. --embed-taxonomy writes tax=<group> into every record description so supervised trainers can read labels straight from the FASTA/FASTQ.
EVE / prophage exclusion. BLAST non-viral genomes against a viral reference and exclude hits when chunking, so endogenous viral elements don't leak into the "bacteria" label.
Structured benchmarks. benchmark-recipe generates N replicates of fixed per-category size from one snapshot, selecting genome sets that are maximally diverse (genome-level BLAST scoring).
Pre-built assets. Up-to-date snapshots ship in snapshots/; viral-reference BLAST DBs ship as GitHub Release assets with SHA-256 manifests.
Scales to large studies. genome-pool prepare + materialize share one heavy download across many experiments via symlinks.

Installation

From PyPI (recommended)

pip install chimera-metagenome-generator

From source

git clone https://github.com/Alexander-Mitrofanov/MetagenomeGenerator.git
cd MetagenomeGenerator
pip install -e .

With BLAST+ (needed for EVE removal and similarity filtering)

conda env create -f environment.yml
conda activate metagenome-simulator
pip install -e .

After install, the CLI is available as:

metagenome-generator --help

Requirements

Python 3.8+
Biopython ≥ 1.83
BLAST+ (optional; for EVE removal and train/test similarity filtering)

NCBI Entrez credentials (required for any command that talks to NCBI):

export ENTREZ_EMAIL="your_email@example.com"
export ENTREZ_API_KEY="your_ncbi_api_key"   # optional, for higher rate limits

Quick start

Generate a metagenome in one command:

metagenome-generator pipeline \
  --num-bacteria 10 \
  --num-virus 10 \
  --output-dir output \
  --output metagenome.fasta \
  --sequence-length 250 \
  --reads-per-organism 1000

Result: genomes in output/downloaded/, metagenome FASTA in output/metagenome.fasta.

Biome-preset one-liner:

metagenome-generator biome-metagenome \
  --biome-profile marine \
  --output-dir output_biome

Applies practical defaults for marine, soil, or gut and runs the standard pipeline. Override any flag (--reads-per-organism, --sequence-length, --num-bacteria, --num-virus, etc.) and add --accessions-file <snapshot> for reproducibility.

Two-step (download, then chunk):

metagenome-generator download --num-bacteria 10 --num-virus 10 --output-dir output
metagenome-generator chunk \
  --input output/downloaded \
  --output metagenome.fasta \
  --output-dir output \
  --sequence-length 250 \
  --reads-per-organism 1000

Equal reads per genome: add --balanced to chunk.

Full walkthroughs (including the 100/25 temporal benchmark, read budgets, and similarity filtering) live in the User Guide.

How CHIMERA stays reproducible

NCBI's RefSeq catalog changes continuously (new submissions, retractions, taxonomy updates). Searching "N bacterial and N viral genomes" on two different dates will return different accessions — experiments built that way are not reproducible.

CHIMERA solves this with three layered primitives:

Accession snapshots. metagenome-generator snapshot records the current NCBI catalog (all matching accessions, with CreateDate and title) to snapshots/accession_snapshot_YYYY-MM-DD.json without downloading any sequences. This is a frozen accession list.
--accessions-file + --max-* --sample-seed. Downstream commands (download, pipeline, benchmark-recipe, genome-pool, etc.) consume a snapshot and optionally take a deterministic random subset so the same snapshot + seed always produces the same genome set.
Deterministic seeds everywhere. --seed fixes randomness for variable-length chunking, read-cap sampling, mutation, abundance assignment, and train/test splitting, so the final metagenome file is byte-stable across re-runs.

Creating your own snapshot (optional, slow):

metagenome-generator snapshot
# writes snapshots/accession_snapshot_YYYY-MM-DD.json
# add --no-metadata for ID-only, or --complete-only to exclude WGS/drafts

The snapshot command queries NCBI for every RefSeq accession matching each category; it can take tens of minutes to hours depending on catalog size and rate limits. Use a pre-built snapshot from this repo whenever possible — only run snapshot for a custom date or catalog refresh.

Replicating past experiments. Keep the exact snapshot file alongside the code (or in version control). With that file and the same --sample-seed / --seed, the genome set and metagenome regenerate byte-for-byte.

Pre-built snapshots and viral reference

You do not need to build your own snapshot or viral reference DB. The repository ships both and keeps them up to date.

Accession snapshots live in snapshots/. Pass any accession_snapshot_YYYY-MM-DD.json to --accessions-file for reproducible downloads and pipelines. New snapshots are added as RefSeq is refreshed.
Viral reference BLAST databases (for EVE / prophage detection) are published as release assets on GitHub Releases. Download the latest viral_db_YYYY-MM-DD.tar.gz, extract it, and pass the BLAST DB prefix to blastn-filter --viral-db (under …/blastn_db/viral_db inside the archive). Each archive includes viral_db_manifest.json with per-file SHA-256 checksums plus an aggregate DB fingerprint. For strict reproducibility, also pass --viral-db-manifest and/or --require-viral-db-sha256.

Release-asset automation. When a GitHub Release is published, the workflow .github/workflows/release-assets.yml does two things:

Uploads every file listed in scripts/release_assets_manifest.txt (paths must exist in the tagged commit). Update that manifest when the canonical snapshot file changes.
Looks for viral_db*.tar.gz on older releases (newest first, excluding the release just published) and uploads the same tarball to the new release, so every version ships a viral DB without rebuilding. When publishing a new DB, attach viral_db_YYYY-MM-DD.tar.gz locally with gh release upload TAG viral_db_YYYY-MM-DD.tar.gz; subsequent releases propagate it until you replace it.

Use cases at a glance

Use case	Objective	Command or flow
Single metagenome	Generate one synthetic metagenome FASTA (fixed or variable read length) for classifier training or method benchmarking, with controlled genome counts and read parameters.	`pipeline --num-bacteria N --num-virus N --output-dir out --output metagenome.fasta --sequence-length 250 --reads-per-organism 1000`
In-house genome set	Use your own genome FASTAs (isolates, assemblies, phages) instead of NCBI: place them in `bacteria/`, `virus/`, etc., then run `chunk` with `--input` pointing at that directory.	Create folder layout → drop FASTAs in the right category folders → `chunk --input my_genomes --output metagenome.fasta --output-dir out --sequence-length 250 --reads-per-organism 1000`
Reproducible genome set	Freeze the set of genomes used across runs and machines: record the catalog once with `snapshot`, then download and chunk from that list. Optionally take a subset with `--max-bacteria`, `--max-virus`, `--sample-seed`.	`snapshot` → save JSON; then `download --accessions-file <json>` (and chunk) or use it in `pipeline`. For a subset: `--max-bacteria N --max-virus M --sample-seed 42`.
Temporal train/test	Evaluate generalization to "future" genomes: train on accessions submitted before a cutoff date and test on accessions on/after, with BLAST-based removal of test reads similar to train.	One shot: `temporal-pipeline --accessions-file <snap> --split-date YYYY-MM-DD --output-dir <dir>`. Or step-by-step: `temporal-split-search` → `temporal-split` → download/chunk train and test → `filter-test-against-train`.
Single-dataset train/test	Split one synthetic metagenome into train and test fractions (e.g. 80/20) with automatic removal of test reads ≥ threshold similar to train, for quick evaluation without a temporal split.	`pipeline --train-test-split 80`, or `chunk` (one FASTA) then `split-metagenome-train-test --input … --train-test-split 80` (`chunk` itself does not take `--train-test-split`).
Easy biome-like metagenome	One biome-like dataset in one command with preset defaults (`marine`, `soil`, `gut`), with optional overrides and optional snapshot-based reproducibility.	`biome-metagenome --biome-profile marine --output-dir out` (optionally add `--accessions-file <snap>` or `--genome-dir <dir>`).
Structured benchmark	Produce multiple replicate datasets with fixed N genomes per category (e.g. 50 bacterial, 50 viral per replicate) sampled from a snapshot; replicates are selected to be maximally diverse and each is split into train/test reads with similarity filtering.	`snapshot` → `benchmark-recipe --accessions-file <snap> --output-dir out --per-category 50 --replicates 5 --train-test-split 80`

For detailed walkthroughs see the User Guide.

Choosing workflows (how the main tools differ)

Use this section when deciding which commands to chain for reproducible benchmarks versus quick one-off runs.

Approach	What it does	When to use it
`pipeline`	Download (or `--genome-dir`) → optional BLASTN EVE (`--run-blastn-filter`) → read generation. Can write train/test FASTAs in one go via `--train-test-split`.	Single end-to-end run from counts or an existing genome directory; simplest path if you don't need to reuse the same metagenome with several split seeds.
`download` + `blastn-filter` + `chunk` + `split-metagenome-train-test`	Explicit steps: fetch genomes, build `eve_intervals.json`, chunk once to one metagenome FASTA, then split (possibly many times with different `--seed`).	Same genomes and same chunked metagenome, multiple train/test splits (e.g. several read lengths or shuffle seeds). The `chunk` subcommand does not accept `--train-test-split` — use `split-metagenome-train-test` on the metagenome file instead.
`benchmark-recipe`	Samples a fixed N per category from a snapshot for R replicates; picks diverse genomes (genome-level BLAST scoring); each replicate gets `{stem}_train.` / `{stem}_test.` with similarity filtering.	Published-style structured benchmark with named replicates under `replicate_XXX/`.
`genome-pool prepare` + `genome-pool materialize`	Prepare: download up to `max_` accessions into a shared pool (one heavy download). Materialize:* symlink (or copy) a reproducible subset into a per-run genome directory.	Large studies where many runs share the same underlying download; avoids re-fetching genomes for each experiment. Ordinary `download` writes straight to one output tree and does not build a pool.
`temporal-pipeline` (or manual `temporal-split` + downloads + chunk + `filter-test-against-train`)	Splits accessions by NCBI submission date; builds separate train/test metagenomes; removes test reads similar to train.	Time-based generalization ("train on past, test on future"), not a random 80/20 split of one metagenome.
`biome-metagenome` / `biome-dataset-pipeline`	Preset biome-like defaults or contig-fetch + chunk in fewer steps.	Convenience over manual tuning of every `pipeline` flag.

BLAST / EVE: Run blastn-filter once per genome directory (optionally with --eve-query-store pointing at a shared directory) so per-genome EVE results are reused across reruns. Use --force-recompute to ignore that store.

Extended usage

Download genomes

Obtain RefSeq genomes by category (bacteria, virus, archaea, plasmid) from NCBI Nucleotide. You specify how many bacteria and how many virus genomes separately; optionally add archaea and plasmid as extra negative samples. Each genome is saved as {accession}.fasta (e.g. NC_000001.1.fasta) in the corresponding category folder. Output layout: bacteria/, virus/, archaea/, plasmid/ under the output directory.

metagenome-generator download --num-bacteria 10 --num-virus 10 --output-dir output

To use a reproducible subset from an existing snapshot (e.g. 50 bacterial

50 viral) instead of downloading the whole file:

metagenome-generator download \
  --accessions-file snapshots/accession_snapshot_2026-03-10.json \
  --max-bacteria 50 --max-virus 50 \
  --sample-seed 42 \
  --output-dir output/downloaded

Option	Use
`--num-bacteria`	Number of bacteria genomes. How many RefSeq bacterial genomes to fetch via NCBI search. Use for negative (non-viral) samples in viral vs. prokaryotic classifiers. Ignored when `--accessions-file` is set — use `--max-bacteria` to cap the per-category count in that mode.
`--num-virus`	Number of virus genomes. How many RefSeq viral genomes to fetch via NCBI search. Use for positive (viral) samples. Ignored when `--accessions-file` is set — use `--max-virus` instead.
`--num-archaea`	Number of archaea genomes. Optional; default 0. Archaea are additional negative samples (non-viral). Use to broaden the diversity of non-viral sequences (e.g. for phage vs. bacteria + archaea). Ignored when `--accessions-file` is set — use `--max-archaea`.
`--num-plasmid`	Number of plasmid sequences. Optional; default 0. Plasmids are additional negative samples. Use when you want to avoid classifying plasmid-derived reads as viral.
`--output-dir`	Output directory. All category folders (`bacteria/`, `virus/`, etc.) are created under this path. Use a dedicated directory (e.g. `working_directory/downloaded/`) to keep runs organized.
`--accessions-file`	Reproducible run. Path to a JSON file containing accession IDs (e.g. from `snapshot` or a previous `--save-accessions` run). NCBI search is skipped; by default all accessions in the file are downloaded. Use when you need the same genome set on every run (e.g. for benchmarks or paper reproducibility).
`--max-bacteria`, `--max-virus`, `--max-archaea`, `--max-plasmid`	Limit how many to use from the snapshot. When using `--accessions-file`, these set an upper bound per category: the tool takes a random sample of that many accessions (or all if the file has fewer). Omit to download the full snapshot. Example: `--accessions-file snap.json --max-bacteria 50 --max-virus 50` downloads 50 bacterial + 50 viral from the file.
`--sample-seed`	Reproducible subset. When using `--max-*` with `--accessions-file`, seed for the random sample (default 42). Use the same seed to get the same subset on every run.
`--save-accessions`	Save chosen accessions. After searching NCBI, write the selected accession lists and a UTC timestamp to this JSON path. Use this file later as `--accessions-file` to re-download the same set. Ignored when `--accessions-file` is set.
`--complete-only`	Complete genomes only. When searching NCBI (no `--accessions-file`), restrict results to complete genomes and exclude WGS/draft (uses NCBI `complete[Properties]` and `NOT WGS[Properties]`). For reproducible complete-only runs, create a snapshot with `snapshot --complete-only` and use that JSON as `--accessions-file`. Ignored when using `--accessions-file`.

For large snapshots, use --max-* and --sample-seed (see How CHIMERA stays reproducible).

Using your own (in-house) genome set

You can skip the download step and use your own genome FASTA files. Use the folder layout the tool expects:

Folder	Contents
`bacteria/`	One or more FASTA files (e.g. your bacterial isolates or assemblies).
`virus/`	One or more FASTA files (e.g. your viral sequences or phages).
`archaea/`	Optional. FASTA files for archaeal genomes.
`plasmid/`	Optional. FASTA files for plasmid sequences.

Requirements: At least one file in virus/ and at least one file in one of bacteria/, archaea/, or plasmid/ (so both viral and non-viral categories are present). Empty folders are ignored.

File naming: Any filename (e.g. isolate_001.fasta, NC_12345.fasta). The file stem (filename without .fasta) becomes the genome identifier in the output (e.g. isolate_001_read_0 with description start=0 end=250). Multi-record FASTA files are supported: records past the first are disambiguated with a _seg{N} infix to keep IDs unique.

Workflow: Create the directory, place your FASTA files in the correct category folders, then run chunk with --input pointing at that directory. You do not need to run download.

# Example: in-house data in my_genomes/
# my_genomes/bacteria/isolate_A.fasta  my_genomes/bacteria/isolate_B.fasta
# my_genomes/virus/phage_1.fasta

metagenome-generator chunk \
  --input my_genomes \
  --output metagenome.fasta \
  --output-dir output \
  --sequence-length 250 \
  --reads-per-organism 1000

You can also mix NCBI-downloaded and in-house data: run download into a directory, then copy or symlink your own FASTA files into the same bacteria/, virus/, etc. folders before running chunk.

Generate reads from genomes (`chunk`)

The chunk subcommand turns genome FASTAs into one metagenome FASTA (or FASTQ) by splitting each genome into fixed-length simulated reads or variable-length contigs. Input is either the download output directory or your own directory with the same layout (bacteria/, virus/, archaea/, plasmid/). See Using your own (in-house) genome set above.

metagenome-generator chunk \
  --input output/downloaded \
  --output metagenome.fasta \
  --output-dir output/metagenome \
  --sequence-length 250 \
  --reads-per-organism 1000

Option	Use
`--sequence-length`	Fixed read length (nt). Each simulated read is exactly this many nucleotides. Typical values: 250–500 for short-read style; match your classifier's expected input. Required unless you use variable-length mode.
`--reads-per-organism`	Max reads per genome. Upper limit on how many non-overlapping (or sampled) reads are taken from each genome file. Omit to use all possible reads from every genome (can produce very large outputs). Use a fixed value (e.g. 1000) for controlled dataset size and balance across genomes.
`--balanced`	Same number of reads per genome. Each genome contributes the same count of reads (the minimum across all genomes). Use when you want to avoid one category (e.g. bacteria) dominating simply because genomes are longer.
`--cap-total-reads`	Cap total reads. Downsample the whole metagenome to at most N reads. Use to match a target size (e.g. cap to the size of your positive set) or to keep evaluation sets manageable. Applied after per-genome limits and balancing.
`--min-contig-length`, `--max-contig-length`	Variable-length contigs. Instead of fixed-length reads, sample contigs with lengths uniformly between these two values (nt). Use for long-read or contig-level benchmarks (e.g. 300–2000 bp). Omit both to use fixed `--sequence-length`.
`--contig-quality-profile`	Contig-quality stratification preset. Use a preset mixture of low/medium/high contig length strata: `realistic`, `high-quality`, or `low-quality`. Mutually exclusive with `--min-contig-length`/`--max-contig-length`.
`--seed`	Random seed. Fixes randomness for variable-length sampling, cap, mutation, and train/test split. Use the same seed to reproduce the exact same metagenome; change the seed to get a different sample.
`--eve-intervals`	EVE exclusion. Path to `eve_intervals.json` produced by `blastn-filter`. Reads/contigs that overlap these endogenous viral element intervals on non-viral genomes are excluded from the metagenome. Use to avoid bacterial/archaeal regions that look viral.
`--forbid-ambiguous`	Exclude ambiguous bases. Discard any read that contains non-ACGT characters (e.g. N, R, Y). Use when your pipeline or classifier assumes strict ACGT-only sequence, or to simulate cleaner sequencing.
`--substitution-rate`, `--indel-rate`	Mutation simulation. Introduce substitutions and/or indels at the given per-base rate (0–1). Use to test classifier robustness to sequencing error or divergence (e.g. 0.01 for 1% substitution rate). Combine with `--seed` for reproducible mutated datasets.
`--error-model`	Platform-specific sequencing errors. One of `illumina`, `nanopore`, `pacbio-hifi`, `pacbio-clr`. Illumina = position-dependent substitution; long-read models add indels (and nanopore inflates error rates inside homopolymer runs ≥3 bp). See Error model, FASTQ, and abundance file below.
`--multi-length`	Multi-length benchmark output. Comma-separated list of read lengths (e.g. `300,500,1000,3000`). Writes one FASTA/FASTQ per length (`{stem}_L{N}.{fasta,fastq}`). Incompatible with variable-length contig flags and with `--train-test-split`.
`--paired`, `--insert-size`, `--insert-size-sd`	Paired-end reads. `--paired` writes `{stem}_R1{suffix}` + `{stem}_R2{suffix}` with `/1` `/2` mate tags; R2 is the reverse complement of the 3' end of each fragment. `--insert-size` sets the mean fragment length (default `3 × --sequence-length`, must be `≥ --sequence-length`); `--insert-size-sd` sets the standard deviation (default `0.1 × --insert-size`). Incompatible with `--multi-length`, `--train-test-split`, `--filter-similar`, `--eve-intervals` (pipeline: `--run-blastn-filter`), and variable-length options.
`--coverage`, `--coverage-cv`	Coverage-depth model. `--coverage X` derives per-genome read counts from a target depth (`reads ≈ bp × coverage / read_length`); `--reads-per-organism` still caps. `--coverage-cv Y` draws per-genome coverage from a log-normal with coefficient of variation `Y` so organisms get uneven depths (use `--seed` for reproducibility).
`--chimera-rate`, `--pcr-duplicate-rate`	Library-prep artefacts. `--chimera-rate` replaces a fraction of records with two-parent chimeras (`chimera_{idx}` IDs, header `chimera parents=A\|B`); length preserved. `--pcr-duplicate-rate` appends bit-identical duplicates (`_dup` suffix, header `pcr_duplicate=true`). Both require `--seed` for reproducibility.
`--embed-taxonomy`	Gold-standard taxonomy labels. Appends `tax=<group>` to every record's description. Requires `--viral-taxonomy JSON`. Viral reads use the JSON lookup (`unknown` fallback); non-viral reads use the category name.
`--output-fastq`	FASTQ output. Write single-end FASTQ with per-base Phred qualities. See Error model, FASTQ, and abundance file below.
`--write-abundance`	Ground-truth abundance file. Write `{output_stem}_abundance.txt` (genome_id, read_count, proportion). See Error model, FASTQ, and abundance file below.
`--extra-viral-fasta`	Merge user viral sequences. Path to a FASTA of additional viral sequences (e.g. metavirome contigs, custom viral set). Reads are generated as for RefSeq viral genomes and merged into the viral pool.
`--abundance-profile`	Per-category read weights. Comma-separated `category=weight`, e.g. `bacteria=0.5,virus=2,archaea=1,plasmid=1`. Scales how many reads are taken from each category relative to the base limit.
`--abundance-distribution`	Per-genome abundance model. Set to `exponential` to assign each genome a weight from an exponential distribution (then normalized). Produces a few "abundant" and many "rare" genomes, similar to real communities. Use `--seed` for reproducibility.
`--viral-taxonomy`, `--balance-viral-by-taxonomy`	Taxonomy-aware viral balancing. `--viral-taxonomy` is the path to the JSON from `viral-taxonomy`. With `--balance-viral-by-taxonomy`, viral read limits are set so each taxonomy group (e.g. family) contributes equally.
`--filter-similar`	Within-metagenome similarity filter. Remove any read that is ≥ 90% similar (identity and coverage) to a read already kept. The tool oversamples and refills to try to reach the target count. Use to reduce near-duplicate sequences.

Train/test split. The chunk subcommand writes one metagenome file only. For an 80/20 (or similar) split with train-vs-test similarity filtering, either run pipeline --train-test-split 80, or run split-metagenome-train-test on the FASTA produced by chunk (see Percentage split with similarity check).

Read and contig IDs; traceability. Every output ID is prefixed with the source category derived from the input directory: bacteria, virus, archaea, or plasmid. Fixed-length segments are named reads ({category}_{stem}_read_{idx}, e.g. bacteria_NC_000913.3_read_17); variable-length segments are contigs ({category}_{stem}_contig_{idx}, e.g. virus_NC_001798.2_contig_3). With accession-named genome files {stem} is the accession, so the full ID is e.g. virus_NC_001798.2_read_0. Multi-record FASTAs (multi-segment viruses, multi-contig drafts) append _seg{N} starting from the second record to keep IDs unique: virus_{stem}_seg1_read_0, virus_{stem}_seg2_read_0, etc. The FASTA/FASTQ description carries start= and end= (0-based positions on the source record) so any read or contig is traceable to its origin.

Error model, FASTQ, and abundance file

--error-model illumina — Position-dependent substitution (low at 5′, higher at 3′); use for realistic short-read benchmarking. Use --seed for reproducibility.
--error-model nanopore — Long-read profile: substitutions + indels (~5.5 % total error) with homopolymer inflation (×2.5 inside runs ≥3 bp) to reproduce the classic basecaller weakness. FASTQ output uses a flat low Phred (~Q13).
--error-model pacbio-hifi — CCS-grade, low-error (~0.3 % total), substitution-biased; FASTQ Phred ≈ Q25.
--error-model pacbio-clr — Continuous long-read, indel-heavy (~12 % total); FASTQ Phred ≈ Q9.
--output-fastq — Write FASTQ with per-base Phred qualities. Illumina uses a position-dependent quality curve; long-read models use a flat per-model Phred derived from their total error rate.
--write-abundance — Write {stem}_abundance.txt next to the metagenome (columns: genome_id, read_count, proportion). Use as ground truth for abundance estimators or method papers.
--multi-length 300,500,1000,3000 — Emit one file per length ({stem}_L300.fasta, {stem}_L500.fasta, …) from the same genome set; useful for matched-genome DeepVirFinder / VirFinder-style benchmarks.
--paired --insert-size 450 --insert-size-sd 30 — Paired-end output; produces {stem}_R1{suffix} + {stem}_R2{suffix} with Illumina-style /1 /2 mate tags.
--coverage 5 --coverage-cv 0.5 — Target 5× coverage per genome, with per-genome depths drawn log-normally so organisms receive uneven coverage (matches real metagenomes).
--chimera-rate 0.02 --pcr-duplicate-rate 0.1 — Inject realistic library-prep artefacts (2 % chimeras, 10 % duplicate reads); tagged in headers for downstream evaluation.
--embed-taxonomy --viral-taxonomy viral_taxonomy.json — Embed tax=<group> in every record's description so supervised trainers can read labels directly from the FASTA/FASTQ header.

Example: add --error-model illumina --output-fastq --write-abundance --seed 42 to chunk or pipeline.

Example (paired-end nanopore, log-normal coverage, duplicates):

metagenome-generator chunk \
  --input output/downloaded \
  --output metagenome.fastq \
  --output-dir output/metagenome \
  --sequence-length 500 \
  --paired --insert-size 1500 --insert-size-sd 150 \
  --error-model nanopore \
  --coverage 5 --coverage-cv 0.5 \
  --pcr-duplicate-rate 0.1 \
  --output-fastq --seed 42

Pipeline (download + read generation)

One command to download genomes and generate reads; optionally run BLASTN (EVE). Layout: output-dir/downloaded/, blastn/, logs/, and the final metagenome FASTA in output-dir/<output> (e.g. output-dir/metagenome.fasta).

metagenome-generator pipeline \
  --num-bacteria 10 \
  --num-virus 10 \
  --output-dir output \
  --output metagenome.fasta \
  --sequence-length 250 \
  --reads-per-organism 1000

pipeline accepts the same read-generation options as chunk (e.g. --train-test-split, --balanced, --eve-intervals) plus --run-blastn-filter, --accessions-file, --complete-only. See metagenome-generator pipeline --help.

Structured benchmark recipe

Fixed N per category (e.g. 50 bacterial, 50 viral), optional replicates; one command, reproducible and comparable to published benchmarks. No NCBI search at recipe time — samples from your snapshot. Replicates are selected to be maximally diverse (greedy genome-level BLAST scoring), and each replicate is split into train/test reads with similarity filtering.

# 1. Create a snapshot once (or use an existing one)
metagenome-generator snapshot

# 2. Run the recipe: 50 bacterial + 50 viral per replicate, 5 replicates, seed 42
metagenome-generator benchmark-recipe \
  --accessions-file snapshots/accession_snapshot_$(date +%Y-%m-%d).json \
  --output-dir benchmarks/run1 \
  --per-category 50 \
  --replicates 5 \
  --seed 42 \
  --sequence-length 250 \
  --reads-per-organism 1000

Output (per replicate): benchmarks/run1/replicate_001/downloaded/ plus train/test read FASTAs inside the replicate directory:

benchmarks/run1/replicate_001/metagenome_train.fasta
benchmarks/run1/replicate_001/metagenome_test.fasta

…and so on for replicate_002 through replicate_005.

Optional: --archaea 50, --plasmid 50 to include archaea/plasmid in each replicate; --output metagenome.fasta to set the {output_stem}_train.* / {output_stem}_test.* filenames.

Train/test defaults: --train-test-split 80, --train-test-similarity-threshold 90, --min-coverage 0.8.

Train/test similarity filtering knobs (BLAST): --train-test-blast-threads, --train-test-blast-batch-size.

Diversity selection knobs (genome-level BLAST scoring): --diversity-max-attempts, --diversity-blast-perc-identity, --diversity-blast-min-coverage, --diversity-blast-threads.

Train/test split and similarity filtering

Two workflows:

Temporal split — Split accessions by NCBI submission date; build train and test metagenomes separately; then run filter-test-against-train to remove test reads ≥ threshold similar to train. Use when you want "train on past, test on future" (e.g. generalization to novel viruses).
Percentage split — Build one metagenome and split reads (e.g. 80% / 20%); the tool automatically removes from test any read ≥ threshold similar to a train read. Use for quick train/test from a single dataset.

Removing test reads similar to train avoids inflated metrics from near-identical strains; CHIMERA supports this for both split types.

Temporal split (by NCBI CreateDate)

Find a split date that gives at least N train (total) and M test (total) genomes.

By default, the suggested test split also enforces per-category minima: test bacterial >= M and test viral >= M (so you don't get 0 viral reads when you only pass --min-test).

Optional per-category overrides (all optional): --min-test-bacteria, --min-test-virus/--min-test-viral, --min-test-archaea, --min-test-plasmid. If you don't set the archaea/plasmid flags, they default to 0 for the test set.

Optional train per-category minima: --min-train-bacteria, --min-train-virus/--min-train-viral, --min-train-archaea, --min-train-plasmid (default 0; only --min-train total is enforced by default).
```
metagenome-generator temporal-split-search \
  --accessions-file snapshots/accession_snapshot_YYYY-MM-DD.json \
  --min-train 100 --min-test 20
```
Prints a suggested --split-date and counts. Then use that date in the steps below.

Preview counts for a chosen date (no files written):

metagenome-generator temporal-split-info \
  --accessions-file snapshots/accession_snapshot_2026-03-10.json \
  --split-date 2019-06-01

Write train/test JSONs:

metagenome-generator temporal-split \
  --accessions-file snapshots/accession_snapshot_2026-03-10.json \
  --split-date 2019-06-01

Build train and test metagenomes: Run download (or pipeline) twice with --accessions-file train_<basename>.json and --accessions-file test_<basename>.json into separate dirs; generate reads from each to get train_metagenome.fasta and test_metagenome.fasta.
Filter test against train (important for rigorous evaluation): remove test reads ≥ threshold similar to train so near-duplicates are not counted as "novel" test.
```
metagenome-generator filter-test-against-train \
  --train-fasta output_train/train_metagenome.fasta \
  --test-fasta output_test/test_unfiltered.fasta \
  --output output_test/test_metagenome.fasta \
  --similarity-threshold 90
```
With temporal-pipeline, the output dir contains only train_downloaded/, test_downloaded/, blastn/, train_metagenome.fasta, and test_metagenome.fasta (filtered). For manual runs, --output places the filtered test FASTA where you want; omit it to write one folder up from the test FASTA. Options: --min-coverage (default 0.8), --threads, --batch-size. Requires BLAST+.

Percentage split with similarity check (single metagenome)

Option A — pipeline (one command): --train-test-split accepts either 80 or 0.8 for an 80/20 split. The pipeline builds the metagenome and applies the same similarity filter as Option B.

metagenome-generator pipeline \
  --num-bacteria 10 --num-virus 10 \
  --output-dir output --output metagenome.fasta \
  --sequence-length 250 --reads-per-organism 1000 \
  --train-test-split 80 --seed 42

Option B — chunk then split-metagenome-train-test: Use this when you need one combined metagenome FASTA first (e.g. to run several split seeds or lengths without re-chunking). The chunk CLI does not implement --train-test-split.

metagenome-generator chunk \
  --input output/downloaded \
  --output metagenome.fasta \
  --output-dir output \
  --sequence-length 250 \
  --reads-per-organism 1000 \
  --seed 42

metagenome-generator split-metagenome-train-test \
  --input output/metagenome.fasta \
  --output-dir output \
  --train-test-split 80 \
  --seed 42

Output train/test files: {output_stem}_train.fasta and {output_stem}_test.fasta next to --output-dir (default stem is the input filename stem, e.g. metagenome_train.fasta). The initial split follows the requested ratio; after similarity filtering, test can shrink if near-duplicates are removed. Options on split-metagenome-train-test: --similarity-threshold, --min-coverage, --threads, --batch-size.

BLASTN filtering (EVE removal)

EVEs in non-viral genomes can be misclassified as viral. BLAST non-viral vs virus; exclude reads/contigs overlapping hits when building the metagenome.

Viral reference for proper prophage/EVE detection. By default, the viral BLAST DB is built from the virus/ folder in --genome-dir (i.e. only the viral genomes you downloaded for that run). Prophage/EVE regions that match viruses not in that set are missed. To check against the full viral catalog, use a pre-built viral DB or build one yourself:

Pre-built (recommended): viral reference DBs are available from this repository's Releases page. Download the latest viral_db_YYYY-MM-DD.tar.gz, extract it, then run:

metagenome-generator blastn-filter --genome-dir output/downloaded --out-dir output/blastn \
  --viral-db /path/to/viral_db_YYYY-MM-DD/blastn_db/viral_db \
  --viral-db-manifest /path/to/viral_db_YYYY-MM-DD/viral_db_manifest.json

Build your own: if you need a DB for a snapshot date not yet in Releases, run build-viral-db once (creates viral_reference/viral_db_YYYY-MM-DD/ using the snapshot date, plus viral_db_manifest.json), then pass the printed DB path to blastn-filter --viral-db:

metagenome-generator build-viral-db --accessions-file snapshots/accession_snapshot_YYYY-MM-DD.json --output-dir viral_reference
metagenome-generator blastn-filter --genome-dir output/downloaded --out-dir output/blastn \
  --viral-db viral_reference/viral_db_YYYY-MM-DD/blastn_db/viral_db \
  --viral-db-manifest viral_reference/viral_db_YYYY-MM-DD/viral_db_manifest.json

You can instead pass a FASTA of viral sequences with --viral-reference-fasta (the tool will run makeblastdb on it). If you pin a specific DB release, use --require-viral-db-sha256 <aggregate_sha256_from_manifest> to hard-fail on mismatches.

Standalone (default: viral DB from genome-dir):

metagenome-generator blastn-filter --genome-dir output/downloaded --out-dir output/blastn \
  --evalue 1e-5 --perc-identity 70 \
  --export-eve-fasta output/blastn/eve_intervals.fasta --export-eve-min-length 200
metagenome-generator chunk --input output/downloaded --output metagenome.fasta --output-dir output \
  --balanced --eve-intervals output/blastn/eve_intervals.json

In pipeline: add --run-blastn-filter; optional --blastn-evalue, --blastn-perc-identity, --blastn-threads, --blastn-task, --blastn-export-eve-fasta, --blastn-export-eve-min-length. Requires BLAST+.

Speed and reuse notes (important):

Use --threads to parallelize BLASTN.
Use --task dc-megablast (default) for faster EVE search on large references.
Per-query cache: pass --eve-query-store <dir> (shared directory optional) so each non-viral genome's EVE intervals are stored under a key derived from file path, size, mtime, viral DB fingerprint, and BLAST parameters. Matching genomes skip rerunning blastn. The aggregate eve_intervals.json is still written under --out-dir.
To force recomputation for every query, use --force-recompute (blastn-filter) or --blastn-force-recompute (pipeline / temporal-pipeline).

Viral taxonomy (taxonomy-aware balancing)

Fetch viral taxonomy from NCBI and write a JSON mapping viral accession → taxonomy group (e.g. NC_001234.1 → Herpesviridae). Use with chunk or pipeline and --balance-viral-by-taxonomy so each viral taxonomy group contributes equally.

metagenome-generator viral-taxonomy \
  --accessions-file snapshots/accession_snapshot_2026-03-10.json \
  --output output/viral_taxonomy.json \
  --level family

Then run read generation with balancing:

metagenome-generator chunk --input output/downloaded --output metagenome.fasta --output-dir output \
  --sequence-length 250 --reads-per-organism 1000 \
  --viral-taxonomy output/viral_taxonomy.json --balance-viral-by-taxonomy

Command reference

Command	Purpose
`download`	Download genomes from NCBI into category folders.
`chunk`	Generate reads from genome FASTAs and write one metagenome FASTA (or FASTQ).
`pipeline`	Download + read generation (+ optional BLASTN).
`snapshot`	Save full accession catalog to JSON (no downloads).
`migrate-snapshot`	Convert legacy snapshot to per-category metadata format.
`temporal-split-info`	Show train/test counts for a split date (no files written).
`temporal-split-search`	Find a split date so train has at least N and test at least M genomes. Default per-category minima for test: bacterial `>= M` and viral `>= M`. Override with `--min-test-bacteria/--min-test-virus/--min-test-archaea/--min-test-plasmid`. Optional train per-category minima are also available.
`temporal-split`	Write train and test accession JSONs by CreateDate.
`temporal-pipeline`	Full temporal run: split → download train/test → optional EVE → chunk both → similarity filter.
`filter-test-against-train`	Remove test-FASTA reads similar to train (BLAST). Use after temporal split, or use `temporal-pipeline` to run everything.
`split-metagenome-train-test`	Split an existing metagenome FASTA/FASTQ into train/test with BLAST-based removal of test sequences similar to train. Use after `chunk` when you don't use `pipeline --train-test-split`.
`blastn-filter`	BLAST non-viral vs viral; EVE intervals for read generation. Use `--viral-db` or `--viral-reference-fasta` for full viral catalog. For pinned reproducibility: `--viral-db-manifest` and/or `--require-viral-db-sha256`.
`build-viral-db`	Download all viral genomes from a snapshot and build a BLAST DB for use with `blastn-filter --viral-db` (proper prophage/EVE detection). Also writes `viral_db_manifest.json` with checksums and aggregate fingerprint.
`viral-taxonomy`	Fetch viral taxonomy; write accession→group JSON for `--balance-viral-by-taxonomy`.
`fetch-biome-data`	Fetch a reproducible fraction of biome benchmark resources (`metadata`, `contigs`, or `reads`) with `--fraction`, `--max-samples`, `--seed`; writes `selection_manifest.json`.
`biome-metagenome`	End-user shortcut to generate a biome-like metagenome in one command using a preset (`marine`, `soil`, `gut`).
`biome-dataset-pipeline`	Fetch sampled biome contig FASTAs from a manifest and generate a metagenome from those fetched files in one run.
`benchmark-recipe`	Structured benchmark: fixed N per category, R diverse replicates; samples from snapshot, no NCBI search. Writes `{output_stem}_train.` and `{output_stem}_test.` inside each `replicate_XXX/` (default train split 80 %, then removes test reads similar to train).
`genome-pool`	Shared downloads: `prepare` samples `max_*` accessions from a snapshot into `pool_dir` and downloads once; `materialize` builds a run directory (default: symlinks) from that pool.

Full options: metagenome-generator <command> --help.

Capabilities summary

Area	Features
Genomes	Download by category (RefSeq); in-house FASTAs; snapshot for reproducibility; `--complete-only`.
Read generation	Fixed/variable length; balanced or weighted; `--forbid-ambiguous`; mutation rates; Illumina-like errors; FASTQ + abundance file.
Biome convenience	One-command biome presets (`biome-metagenome`) and fractional benchmark-resource fetch (`fetch-biome-data`).
Train/test	Temporal split by CreateDate or percentage split; `filter-test-against-train` / similarity filtering.
EVE	BLAST non-viral vs viral; exclude or export provirus regions.
Benchmark	`benchmark-recipe`: fixed N per category, R diverse replicates; genome-level BLAST-driven diversity; generates `{output_stem}_train.` and `{output_stem}_test.` with train-vs-test similarity filtering.
Genome pool	`genome-pool prepare` / `materialize` for shared NCBI downloads and reproducible subsets.

Documentation

User Guide — end-to-end walkthroughs, including the 100/25 temporal benchmark, read-budget rationale, and similarity filtering.
CHANGELOG — notable changes per release.
docs/TOOLS_AND_FEATURES.md — comparison to related tools and feature matrix.
docs/DATA_PREPARATION_COMPARISON.md — comparison of data-prep practices against published benchmarks.
docs/improvements.md — roadmap and open items.
CLI help: metagenome-generator --help and metagenome-generator <command> --help.

Development

Clone and install in editable mode with the dev extras to get pytest and ruff:

git clone https://github.com/Alexander-Mitrofanov/MetagenomeGenerator.git
cd MetagenomeGenerator
pip install -e ".[dev]"

Run the test suite:

pytest

Run the linter:

ruff check .

BLAST+ is required for any test or command path that exercises EVE removal or similarity filtering; install it via conda env create -f environment.yml or your system package manager. Tests that need BLAST+ self-skip when it is not available.

Project structure

MetagenomeGenerator/
├── scripts/
│   └── release_assets_manifest.txt
├── pyproject.toml
├── README.md
├── USER_GUIDE.md
├── CHANGELOG.md
├── LICENSE
├── environment.yml
├── main.py
├── src/metagenome_generator/
│   ├── cli.py
│   ├── download_genomes.py
│   ├── ncbi_search.py
│   ├── accession_snapshot.py
│   ├── chunk_genomes.py
│   ├── genome_layout.py
│   ├── blastn_filter.py
│   ├── similarity_filter.py
│   ├── temporal_split.py
│   ├── viral_taxonomy.py
│   ├── benchmark_recipe.py
│   ├── genome_pool.py
│   └── biome_fetch.py
├── tests/
├── docs/
├── snapshots/
└── working_directory/

Programmatic use:

from metagenome_generator import (
    build_metagenome,
    download_genomes,
    load_accessions,
    validate_genome_dir,
)

Project status

CHIMERA is in beta (Development Status :: 4 - Beta). The CLI surface and on-disk file formats are considered stable; see CHANGELOG.md for behavioural changes.

Contributing

Bug reports, feature requests, and pull requests are welcome on GitHub Issues.

Before opening a PR, please:

Run pytest (new features should come with regression tests).
Run ruff check ..
Update CHANGELOG.md under the [Unreleased] section, and
Update the relevant documentation (README, USER_GUIDE, and/or docs/) — the three docs are cross-referenced and should stay in sync.

Citation

If CHIMERA was useful in your research, please cite the repository:

@software{chimera_metagenome_generator,
  author  = {{CHIMERA contributors}},
  title   = {{CHIMERA: Configurable Hybrid In-silico Metagenome Emulator for Read Analysis}},
  url     = {https://github.com/Alexander-Mitrofanov/MetagenomeGenerator},
  year    = {2026}
}

A DOI will be provided once a versioned release is archived on Zenodo.

Notes

NCBI rate limits apply; the tool uses delays and retries (up to 3 with backoff).
Genome selection uses RefSeq and length filters; see DEFAULT_QUERIES in ncbi_search.py to change criteria.
Prefer a dedicated working directory for runs (e.g. working_directory/).

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.9.0

Apr 21, 2026

0.7.1

Mar 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chimera_metagenome_generator-0.9.0.tar.gz (147.9 kB view details)

Uploaded Apr 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chimera_metagenome_generator-0.9.0-py3-none-any.whl (108.2 kB view details)

Uploaded Apr 21, 2026 Python 3

File details

Details for the file chimera_metagenome_generator-0.9.0.tar.gz.

File metadata

Download URL: chimera_metagenome_generator-0.9.0.tar.gz
Upload date: Apr 21, 2026
Size: 147.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for chimera_metagenome_generator-0.9.0.tar.gz
Algorithm	Hash digest
SHA256	`b898bb3dca53ec0c16b55108ae737530eca490f6ebcea2a973caca9fe50849a3`
MD5	`4c7e52ef42ae5adcef15bb080c13e300`
BLAKE2b-256	`5370046471897e238772c346189475e72280a2a868cde90976ab76fb240fb557`

See more details on using hashes here.

File details

Details for the file chimera_metagenome_generator-0.9.0-py3-none-any.whl.

File metadata

Download URL: chimera_metagenome_generator-0.9.0-py3-none-any.whl
Upload date: Apr 21, 2026
Size: 108.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for chimera_metagenome_generator-0.9.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c0d993ea187d62e0e003e56e60fd1dbe732ab2a53a71a84b5b367f863b0a9107`
MD5	`f9d81126355f9918de4f2cf70998b080`
BLAKE2b-256	`94270e99c993086b79ec5be94fd7159901d23be8e4ab6f0823a93088af763f3d`

See more details on using hashes here.

chimera-metagenome-generator 0.9.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CHIMERA

Table of contents

Highlights

Installation

From PyPI (recommended)

From source

With BLAST+ (needed for EVE removal and similarity filtering)

Requirements

Quick start

How CHIMERA stays reproducible

Pre-built snapshots and viral reference

Use cases at a glance

Choosing workflows (how the main tools differ)

Extended usage

Download genomes

Using your own (in-house) genome set

Generate reads from genomes (chunk)

Error model, FASTQ, and abundance file

Pipeline (download + read generation)

Structured benchmark recipe

Train/test split and similarity filtering

Temporal split (by NCBI CreateDate)

Percentage split with similarity check (single metagenome)

BLASTN filtering (EVE removal)

Viral taxonomy (taxonomy-aware balancing)

Command reference

Capabilities summary

Documentation

Development

Project structure

Project status

Contributing

Citation

Notes

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Generate reads from genomes (`chunk`)