Synthetic Hi-C / Micro-C / 3C triplet FASTQ benchmark generator and .pairs recovery analyser.
Project description
bench3c
bench3c is a synthetic benchmark generator for 3C-derived sequencing workflows. It generates controlled Hi-C-like and Micro-C-like paired-end FASTQ reads containing known triplet structures, then evaluates whether a mapping / splitting / .pairs reconstruction pipeline recovers the expected fragments.
The benchmark is mainly designed to test preprocessing tools for chimeric or multiplex reads in Hi-C, Micro-C, Pore-C-like or split-read 3C workflows.
Model
The benchmark encodes a known triplet of genomic fragments:
R1: AAAAAAABBBBBBBBB
R2: CCCCCCCCCCCCCCCC
The read name stores the true genomic coordinates:
@chrA-startA-endA:chrB-startB-endB::chrC-startC-endC
This allows downstream analysis to compare the expected fragment lengths with the observed alignments recovered in .pairsam files.
Modes
bench3c has three main modes:
--hic: generate Hi-C-like triplet reads from a digested genome.--microc: generate Micro-C-like triplet reads directly from a FASTA.--analyse: analyse a.pairs,.pairs.gzor.pairsamfile and compare recovered alignments to the encoded truth.
Installation
From source:
git clone <repo-url>
cd <repo>
uv sync
uv run bench3c --help
Or with pip after packaging:
pip install bench3c
bench3c --help
Hi-C simulation
Generate Hi-C-like paired-end FASTQ reads:
bench3c --hic \
--fasta genome.fa \
--site GATC \
--out bench/hic_sim \
--number-reads-pairs 100000 \
--read-len 150 \
--min-piece 50 \
--max-jump 300
If --fasta is not provided, bench3c can generate a random FASTA. If --digested is not provided, the genome is digested internally using --site.
Typical outputs:
bench/hic_sim_R1.fq
bench/hic_sim_R2.fq
Micro-C simulation
Generate Micro-C-like paired-end FASTQ reads:
bench3c --microc \
--fasta genome.fa \
--out bench/microc_sim \
--number-reads-pairs 100000 \
--read-len 150 \
--min-piece 50
Add non-chimeric pairs:
bench3c --microc \
--fasta genome.fa \
--out bench/microc_mixed \
--number-reads-pairs 100000 \
--read-len 150 \
--prop-nonchimeric 0.2
Analysis mode
Analyse a .pairsam.gz or .pairsam file after mapping and reconstruction:
bench3c --analyse \
--pairs output.pairs.gz \
--read-len 150 \
--analyse-out-dir benchmark_results \
--condition my_pipeline
The analyser expects read names following the truth-encoding format:
chrA-startA-endA:chrB-startB-endB::chrC-startC-endC
It also expects recoverable alignment information, typically through sam1 and sam2 columns in the .pairs / .pairsam file.
Analysis outputs
The analysis mode writes summary tables and plots, including:
<condition>_read_recovery.tsv
<condition>_problem_fragments.tsv
<condition>_problem_summary.tsv
<condition>_cut_summary.tsv
<condition>_histogram.pdf
<condition>_tolerance_curve.pdf
Depending on the current version, additional outputs may include chimeric-size summaries and chimeric-specific histograms.
Typical benchmark workflow
# 1. Generate synthetic reads
bench3c --microc \
--fasta genome.fa \
--out bench/microc \
--number-reads-pairs 10000 \
--read-len 150
# 2. Map the reads with your mapper
bwa mem -SP genome.fa bench/microc_R1.fq bench/microc_R2.fq > bench/microc.sam
# 3. Convert the mappings to .pairs with your pipeline
# 4. Analyse fragment recovery
bench3c --analyse \
--pairs bench/output.pairs.gz \
--read-len 150 \
--analyse-out-dir bench/results \
--condition my_pipeline
Interpretation
A perfect read is a read for which all expected fragments are recovered with no extra observed fragment.
Common failure classes:
missing: an expected fragment was not recovered.too_short: the observed fragment is shorter than the truth.too_long: the observed fragment is longer than the truth.over_split: extra observed fragments were recovered.under_split_or_missing: one or more truth fragments were not recovered.
Limitations
bench3c is a controlled synthetic benchmark. It does not fully model all experimental biases of real Hi-C or Micro-C libraries, such as PCR duplicates, GC bias, mappability bias, restriction efficiency, ligation bias, base-quality degradation, optical duplicates, or complex multi-mapping.
It is intended to test whether a pipeline can recover known chimeric or multiplex structures under controlled conditions.
License
AGPL-3.0-or-later.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bench3c-0.0.1.tar.gz.
File metadata
- Download URL: bench3c-0.0.1.tar.gz
- Upload date:
- Size: 43.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3a58ff41ec981730088a404628eb9b4bb31838e9ff073e05acb2a6600900fd7
|
|
| MD5 |
928a59810276b4b3777ebbe88d43e178
|
|
| BLAKE2b-256 |
f2a7b0ef24fde4a81c671e9003b82f7fb8f8240b9555465303264e2585de10c5
|
File details
Details for the file bench3c-0.0.1-py3-none-any.whl.
File metadata
- Download URL: bench3c-0.0.1-py3-none-any.whl
- Upload date:
- Size: 44.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bd468ec2e136249bc3a75f1a865e2e49701867a13611473363f7c584f95dc186
|
|
| MD5 |
1f716de6f0a8447a98b3c44c0f8c3ca1
|
|
| BLAKE2b-256 |
aa8ddb8b105245511b4d1737e06769e2cd7e305a9c40ab305104f2eb8e01be88
|