Skip to main content

Synthetic Hi-C / Micro-C / 3C triplet FASTQ benchmark generator and .pairs recovery analyser.

Project description

bench3c

bench3c is a synthetic benchmark generator for 3C-derived sequencing workflows. It generates controlled Hi-C-like and Micro-C-like paired-end FASTQ reads containing known triplet structures, then evaluates whether a mapping / splitting / .pairs reconstruction pipeline recovers the expected fragments.

The benchmark is mainly designed to test preprocessing tools for chimeric or multiplex reads in Hi-C, Micro-C, Pore-C-like or split-read 3C workflows.

Model

The benchmark encodes a known triplet of genomic fragments:

R1: AAAAAAABBBBBBBBB
R2: CCCCCCCCCCCCCCCC

The read name stores the true genomic coordinates:

@chrA-startA-endA:chrB-startB-endB::chrC-startC-endC

This allows downstream analysis to compare the expected fragment lengths with the observed alignments recovered in .pairsam files.

Modes

bench3c has three main modes:

  • --hic: generate Hi-C-like triplet reads from a digested genome.
  • --microc: generate Micro-C-like triplet reads directly from a FASTA.
  • --analyse: analyse a .pairs, .pairs.gz or .pairsam file and compare recovered alignments to the encoded truth.

Installation

From source:

git clone <repo-url>
cd <repo>
uv sync
uv run bench3c --help

Or with pip after packaging:

pip install bench3c
bench3c --help

Hi-C simulation

Generate Hi-C-like paired-end FASTQ reads:

bench3c --hic \
  --fasta genome.fa \
  --site GATC \
  --out bench/hic_sim \
  --number-reads-pairs 100000 \
  --read-len 150 \
  --min-piece 50 \
  --max-jump 300

If --fasta is not provided, bench3c can generate a random FASTA. If --digested is not provided, the genome is digested internally using --site.

Typical outputs:

bench/hic_sim_R1.fq
bench/hic_sim_R2.fq

Micro-C simulation

Generate Micro-C-like paired-end FASTQ reads:

bench3c --microc \
  --fasta genome.fa \
  --out bench/microc_sim \
  --number-reads-pairs 100000 \
  --read-len 150 \
  --min-piece 50

Add non-chimeric pairs:

bench3c --microc \
  --fasta genome.fa \
  --out bench/microc_mixed \
  --number-reads-pairs 100000 \
  --read-len 150 \
  --prop-nonchimeric 0.2

Analysis mode

Analyse a .pairsam.gz or .pairsam file after mapping and reconstruction:

bench3c --analyse \
  --pairs output.pairs.gz \
  --read-len 150 \
  --analyse-out-dir benchmark_results \
  --condition my_pipeline

The analyser expects read names following the truth-encoding format:

chrA-startA-endA:chrB-startB-endB::chrC-startC-endC

It also expects recoverable alignment information, typically through sam1 and sam2 columns in the .pairs / .pairsam file.

Analysis outputs

The analysis mode writes summary tables and plots, including:

<condition>_read_recovery.tsv
<condition>_problem_fragments.tsv
<condition>_problem_summary.tsv
<condition>_cut_summary.tsv
<condition>_histogram.pdf
<condition>_tolerance_curve.pdf

Depending on the current version, additional outputs may include chimeric-size summaries and chimeric-specific histograms.

Typical benchmark workflow

# 1. Generate synthetic reads
bench3c --microc \
  --fasta genome.fa \
  --out bench/microc \
  --number-reads-pairs 10000 \
  --read-len 150

# 2. Map the reads with your mapper
bwa mem -SP genome.fa bench/microc_R1.fq bench/microc_R2.fq > bench/microc.sam

# 3. Convert the mappings to .pairs with your pipeline

# 4. Analyse fragment recovery
bench3c --analyse \
  --pairs bench/output.pairs.gz \
  --read-len 150 \
  --analyse-out-dir bench/results \
  --condition my_pipeline

Interpretation

A perfect read is a read for which all expected fragments are recovered with no extra observed fragment.

Common failure classes:

  • missing: an expected fragment was not recovered.
  • too_short: the observed fragment is shorter than the truth.
  • too_long: the observed fragment is longer than the truth.
  • over_split: extra observed fragments were recovered.
  • under_split_or_missing: one or more truth fragments were not recovered.

Limitations

bench3c is a controlled synthetic benchmark. It does not fully model all experimental biases of real Hi-C or Micro-C libraries, such as PCR duplicates, GC bias, mappability bias, restriction efficiency, ligation bias, base-quality degradation, optical duplicates, or complex multi-mapping.

It is intended to test whether a pipeline can recover known chimeric or multiplex structures under controlled conditions.

License

AGPL-3.0-or-later.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bench3c-0.0.1.tar.gz (43.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bench3c-0.0.1-py3-none-any.whl (44.5 kB view details)

Uploaded Python 3

File details

Details for the file bench3c-0.0.1.tar.gz.

File metadata

  • Download URL: bench3c-0.0.1.tar.gz
  • Upload date:
  • Size: 43.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for bench3c-0.0.1.tar.gz
Algorithm Hash digest
SHA256 b3a58ff41ec981730088a404628eb9b4bb31838e9ff073e05acb2a6600900fd7
MD5 928a59810276b4b3777ebbe88d43e178
BLAKE2b-256 f2a7b0ef24fde4a81c671e9003b82f7fb8f8240b9555465303264e2585de10c5

See more details on using hashes here.

File details

Details for the file bench3c-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: bench3c-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 44.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for bench3c-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bd468ec2e136249bc3a75f1a865e2e49701867a13611473363f7c584f95dc186
MD5 1f716de6f0a8447a98b3c44c0f8c3ca1
BLAKE2b-256 aa8ddb8b105245511b4d1737e06769e2cd7e305a9c40ab305104f2eb8e01be88

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page