Skip to main content

Compression and interactive exploration of large-scale sequencing alignments with circular mapping support

Project description

image

TheBIGbam is a genome browser and alignment viewer designed for massive metagenomic and metatranscriptomic datasets.

It enables compression and visualization of large genomic annotation (GenBank) and alignment files (BAM). It supports the generation of alignments with explicit circular-genome mapping.

Built with Rust for fast BAM processing and Python + Bokeh for interactive visualization.


Table of contents


Installation

Option 1: conda

conda install -c bioconda thebigbam

This also installs samtools, minimap2, and bwa-mem2 for read mapping.

Option 2: pip

pip install thebigbam

Mapping tools need to be installed independently.

Check installation succeeded

First check main command works:

thebigbam -h

Then run test with HK97 example data:

thebigbam calculate \
 -g tests/HK97/HK97_GCF_000848825.1_pharokka.gbk \
 -b tests/HK97/ \
 -o tests/HK97/test.db

Finally visualize interactively the test data:

thebigbam serve --db tests/HK97/test.db --port 5006

Open browser to http://localhost:5006

See the installation guide for more detailed instructions


Main usage

TheBIGbam consists of 3 main steps:

  • (optional) Generation of alignment files for your samples with circular-genome support
  • Generation of a DuckDB database summarizing your genomic and mapping files with Rust
  • Interactive visualization of the DuckDB database content using Python and Bokeh

Quick usage with HK97 test data

# only available if you downloaded the mapping dependencies
thebigbam mapping-per-sample \
  -r1 tests/HK97/HK97_R1_illumina.fastq.gz \
  -r2 tests/HK97/HK97_R2_illumina.fastq.gz \
  -a tests/HK97/HK97_GCF_000848825.1.fasta \
  --circular -o tests/HK97/HK97_illumina_circular.bam

thebigbam calculate \
  -b tests/HK97 \
  -g tests/HK97/HK97_GCF_000848825.1_pharokka.gbk \
  -m coverage,misalignment \
  -o tests/HK97/HK97.db \
  -t 4

thebigbam serve --db tests/HK97/HK97.db --port 5006

For more complex examples see the usage page.

Database computation

thebigbam calculate command converts large BAM files and associated annotated assemblies into a compact, queryable DuckDB database.

Example command to compute the database for a single sample containing paired-end short reads mapped to the reference genome of phage HK97:

thebigbam calculate  
-b tests/HK97  
-g tests/HK97/HK97_GCF_000848825.1_pharokka.gbk  
-m coverage,misalignment  
-o tests/HK97/HK97.db  
-t 4

For more complex examples see the usage page.

What input files do I need?

You need to provide at least one of the following:

  • BAM mapping files (-b)

  • A GenBank annotation file (-g)

If only an annotation file is provided, contig-level data (annotations, GC content, repeats) is calculated without any sample-level mapping features.

If only BAM files are provided, an assembly file of contigs (FASTA format) can be supplied with -a to allow the computation of sequence-dependent, mapping-derived features.

Alignment files

Parameter: --bam_files DIRECTORY, short-version -b

Your mapping files need to be sorted BAM files with MD tags. If you have mapping files but not in the right format, SAMtools is your friend!

# to convert a SAM/BAM file in a sorted BAM file
samtools view -bS example.sam | samtools sort -o example.sorted.bam

# to add an index file to your BAM file
samtools index example.sorted.bam

# to add MD tags to your BAM file 
# you also need the fasta file used during the mapping step
samtools calmd -b example.sorted.bam ref.fasta > example.sorted.md.bam

Alternatively, you can produce your alignment files directly in theBIGbam as specified in the Mapping section.

Annotation file

Parameter: --genbank FILE, short-version -g

Annotation file should be in GenBank (.gbk, .gbff, .gb) or GFF3 (.gff, .gff3) format, made with the tool of your choice: bakta for bacteria, pharokka or phold for phages, eggnog-mapper for eukaryotes, etc.

Examples of commands to generate such annotations are available in the usage page.

Which features can I calculate?

Parameter (optional): --modules COMMA-SEPARATED LIST, short version -m

When BAM files are provided, theBIGbam performs fast Rust-based computations on them to extract relevant values. Individual read information is discarded in favor of lightweight per-position averages for each contig in each sample.

All mapping-derived modules are computed and stored in the database unless you provide a specific subset of modules. 5 mapping-derived modules exist at the moment:

  • Coverage: computes per-position coverage for primary, secondary, and supplementary reads, as well as the mapping quality (MAPQ)
  • Misalignment: computes per-position number of clippings, insertions, deletions and mismatches
  • Long-reads: computes per-position average length of reads
  • Paired-reads: computes per-position average insert size of reads along with the number of incorrect pair orientations (non-inward pairs, mates unmapped or mapping or another contig)
  • Phage termini: compute per-position coverage for primary-reads starting with an exact match (a short clipping < 5 bp is tolerated). Among those reads, the number of mapped reads starting and ending is computed. This module requires sequences to be provided

When contig sequences are provided, the Genome module is computed. It calculates GC content, GC skew and the repeats contained within each contig using an autoblast. If annotations are available (GenBank file provided), contig annotations (e.g. positions of the coding sequences and their functions) are also saved.

A more detailed explanation of the modules and the features it contains is available in the features section.

Database compression

Parameters (optional): --min_aligned_fraction, --min_coverage_depth, --variation_percentage, --contig_variation_percentage, --coverage_percentage,

Discarding the reads to only keep the main features of the mappings (like the coverage per position) already allows the DuckDB database to be way lighter than the original BAM file. The database itself is also structured to be as light as possible.

First, the database is organised per contig per sample (qualified as a contig/sample pair thereafter). Only pairs relative to a contig present in a sample are stored in the database. The definition of a presence can be tweaked via two parameters:

  • --min_aligned_fraction controls the minimum percentage of positions that received reads (default 50%, meaning a contig is considered present only if more than half of it received reads)

  • --min_coverage_depth sets the minimum mean coverage depth required for contig inclusion (default 0, i.e. disabled — set to e.g. 5 to filter out contigs with very low depth that produce noisy signals).

To further reduce the size of the database, values per feature are compressed rather than saving all positions. The type of compression depends on the type of plots:

  • A Run-Length Encoding approach (RLE) is applied to the continuous plots (features from Coverage, Paired-reads and Long-reads module, "Coverage reduced" feature in Phage termini module). RLE stores consecutive genomic positions with similar values as a single entry, preserving the overall signal while substantially reducing storage size. The allowed percentage of variation can be adjusted using the --variation_percentage parameter (default 50% ie 0.5) for mapping-related features, and the --contig_variation_percentage parameter (default: 10% ie 0.1) for contig-related features

  • Only positions with values above a defined percentage of the local coverage are retained for Bar plots (Misalignment and Phage termini module except for "Coverage reduced" feature). For each position, values are compared to the local coverage and discarded if they fall below the --coverage_percentage threshold (default 10%), ensuring that only meaningful peaks are preserved

The output is a DuckDB database that is typically 10–100 times smaller than the original BAM files while retaining the essential characteristics of the mapping data. When using theBIGbam only for a GenBank file, the main objective is visualization, as the output database is typically similar in size to the original file.

For more information see the compression section.

Metrics computed per contig and per sample

In addition to per-position information, summary metrics are computed and stored in the database per contig, per sample and per contig–sample pair. These metrics combine the per-position values into average values like the coverage mean to help identify informative contig–sample pairs without requiring specific hypotheses.

Metrics belong to 4 categories:

  • Presence detection

  • Misassembly

  • Microdiversity

  • Topology

A description of all metrics is available in the filters section.

Visualization

Once the database has been computed, it can be visualized interactively using thebigbam serve command. This starts a local web server that hosts the interactive plots.

Example command:

thebigbam serve --db tests/HK97/HK97.db --port 5006

When accessing the web server (http://localhost:5006), you will be presented with a web interface:

image

Selection panel

One Sample mode

You are initially in the One Sample mode, which allows exploration of all computed features for a single sample. Several sections on the left panel control what is plotted:

  • Filtering: Only pairs of contig/samples matching the selected filters are available in the Contigs and Samples sections. For instance, if the contig length filter is set to >10 kbp, only contigs longer than this threshold will appear in the Contigs section, and only samples containing at least one such contig will appear in the Samples section. To consult the list of filters available have a look at the filtering page

  • Contigs: Select the contig you want to explore. If sequences and/or annotations were provided when creating the database, genomic features (gene maps, repeats, GC content, GC skew) can be selected for plotting by clicking on the contig features

  • Samples: Select the sample you want to explore

  • Variables: Select the features to plot. You can either use the checkboxes to select all features from a module or click individual features within a module

  • Plotting parameters: You can customize several aesthetic aspects of the plots (e.g. the heights of the genomic feature tracks and mapping-derived plots)

Finally, click Apply to visualize the requested features for the selected contig and sample. Alternatively, click Peruse Data to display tables containing the metrics and feature values.

All Samples mode

All Samples mode enables comparison of a specific feature across multiple samples. Compared to the One Sample mode, the Samples section is omitted, and only a single feature can be selected in the Variables section (e.g. mismatches on the figure above).

Plotting

Genomic tracks are plotted at the top and mapping-derived features below. On the figure for instance, you can see the gene map, the sequence track and the codon track. Below are displayed the mismatch track for the samples in display (All Samples mode).

All plots leverage the full capabilities of Bokeh: you can pan, zoom, and hover over specific points to inspect local values. For misalignment tracks, the dominant alternative sequences among the misaligned reads is displayed, in addition with the potential replacement codon for mismatches (for example a Glycine on the figure above).

Buttons in the top-right section allow you to disable pan, zoom, or hover interactions, reset the plots to their original state, and export the current view as a PNG image. In addition, the green button SHOW SUMMARY opens a new html page showing the metrics computed per contig per sample for the contig and samples in display. The blue buttons allow you to download:

  • The metrics relative to the contig (DOWNLOAD CONTIG SUMMARY)

  • The metrics relative to the contig per sample for all samples in display (DOWNLOAD METRICS SUMMARY)

  • All data plotted at the moment (considering all points without adaptive resolution rendering) (DOWNLOAD DATA)

Adaptive resolution rendering

Sequences and contig annotations only make sense when looking at a small window: by default sequences are plotted for ≤ 1 kbp window and gene maps are plotted for ≤ 100 kbp window.

The level of detail of the other plots is adapted to the viewing window size to ensure responsive plotting:

  • Full resolution (≤ 100 kbp window): All data points are plotted

  • Downsampled view (> 100 kbp window): SQL-side binning reduces the number of points sent to the browser: the visible window is divided into 1000 fixed-width bins and the maximum value per bin is kept to preserve spikes and outliers

The binning thresholds are configurable in the Plotting parameters via 3 spinners:

  • Feature plots without binning (default: 100 kbp)

  • Gene map (bp) (default: 100 kbp)

  • Sequence plots (bp) (default: 1000 bp)

When zooming or panning, you need to re-click APPLY to refresh the plots with the current window size. For more information consult the visualization section.

Mapping

TheBIGbam is not a read aligner: it relies on minimap2 or bwa-mem2 for alignment, applying minimal modifications to generate output compatible with thebigbam calculate command. thebigbam mapping-per-sample command produces sorted, indexed BAM files with MD tags.

Default mapping uses minimap2 for short reads while keeping secondary and supplementary reads. The mapper preset of settings can be changed with the --mapper option:

  • minimap2-sr: minimap2 with short-reads preset

  • minimap2-sr-secondary: minimap2 short-read preset, but retains secondary alignments (default)

  • bwa-mem2: BWA-MEM2 for short reads

  • minimap2-ont: minimap2 with Oxford Nanopore preset

  • minimap2-pb: minimap2 with PacBio CLR preset

  • minimap2-hifi: minimap2 with PacBio HiFi preset

  • minimap2-no-preset: minimap2 with no preset (advanced users, parameters can be provided using --minimap2-params instead)

Additional parameters can be provided to minimap2 and bwa-mem2 using the --minimap2-params and --bwa-params options. Those paramaters takes precedence over the presets parameters if different values for the same parameter are provided.

Mapping with circular genome support

The --circular flag is a specificity of theBIGbam mapping allowing explicit circular genome support. To do that, each contig is duplicated prior to alignment, enabling seamless mapping across the junction. Artificial secondary and supplementary alignments arising from the duplication are removed, and reads are reassigned to their correct positions before output. This approach preserves consistent coverage at contig ends of circular genomes.

Example command:

thebigbam mapping-per-sample  
-r1 tests/HK97/HK97_R1_illumina.fastq.gz  
-r2 tests/HK97/HK97_R2_illumina.fastq.gz  
-a tests/HK97/HK97_GCF_000848825.1.fasta  
--circular -o tests/HK97/HK97_illumina_circular.bam

For more details on theBIGbam circular genome support, you can consult the circular mapping page.


Additional utilities

Exporting data

Export any metric as a TSV matrix (with contigs as rows and samples as columns):

thebigbam export -d tests/HK97/HK97.db --metric Coverage_mean -o tests/HK97/coverage.tsv

Run thebigbam export -h to see the full list of available metrics.

Database maintenance

Consult DATABASE.md for instructions on reading and modifying the database after it has been created. The documentation explains how to add, remove, or list samples, contigs, and variables. It also describes how to query the database directly using SQL.


Additional in-depth documentation pages

TO-DO: check the list! Specify that links only work on main github page not pypi page

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thebigbam-0.1.2.tar.gz (11.8 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

thebigbam-0.1.2-pp310-pypy310_pp73-manylinux_2_28_aarch64.whl (13.6 MB view details)

Uploaded PyPymanylinux: glibc 2.28+ ARM64

thebigbam-0.1.2-pp39-pypy39_pp73-manylinux_2_28_aarch64.whl (13.6 MB view details)

Uploaded PyPymanylinux: glibc 2.28+ ARM64

thebigbam-0.1.2-cp310-abi3-manylinux_2_28_x86_64.whl (15.7 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

thebigbam-0.1.2-cp310-abi3-manylinux_2_28_aarch64.whl (13.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

thebigbam-0.1.2-cp310-abi3-macosx_11_0_arm64.whl (12.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

thebigbam-0.1.2-cp310-abi3-macosx_10_12_x86_64.whl (13.9 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file thebigbam-0.1.2.tar.gz.

File metadata

  • Download URL: thebigbam-0.1.2.tar.gz
  • Upload date:
  • Size: 11.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for thebigbam-0.1.2.tar.gz
Algorithm Hash digest
SHA256 3f568546d59f8cc4b27ad14c28a0120b7a7779a22a4c2d5d3386d5e0f5bc8a73
MD5 3b0ae70566232a9436610e58f81c4b30
BLAKE2b-256 24f8d476fcaa4dec011837bd745b7c77e6a5db1ac5da4d264dfb9cc6c7fd8ba5

See more details on using hashes here.

Provenance

The following attestation bundles were made for thebigbam-0.1.2.tar.gz:

Publisher: ci.yml on bhagavadgitadu22/theBIGbam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file thebigbam-0.1.2-pp310-pypy310_pp73-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for thebigbam-0.1.2-pp310-pypy310_pp73-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 822a87712603bd675e60cbe11ad654ab152f6db7fed47d198fd7c2b276fdbe83
MD5 636231a16463587171a5f10f5ea2f009
BLAKE2b-256 56b7ed1329eb83efa52cc1346b38a5b31e1455f8a798f44eff748a3954ec60ed

See more details on using hashes here.

Provenance

The following attestation bundles were made for thebigbam-0.1.2-pp310-pypy310_pp73-manylinux_2_28_aarch64.whl:

Publisher: ci.yml on bhagavadgitadu22/theBIGbam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file thebigbam-0.1.2-pp39-pypy39_pp73-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for thebigbam-0.1.2-pp39-pypy39_pp73-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 d3638bebb01c1a4e5641659ab615449e3e12566f14b59ce2e9990d5e6a6cdf4c
MD5 9066928327b23f7b46b2d651504ceb4e
BLAKE2b-256 a17f3e18b683efa5eada236a208194ddd0a1af522b69c8142119128aab190ada

See more details on using hashes here.

Provenance

The following attestation bundles were made for thebigbam-0.1.2-pp39-pypy39_pp73-manylinux_2_28_aarch64.whl:

Publisher: ci.yml on bhagavadgitadu22/theBIGbam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file thebigbam-0.1.2-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for thebigbam-0.1.2-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 fb8b9e89f4da52e267db016d9120b53db2f3a182aa312b969bf573a361401d2f
MD5 3ad2e5fa6d2f997d9b4c83daa614ff4f
BLAKE2b-256 1196ddfc16371966ab346f5f8a9cfd1d990e96a8a2f5b3d633671b6ca58faad8

See more details on using hashes here.

Provenance

The following attestation bundles were made for thebigbam-0.1.2-cp310-abi3-manylinux_2_28_x86_64.whl:

Publisher: ci.yml on bhagavadgitadu22/theBIGbam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file thebigbam-0.1.2-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for thebigbam-0.1.2-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 c3668b98b374fff7b245ab2f52a17cff19192af3f30bbb09b5c9cf31c90a3b5f
MD5 889174bd1ee249a3d9351ae7be060dc7
BLAKE2b-256 8fb5b4d3759b0b04b26e76218d1b03fe8394730145fbeaf527a05459eacaeb27

See more details on using hashes here.

Provenance

The following attestation bundles were made for thebigbam-0.1.2-cp310-abi3-manylinux_2_28_aarch64.whl:

Publisher: ci.yml on bhagavadgitadu22/theBIGbam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file thebigbam-0.1.2-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for thebigbam-0.1.2-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e978aac727a0b719275b933d58ae2536faa8655751156091de28e1e00c66cab9
MD5 b3aeb7c2d27e8e15035d3543e1a9bde8
BLAKE2b-256 41ed2d893f34a2d771820a4f89912b71afe3a46331a0ad5cba31548e6785e873

See more details on using hashes here.

Provenance

The following attestation bundles were made for thebigbam-0.1.2-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: ci.yml on bhagavadgitadu22/theBIGbam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file thebigbam-0.1.2-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for thebigbam-0.1.2-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 4512ff5c1b3435a179db74fe7246b1b209d8544e2719efd3ac75bfe9f5ac0f09
MD5 7d55ef934772840125abc526243c415c
BLAKE2b-256 391aa16d35fcc24f436c4f84ef319117dbb30847202baa4de8d01b85f6e27415

See more details on using hashes here.

Provenance

The following attestation bundles were made for thebigbam-0.1.2-cp310-abi3-macosx_10_12_x86_64.whl:

Publisher: ci.yml on bhagavadgitadu22/theBIGbam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page