Skip to main content

Compression and interactive exploration of large-scale sequencing alignments with circular mapping support

Project description

image

TheBIGbam is a genome browser and alignment viewer designed for massive metagenomic and metatranscriptomic datasets.

It enables compression and visualization of large genomic annotation (GenBank) and alignment files (BAM). It supports the generation of alignments with explicit circular-genome mapping.

Built with Rust for fast BAM processing and Python + Bokeh for interactive visualization.


Table of contents


Installation

Option 1 (recommended): conda

conda install -c bioconda thebigbam

This also installs samtools, minimap2, bwa-mem2, and BLAST+ (for repeat detection).

Option 2: pip

Make sure you have python>=3.10 installed. You can check your Python version with python -v or python3 -v.

Then you can install theBIGbam with pip:

pip install thebigbam

Mapping tools need to be installed independently.

Check installation succeeded

First check main command works:

thebigbam -h

Then run test with HK97 example data:

thebigbam calculate \
 -g tests/HK97/HK97_GCF_000848825.1_pharokka.gbk \
 -b tests/HK97/ \
 -o tests/HK97/test.db

Finally visualize interactively the test data:

thebigbam serve --db tests/HK97/test.db --port 5006

Open browser to http://localhost:5006

See the installation guide for more detailed instructions


Main usage

TheBIGbam consists of 3 main steps:

  • (optional) Generation of alignment files for your samples with circular-genome support
  • Generation of a DuckDB database summarizing your genomic and mapping files with Rust
  • Interactive visualization of the DuckDB database content using Python and Bokeh

Quick usage with HK97 test data

# only available if you downloaded the mapping dependencies
thebigbam mapping-per-sample \
  -r1 tests/HK97/HK97_R1_illumina.fastq.gz \
  -r2 tests/HK97/HK97_R2_illumina.fastq.gz \
  -a tests/HK97/HK97_GCF_000848825.1.fasta \
  --circular -o tests/HK97/HK97_illumina_circular.bam

thebigbam calculate \
  -b tests/HK97 \
  -g tests/HK97/HK97_GCF_000848825.1_pharokka.gbk \
  -m coverage,misalignment \
  -o tests/HK97/HK97.db \
  -t 4

thebigbam serve --db tests/HK97/HK97.db --port 5006

For more complex examples see the usage page.

Database computation

thebigbam calculate command converts large BAM files and associated annotated assemblies into a compact, queryable DuckDB database.

Example command to compute the database for a single sample containing paired-end short reads mapped to the reference genome of phage HK97:

thebigbam calculate  
-b tests/HK97  
-g tests/HK97/HK97_GCF_000848825.1_pharokka.gbk  
-m coverage,misalignment  
-o tests/HK97/HK97.db  
-t 4

For more complex examples see the usage page.

What input files do I need?

You need to provide at least one of the following:

  • BAM mapping files (-b)

  • A GenBank annotation file (-g)

If only an annotation file is provided, contig-level data (annotations, GC content, repeats) is calculated without any sample-level mapping features.

If only BAM files are provided, an assembly file of contigs (FASTA format) can be supplied with -a to allow the computation of sequence-dependent, mapping-derived features.

Alignment files

Parameter: --bam_files DIRECTORY, short-version -b

Your mapping files need to be sorted BAM files with MD tags. If you have mapping files but not in the right format, SAMtools is your friend!

# to convert a SAM/BAM file in a sorted BAM file
samtools view -bS example.sam | samtools sort -o example.sorted.bam

# to add an index file to your BAM file
samtools index example.sorted.bam

# to add MD tags to your BAM file 
# you also need the fasta file used during the mapping step
samtools calmd -b example.sorted.bam ref.fasta > example.sorted.md.bam

Alternatively, you can produce your alignment files directly in theBIGbam as specified in the Mapping section.

Annotation file

Parameter: --genbank FILE, short-version -g

Annotation file should be in GenBank (.gbk, .gbff, .gb) or GFF3 (.gff, .gff3) format, made with the tool of your choice: bakta for bacteria, pharokka or phold for phages, eggnog-mapper for eukaryotes, etc.

Examples of commands to generate such annotations are available in the usage page.

Which features can I calculate?

Parameter (optional): --modules COMMA-SEPARATED LIST, short version -m

When BAM files are provided, theBIGbam performs fast Rust-based computations on them to extract relevant values. Individual read information is discarded in favor of lightweight per-position averages for each contig in each sample.

All mapping-derived modules are computed and stored in the database unless you provide a specific subset of modules. 5 mapping-derived modules exist at the moment:

  • Coverage: computes per-position coverage for primary, secondary, and supplementary reads, as well as the mapping quality (MAPQ)
  • Misalignment: computes per-position number of clippings, insertions, deletions and mismatches
  • Long-reads: computes per-position average length of reads
  • Paired-reads: computes per-position average insert size of reads along with the number of incorrect pair orientations (non-inward pairs, mates unmapped or mapping or another contig)
  • Phage termini: compute per-position coverage for primary-reads starting with an exact match (a short clipping < 5 bp is tolerated). Among those reads, the number of mapped reads starting and ending is computed. This module requires sequences to be provided

When contig sequences are provided, the Genome module is computed. It calculates GC content, GC skew and the repeats contained within each contig using an autoblast. If annotations are available (GenBank file provided), contig annotations (e.g. positions of the coding sequences and their functions) are also saved.

A more detailed explanation of the modules and the features it contains is available in the features section.

Database compression

Parameters (optional): --min_aligned_fraction, --min_coverage_depth, --variation_percentage, --contig_variation_percentage, --coverage_percentage,

Discarding the reads to only keep the main features of the mappings (like the coverage per position) already allows the DuckDB database to be way lighter than the original BAM file. The database itself is also structured to be as light as possible.

First, the database is organised per contig per sample (qualified as a contig/sample pair thereafter). Only pairs relative to a contig present in a sample are stored in the database. The definition of a presence can be tweaked via two parameters:

  • --min_aligned_fraction controls the minimum percentage of positions that received reads (default 50%, meaning a contig is considered present only if more than half of it received reads)

  • --min_coverage_depth sets the minimum mean coverage depth required for contig inclusion (default 0, i.e. disabled — set to e.g. 5 to filter out contigs with very low depth that produce noisy signals).

To further reduce the size of the database, values per feature are compressed rather than saving all positions. The type of compression depends on the type of plots:

  • A Run-Length Encoding approach (RLE) is applied to the continuous plots (features from Coverage, Paired-reads and Long-reads module, "Coverage reduced" feature in Phage termini module). RLE stores consecutive genomic positions with similar values as a single entry, preserving the overall signal while substantially reducing storage size. The allowed percentage of variation can be adjusted using the --variation_percentage parameter (default 50% ie 0.5) for mapping-related features, and the --contig_variation_percentage parameter (default: 10% ie 0.1) for contig-related features

  • Only positions with values above a defined percentage of the local coverage are retained for Bar plots (Misalignment and Phage termini module except for "Coverage reduced" feature). For each position, values are compared to the local coverage and discarded if they fall below the --coverage_percentage threshold (default 10%), ensuring that only meaningful peaks are preserved

The output is a DuckDB database that is typically 10–100 times smaller than the original BAM files while retaining the essential characteristics of the mapping data. When using theBIGbam only for a GenBank file, the main objective is visualization, as the output database is typically similar in size to the original file.

For more information see the compression section.

Metrics computed per contig and per sample

In addition to per-position information, summary metrics are computed and stored in the database per contig, per sample and per contig–sample pair. These metrics combine the per-position values into average values like the coverage mean to help identify informative contig–sample pairs without requiring specific hypotheses.

Metrics belong to 4 categories:

  • Presence detection

  • Misassembly

  • Microdiversity

  • Topology

A description of all metrics is available in the filters section.

Visualization

Once the database has been computed, it can be visualized interactively using thebigbam serve command. This starts a local web server that hosts the interactive plots.

Example command:

thebigbam serve --db tests/HK97/HK97.db --port 5006

When accessing the web server (http://localhost:5006), you will be presented with a web interface:

image

Selection panel

One Sample mode

You are initially in the One Sample mode, which allows exploration of all computed features for a single sample. Several sections on the left panel control what is plotted:

  • Filtering: Only pairs of contig/samples matching the selected filters are available in the Contigs and Samples sections. For instance, if the contig length filter is set to >10 kbp, only contigs longer than this threshold will appear in the Contigs section, and only samples containing at least one such contig will appear in the Samples section. To consult the list of filters available have a look at the filtering page

  • Contigs: Select the contig you want to explore. If sequences and/or annotations were provided when creating the database, genomic features (gene maps, repeats, GC content, GC skew) can be selected for plotting by clicking on the contig features

  • Samples: Select the sample you want to explore

  • Variables: Select the features to plot. You can either use the checkboxes to select all features from a module or click individual features within a module

  • Plotting parameters: You can customize several aesthetic aspects of the plots (e.g. the heights of the genomic feature tracks and mapping-derived plots)

Finally, click Apply to visualize the requested features for the selected contig and sample. Alternatively, click Peruse Data to display tables containing the metrics and feature values.

All Samples mode

All Samples mode enables comparison of a specific feature across multiple samples. Compared to the One Sample mode, the Samples section is omitted, and only a single feature can be selected in the Variables section (e.g. mismatches on the figure above).

Plotting

Genomic tracks are plotted at the top and mapping-derived features below. On the figure for instance, you can see the gene map, the sequence track and the codon track. Below are displayed the mismatch track for the samples in display (All Samples mode).

All plots leverage the full capabilities of Bokeh: you can pan, zoom, and hover over specific points to inspect local values. For misalignment tracks, the dominant alternative sequences among the misaligned reads is displayed, in addition with the potential replacement codon for mismatches (for example a Glycine on the figure above).

Buttons in the top-right section allow you to disable pan, zoom, or hover interactions, reset the plots to their original state, and export the current view as a PNG image. In addition, the green button SHOW SUMMARY opens a new html page showing the metrics computed per contig per sample for the contig and samples in display. The blue buttons allow you to download:

  • The metrics relative to the contig (DOWNLOAD CONTIG SUMMARY)

  • The metrics relative to the contig per sample for all samples in display (DOWNLOAD METRICS SUMMARY)

  • All data plotted at the moment (considering all points without adaptive resolution rendering) (DOWNLOAD DATA)

Adaptive resolution rendering

Sequences and contig annotations only make sense when looking at a small window: by default sequences are plotted for ≤ 1 kbp window and gene maps are plotted for ≤ 100 kbp window.

The level of detail of the other plots is adapted to the viewing window size to ensure responsive plotting:

  • Full resolution (≤ 100 kbp window): All data points are plotted

  • Downsampled view (> 100 kbp window): SQL-side binning reduces the number of points sent to the browser: the visible window is divided into 1000 fixed-width bins and the maximum value per bin is kept to preserve spikes and outliers

The binning thresholds are configurable in the Plotting parameters via 3 spinners:

  • Feature plots without binning (default: 100 kbp)

  • Gene map (bp) (default: 100 kbp)

  • Sequence plots (bp) (default: 1000 bp)

When zooming or panning, you need to re-click APPLY to refresh the plots with the current window size. For more information consult the visualization section.

Mapping

TheBIGbam is not a read aligner: it relies on minimap2 or bwa-mem2 for alignment, applying minimal modifications to generate output compatible with thebigbam calculate command. thebigbam mapping-per-sample command produces sorted, indexed BAM files with MD tags.

Default mapping uses minimap2 for short reads while keeping secondary and supplementary reads. The mapper preset of settings can be changed with the --mapper option:

  • minimap2-sr: minimap2 with short-reads preset

  • minimap2-sr-secondary: minimap2 short-read preset, but retains secondary alignments (default)

  • bwa-mem2: BWA-MEM2 for short reads

  • minimap2-ont: minimap2 with Oxford Nanopore preset

  • minimap2-pb: minimap2 with PacBio CLR preset

  • minimap2-hifi: minimap2 with PacBio HiFi preset

  • minimap2-no-preset: minimap2 with no preset (advanced users, parameters can be provided using --minimap2-params instead)

Additional parameters can be provided to minimap2 and bwa-mem2 using the --minimap2-params and --bwa-params options. Those paramaters takes precedence over the presets parameters if different values for the same parameter are provided.

Mapping with circular genome support

The --circular flag is a specificity of theBIGbam mapping allowing explicit circular genome support. To do that, each contig is duplicated prior to alignment, enabling seamless mapping across the junction. Artificial secondary and supplementary alignments arising from the duplication are removed, and reads are reassigned to their correct positions before output. This approach preserves consistent coverage at contig ends of circular genomes.

Example command:

thebigbam mapping-per-sample  
-r1 tests/HK97/HK97_R1_illumina.fastq.gz  
-r2 tests/HK97/HK97_R2_illumina.fastq.gz  
-a tests/HK97/HK97_GCF_000848825.1.fasta  
--circular -o tests/HK97/HK97_illumina_circular.bam

For more details on theBIGbam circular genome support, you can consult the circular mapping page.


Additional utilities

Exporting data

Export any metric as a TSV matrix (with contigs as rows and samples as columns):

thebigbam export -d tests/HK97/HK97.db --metric Coverage_mean -o tests/HK97/coverage.tsv

Run thebigbam export -h to see the full list of available metrics.

Database maintenance

Consult DATABASE.md for instructions on reading and modifying the database after it has been created. The documentation explains how to add, remove, or list samples, contigs, and variables. It also describes how to query the database directly using SQL.


Additional in-depth documentation pages

TO-DO: check the list! Specify that links only work on main github page not pypi page

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thebigbam-0.1.3.tar.gz (11.8 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

thebigbam-0.1.3-pp310-pypy310_pp73-manylinux_2_28_aarch64.whl (13.6 MB view details)

Uploaded PyPymanylinux: glibc 2.28+ ARM64

thebigbam-0.1.3-pp39-pypy39_pp73-manylinux_2_28_aarch64.whl (13.6 MB view details)

Uploaded PyPymanylinux: glibc 2.28+ ARM64

thebigbam-0.1.3-cp310-abi3-manylinux_2_28_x86_64.whl (15.7 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ x86-64

thebigbam-0.1.3-cp310-abi3-manylinux_2_28_aarch64.whl (13.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.28+ ARM64

thebigbam-0.1.3-cp310-abi3-macosx_11_0_arm64.whl (12.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

thebigbam-0.1.3-cp310-abi3-macosx_10_12_x86_64.whl (13.9 MB view details)

Uploaded CPython 3.10+macOS 10.12+ x86-64

File details

Details for the file thebigbam-0.1.3.tar.gz.

File metadata

  • Download URL: thebigbam-0.1.3.tar.gz
  • Upload date:
  • Size: 11.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for thebigbam-0.1.3.tar.gz
Algorithm Hash digest
SHA256 95b138579d3e3d1d1d2fe15cc7f27046f89576e66ab23e1c606bfcb9c57fe486
MD5 1f6ace370862f85e9c950aa4e730989f
BLAKE2b-256 16e895452748cbaab6275535720e84f1d8303942e192bab7127b9f6ee66fd6a6

See more details on using hashes here.

Provenance

The following attestation bundles were made for thebigbam-0.1.3.tar.gz:

Publisher: ci.yml on bhagavadgitadu22/theBIGbam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file thebigbam-0.1.3-pp310-pypy310_pp73-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for thebigbam-0.1.3-pp310-pypy310_pp73-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 90abf1c67cfd657b22e97529c032d00715def5da69290c20813bce05000562f3
MD5 fea68d765bcf787f7c2dad513492d168
BLAKE2b-256 6f9cf96b4010b814f3dca7b9d6cb34b2bc2e67eec5894311f8169629b282b370

See more details on using hashes here.

Provenance

The following attestation bundles were made for thebigbam-0.1.3-pp310-pypy310_pp73-manylinux_2_28_aarch64.whl:

Publisher: ci.yml on bhagavadgitadu22/theBIGbam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file thebigbam-0.1.3-pp39-pypy39_pp73-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for thebigbam-0.1.3-pp39-pypy39_pp73-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 3af449d259aec0bd87394b4d022f88515356b86c86b519d870ed35170882e455
MD5 628b1aa8692c1b8b2515d62f22b8a751
BLAKE2b-256 42eeaacb765c680d905086dc48284ba8bdf643a2e16645b5413b6ab452c2c4d8

See more details on using hashes here.

Provenance

The following attestation bundles were made for thebigbam-0.1.3-pp39-pypy39_pp73-manylinux_2_28_aarch64.whl:

Publisher: ci.yml on bhagavadgitadu22/theBIGbam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file thebigbam-0.1.3-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for thebigbam-0.1.3-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 399411ebb5bb638979404f2f711295dcfd72a874741c4f79937b0ba15e0fcafa
MD5 cb1a3824d84236567a520170408c3584
BLAKE2b-256 5a82be16c534badd1a6080ad9071469714e59f89c84aa0807ae328120185ecba

See more details on using hashes here.

Provenance

The following attestation bundles were made for thebigbam-0.1.3-cp310-abi3-manylinux_2_28_x86_64.whl:

Publisher: ci.yml on bhagavadgitadu22/theBIGbam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file thebigbam-0.1.3-cp310-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for thebigbam-0.1.3-cp310-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 57e3a7e79efbfddbd53b3832a6e009598409bf9432405fe3409476358ecc781b
MD5 5164a1cf8ba6e40ee499b3ea09400cfd
BLAKE2b-256 ae5be8810d182a7df46123955d4c6a5aebc588ccd931527c395637da9ae6b4de

See more details on using hashes here.

Provenance

The following attestation bundles were made for thebigbam-0.1.3-cp310-abi3-manylinux_2_28_aarch64.whl:

Publisher: ci.yml on bhagavadgitadu22/theBIGbam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file thebigbam-0.1.3-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for thebigbam-0.1.3-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 20f48777eedf36ffa7cee13e6f41ca06ba344f07b985ec34a4260c904df55b48
MD5 432d874a5097a3648613240c2daf7024
BLAKE2b-256 2164a505bf5a96729a133a37b927889ba39e27e2a011a0de0904c652496ca571

See more details on using hashes here.

Provenance

The following attestation bundles were made for thebigbam-0.1.3-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: ci.yml on bhagavadgitadu22/theBIGbam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file thebigbam-0.1.3-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for thebigbam-0.1.3-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 5692652e6f462d843794cbfc34a083458a76fb25aae09680cdfcce0d2983413d
MD5 9580b5fa17523f703bc25c0f0f887ba0
BLAKE2b-256 d194231e264d6903e6e86e45070e428904e1fd6af1515f27a6c3868fdb7ce433

See more details on using hashes here.

Provenance

The following attestation bundles were made for thebigbam-0.1.3-cp310-abi3-macosx_10_12_x86_64.whl:

Publisher: ci.yml on bhagavadgitadu22/theBIGbam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page