Skip to main content

A samtools wrapper for CRAM conversion automation.

Project description

A samtools wrapper for CRAM conversion automation.

Introduction

cram-archiver was written to help with an archival task where a substantial volume of BAM files needed to be converted to CRAM in order to save disk space.

Features:

  • Automated recursive discovery of all .bam files in a directory.

  • Multiple reference support. Cram-archiver loads in fasta indexed references and checks that the appropriate BAM file is matched to the appropriate reference using the contig and length information in the BAM header.

  • Performs CRAM conversion using samtools view.

  • Performs samtools checksum --all on the BAM and CRAM file and checks if the checksum matches.

  • On by default: writes checksum files for manual verification.

  • On by default: writes CRAM indexes.

  • Optional: deletes BAM file after conversion.

  • Optional: Set a minimum age in days for the BAM file’s last modified time. If the file is “older” than the set number of days, the file will be converted.

Caveats

CRAM was never intended and built as a “pure” archival format with bit-for-bit reproducibility. As a result it is impossible to get the original BAM file back from a pure CRAM file. There are several reasons for this:

  • BAM files are by definition always bgzip compressed using the DEFLATE algorithm. Differences in the DEFLATE algorithm implementation can cause different outputs.

  • When converting a BAM and its derived CRAM to SAM the two SAMs can have differences too:

    • MD and NM tags are not stored in CRAM files but always calculated on the fly when decoding. If the MD and NM flags were not present in the original BAM, this can cause differences.

    • The order of tags might be different.

    • M, = and X in CIGAR strings. = means that the nucleotide is the same at this position. X means a mismatch at this position. M means that the position matches (no indels), but gives no information whether it is X or =. Since X and = can be derived from the sequence, the extra information is redundant and CRAM stores everything as M. This can give rise to differences.

    • Redundant information in BAM files such as unaligned reads with MAPQ values or CIGAR strings. This does not get stored.

    • Errors, such as wrong mate pair information. Some of it may be fixed during the CRAM conversion.

To assure the CRAM file is “functionally the same” as the BAM file, the samtools checksum tool with the --all flag is run. For more information about comparing BAM and CRAM checkout the discussion here.

Quickstart

Converting a single BAM file:

cram-archiver -r my_reference.fasta my.bam

This will create my.cram, my.cram.crai, my.cram.checksum and my.bam.checksum. Checksum file creation can be turned of with --dont-write-checksums. The checkums will still be checked, just not written to disk.

Archiving a directory with BAMs, but only BAMs that have a lost modified time older than 30 days. Also, there are hg19 and hg38 BAM files in the directory.:

cram-archiver --reference hg19.fasta --reference hg38.fasta --minimum-age-days 30 my_directory

If the --delete flag is added, all the converted BAM files will be deleted and just the CRAM files remain. This only happens when the conversion is succesful and the checksums match.

Usage

usage: cram-archiver [-h] -r REFERENCE [-t THREADS] [-d MINIMUM_AGE_DAYS]

[–delete] [–cram-version CRAM_VERSION] [–dont-write-checksums] [–dont-write-index] [–dry-run] [-v] [-q] PATH

positional arguments:
PATH Path to BAM file or directory to be recursively

searched.

options:
-h, --help

show this help message and exit

-r REFERENCE, --reference REFERENCE

Reference to be used for CRAM conversion. Can be used multiple times. Reference will be checked with the BAM file.

-t THREADS, --threads THREADS

The number of threads used for conversion and checksumming.Default: 1.

-d MINIMUM_AGE_DAYS, --minimum-age-days MINIMUM_AGE_DAYS

The minimum last modification of the BAM file in days prior. This assumes the system clock timezone matches that of the file while also assuming that every day has 24x60x60 seconds. Default 0.

--delete

Delete BAM files after successful conversion.

--cram-version CRAM_VERSION

CRAM version to use for CRAM conversion. Default: 3.0.

--dont-write-checksums

Do not store samtools checksum output on disk.

--dont-write-index

Do not write index files for CRAM files.

--dry-run

Print the paths of the to be archived BAM files. Perform no actions.

-v, --verbose

Display more logging information.

-q, --quiet

Display less logging information.

On CRAM format settings

Cram-archiver uses version 3.0 of the CRAM standard by default. The reason for this is that CRAM version 3.0 is better supported than version 3.1. CRAM version 3.1 comes with newer codecs and is able to achieve smaller file sizes because of that. For more information checkout the article on advances in CRAM by James Bonfield.

Cram archiver uses the CRAM default presets. CRAM has some presets: fast, normal, small and archive. However, the size differences between normal and archive are quite small (less than 6% smaller in our tests). On top of that, the memory requirements rise steeply especially on very long read alignments of ONT data.

Acknowledgements

A huge thank you to James Bonfield (@jkbonfield) for providing a lot of information and background about CRAM and its tooling. This was invaluable for creating this project. James Bonfield has also spent a lot of effort into making CRAM the very usable format it is today for which we are very grateful.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cram_archiver-1.0.0.tar.gz (908.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cram_archiver-1.0.0-py3-none-any.whl (23.2 kB view details)

Uploaded Python 3

File details

Details for the file cram_archiver-1.0.0.tar.gz.

File metadata

  • Download URL: cram_archiver-1.0.0.tar.gz
  • Upload date:
  • Size: 908.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for cram_archiver-1.0.0.tar.gz
Algorithm Hash digest
SHA256 3c8437284351b8656b1211693a2a21165355dd6feed4b4c0f9a9a4788c9411fe
MD5 1204abe63876dae4b6f8907a37f83b55
BLAKE2b-256 e7f29fc0683b7331d7caed4e3bed25155b1aac2f68765f90a97da1cc70920745

See more details on using hashes here.

File details

Details for the file cram_archiver-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: cram_archiver-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 23.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for cram_archiver-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7d6d49d9df79fd8321c1b488eaf9f66f31523e231d204450ca1ed5ab0e54a310
MD5 c1eee4a1a4023c0b523f1a3303e5d0dc
BLAKE2b-256 672e8c4663f449ea0b74de6dd398deabe303bee32208ca4c6168a1e2519417bb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page