remove genetic variation from sequencing data
Project description
BAMboozle.py: de-identification of sequencing reads
BAMboozle.py is a tool that can remove genetic variation from sequencing reads stored in BAM file format to protect the privacy and genetic information of donor individuals.
Installation
BAMboozle.py is available through PyPI. To install, type the following command line, and add -U
for upgrading:
pip install BAMboozle
Alternatively, you can install from this GitHub repository for the latest version:
pip install git+https://github.com/sandberg-lab/dataprivacy
Add --user
if you don't have write permissions in your default python folder.
Usage
BAMboozle.py requires only an aligned .bam file and the reference genome in fasta format.
Your fasta file should be indexed (samtools faidx
).
The .bam file should be coordinate sorted and indexed, however BAMboozle.py
will try to do this for you if not.
usage: BAMboozle [-h] [--bam FILENAME] [--out FILENAME] [--fa FILENAME]
[--p P] [--strict] [--keepsecondary] [--keepunmapped]
optional arguments:
-h, --help show this help message and exit
--bam FILENAME Path to input BAM file
--out FILENAME Path to output bam file
--fa FILENAME Path to genome reference fasta
--p P Number of processes to use
--strict Strict: also sanitize mapping score & auxiliary tags (eg. AS / NH).
--keepsecondary Keep secondary alignments in output bam file.
--keepunmapped Keep ummapped reads in output bam file.
Description
BAMboozle sanitizes sequence reads to provide privacy protection and facilitate data sharing.
The BAMboozle procedure involves modification of the observed read sequence to the reference genome sequence and sanitation of auxiliary tags.
Here is an overview of the sequence correction strategy:
- SNPs: Mismatches to the reference (either explicitly X coded in the CIGAR value or within M matched segments) are replaced by the reference base.
- Insertions: The read sequence is extended by the length equal to the insertion while keeping the 5' mapping position constant.
- Deletions: The missing reference sequence is inserted into the read while removing an equal numbers of bases from the 3’ end.
- Clipping: soft or hard clipped bases (CIGAR: S / H) are replaced by matching reference sequence. If reads start with clipped bases in single-end data, the reference position of the read start is adjusted, however this is not possible for paired-end reads because it would invalidate the mate-pair information (TLEN and PNEXT fields). Instead for paired-end reads, the clipped sequence portion is added to the end of the read.
- Splicing: Splicing is observed and splice-sites are conserved even in the case of deletions and insertions.
- Multimapping: In the default behavior, only primary alignments are emitted. The user can choose to keep secondary but note that anonymization can not be guaranteed.
- Unmapped reads: Unmapped reads cannot be sanitized and are discarded in default settings.
Donor-related information could also be inferred from standard bam fields and auxiliary tags:
- CIGAR value is matched to the BAMboozled sequence (eg. 100M).
- MD are matched to the BAMboozled sequence, if present (eg. 100) .
- NM and nM tags are sanitized by replacement with 1.
- Tags containing information on the alignment are discarded (MC, XN, XM, XO, XG)
In --strict
mode, the following tags are also changed:
- Mapping quality set to max/unavailable (255)
- AS and MQ are set to read length
- NH is set to 1
- Discarding of the following tags: HI, IH, H1, H2, OA, OC, OP, OQ, SA, SM, XA, XS
The output bam file also will contain a @PG
line reflecting the invoked command line call.
Reference
https://www.biorxiv.org/content/10.1101/2021.01.11.426206v1
FAQ
Help! I am getting the following error message: ERROR: Could not find a version that satisfies the requirement BAMboozle (from versions: none) ERROR: No matching distribution found for BAMboozle
Make sure that you are using pip from a python3 installation! Try pip3 install BAMboozle
instead.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file BAMboozle-0.5.0.tar.gz
.
File metadata
- Download URL: BAMboozle-0.5.0.tar.gz
- Upload date:
- Size: 21.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.24.0 setuptools/50.0.3 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fe75208435e288ca350d43fff597245f991659677d55bf5dc2c4886892463f17 |
|
MD5 | 59890086f6378fdefafece14163ba1dc |
|
BLAKE2b-256 | e383d2b15caa8a78a6adcd2d0124789eda00ac86b721af398c667460ea0c3937 |