Analyse sam file and keep the unique aligment record.
Unique-Sam is a simple command line tool to remove the duplicated alignments in the SAM file. If the MAPQ field of the alignment is available, unique-sam will keep one and only one alignment with the highest score. Otherwise, unique-sam will calculate a score according to the alignment’s MD or CIGAR field and use the calculated value to remove the duplicated alignments.
Install with the source code, in the source folder:
python setup.py install
If you have **pip** installed, you can simply run
pip install unique-sam
After installation you can access unique-sam from your command line.
unique-sam need a SAM format file to run properly. Before using unique-sam command, we must sort the sam file by the QNAMEfield. You can use samtools to achieve this purpose, refer to samtools for more helps:
samtools sort --help
For basic usage, in your command line environment:
unique-sam input.sam -o output.sam
If you don’t have access to samtools, you can use -s option of unique-sam:
unique-sam -s input.sam -o output.sam
The sort functionality of unique-sam is implemented as
- copy a temp file from the original sam file
- extract the header of the sam file
- sort the alignments with Bash sort program
-k parameter give you the control on how to extract the alignment key from qname field the parameter of the -k is a regular expression. You should group the key part with parentheses. ** e.g. 1** > qname: N|GACGCGGATCTT/500407:4:H03E5AFXX:1:21109:5977:6969_2:N:0:ATACAA > > -k '(.*)\_[1-2](.*)' > > key will be: N|GACGCGGATCTT/500407:4:H03E5AFXX:1:21109:5977:6969:N:0:ATACAA > which will remove the _1/2 part of the qname.
** e.g. 2** > qname: HWI-ST667_0147:1:1101:1128:2079#CGATGT/1 > > -k '(.*)\/[1-2]' > > key will be: HWI-ST667_0147:1:1101:1128:2079#CGATGT > which will remove the /1/2 part of the qname.
For more about unique-sam run:
Following strategies are applied to find the unique & the best alignment
- Keep the alignment pair that has the highest score. If more than one pairs are found to have the same “Highest Score”, these pairs will be removed.
- Read1 and Read2 should be mapped on different strands.
- The segment length decided by the read pairs should be longer than 0.7 * read length
All removed alignments will be written into log file input.sam.log under current folder. Each line of the log file start with a symbol and followed by the deleted alignment (the original alignment record in the input.sam). The symbol describe the reason of why this/these alignments should be removed. The specification of these symbols are listed in the follow table:
| Symbol | Description | | —— | ———– | | ! | Error lines | | < | Low score alignments | | = | Pairs with more than one best score | | ~ | Read pair mapped on the same strand | | ? | Segment length too short | | - | Invalid read1/2 information in flag filed or unmapped segment |
Copyright (c) 2015 dlmeduLi@163.com