Extract soft-clipped sequences from BAM/SAM files
Project description
Extract Soft-Clipped Sequences
A Python tool to extract soft-clipped sequences from BAM/SAM files.
Installation
From PyPI (recommended)
pip install extract-soft-clipped
With uvx (no installation required)
If you use uv, you can run the tool directly without installing:
uvx extract-soft-clipped --left input.bam
For Development
This project uses uv for dependency management. Clone and install dependencies with:
git clone <repo-url>
cd extract-soft-clipped
uv sync
Usage
Basic Usage
Extract left-clipped sequences:
extract-soft-clipped --left input.bam
Extract right-clipped sequences:
extract-soft-clipped --right input.sam
Alternative Usage Methods
With uvx (no installation):
uvx extract-soft-clipped --left input.bam
Development usage:
uv run extract-soft-clipped --left input.bam
Filter by Length
Only extract soft-clipped sequences of a minimum length:
# Only extract left clips of at least 20 bases
extract-soft-clipped --left --min-length 20 input.bam
# Only extract right clips of at least 10 bases
extract-soft-clipped --right --min-length 10 input.sam
FASTQ Output
Output sequences in FASTQ format with query IDs and quality scores:
# Extract left clips in FASTQ format
extract-soft-clipped --left --fastq input.bam
# Extract right clips in FASTQ format with minimum length
extract-soft-clipped --right --fastq --min-length 15 input.sam
# Preserve original query IDs (no sequence numbering)
extract-soft-clipped --left --fastq --preserve-ids input.bam
FASTA Output
Output sequences in FASTA format with query IDs:
# Extract left clips in FASTA format
extract-soft-clipped --left --fasta input.bam
# Extract right clips in FASTA format with minimum length
extract-soft-clipped --right --fasta --min-length 15 input.sam
# Preserve original query IDs (no sequence numbering)
extract-soft-clipped --left --fasta --preserve-ids input.bam
Region Filtering
Filter soft-clipped sequences by reference coordinate overlap:
# Extract clips that overlap reference positions 1000-1025
extract-soft-clipped --left --region 1000-1025 input.bam
# Multiple regions can be specified
extract-soft-clipped --right --region 1000-1025 --region 2000-2100 input.bam
# Region coordinates are included in FASTA/FASTQ headers when filtering
extract-soft-clipped --left --fasta --region 1000-1025 input.bam
Summarize Clipped Sequences
To get a frequency summary of N bases at the relevant end of clipped regions:
# Summarize last 10 bases of left-clipped sequences
extract-soft-clipped --left --summarize 10 input.bam
# Summarize first 15 bases of right-clipped sequences
extract-soft-clipped --right --summarize 15 input.bam
Examples
# Extract all left-clipped sequences
extract-soft-clipped --left reads.bam > left_clips.txt
# Using uvx (no installation required)
uvx extract-soft-clipped --left reads.bam > left_clips.txt
# Get frequency of 8-mers at the end of left clips
extract-soft-clipped --left --summarize 8 reads.bam
# Extract right clips and save to file
extract-soft-clipped --right reads.sam > right_clips.txt
# Extract left clips of at least 15 bases
extract-soft-clipped --left --min-length 15 reads.bam
# Combine length filtering with summarization
extract-soft-clipped --right --min-length 10 --summarize 5 reads.bam
# Extract left clips in FASTQ format
extract-soft-clipped --left --fastq reads.bam > left_clips.fastq
# Extract right clips with quality filtering in FASTQ format
extract-soft-clipped --right --min-length 20 --fastq reads.bam > right_clips.fastq
# Extract left clips in FASTA format
extract-soft-clipped --left --fasta reads.bam > left_clips.fasta
# Extract right clips with length filtering in FASTA format
extract-soft-clipped --right --min-length 10 --fasta reads.bam > right_clips.fasta
# Extract clips overlapping specific reference regions
extract-soft-clipped --left --region 1000-1025 reads.bam > region_clips.txt
# Multiple filters combined with FASTA output
extract-soft-clipped --right --min-length 15 --region 2000-3000 --fasta reads.bam > filtered_clips.fasta
# Preserve original query IDs in output
extract-soft-clipped --left --fasta --preserve-ids reads.bam > original_ids.fasta
Library Usage
You can also use extract-soft-clipped as a Python library:
from extract_soft_clipped import extract_soft_clips, extract_soft_clips_iter
# Extract all left-clipped sequences
clips = extract_soft_clips("reads.bam", extract_left=True, extract_right=False)
for clip in clips:
print(f"Read: {clip.query_name}")
print(f"Sequence: {clip.sequence}")
print(f"Quality: {clip.quality}")
print(f"Is left clip: {clip.is_left_clip}")
# Use the generator for memory efficiency with large files
for clip in extract_soft_clips_iter("reads.bam", extract_left=True, extract_right=False, min_length=10):
print(clip.sequence)
How it Works
The script parses the CIGAR string in SAM/BAM alignments to identify soft-clipped regions:
- Left clips: Soft-clipped bases at the start of reads (before alignment)
- Right clips: Soft-clipped bases at the end of reads (after alignment)
For the --summarize N option:
- Left clips: Analyzes the last N bases of each left-clipped sequence (closest to aligned portion)
- Right clips: Analyzes the first N bases of each right-clipped sequence (closest to aligned portion)
For the --region START-END option:
- Coordinates: 1-based inclusive coordinates (e.g., 1000-1025 includes positions 1000 and 1025)
- Left clips: Calculates where soft-clipped bases would map before the alignment start
- Right clips: Calculates where soft-clipped bases would map after the alignment end
- Output: When using
--fastaor--fastqwith region filtering, headers include the full reference range (e.g.,region-990-1030)
Requirements
- Python ≥3.10
- pysam ≥0.24.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file extract_soft_clipped-0.1.1.tar.gz.
File metadata
- Download URL: extract_soft_clipped-0.1.1.tar.gz
- Upload date:
- Size: 61.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8817c07337da7dd026b535eadb50d0b42ab4fde78952a05cdaf8319a41bbe033
|
|
| MD5 |
d45517153f4dbdec6ca2aa3a712e5c17
|
|
| BLAKE2b-256 |
50b4b1469865bc9693eb13601820ccf0e07cfe776405a51d4d24dbef29cbfba8
|
File details
Details for the file extract_soft_clipped-0.1.1-py3-none-any.whl.
File metadata
- Download URL: extract_soft_clipped-0.1.1-py3-none-any.whl
- Upload date:
- Size: 9.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eca19dfd93650aabbb3f87212774ea84c8e24988a1c0f4819cebaf83943cec79
|
|
| MD5 |
0310e5bdc62f0918cfc260e9fe4b623b
|
|
| BLAKE2b-256 |
7d3ea6313bb8ce8a1f79eb7ea1a0d904c1b40239186449a81f2a0d6db8281c88
|