Skip to main content

Extract soft-clipped sequences from BAM/SAM files

Project description

Extract Soft-Clipped Sequences

A Python tool to extract soft-clipped sequences from BAM/SAM files.

Installation

From PyPI (recommended)

pip install extract-soft-clipped

With uvx (no installation required)

If you use uv, you can run the tool directly without installing:

uvx extract-soft-clipped --left input.bam

For Development

This project uses uv for dependency management. Clone and install dependencies with:

git clone <repo-url>
cd extract-soft-clipped
uv sync

Usage

Basic Usage

Extract left-clipped sequences:

extract-soft-clipped --left input.bam

Extract right-clipped sequences:

extract-soft-clipped --right input.sam

Alternative Usage Methods

With uvx (no installation):

uvx extract-soft-clipped --left input.bam

Development usage:

uv run extract-soft-clipped --left input.bam

Filter by Length

Only extract soft-clipped sequences of a minimum length:

# Only extract left clips of at least 20 bases
extract-soft-clipped --left --min-length 20 input.bam

# Only extract right clips of at least 10 bases
extract-soft-clipped --right --min-length 10 input.sam

FASTQ Output

Output sequences in FASTQ format with query IDs and quality scores:

# Extract left clips in FASTQ format
extract-soft-clipped --left --fastq input.bam

# Extract right clips in FASTQ format with minimum length
extract-soft-clipped --right --fastq --min-length 15 input.sam

# Preserve original query IDs (no sequence numbering)
extract-soft-clipped --left --fastq --preserve-ids input.bam

FASTA Output

Output sequences in FASTA format with query IDs:

# Extract left clips in FASTA format
extract-soft-clipped --left --fasta input.bam

# Extract right clips in FASTA format with minimum length
extract-soft-clipped --right --fasta --min-length 15 input.sam

# Preserve original query IDs (no sequence numbering)
extract-soft-clipped --left --fasta --preserve-ids input.bam

Region Filtering

Filter soft-clipped sequences by reference coordinate overlap:

# Extract clips that overlap reference positions 1000-1025
extract-soft-clipped --left --region 1000-1025 input.bam

# Multiple regions can be specified
extract-soft-clipped --right --region 1000-1025 --region 2000-2100 input.bam

# Region coordinates are included in FASTA/FASTQ headers when filtering
extract-soft-clipped --left --fasta --region 1000-1025 input.bam

Summarize Clipped Sequences

To get a frequency summary of N bases at the relevant end of clipped regions:

# Summarize last 10 bases of left-clipped sequences
extract-soft-clipped --left --summarize 10 input.bam

# Summarize first 15 bases of right-clipped sequences
extract-soft-clipped --right --summarize 15 input.bam

Examples

# Extract all left-clipped sequences
extract-soft-clipped --left reads.bam > left_clips.txt

# Using uvx (no installation required)
uvx extract-soft-clipped --left reads.bam > left_clips.txt

# Get frequency of 8-mers at the end of left clips
extract-soft-clipped --left --summarize 8 reads.bam

# Extract right clips and save to file
extract-soft-clipped --right reads.sam > right_clips.txt

# Extract left clips of at least 15 bases
extract-soft-clipped --left --min-length 15 reads.bam

# Combine length filtering with summarization
extract-soft-clipped --right --min-length 10 --summarize 5 reads.bam

# Extract left clips in FASTQ format
extract-soft-clipped --left --fastq reads.bam > left_clips.fastq

# Extract right clips with quality filtering in FASTQ format
extract-soft-clipped --right --min-length 20 --fastq reads.bam > right_clips.fastq

# Extract left clips in FASTA format
extract-soft-clipped --left --fasta reads.bam > left_clips.fasta

# Extract right clips with length filtering in FASTA format
extract-soft-clipped --right --min-length 10 --fasta reads.bam > right_clips.fasta

# Extract clips overlapping specific reference regions
extract-soft-clipped --left --region 1000-1025 reads.bam > region_clips.txt

# Multiple filters combined with FASTA output
extract-soft-clipped --right --min-length 15 --region 2000-3000 --fasta reads.bam > filtered_clips.fasta

# Preserve original query IDs in output
extract-soft-clipped --left --fasta --preserve-ids reads.bam > original_ids.fasta

Library Usage

You can also use extract-soft-clipped as a Python library:

from extract_soft_clipped import extract_soft_clips, extract_soft_clips_iter

# Extract all left-clipped sequences
clips = extract_soft_clips("reads.bam", extract_left=True, extract_right=False)

for clip in clips:
    print(f"Read: {clip.query_name}")
    print(f"Sequence: {clip.sequence}")
    print(f"Quality: {clip.quality}")
    print(f"Is left clip: {clip.is_left_clip}")

# Use the generator for memory efficiency with large files
for clip in extract_soft_clips_iter("reads.bam", extract_left=True, extract_right=False, min_length=10):
    print(clip.sequence)

How it Works

The script parses the CIGAR string in SAM/BAM alignments to identify soft-clipped regions:

  • Left clips: Soft-clipped bases at the start of reads (before alignment)
  • Right clips: Soft-clipped bases at the end of reads (after alignment)

For the --summarize N option:

  • Left clips: Analyzes the last N bases of each left-clipped sequence (closest to aligned portion)
  • Right clips: Analyzes the first N bases of each right-clipped sequence (closest to aligned portion)

For the --region START-END option:

  • Coordinates: 1-based inclusive coordinates (e.g., 1000-1025 includes positions 1000 and 1025)
  • Left clips: Calculates where soft-clipped bases would map before the alignment start
  • Right clips: Calculates where soft-clipped bases would map after the alignment end
  • Output: When using --fasta or --fastq with region filtering, headers include the full reference range (e.g., region-990-1030)

Requirements

  • Python ≥3.10
  • pysam ≥0.24.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extract_soft_clipped-0.1.1.tar.gz (61.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extract_soft_clipped-0.1.1-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file extract_soft_clipped-0.1.1.tar.gz.

File metadata

File hashes

Hashes for extract_soft_clipped-0.1.1.tar.gz
Algorithm Hash digest
SHA256 8817c07337da7dd026b535eadb50d0b42ab4fde78952a05cdaf8319a41bbe033
MD5 d45517153f4dbdec6ca2aa3a712e5c17
BLAKE2b-256 50b4b1469865bc9693eb13601820ccf0e07cfe776405a51d4d24dbef29cbfba8

See more details on using hashes here.

File details

Details for the file extract_soft_clipped-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for extract_soft_clipped-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 eca19dfd93650aabbb3f87212774ea84c8e24988a1c0f4819cebaf83943cec79
MD5 0310e5bdc62f0918cfc260e9fe4b623b
BLAKE2b-256 7d3ea6313bb8ce8a1f79eb7ea1a0d904c1b40239186449a81f2a0d6db8281c88

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page