Skip to main content

A Python library for manipulation of CIGAR, MSA, and other alignment formats.

Project description

codecov types - Mypy flake8 checked License - AGPL3 PRs Welcome

Alignment tools

This Python library provides robust tools for handling CIGAR strings and related alignment formats in bioinformatics. CIGAR ("Compact Idiosyncratic Gapped Alignment Report") strings succinctly represent alignment data between sequences - such as genomic sequences - highlighting matches, mismatches, insertions, deletions, and more within aligned sequences. The functionality within this library abstracts and simplifies interactions with CIGAR strings, making it easier to manipulate, interpret, and convert these alignment specifications for analytical and visualization purposes.

Features

  • Easy CIGAR String Parsing and Construction: Easily parse CIGAR strings from text or create them programmatically with detailed annotations for operations like matching, inserting, or deleting.
  • Alignment Manipulation: Slice, trim, and concatenate sequence alignments for advanced genomic analyses.
  • Coordinate Mapping: Provides bi-directional mapping between reference and query sequences, enabling you to track and modify coordinates with precision.
  • Gap Identification: Detect gaps, insertions, and deletions within alignments to facilitate adjustments and further analysis.
  • Multiple Sequence Alignment (MSA): Convert CIGAR strings into multiple sequence alignment (MSA) representations for visualization or further processing.
  • CIGAR String Normalization: Merge consecutive identical operations and support merging non-overlapping alignments to normalize complex CIGAR strings.

Examples

Parsing and Manipulating a CIGAR String

You can easily parse a CIGAR string and perform various operations using the Cigar class:

from aligntools import Cigar

# Parse a CIGAR string
cigar = Cigar.coerce("10M1I5D5M")

# Enumerate operations
for operation in cigar.iterate_operations():
    print(operation)

# Output:
# CigarActions.MATCH
# CigarActions.MATCH
# ... (continues)
# CigarActions.MATCH
# CigarActions.INSERT
# CigarActions.DELETE
# ... (continues)
# CigarActions.MATCH
# CigarActions.MATCH

# Calculate the lengths of the reference and query sequences
print(f"Reference Length: {cigar.ref_length}")
print(f"Query Length: {cigar.query_length}")

# Output:
# Reference Length: 20
# Query Length: 16

Cutting a CIGAR String by Reference Position

CIGAR strings can be split at specific reference positions, allowing for more fine-grained control over the alignment:

from aligntools import Cigar, CigarHit

# Parse a CIGAR string and create a CigarHit
cigar = Cigar.coerce("10M5I10M")
hit = CigarHit(cigar, r_st=0, r_ei=19, q_st=0, q_ei=24)

# Cut the alignment at reference position 10.5 (at midpoint between positions 10 and 11).
left, right = hit.cut_reference(10.5)

print("Left slice:", left)
print("Right slice:", right)

# Output:
# Left slice: 10M5I1M@[0,15]->[0,10]
# Right slice: 9M@[16,24]->[11,19]

Trimming Query and Reference Sequences

You can trim unmatched regions from either the query or reference sequence:

from aligntools import Cigar

# Parse a CIGAR string with unmatched regions
cigar = Cigar.coerce("5S10M5S")

# Trim unmatched regions from the query sequence
trimmed_cigar = cigar.rstrip_query()

print(f"Trimmed CIGAR: {trimmed_cigar}")

# Output:
# Trimmed CIGAR: 5S10M

Converting CIGAR Strings to Multiple Sequence Alignments (MSA)

Convert a CIGAR string to a multiple sequence alignment (MSA) representation for better visualization of how sequences align:

from aligntools import Cigar

# Parse a CIGAR string
cigar = Cigar.coerce("5M2I5M")

# Define reference and query sequences
ref_seq =   "ACGTACGTAC"
query_seq = "ACGTTACGTATG"

# Convert to MSA
ref_msa, query_msa = cigar.to_msa(ref_seq, query_seq)

print(f"Reference MSA: {ref_msa}")
print(f"Query MSA:     {query_msa}")

# Output:
# Reference MSA: ACGTA--CGTAC
# Query MSA:     ACGTTACGTATG

Merging Consecutive Alignments

aligntools can merge two consecutive CIGAR strings into a single normalized CIGAR string:

from aligntools import Cigar

# Parse two CIGAR strings
cigar1 = Cigar.coerce("5M5D")
cigar2 = Cigar.coerce("10M")

# Merge the two alignments
merged_cigar = cigar1 + cigar2

print(f"Merged CIGAR: {merged_cigar}")
# Output: Merged CIGAR: 5M5D10M

Advanced Usage: Coordinate Mapping Between Reference and Query Sequences

You can manage coordinate translations between the reference and query using CoordinateMapping:

from aligntools import Cigar

# Create a CIGAR and its coordinate mapping
cigar = Cigar.coerce("5M2I3M")
mapping = cigar.coordinate_mapping

# Translate reference and query coordinates
ref_coordinate = 3
query_coordinate = mapping.ref_to_query[ref_coordinate]

print(f"Query coordinate: {query_coordinate}")
# Output:
# Query coordinate: 3

ref_coordinate = 6
query_coordinate = mapping.ref_to_query[ref_coordinate]
print(f"Query coordinate: {query_coordinate}")
# Output:
# Query coordinate: 8

Using CigarHit for Complex Alignment Manipulations

The CigarHit class is a higher-level abstraction that allows you to perform complex operations on a sequence alignment:

from aligntools import Cigar, CigarHit

# Define a complex CIGAR string and create a CigarHit
cigar = Cigar.coerce("5M2D5M")
hit = CigarHit(cigar, r_st=0, r_ei=11, q_st=0, q_ei=9)

# Cut and trim the alignment
left, right = hit.cut_reference(4.5)
trimmed_hit = right.rstrip_reference()

print("Left Hit:           ", left)
print("Right (trimmed) Hit:", trimmed_hit)

# Output:
# Left Hit:            5M@[0,4]->[0,4]
# Right (trimmed) Hit: 2D5M@[5,9]->[5,11]

Installation

To use aligntools for your projects, simply run pip install aligntools.

Contributing

We welcome contributions to the aligntools project! Whether you want to fix a bug, add new features, or improve documentation, feel free to fork the repository, make your changes, and submit a pull request. We also welcome issues and suggestions.

To make changes to aligntools:

  • Fork the repository. This is done through GitHub UI, initiated by the "Fork" button.
  • Cloning the repository, like so:
# Get the repository sources.
git clone https://github.com/${YOUR_USERNAME}/aligntools
cd aligntools
pip install .[dev,test] # Install all development, and test dependencies.
git checkout -b $YOUR_CHANGE_NAME
  • Editing some files & commiting the results.
  • Running the validation, like this:
pytest && flake8 && bandit
  • Pushing the changes back to GitHub servers:
git push origin HEAD
  • Creating a pull request through GitHub UI. Go to aligntools repository, and select "Create pull request option".

License

This project is licensed under the AGPLv3.0 License. See the COPYING file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aligntools-1.0.6.tar.gz (24.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aligntools-1.0.6-py2.py3-none-any.whl (26.5 kB view details)

Uploaded Python 2Python 3

File details

Details for the file aligntools-1.0.6.tar.gz.

File metadata

  • Download URL: aligntools-1.0.6.tar.gz
  • Upload date:
  • Size: 24.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for aligntools-1.0.6.tar.gz
Algorithm Hash digest
SHA256 31c57aec71804680111d92b6cf4f556e722b2a46db7c3de0c83adae644362252
MD5 224eebcfe80f31333973e9a2eb5fcbf7
BLAKE2b-256 f90f029d1742337181e671f3bdc4736cb16c1f4f8605a12574024d4854a8b414

See more details on using hashes here.

File details

Details for the file aligntools-1.0.6-py2.py3-none-any.whl.

File metadata

  • Download URL: aligntools-1.0.6-py2.py3-none-any.whl
  • Upload date:
  • Size: 26.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for aligntools-1.0.6-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 8f28a6333d69a19d53ff07d48202e7789cb1baf0e2597e929116b0fdc5b813d7
MD5 68f2232f541094724cc5173daca4786a
BLAKE2b-256 20a38e9284dee9ef618195312009340c37189b25c70d359c4483d2c5d4e1d89d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page