Skip to main content

SigAlign's language binding for python

Project description

SigAlign for Python

SigAlign is a sequence alignment algorithm. This repository hosts the Python language bindings for SigAlign. The original project can be found here.

Requirements

  • Python >= 3.10

Installation

  • Via pip

    pip install sigalign
    
  • Manual build: SigAlign relies on maturin as a backend. To build manually:

    pip install maturin
    maturin develop
    

Usage Example

(1) Import SigAlign

from sigalign import Reference, Aligner

(2) Construct Reference

# Build Reference object from `iterable` of tuples (label, sequence).
reference = Reference.from_iterable([
    ("target_1", "ACACAGATCGCAAACTCACAATTGTATTTCTTTGCCACCTGGGCATATACTTTTTGCGCCCCCTCATTTA"),
    ("target_2", "TCTGGGGCCATTGTATTTCTTTGCCAGCTGGGGCATATACTTTTTCCGCCCCCTCATTTACGCTCATCAC"),
])
# Or only sequences
reference = Reference.from_iterable([
    "ACACAGATCGCAAACTCACAATTGTATTTCTTTGCCACCTGGGCATATACTTTTTGCGCCCCCTCATTTA",
    "TCTGGGGCCATTGTATTTCTTTGCCAGCTGGGGCATATACTTTTTCCGCCCCCTCATTTACGCTCATCAC",
])
# Bytes can be used instead of strings
reference = Reference.from_iterable([
    b"ACACAGATCGCAAACTCACAATTGTATTTCTTTGCCACCTGGGCATATACTTTTTGCGCCCCCTCATTTA",
    b"TCTGGGGCCATTGTATTTCTTTGCCAGCTGGGGCATATACTTTTTCCGCCCCCTCATTTACGCTCATCAC",
])

# FASTA format can be used
reference = Reference.from_fasta(b""">target_1
ACACAGATCGCAAACTCACAATTGTATTTCTTTGCCACCTGGGCATATACTTTTTGCGCCCCCTCATTTA
>target_2
TCTGGGGCCATTGTATTTCTTTGCCAGCTGGGGCATATACTTTTTCCGCCCCCTCATTTACGCTCATCAC""")
# Or from a file
# reference = Reference.from_fasta_file("reference.fasta")

Check status of Reference

print("# Reference Status")
print(f" - Num targets: {reference.num_targets}")
print(f" - Total length: {reference.total_length} bps")
print(f" - Estimated size: {reference.estimated_size / 1024:.2f} KiB")
  • Output:

    # Reference Status
      - Num targets: 2
      - Total length: 140 bps
      - Estimated size: 1.32 KiB
    

Parse target label and sequence

for target_index in range(reference.num_targets):
    print(f"# Target {target_index}")
    print(f"  - Label: {reference.get_label(target_index)}")
    print(f"  - Sequence: {reference.get_sequence(target_index)}")
  • Output:

    # Target 0
      - Label: target_1
      - Sequence: ACACAGATCGCAAACTCACAATTGTATTTCTTTGCCACCTGGGCATATACTTTTTGCGCCCCCTCATTTA
    # Target 1
      - Label: target_2
      - Sequence: TCTGGGGCCATTGTATTTCTTTGCCAGCTGGGGCATATACTTTTTCCGCCCCCTCATTTACGCTCATCAC
    

Save and load

# Save
reference.save_to_file("reference.sigref")

# Load
reference = Reference.load_from_file("reference.sigref")

(3) Initialize Aligner

aligner = Aligner(
    4,     # Mismatch penalty
    6,     # Gap-open penalty
    2,     # Gap-extend penalty
    50,    # Minimum length
    0.2,   # Maximum penalty per length
    use_local_mode=True, # Use local alignment (default: True)
    use_limit=None,      # Limit the number of alignments (default: None)
    use_chunk=None,      # Align with chunked query with (chunk size, sliding window size) (default: None)
)

Check status of Aligner

print("# Aligner Status")
print("  - Penalties")
print(f"    - Mismatch penalty: {aligner.px}")
print(f"    - Gap-open penalty: {aligner.po}")
print(f"    - Gap-extend penalty: {aligner.pe}")
print("  - Similarity Cutoffs")
print(f"    - Minimum length: {aligner.minl}")
print(f"    - Maximum penalty per length: {aligner.maxp:.2f}")
print(f"  - Mode is {'Local' if aligner.is_local_mode else 'Semi-global'}")
print(f"    - Max alignments: {'Infinity' if aligner.limitation is None else aligner.limitation}")
print(f"    - Chunk: {aligner.chunk}")
  • Output:

    # Aligner Status
    - Penalties
        - Mismatch penalty: 4
        - Gap-open penalty: 6
        - Gap-extend penalty: 2
    - Similarity Cutoffs
        - Minimum length: 50
        - Maximum penalty per length: 0.20
    - Mode is Local
        - Max alignments: Infinity
        - Chunk: None
    

(4) Perform Alignment

# Align a query str to the reference
query = "CAAACTCACAATTGTATTTCTTTGCCAGCTGGGCATATACTTTTTCCGCCCCCTCATTTAACTTCTTGGA"
results = aligner.align_query(query, reference)

# Or query bytes can be used
query = b"CAAACTCACAATTGTATTTCTTTGCCAGCTGGGCATATACTTTTTCCGCCCCCTCATTTAACTTCTTGGA"
results = aligner.align_query(query, reference, with_label=True) # including label is slightly slower than without label (default: False)

# FASTA (str or bytes) can be used
fasta = b""">query_1
CAAACTCACAATTGTATTTCTTTGCCAGCTGGGCATATACTTTTTCCGCCCCCTCATTTAACTTCTTGGA"""
results = aligner.align_fasta(
    fasta,
    reference,
)
# Or file can be used:
# results = aligner.align_fasta_file(
#     "path/to/file.fasta",
#     reference,
# )

# FASTQ (str or bytes) can be used:
fastq = b"""@query_1
CAAACTCACAATTGTATTTCTTTGCCAGCTGGGCATATACTTTTTCCGCCCCCTCATTTAACTTCTTGGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII"""
results = aligner.align_fastq(
    fastq,
    reference,
    with_label=True, # include label in the result (default: False)
    with_reverse_complementary=False, # align both forward and reverse complementary (default: False)
    allow_interrupt=True, # allow interrupting with KeyboardInterrupt (default: False)
)
# Or file can be used:
# results = aligner.align_fastq_file(
#     "path/to/file.fastq",
#     reference,
# )

(5) Display Results

for read_alignment in results:
    print(f"# Query: {read_alignment.read} (is forward: {read_alignment.is_forward})")
    for target_alignment in read_alignment.result:
        print(f"  - Target: {target_alignment.label} (index {target_alignment.index})")
        for idx, alignment in enumerate(target_alignment.alignments):
            print(f"    - Result {idx+1}")
            print(f"      * Penalty: {alignment.penalty}")
            print(f"      * Length: {alignment.length}")
            print(f"      * Query position: {alignment.query_position}")
            print(f"      * Target position: {alignment.target_position}")
  • Output:

    # Query: query_1 (is forward: True)
    - Target: target_1 (index 0)
        - Result 1
        - Penalty: 8
        - Length: 60
        - Query position: (0, 60)
        - Target position: (10, 70)
    - Target: target_2 (index 1)
        - Result 1
        - Penalty: 8
        - Length: 51
        - Query position: (10, 60)
        - Target position: (9, 60)
    

Convert results to json or dict

import json
json.loads(results.to_json())
  • Output:

    [{'read': 'query_1',
    'is_forward': True,
    'result': [{'index': 0,
        'label': 'target_1',
        'alignments': [{'penalty': 8,
        'length': 60,
        'query_position': [0, 60],
        'target_position': [10, 70],
        'operations': [{'operation': 'Match', 'count': 27},
        {'operation': 'Subst', 'count': 1},
        {'operation': 'Match', 'count': 17},
        {'operation': 'Subst', 'count': 1},
        {'operation': 'Match', 'count': 14}]}]},
    {'index': 1,
        'label': 'target_2',
        'alignments': [{'penalty': 8,
        'length': 51,
        'query_position': [10, 60],
        'target_position': [9, 60],
        'operations': [{'operation': 'Match', 'count': 23},
        {'operation': 'Deletion', 'count': 1},
        {'operation': 'Match', 'count': 27}]}]}]}]
    

Convert results to a table

import pandas as pd
df = pd.DataFrame(
    results.to_rows(),
    columns = [
        'query_label', 'is_forward',
        'target_index', 'target_label', 'penalty', 'length',
        'query_start', 'query_end', 'target_start', 'target_end', 'operations',
    ],
)
df
  • Output:

    query_label is_forward target_index target_label penalty length query_start query_end target_start target_end CIGAR
    0 query_1 TRUE 1 target_2 8 51 10 60 9 60 23=1D27=
    1 query_1 TRUE 0 target_1 8 60 0 60 10 70 27=1X17=1X14=
import polars as pl
df = pl.DataFrame(
    results.to_rows(),
    orient="row",
    schema=[
        'query_label', 'is_forward',
        'target_index', 'target_label', 'penalty', 'length',
        'query_start', 'query_end', 'target_start', 'target_end', 'operations',
    ],
)
df
  • Output:

    query_label is_forward target_index target_label penalty length query_start query_end target_start target_end CIGAR
    str bool i64 str i64 i64 i64 i64 i64 i64 str
    "query_1" true 1 "target_2" 8 51 10 60 9 60 "23=1D27="
    "query_1" true 0 "target_1" 8 60 0 60 10 70 "27=1X17=1X14="

Additional Information

This Python library provides bindings for the Rust crate sigalign. It offers a set of functions sufficient for most common tasks. However, for more customization, using the Rust crate directly is recommended.

Support

For any questions or issues, please refer to the original project's GitHub issue tracker.

License

SigAlign for Python is released under the MIT license.

Citation

Bahk, K., & Sung, J. (2024). SigAlign: an alignment algorithm guided by explicit similarity criteria. Nucleic Acids Research, gkae607.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sigalign-0.3.1.tar.gz (77.4 kB view details)

Uploaded Source

Built Distributions

sigalign-0.3.1-cp310-abi3-win_amd64.whl (392.1 kB view details)

Uploaded CPython 3.10+ Windows x86-64

sigalign-0.3.1-cp310-abi3-manylinux_2_28_x86_64.whl (12.2 MB view details)

Uploaded CPython 3.10+ manylinux: glibc 2.28+ x86-64

sigalign-0.3.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (454.6 kB view details)

Uploaded CPython 3.10+ manylinux: glibc 2.17+ x86-64

sigalign-0.3.1-cp310-abi3-macosx_10_12_x86_64.whl (380.3 kB view details)

Uploaded CPython 3.10+ macOS 10.12+ x86-64

File details

Details for the file sigalign-0.3.1.tar.gz.

File metadata

  • Download URL: sigalign-0.3.1.tar.gz
  • Upload date:
  • Size: 77.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.6.0

File hashes

Hashes for sigalign-0.3.1.tar.gz
Algorithm Hash digest
SHA256 67352d83190c3ba080f7d294849425bcd76a4fd7506f9c75e656e09e51135edb
MD5 1b98730bdc6a9a349c5ec617cefc5188
BLAKE2b-256 ad311a08aa641fd3e63f308932b1358fdde22679f4646e35832581410f6a488c

See more details on using hashes here.

File details

Details for the file sigalign-0.3.1-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for sigalign-0.3.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 72da8798a840a8618a5426f383ca9ec65ba7be37e9bcc9b68ea2153efd5cdf4d
MD5 7a080c24dc6e0de56f1f51cc5f74373c
BLAKE2b-256 d06412fa2e15e1dc16e8b91e1aec5a836466895e80148fd8a4f357b46b59c3af

See more details on using hashes here.

File details

Details for the file sigalign-0.3.1-cp310-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for sigalign-0.3.1-cp310-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 3bae2127d93510ab73eda82210d8d0c41b44891b49c6abb588d8af8ce8dc35df
MD5 6a066d6cb9ca622338448cfec321f019
BLAKE2b-256 c5b0fd393dd00ac2ec3fbd9860bd3c02e4b63b41857203001b801c2aa59e0b48

See more details on using hashes here.

File details

Details for the file sigalign-0.3.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for sigalign-0.3.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 211ab0edf1b48dd0bbc334310843c6b66fe84b194d78de10f9c22ed940a70f3a
MD5 be4a1fe5dac5f0251d360146cd81fd08
BLAKE2b-256 41d29271685a401cb454bdf3fc39ef25c30bfc0d1fdf930094e5e27ff179c24d

See more details on using hashes here.

File details

Details for the file sigalign-0.3.1-cp310-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sigalign-0.3.1-cp310-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 6d96113f2a8bc1a436818a6e72a4009a599c10fd0796a6db75c3dfdfbc26a604
MD5 b3b9003b755ff154f045e627465c4f91
BLAKE2b-256 d27de1cff4ab4efeefcf630cbdcdc40642e149555c3ae219787d3c83c87328c4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page