Adjusted Identity Calculator for DNA Sequences with MycoBLAST-style preprocessing and MSA support

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Adjusted Identity Calculator for DNA Sequences

A Python package implementing MycoBLAST-style sequence identity calculations for DNA sequences, specifically designed for mycological DNA barcoding applications. This package provides sophisticated sequence alignment and scoring with various adjustments for homopolymer differences, IUPAC ambiguity codes, and sequencing artifacts.

Based on the MycoBLAST algorithm developed by Stephen Russell and Mycota Lab. See the foundational article: "Why NCBI BLAST identity scores can be misleading for fungi" which explains the theoretical basis and motivation for these sequence preprocessing adjustments.

Features

Homopolymer Length Normalization: Ignore differences in homopolymer run lengths (e.g., "AAA" vs "AAAA")
Repeat Motif Adjustment: Handle dinucleotide and longer repeat motifs (e.g., "ATATAT" vs "ATATATAT")
IUPAC Ambiguity Code Handling: Allow different ambiguity codes to match via nucleotide intersection
MSA Dual-Gap Support: Correctly handle sequences from multi-sequence alignments (MSA) where both sequences may have gaps at the same position
End Trimming: Skip mismatches in terminal regions to avoid sequencing artifacts (disabled by default, set end_skip_distance to enable)
Indel Normalization: Count contiguous indels as single evolutionary events
Comprehensive Alignment: Multi-stage bidirectional alignment optimization using edlib
Flexible Configuration: Enable/disable individual adjustments as needed

Installation

From GitHub

pip install git+https://github.com/joshuaowalker/adjusted-identity.git

Development Installation

git clone https://github.com/joshuaowalker/adjusted-identity.git
cd adjusted-identity
pip install -e ".[dev]"

Quick Start

Option 1: Complete Solution (align + score)

from adjusted_identity import align_and_score

# For raw sequences - handles alignment and scoring
result = align_and_score("ATCGAAAAATGTC", "ATCGAAAATGTC")
print(f"Adjusted identity: {result.identity:.3f}")
print(f"Coverage: seq1={result.seq1_coverage:.3f}, seq2={result.seq2_coverage:.3f}")

Option 2: Core Scoring Only (use with your alignments)

from adjusted_identity import score_alignment

# For pre-aligned sequences from BLAST, BioPython, etc.
aligned_seq1 = "ATCG-AAAT"  # From your alignment tool
aligned_seq2 = "ATCGAAAAT"  # From your alignment tool

result = score_alignment(aligned_seq1, aligned_seq2)
print(f"Adjusted identity: {result.identity:.3f}")

Compare with Traditional Identity

from adjusted_identity import align_and_score, RAW_ADJUSTMENT_PARAMS

# Traditional identity (no adjustments)
raw_result = align_and_score("ATCGAAAAATGTC", "ATCGAAAATGTC", RAW_ADJUSTMENT_PARAMS)
print(f"Traditional identity: {raw_result.identity:.3f}")

# Adjusted identity (with MycoBLAST adjustments)  
adj_result = align_and_score("ATCGAAAAATGTC", "ATCGAAAATGTC")
print(f"Adjusted identity: {adj_result.identity:.3f}")

# Examine the scoring pattern
print(f"Score pattern: {adj_result.score_aligned}")
# '|' = exact match, '=' = ambiguous/homopolymer, ' ' = substitution

Use Cases

Mycological DNA Barcoding

Common scenario: ITS sequences with homopolymer differences due to sequencing artifacts.

from adjusted_identity import align_and_score

# Fungal ITS sequences with different homopolymer lengths
its_seq1 = "TCCGTAGGTGAACCTGCGGAAGGATCATTACCGAGTTTAAA"    # 3 A's at end
its_seq2 = "TCCGTAGGTGAACCTGCGGAAGGATCATTACCGAGTTTTAAAA"   # 4 A's at end

result = align_and_score(its_seq1, its_seq2)
print(f"Species identity: {result.identity:.3f}")  # Should be ~1.0 with homopolymer adjustment

Handling Ambiguous Bases

from adjusted_identity import align_and_score

# Sequences with IUPAC ambiguity codes
barcode1 = "ATCGRGTC"  # R = A or G
barcode2 = "ATCGKGTC"  # K = G or T (both R and K contain G)

result = align_and_score(barcode1, barcode2)
print(f"Identity with IUPAC handling: {result.identity:.3f}")  # Should be 1.0
print(f"Score pattern: {result.score_aligned}")  # Shows '=' for ambiguous matches

Understanding Score Patterns

The score_aligned field provides a visual representation of how each position was scored:

| = Exact match between standard nucleotides (A=A, C=C, G=G, T=T)
= = Ambiguous match (IUPAC codes) or homopolymer/repeat extension
(space) = Substitution (mismatch)
- = Indel extension (normalized)
. = End-trimmed, dual-gap, or overhang position (not scored)

from adjusted_identity import align_and_score

result = align_and_score("ATCGRAAATGTC", "ATCGAAAAATGTC")
print(f"Seq1: {result.seq1_aligned}")
print(f"Seq2: {result.seq2_aligned}") 
print(f"Score: {result.score_aligned}")
# Output might show: ||||==||||||
#                    ATCG = exact matches (||||)
#                    R vs A = ambiguous match (=)
#                    AAA vs AAAA = homopolymer extension (=)

Repeat Motif Handling

from adjusted_identity import align_and_score, AdjustmentParams

# Dinucleotide repeat differences (AT repeat from Russell article)
seq1 = "CGATAT--C"  # Missing one AT unit
seq2 = "CGATATATC"  # Has extra AT unit

# With repeat motif adjustment (default)
result = align_and_score(seq1, seq2)
print(f"Adjusted identity: {result.identity:.3f}")  # Should be 1.0

# Control max repeat motif length
params = AdjustmentParams(
    max_repeat_motif_length=3  # Detect up to trinucleotide repeats (e.g., CAG)
)
result = align_and_score("CAGCAG---TTC", "CAGCAGCAGTTC", params)

Multi-Sequence Alignment (MSA) Support

The package correctly handles sequence pairs extracted from multi-sequence alignments (MSA), where both sequences may have gaps at the same position due to alignment with third sequences.

from adjusted_identity import score_alignment

# Sequences from MSA (e.g., spoa, MUSCLE, MAFFT output)
# Both sequences have gaps at positions 3-4 due to alignment with a third sequence
msa_seq1 = "AGA--TT"
msa_seq2 = "AGAT-TT"

result = score_alignment(msa_seq1, msa_seq2)
print(f"MSA identity: {result.identity:.3f}")  # Should be 1.0 - 'T' recognized as homopolymer
print(f"Score pattern: {result.score_aligned}")

# Another example with consensus-based homopolymer detection
msa_seq1 = "AGG-AC"  # G at position 2
msa_seq2 = "AG-GAC"  # G at position 3

result = score_alignment(msa_seq1, msa_seq2)
print(f"MSA identity: {result.identity:.3f}")  # Both G's recognized as homopolymer extensions

Key MSA features:

Dual-gap handling: Positions where both sequences have '-' are excluded from scoring (marked with .)
Consensus context: Homopolymer detection uses consensus nucleotides from both sequences
Conflict resolution: When sequences disagree at context positions, homopolymer extension is not applied

Custom Adjustments

from adjusted_identity import align_and_score, AdjustmentParams

# Custom adjustment parameters
custom_params = AdjustmentParams(
    normalize_homopolymers=True,   # Enable homopolymer adjustment
    handle_iupac_overlap=False,    # Disable IUPAC intersection
    normalize_indels=True,         # Enable indel normalization
    end_skip_distance=10,         # Skip 10bp from each end (default is 0 = disabled)
    max_repeat_motif_length=2     # Detect up to dinucleotide repeats (default)
)

result = align_and_score(seq1, seq2, custom_params)

API Architecture

This package provides a layered API design that separates sequence alignment from identity scoring, giving you maximum flexibility:

Core Layer: `score_alignment()`

The core implementation that applies MycoBLAST-style adjustments to any pre-aligned sequences. This allows you to use any alignment library (BLAST, BioPython, edlib, etc.) with the adjusted identity algorithm.

Input: Just needs gapped sequences of equal length
Output: Adjusted identity metrics with detailed scoring information

Convenience Layer: `align_and_score()`

A higher-level function that combines fast edlib alignment with the scoring algorithm. Provides BLAST-like infix alignment that's fast enough for production use without requiring external dependencies.

Input: Raw unaligned sequences
Output: Complete alignment and adjusted identity results

API Reference

Core Function

`score_alignment(seq1_aligned, seq2_aligned, adjustment_params=None, scoring_format=None)`

The core implementation - applies MycoBLAST-style adjustments to pre-aligned sequences from any source.

Use this when:

You already have alignments from BLAST, BioPython, or other tools
You want to integrate adjusted identity into existing pipelines
You need maximum control over the alignment process

Parameters:

seq1_aligned, seq2_aligned (str): Pre-aligned sequences with gaps (must be same length)
adjustment_params (AdjustmentParams, optional): Adjustment parameters
scoring_format (ScoringFormat, optional): Scoring visualization format

Returns:

AlignmentResult: Scoring results and metrics

Example with BLAST alignment:

# After getting BLAST alignment results
from adjusted_identity import score_alignment

blast_seq1 = "ATCG-AAAT"  # From BLAST output
blast_seq2 = "ATCGAAAAT"  # From BLAST output

result = score_alignment(blast_seq1, blast_seq2)
print(f"Adjusted identity: {result.identity:.3f}")

Convenience Function

`align_and_score(seq1, seq2, adjustment_params=None, scoring_format=None)`

High-level convenience function that handles both alignment and scoring in one step.

Use this when:

You want a simple, fast solution without additional alignment tools
You need BLAST-like performance for production use
You're comparing raw sequences end-to-end

Parameters:

seq1, seq2 (str): Raw DNA sequences to compare
adjustment_params (AdjustmentParams, optional): Adjustment parameters
scoring_format (ScoringFormat, optional): Scoring visualization format

Returns:

AlignmentResult: Contains identity metrics, alignment, and coverage information

Example:

from adjusted_identity import align_and_score

result = align_and_score("ATCGAAAATGTC", "ATCGAAAATGTC")
print(f"Identity: {result.identity:.3f}")

Configuration Classes

`AdjustmentParams`

Configure which sequence adjustments to apply:

AdjustmentParams(
    normalize_homopolymers=True,    # Ignore homopolymer length differences
    handle_iupac_overlap=True,      # Allow IUPAC ambiguity intersections
    normalize_indels=True,          # Count contiguous indels as single events
    end_skip_distance=0,           # Skip first/last N nucleotides (0 = disabled by default)
    max_repeat_motif_length=2      # Maximum repeat motif length to detect (1=homopolymers only, 2=dinucleotides, etc.)
)

`ScoringFormat`

Customize alignment visualization characters:

ScoringFormat(
    match='|',                     # Exact match (A=A, C=C, G=G, T=T)
    ambiguous_match='=',           # Ambiguous nucleotide match (any IUPAC code match)
    substitution=' ',              # Nucleotide substitution
    indel_start=' ',               # First position of indel
    indel_extension='-',           # Indel positions (normalization)
    homopolymer_extension='=',     # Homopolymer length difference
    end_trimmed='.'               # Position outside scoring region
)

`AlignmentResult`

Results returned by alignment functions:

AlignmentResult(
    identity=0.95,                 # Identity score (0.0-1.0)
    mismatches=2,                  # Number of mismatches counted
    scored_positions=40,           # Positions used for identity calculation
    seq1_coverage=0.98,            # Fraction of seq1 in alignment
    seq2_coverage=0.97,            # Fraction of seq2 in alignment
    seq1_aligned="ATCG-ATCG",      # Aligned sequence 1 with gaps
    seq2_aligned="ATCGATCG-",      # Aligned sequence 2 with gaps
    score_aligned="||||=|||"       # Scoring visualization
)

Constants

DEFAULT_ADJUSTMENT_PARAMS: All adjustments enabled (recommended)
RAW_ADJUSTMENT_PARAMS: No adjustments (traditional identity)
DEFAULT_SCORING_FORMAT: Default visualization characters

Integration with Other Alignment Tools

The core score_alignment() function works with alignments from any source. Here are examples with popular alignment libraries:

Using with NCBI BLAST

# NOTE: This example is illustrative and not tested
from Bio.Blast import NCBIWWW, NCBIXML
from adjusted_identity import score_alignment

# Run BLAST search (example)
result_handle = NCBIWWW.qblast("blastn", "nt", query_sequence)
blast_records = NCBIXML.parse(result_handle)

for blast_record in blast_records:
    for alignment in blast_record.alignments:
        for hsp in alignment.hsps:
            # Extract aligned sequences from BLAST HSP
            query_aligned = hsp.query
            subject_aligned = hsp.sbjct
            
            # Apply adjusted identity scoring
            adj_result = score_alignment(query_aligned, subject_aligned)
            
            print(f"BLAST identity: {hsp.identities/hsp.align_length:.3f}")
            print(f"Adjusted identity: {adj_result.identity:.3f}")

Using with BioPython PairwiseAligner

from Bio import Align
from adjusted_identity import score_alignment

# Create BioPython aligner
aligner = Align.PairwiseAligner()
aligner.match_score = 2
aligner.mismatch_score = -1

# Example sequences
seq1 = "ATCGATCG"
seq2 = "ATCGTCG"  # Missing one nucleotide

# Perform alignment
alignments = aligner.align(seq1, seq2)
best_alignment = alignments[0]

# Extract aligned sequences using indexing (idiomatic BioPython)
seq1_aligned = str(best_alignment[0])
seq2_aligned = str(best_alignment[1])

# Apply adjusted scoring
result = score_alignment(seq1_aligned, seq2_aligned)
print(f"Adjusted identity: {result.identity:.3f}")

Using with Custom/External Aligners

# NOTE: This example is illustrative template code
from adjusted_identity import score_alignment

def process_alignment_file(alignment_file):
    """Process alignments from any external tool."""
    results = []
    
    # Parse your alignment format (FASTA, SAM, custom, etc.)
    for seq1_aligned, seq2_aligned in parse_alignment_file(alignment_file):
        # Ensure sequences are the same length
        assert len(seq1_aligned) == len(seq2_aligned)
        
        # Apply adjusted identity scoring
        result = score_alignment(seq1_aligned, seq2_aligned)
        results.append(result)
    
    return results

Understanding End Trimming Behavior

The end_skip_distance parameter implements "digital end trimming" to skip sequencing artifacts near read ends. Important: This parameter counts nucleotides (non-gap characters), not alignment positions.

Automatic Activation

End trimming only activates when sequences are long enough:

from adjusted_identity import align_and_score, AdjustmentParams

# Short sequences (< 2 × end_skip_distance nucleotides): NO trimming applied
short_seq1 = "ATCGATCG"      # 8 nucleotides 
short_seq2 = "ATCGATCG"      # 8 nucleotides
result = align_and_score(short_seq1, short_seq2)  # end_skip_distance=0 by default
print(f"Scored positions: {result.scored_positions}")  # 8 (full sequence)
print(f"Score pattern: {result.score_aligned}")        # "||||||||" (no trimming dots)

# Long sequences (≥ 2 × end_skip_distance nucleotides): Trimming applied  
long_seq1 = "A" * 25 + "TCGX" + "T" * 25    # 54 nucleotides
long_seq2 = "A" * 25 + "TCGA" + "T" * 25    # 54 nucleotides
result = align_and_score(long_seq1, long_seq2)
print(f"Scored positions: {result.scored_positions}")  # ~14 (middle region only)
print(f"Score pattern: {result.score_aligned}")        # ".......|||| |||||......." (dots show trimmed regions)

Nucleotide vs Position Counting

End trimming counts actual nucleotides in each sequence, ignoring gaps:

# This alignment has gaps, but nucleotide counting still works correctly
seq1_aligned = "AAA---TCGATCG---TTT"  # 12 nucleotides (ignoring gaps)
seq2_aligned = "---AAATCGATCGTTT---"  # 12 nucleotides (ignoring gaps)

# With end_skip_distance=5: skips first 5 and last 5 nucleotides from each sequence
# Only the middle "TCGATCG" region (2 nucleotides) would be scored

Customizing End Trimming

# Disable end trimming completely
no_trim_params = AdjustmentParams(end_skip_distance=0)
result = align_and_score(long_seq1, long_seq2, no_trim_params)

# Use shorter trimming distance for smaller sequences
short_trim_params = AdjustmentParams(end_skip_distance=5) 
result = align_and_score(medium_seq1, medium_seq2, short_trim_params)

Rule of thumb: For sequences shorter than 2 × end_skip_distance nucleotides, end trimming has no effect and the entire alignment is scored.

Advanced Usage

Batch Processing

from adjusted_identity import align_and_score

sequences = [
    ("seq1", "ATCGATCGATCG"),
    ("seq2", "ATCGATCGATCC"),
    ("seq3", "ATCGATCGATCG"),
]

reference = sequences[0][1]
results = []

for name, seq in sequences[1:]:
    result = align_and_score(reference, seq)
    results.append({
        'name': name,
        'identity': result.identity,
        'coverage': min(result.seq1_coverage, result.seq2_coverage)
    })

# Sort by identity
results.sort(key=lambda x: x['identity'], reverse=True)

Understanding Scoring Patterns

The score_aligned field shows how each position was scored:

result = align_and_score("AAA-TTT", "AAAATTT")
print(result.score_aligned)  # "|||=|||"
# | = match
# = = homopolymer extension (ignored if adjustment enabled)

Scoring symbols:

|: Exact match (A=A, C=C, G=G, T=T)
=: Ambiguous match (IUPAC) or homopolymer/repeat extension
(space): Substitution or indel start (counts as mismatch)
-: Indel extension (ignored if normalization enabled)
.: End-trimmed, dual-gap, or overhang (not scored)

Testing

Run the comprehensive test suite:

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=adjusted_identity --cov-report=html

The test suite includes:

Unit tests for all adjustment types
Edge cases and error conditions
Real-world mycological scenarios
Performance tests with long sequences
Documentation examples

Background

This package implements the sequence preprocessing approach described in the MycoBLAST algorithm by Stephen Russell and Mycota Lab, adapted for general-purpose DNA sequence comparison. The foundational research is detailed in "Why NCBI BLAST identity scores can be misleading for fungi".

How the Variant Range Algorithm Works

Starting in v0.2.0, this package uses a variant range algorithm that provides more accurate scoring for complex indel patterns, especially in multi-sequence alignments.

The key insight: Standard aligners don't know about homopolymers—they just find the minimum-edit alignment. This can produce patterns where a simple homopolymer expansion looks like a complex substitution or scattered indels.

How it works:

Find variant ranges: Scan the alignment for contiguous regions where sequences differ (gaps, mismatches, or both). These are bounded by matching positions on each side.
Extract alleles: For each variant range, pull out the gap-free content from each sequence. For example, in TGC-C-TC vs TGCT--TC, the variant range yields alleles "C" and "T".
Check for extensions: Ask whether each allele could be explained as a repeat of the adjacent context:
- Does "C" extend the left context C? Yes (homopolymer)
- Does "T" extend the right context T? Yes (homopolymer)
Apply Occam's razor: If both alleles are valid extensions of their respective contexts, they represent equivalent repeat expansions → 0 edits. No mismatch is counted because both placements are biologically plausible.

Example:

seq1: ATTCA     Traditional scoring: 1 substitution (T vs C)
seq2: ATCCA     Variant range: T extends left T, C extends right C → 0 edits

This approach handles cases that position-by-position algorithms miss, such as "floating" nucleotides in MSA data where gap placement is arbitrary.

For the complete specification, see docs/SCORING_SPEC.md.

Why These Adjustments Matter

The adjustments are particularly valuable for:

Fungal taxonomy: ITS sequences often have homopolymer differences
DNA barcoding: Technical artifacts can obscure phylogenetic signal
Sequence quality assessment: End-trimming handles poor-quality regions
Phylogenetic analysis: IUPAC codes preserve ambiguous but valid matches

Credit: This implementation is based on the MycoBLAST algorithm developed by Stephen Russell and Mycota Lab. The theoretical framework and biological motivation are thoroughly explained in their foundational article.

Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Make changes and add tests
Run the test suite: pytest
Submit a pull request

Citation

If you use this package in your research, please cite:

Walker, J. (2025). Adjusted Identity Calculator for DNA Sequences. 
GitHub: https://github.com/joshuaowalker/adjusted-identity

Please also cite the foundational work:

Russell, S. (2025). Why NCBI BLAST identity scores can be misleading for fungi.
Mycota Lab. https://mycotalab.substack.com/p/why-ncbi-blast-identity-scores-can

License

BSD 2-Clause License - see LICENSE file for details.

Changelog

Version 0.2.2

Removed: score_aligned_seq2 field (added in v0.2.1) has been removed
- Analysis showed it was redundant: same as score_aligned 98% of the time
- Scoring is symmetric: swap seq1/seq2 arguments to get the alternate perspective
- This simplifies the API and reduces memory overhead

Version 0.2.1

Bug Fix: Fixed dual-gap handling so they don't split variant ranges (key regression test added)
Bug Fix: Fixed visualization when one position is extension and other is core with matching cores
Improved visualization for indel normalization: first core position shows , subsequent show -

Version 0.2.0

Major Enhancement: Implemented variant range algorithm for improved homopolymer and repeat motif detection
Key behavioral change: Alternating indel patterns like TGC-C-TC vs TGCT--TC now correctly score as identity=1.0
- The algorithm recognizes that C extends the left C context and T extends the right T context
- Both alleles are valid repeat extensions → 0 edits (Occam's razor principle)
Algorithm improvements:
- Variant regions are now bounded by non-gap match positions (respects alignment boundaries)
- Alleles extracted from variant ranges are analyzed for left/right repeat extensions
- Split scoring: partial extensions allowed (e.g., "AAG" where "AA" extends context scores AA as 0 edits, G as 1 edit)
- Opposite direction extensions are valid (allele1 extending left + allele2 extending right = both valid)
IUPAC integration: Motif matching uses _are_nucleotides_equivalent() so IUPAC codes can extend context
Breaking change: end_skip_distance now defaults to 0 (disabled). Set end_skip_distance=20 to restore previous behavior.
Removed 218 lines of dead code from previous indel processing implementation

Version 0.1.7

Feature: Added multi-sequence alignment (MSA) dual-gap support for homopolymer normalization
Consensus-based context extraction now handles sequences where both have gaps at the same position (common in MSA outputs from spoa, MUSCLE, MAFFT)
Dual-gap positions ('-' vs '-') are now correctly treated as matches, not indels
Homopolymer detection uses consensus from both sequences when extracting context
Added 17 comprehensive tests for MSA edge cases
100% backward compatible - all 133 tests pass
No API changes - existing code works unchanged

Version 0.1.6

Enhancement: Added validation for contradictory AdjustmentParams configuration
Now raises ValueError when normalize_homopolymers=True but max_repeat_motif_length < 1 (which would silently disable homopolymer normalization)
Added comprehensive test coverage for parameter validation edge cases
No API changes - existing valid configurations work unchanged

Version 0.1.5

Enhancement: Added ambiguous_match field to ScoringFormat to distinguish between exact nucleotide matches and ambiguous matches
Modified _are_nucleotides_equivalent() to return a tuple indicating match type
Score patterns now show | for exact standard nucleotide matches (A=A, C=C, G=G, T=T) and = for any matches involving IUPAC ambiguity codes
No breaking changes - existing code works unchanged but score visualization is more informative

Version 0.1.4

Bug fix: Fixed overhang scoring behavior when end_skip_distance=0
Now correctly scores only positions where both sequences have content (no gap vs nucleotide scoring)
Added comprehensive test suite for overhang region handling edge cases
No API changes - existing code will work unchanged but may see different results for overhang alignments

Version 0.1.3

Bug fix: Fixed alignment length mismatch error in align_edlib_bidirectional()
Resolved "Aligned sequences must have same length" errors for certain sequence pairs
Simplified suffix trimming logic by removing unnecessary sequence trimming/reattachment
No API changes or performance impact

Version 0.1.2

Breaking: Removed BioPython dependency - now only requires edlib
Implemented custom reverse_complement() function with full IUPAC support
Reduced package size and installation complexity
Added comprehensive test coverage for reverse complement functionality
Maintains 100% API compatibility (no code changes needed)

Version 0.1.1

Added repeat motif adjustment support (dinucleotide and longer repeats)
Implemented intelligent motif length detection with degeneracy handling
Added max_repeat_motif_length parameter to AdjustmentParams
Enhanced left-right indel processing algorithm for mixed motif lengths
Added comprehensive test coverage for repeat motif scenarios

Version 0.1.0

Initial release
Complete MycoBLAST-style adjustment implementation (except repeat motifs)
Comprehensive test suite
Full documentation and examples

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

joshuaowalker

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.6

Feb 9, 2026

0.2.5

Jan 29, 2026

This version

0.2.4

Dec 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adjusted_identity-0.2.4.tar.gz (71.3 kB view details)

Uploaded Dec 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

adjusted_identity-0.2.4-py3-none-any.whl (59.2 kB view details)

Uploaded Dec 5, 2025 Python 3

File details

Details for the file adjusted_identity-0.2.4.tar.gz.

File metadata

Download URL: adjusted_identity-0.2.4.tar.gz
Upload date: Dec 5, 2025
Size: 71.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for adjusted_identity-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`d646dbccfb0d71069bdccbee2b11ab04516cfb486d70d1ecebe9e07c1d099963`
MD5	`ab0fadf017392bbbcca096bbf3eefd24`
BLAKE2b-256	`3be046cc0b21b232ee59606ba669486aead09a8bf9291dd9a4c42389ef18b01d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for adjusted_identity-0.2.4.tar.gz:

Publisher: publish.yml on joshuaowalker/adjusted-identity

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: adjusted_identity-0.2.4.tar.gz
- Subject digest: d646dbccfb0d71069bdccbee2b11ab04516cfb486d70d1ecebe9e07c1d099963
- Sigstore transparency entry: 743791537
- Sigstore integration time: Dec 5, 2025
Source repository:
- Permalink: joshuaowalker/adjusted-identity@4ee876b6f838adefc0cf6e4de95c24324e2b52a0
- Branch / Tag: refs/tags/v0.2.4
- Owner: https://github.com/joshuaowalker
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4ee876b6f838adefc0cf6e4de95c24324e2b52a0
- Trigger Event: release

File details

Details for the file adjusted_identity-0.2.4-py3-none-any.whl.

File metadata

Download URL: adjusted_identity-0.2.4-py3-none-any.whl
Upload date: Dec 5, 2025
Size: 59.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for adjusted_identity-0.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`917a8a37fe62ef9c332acce7f9fdd33a6fcb9a09cf099dd3e2adf2b77871b689`
MD5	`c13decee60e527105be04e28bd230726`
BLAKE2b-256	`987ac5c35ca7a29e4f90c0f0371ed519dc43802fe48f93581ba49901403f377f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for adjusted_identity-0.2.4-py3-none-any.whl:

Publisher: publish.yml on joshuaowalker/adjusted-identity

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: adjusted_identity-0.2.4-py3-none-any.whl
- Subject digest: 917a8a37fe62ef9c332acce7f9fdd33a6fcb9a09cf099dd3e2adf2b77871b689
- Sigstore transparency entry: 743791538
- Sigstore integration time: Dec 5, 2025
Source repository:
- Permalink: joshuaowalker/adjusted-identity@4ee876b6f838adefc0cf6e4de95c24324e2b52a0
- Branch / Tag: refs/tags/v0.2.4
- Owner: https://github.com/joshuaowalker
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4ee876b6f838adefc0cf6e4de95c24324e2b52a0
- Trigger Event: release

adjusted-identity 0.2.4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Adjusted Identity Calculator for DNA Sequences

Features

Installation

From GitHub

Development Installation

Quick Start

Option 1: Complete Solution (align + score)

Option 2: Core Scoring Only (use with your alignments)

Compare with Traditional Identity

Use Cases

Mycological DNA Barcoding

Handling Ambiguous Bases

Understanding Score Patterns

Repeat Motif Handling

Multi-Sequence Alignment (MSA) Support

Custom Adjustments

API Architecture

Core Layer: score_alignment()

Convenience Layer: align_and_score()

API Reference

Core Function

score_alignment(seq1_aligned, seq2_aligned, adjustment_params=None, scoring_format=None)

Convenience Function

align_and_score(seq1, seq2, adjustment_params=None, scoring_format=None)

Configuration Classes

AdjustmentParams

ScoringFormat

AlignmentResult

Constants

Integration with Other Alignment Tools

Using with NCBI BLAST

Using with BioPython PairwiseAligner

Using with Custom/External Aligners

Understanding End Trimming Behavior

Automatic Activation

Nucleotide vs Position Counting

Customizing End Trimming

Advanced Usage

Batch Processing

Understanding Scoring Patterns

Testing

Background

How the Variant Range Algorithm Works

Why These Adjustments Matter

Contributing

Citation

License

Changelog

Version 0.2.2

Version 0.2.1

Version 0.2.0

Version 0.1.7

Version 0.1.6

Version 0.1.5

Version 0.1.4

Version 0.1.3

Version 0.1.2

Version 0.1.1

Version 0.1.0

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Core Layer: `score_alignment()`

Convenience Layer: `align_and_score()`

`score_alignment(seq1_aligned, seq2_aligned, adjustment_params=None, scoring_format=None)`

`align_and_score(seq1, seq2, adjustment_params=None, scoring_format=None)`

`AdjustmentParams`

`ScoringFormat`

`AlignmentResult`