Skip to main content

No project description provided

Project description

from str_sim_scorer import StrSimScorer

STR Similarity Scorer

This repository contains a Python package to compute the Tanabe score ("non-empty markers" mode) for pairs of records in an input data frame.

This package computes the number of matching and total alleles and common loci largely through matrix algebra, so it's fast enough to be run on thousands of samples (millions of pairs).

Installation

Install str_sim_scorer from PyPI using the package manager of your choice.

Usage

The StrSimScorer class provides an object-oriented interface with caching for efficient computation:

import pandas as pd
from str_sim_scorer import StrSimScorer

df = pd.DataFrame(
    [
        {
            "id": "sample1",
            "csf1po": "11, 13",
            "d13s317": "11, 12",
            "d16s539": "9, 12",
            "d18s51": "11, 19",
            "d21s11": "29, 31.2",
            "d3s1358": "17",
            "d5s818": "13",
            "d7s820": "10",
            "d8s1179": "12, 13",
            "fga": "24",
            "penta_d": "9, 12",
            "penta_e": "7, 13",
            "th01": "6, 8",
            "tpox": "11",
        },
        {
            "id": "sample2",
            "csf1po": "12",
            "d13s317": "11, 12",
            "d16s539": "8, 12",
            "d18s51": "17, 18",
            "d21s11": "28, 33.2",
            "d3s1358": "16",
            "d5s818": "11",
            "d7s820": "8, 13",
            "d8s1179": "9, 10",
            "fga": "21, 25",
            "penta_d": "9, 12",
            "penta_e": "7",
            "th01": "7, 9.3",
            "tpox": "8",
        },
        {
            "id": "sample3",
            "csf1po": "11, 12",
            "d13s317": "8",
            "d16s539": "11",
            "d18s51": "18",
            "d21s11": pd.NA,
            "d3s1358": "16",
            "d5s818": "10, 11",
            "d7s820": "12",
            "d8s1179": "11",
            "fga": "26",
            "penta_d": pd.NA,
            "penta_e": "13.1, 12.1",
            "th01": "6, 9.3",
            "tpox": "12",
        },
    ]
)

# Create the comparison object
scorer = StrSimScorer(
    df,
    sample_id_col_name="id",
    locus_col_names=[
        "csf1po",
        "d13s317",
        "d16s539",
        "d18s51",
        "d21s11",
        "d3s1358",
        "d5s818",
        "d7s820",
        "d8s1179",
        "fga",
        "penta_d",
        "penta_e",
        "th01",
        "tpox",
    ],
)

# Get Tanabe scores as a DataFrame (upper triangle only)
tanabe_scores = scorer.tanabe_scores(output="df")

Output formats

Using output="df" returns a DataFrame for distinct pairs of IDs:

>>> print(tanabe_scores)
       id1      id2  n_common_loci  n_matching_alleles  n_total_alleles  tanabe_score
0  sample1  sample2             14                   6               46      0.260870
1  sample1  sample3             12                   2               35      0.114286
2  sample2  sample3             12                   5               35      0.285714

Using output="symmetric_df" returns the same data with both (id1, id2) and (id2, id1) rows:

>>> tanabe_scores_sym = comp.tanabe_scores(output="symmetric_df")
>>> print(tanabe_scores_sym)
       id1      id2  n_common_loci  n_matching_alleles  n_total_alleles  tanabe_score
0  sample1  sample2             14                   6               46      0.260870
1  sample1  sample3             12                   2               35      0.114286
2  sample2  sample3             12                   5               35      0.285714
5  sample3  sample2             12                   5               35      0.285714
4  sample3  sample1             12                   2               35      0.114286
3  sample2  sample1             14                   6               46      0.260870

Using output="array" returns the raw symmetric matrix:

>>> tanabe_array = comp.tanabe_scores(output="array")
>>> print(tanabe_array)
array([[1.        , 0.26086957, 0.11428571],
       [0.26086957, 1.        , 0.28571429],
       [0.11428571, 0.28571429, 1.        ]])

Accessing individual components

The StrSimScorer class also provides access to intermediate matrices:

from str_sim_scorer import StrSimScorer

scorer = StrSimScorer(...)

# Get the processed alleles DataFrame
alleles = scorer.alleles()

# Get individual matrices
common_loci = scorer.n_common_loci()
matching_alleles = scorer.n_matching_alleles()
total_alleles = scorer.n_total_alleles()

# Get sample IDs in order of the rows/columns of the matrices
sample_ids = scorer.sample_ids()

Development

Installation

  1. Install the required system dependencies:

  2. Install the required Python version (>=3.9):

    pyenv install "$(cat .python-version)"
    
  3. Confirm that python maps to the correct version:

    python --version
    
  4. Set the Poetry interpreter and install the Python dependencies:

    poetry env use "$(pyenv which python)"
    poetry install
    

Run poetry run pyright to check static types with Pyright.

Testing

poetry run pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

str_sim_scorer-2.2.1.tar.gz (8.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

str_sim_scorer-2.2.1-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file str_sim_scorer-2.2.1.tar.gz.

File metadata

  • Download URL: str_sim_scorer-2.2.1.tar.gz
  • Upload date:
  • Size: 8.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.12.9 Darwin/24.6.0

File hashes

Hashes for str_sim_scorer-2.2.1.tar.gz
Algorithm Hash digest
SHA256 1367f70e4f981e7efd457a8c04a2ef013f723a05fe15a7f72f49542097d723a7
MD5 d7d9e9667fc0b977977c29d6ce5679c3
BLAKE2b-256 ae5baf5d672b2c1bc8e287731790a036699dbd3e19f0f59d89107f3f063db5eb

See more details on using hashes here.

File details

Details for the file str_sim_scorer-2.2.1-py3-none-any.whl.

File metadata

  • Download URL: str_sim_scorer-2.2.1-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.5 CPython/3.12.9 Darwin/24.6.0

File hashes

Hashes for str_sim_scorer-2.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 780c9fafe96105778991760c54787ae217fb14029562496b3b2acce2c8d98ccb
MD5 f42199edc866e711b9ac46ca15057b66
BLAKE2b-256 7b07df9af8aad7251b1b832b6a2ed67dd9835cc8f3c8afeb7d41c39967807192

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page