No project description provided

These details have not been verified by PyPI

Project description

from str_sim_scorer import StrSimScorer

STR Similarity Scorer

This repository contains a Python package to compute the similarity score for pairs of records in an input data frame, some of which may be optionally specified as reference profiles.

This package computes the number of matching and total alleles and common loci largely through matrix algebra, so it's fast enough to be run on thousands of samples (millions of pairs).

Installation

Install str_sim_scorer from PyPI using the package manager of your choice.

Usage

The StrSimScorer class provides an object-oriented interface with caching for efficient computation:

import pandas as pd
from str_sim_scorer import StrSimScorer

df = pd.DataFrame(
    [
        {
            "id": "sample1",
            "csf1po": "11, 13",
            "d13s317": "11, 12",
            "d16s539": "9, 12",
            "d18s51": "11, 19",
            "d21s11": "29, 31.2",
            "d3s1358": "17",
            "d5s818": "13",
            "d7s820": "10",
            "d8s1179": "12, 13",
            "fga": "24",
            "penta_d": "9, 12",
            "penta_e": "7, 13",
            "th01": "6, 8",
            "tpox": "11",
        },
        {
            "id": "sample2",
            "csf1po": "12",
            "d13s317": "11, 12",
            "d16s539": "8, 12",
            "d18s51": "17, 18",
            "d21s11": "28, 33.2",
            "d3s1358": "16",
            "d5s818": "11",
            "d7s820": "8, 13",
            "d8s1179": "9, 10",
            "fga": "21, 25",
            "penta_d": "9, 12",
            "penta_e": "7",
            "th01": "7, 9.3",
            "tpox": "8",
        },
        {
            "id": "sample3",
            "csf1po": "11, 12",
            "d13s317": "8",
            "d16s539": "11",
            "d18s51": "18",
            "d21s11": pd.NA,
            "d3s1358": "16",
            "d5s818": "10, 11",
            "d7s820": "12",
            "d8s1179": "11",
            "fga": "26",
            "penta_d": pd.NA,
            "penta_e": "13.1, 12.1",
            "th01": "6, 9.3",
            "tpox": "12",
        },
    ]
)

# Create the comparison object
scorer = StrSimScorer(
    df,
    sample_id_col_name="id",
    locus_col_names=[
        "csf1po",
        "d13s317",
        "d16s539",
        "d18s51",
        "d21s11",
        "d3s1358",
        "d5s818",
        "d7s820",
        "d8s1179",
        "fga",
        "penta_d",
        "penta_e",
        "th01",
        "tpox",
    ],
)

scores = scorer.scores(output="df")

Output formats

Using output="df" returns a DataFrame for distinct pairs of IDs:

>>> print(scores)
       id1      id2  n_loci_used     score
0  sample1  sample2           14   0.26087
1  sample1  sample3           12  0.114286
2  sample2  sample3           12  0.285714

Using output="full_df" returns the same data with both (id1, id2) and (id2, id1) rows:

>>> scores_sym = comp.scores(output="symmetric_df")
>>> print(scores_sym)
       id1      id2  n_loci_used     score
0  sample1  sample2           14   0.26087
1  sample1  sample3           12  0.114286
2  sample2  sample1           14   0.26087
3  sample2  sample3           12  0.285714
4  sample3  sample1           12  0.114286
5  sample3  sample2           12  0.285714

Using output="array" returns the raw similarity matrix as a numpy masked array:

>>> array = comp.scores(output="array")
>>> print(array)
masked_array(
  data=[[--, 0.2608695652173913, 0.11428571428571428],
        [0.2608695652173913, --, 0.2857142857142857],
        [0.11428571428571428, 0.2857142857142857, --]],
  mask=[[ True, False, False],
        [False,  True, False],
        [False, False,  True]],
  fill_value=0.0)
>>> print(scorer.sample_ids) # the row/col names of the matrix 
['sample1', 'sample2', 'sample3']

Only cells representing a valid and relevant pair of samples are unmasked, which is why the diagonal is masked in this example.

Algorithms

Tanabe / non-empty markers

This package implements two algorithms. For a pair of samples where neither is indicated as a "reference", the score is calculated using the Tanabe algorithm under the "non-empty markers" mode. Thus, n_loci_used is the number of loci where both samples had data.

Master vs. reference / reference markers

If your input data frame has a boolean column indicating that some samples are references (e.g. STR profiles from a canonical source like Cellosaurus), set the is_reference_col_name argument to the name of that column. A cell i,j in the scores matrix will be computed using the "masters vs. reference" algorithm if scorer.sample_ids[i] is a master (i.e. real sample) and scorer.sample_ids[j] is a reference. In ths case, n_loci_used is the number of loci present in the reference.

import pandas as pd
from str_sim_scorer import StrSimScorer

df = pd.DataFrame(
    [
        {
            "id": "sample1",
            "is_ref": False,
            "csf1po": "11, 13",
            "d13s317": "11, 12",
            "d16s539": "9, 12",
            "d18s51": "11, 19",
            "d21s11": "29, 31.2",
            "d3s1358": "17",
            "d5s818": "13",
            "d7s820": "10",
            "d8s1179": "12, 13",
            "fga": "24",
            "penta_d": "9, 12",
            "penta_e": "7, 13",
            "th01": "6, 8",
            "tpox": "11",
        },
        {
            "id": "sample2",
            "is_ref": True,
            "csf1po": "12",
            "d13s317": "11, 12",
            "d16s539": "8, 12",
            "d18s51": "17, 18",
            "d21s11": "28, 33.2",
            "d3s1358": "16",
            "d5s818": "11",
            "d7s820": "8, 13",
            "d8s1179": "9, 10",
            "fga": "21, 25",
            "penta_d": "9, 12",
            "penta_e": "7",
            "th01": "7, 9.3",
            "tpox": "8",
        },
        {
            "id": "sample3",
            "is_ref": False,
            "csf1po": "11, 12",
            "d13s317": "8",
            "d16s539": "11",
            "d18s51": "18",
            "d21s11": pd.NA,
            "d3s1358": "16",
            "d5s818": "10, 11",
            "d7s820": "12",
            "d8s1179": "11",
            "fga": "26",
            "penta_d": pd.NA,
            "penta_e": "13.1, 12.1",
            "th01": "6, 9.3",
            "tpox": "12",
        },
    ]
)

scorer = StrSimScorer(
    df,
    sample_id_col_name="id",
    locus_col_names=[
        "csf1po",
        "d13s317",
        "d16s539",
        "d18s51",
        "d21s11",
        "d3s1358",
        "d5s818",
        "d7s820",
        "d8s1179",
        "fga",
        "penta_d",
        "penta_e",
        "th01",
        "tpox",
    ],
    is_reference_col_name="is_ref",
)

scores = scorer.scores(output="array")

>>> print(scores)
masked_array(
  data=[[--, 0.2608695652173913, 0.11428571428571428],
        [--, --, --],
        [0.11428571428571428, 0.21739130434782608, --]],
  mask=[[ True, False, False],
        [ True,  True,  True],
        [False, False,  True]],
  fill_value=0.0)

Cells for pairs of non-reference samples like 0,2 and 2,0 (a vs. c) are computed using the Tanabe algorithm.
Cells like 0,1 (a vs. b) and 2,1 (c vs. b) are computed against the reference sample b, so they use the other algorithm and loci counting mode.
Cells 1,0 (b vs. a) and 1,2 (b vs. c) are masked since these involve comparisons that are irrelevant under both algorithms.

Development

Installation

Install the required system dependencies:
- pyenv
- Poetry
- pre-commit
Install the required Python version (>=3.9):
```
pyenv install "$(cat .python-version)"
```
Confirm that python maps to the correct version:
```
python --version
```
Set the Poetry interpreter and install the Python dependencies:
```
poetry env use "$(pyenv which python)"
poetry install
```

Run poetry run pyright to check static types with Pyright.

Testing

poetry run pytest

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

3.1.1

Oct 2, 2025

This version

3.1.0

Oct 2, 2025

3.0.2

Aug 28, 2025

3.0.1

Aug 28, 2025

3.0.0

Aug 28, 2025

2.2.1

Aug 11, 2025

2.2.0

Aug 11, 2025

2.0.1

Aug 11, 2025

2.0.0

Aug 11, 2025

1.0.0

Aug 5, 2025

0.1.1

Sep 26, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

str_sim_scorer-3.1.0.tar.gz (10.5 kB view details)

Uploaded Oct 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

str_sim_scorer-3.1.0-py3-none-any.whl (9.7 kB view details)

Uploaded Oct 2, 2025 Python 3

File details

Details for the file str_sim_scorer-3.1.0.tar.gz.

File metadata

Download URL: str_sim_scorer-3.1.0.tar.gz
Upload date: Oct 2, 2025
Size: 10.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.5 CPython/3.12.9 Darwin/24.6.0

File hashes

Hashes for str_sim_scorer-3.1.0.tar.gz
Algorithm	Hash digest
SHA256	`17f063e698d7155f95cdbf46d347517dbd7f1dae885f749ee254b3572d73af57`
MD5	`4867e68d1f1f2e922f56c84838dc90d8`
BLAKE2b-256	`0b064ade8d5c0b706777d47cdaa4da7cc9ea9082c13f6dad2144ab7212435426`

See more details on using hashes here.

File details

Details for the file str_sim_scorer-3.1.0-py3-none-any.whl.

File metadata

Download URL: str_sim_scorer-3.1.0-py3-none-any.whl
Upload date: Oct 2, 2025
Size: 9.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.5 CPython/3.12.9 Darwin/24.6.0

File hashes

Hashes for str_sim_scorer-3.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a5e86445b2f6108a18ce04e53fc78630af3729b77a02a7a92cdfc268c1d451f0`
MD5	`5ceafee7fb7c5313ef7bee593a7ed357`
BLAKE2b-256	`f6e793c3f6211180b6a76cc3ddaf142312081581eb02e45b5ca63e4500d5b8f3`

See more details on using hashes here.

str_sim_scorer 3.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

STR Similarity Scorer

Installation

Usage

Output formats

Algorithms

Tanabe / non-empty markers

Master vs. reference / reference markers

Development

Installation

Testing

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes