No project description provided
Project description
from str_sim_scorer import StrSimScorer
STR Similarity Scorer
This repository contains a Python package to compute the Tanabe score ("non-empty markers" mode) for pairs of records in an input data frame.
This package computes the number of matching and total alleles and common loci largely through matrix algebra, so it's fast enough to be run on thousands of samples (millions of pairs).
Installation
Install str_sim_scorer from PyPI using the package manager of your choice.
Usage
The StrSimScorer class provides an object-oriented interface with caching for efficient computation:
import pandas as pd
from str_sim_scorer import StrSimScorer
df = pd.DataFrame(
[
{
"id": "sample1",
"csf1po": "11, 13",
"d13s317": "11, 12",
"d16s539": "9, 12",
"d18s51": "11, 19",
"d21s11": "29, 31.2",
"d3s1358": "17",
"d5s818": "13",
"d7s820": "10",
"d8s1179": "12, 13",
"fga": "24",
"penta_d": "9, 12",
"penta_e": "7, 13",
"th01": "6, 8",
"tpox": "11",
},
{
"id": "sample2",
"csf1po": "12",
"d13s317": "11, 12",
"d16s539": "8, 12",
"d18s51": "17, 18",
"d21s11": "28, 33.2",
"d3s1358": "16",
"d5s818": "11",
"d7s820": "8, 13",
"d8s1179": "9, 10",
"fga": "21, 25",
"penta_d": "9, 12",
"penta_e": "7",
"th01": "7, 9.3",
"tpox": "8",
},
{
"id": "sample3",
"csf1po": "11, 12",
"d13s317": "8",
"d16s539": "11",
"d18s51": "18",
"d21s11": pd.NA,
"d3s1358": "16",
"d5s818": "10, 11",
"d7s820": "12",
"d8s1179": "11",
"fga": "26",
"penta_d": pd.NA,
"penta_e": "13.1, 12.1",
"th01": "6, 9.3",
"tpox": "12",
},
]
)
# Create the comparison object
scorer = StrSimScorer(
df,
sample_id_col_name="id",
locus_col_names=[
"csf1po",
"d13s317",
"d16s539",
"d18s51",
"d21s11",
"d3s1358",
"d5s818",
"d7s820",
"d8s1179",
"fga",
"penta_d",
"penta_e",
"th01",
"tpox",
],
)
# Get Tanabe scores as a DataFrame (upper triangle only)
tanabe_scores = scorer.tanabe_scores(output="df")
Output formats
Using output="df" returns a DataFrame for distinct pairs of IDs:
>>> print(tanabe_scores)
id1 id2 n_common_loci n_matching_alleles n_total_alleles tanabe_score
0 sample1 sample2 14 6 46 0.260870
1 sample1 sample3 12 2 35 0.114286
2 sample2 sample3 12 5 35 0.285714
Using output="symmetric_df" returns the same data with both (id1, id2) and (id2, id1) rows:
>>> tanabe_scores_sym = comp.tanabe_scores(output="symmetric_df")
>>> print(tanabe_scores_sym)
id1 id2 n_common_loci n_matching_alleles n_total_alleles tanabe_score
0 sample1 sample2 14 6 46 0.260870
1 sample1 sample3 12 2 35 0.114286
2 sample2 sample3 12 5 35 0.285714
5 sample3 sample2 12 5 35 0.285714
4 sample3 sample1 12 2 35 0.114286
3 sample2 sample1 14 6 46 0.260870
Using output="array" returns the raw symmetric matrix:
>>> tanabe_array = comp.tanabe_scores(output="array")
>>> print(tanabe_array)
array([[1. , 0.26086957, 0.11428571],
[0.26086957, 1. , 0.28571429],
[0.11428571, 0.28571429, 1. ]])
Accessing individual components
The StrSimScorer class also provides access to intermediate matrices:
from str_sim_scorer import StrSimScorer
scorer = StrSimScorer(...)
# Get the processed alleles DataFrame
alleles = scorer.alleles()
# Get individual matrices
common_loci = scorer.n_common_loci()
matching_alleles = scorer.n_matching_alleles()
total_alleles = scorer.n_total_alleles()
# Get sample IDs in order of the rows/columns of the matrices
sample_ids = scorer.sample_ids()
Development
Installation
-
Install the required system dependencies:
-
Install the required Python version (>=3.9):
pyenv install "$(cat .python-version)"
-
Confirm that
pythonmaps to the correct version:python --version -
Set the Poetry interpreter and install the Python dependencies:
poetry env use "$(pyenv which python)" poetry install
Run poetry run pyright to check static types with Pyright.
Testing
poetry run pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file str_sim_scorer-2.2.1.tar.gz.
File metadata
- Download URL: str_sim_scorer-2.2.1.tar.gz
- Upload date:
- Size: 8.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.5 CPython/3.12.9 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1367f70e4f981e7efd457a8c04a2ef013f723a05fe15a7f72f49542097d723a7
|
|
| MD5 |
d7d9e9667fc0b977977c29d6ce5679c3
|
|
| BLAKE2b-256 |
ae5baf5d672b2c1bc8e287731790a036699dbd3e19f0f59d89107f3f063db5eb
|
File details
Details for the file str_sim_scorer-2.2.1-py3-none-any.whl.
File metadata
- Download URL: str_sim_scorer-2.2.1-py3-none-any.whl
- Upload date:
- Size: 8.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.5 CPython/3.12.9 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
780c9fafe96105778991760c54787ae217fb14029562496b3b2acce2c8d98ccb
|
|
| MD5 |
f42199edc866e711b9ac46ca15057b66
|
|
| BLAKE2b-256 |
7b07df9af8aad7251b1b832b6a2ed67dd9835cc8f3c8afeb7d41c39967807192
|