No project description provided
Project description
STR Similarity Scorer
This repository contains a single Python module to compute the Tanabe score ("non-empty markers" mode) for pairs of records in an input data frame.
Usage
import pandas as pd
import str_sim_scorer
df = pd.DataFrame(
[
{
"id": "sample1",
"csf1po": "11, 13",
"d13s317": "11, 12",
"d16s539": "9, 12",
"d18s51": "11, 19",
"d21s11": "29, 31.2",
"d3s1358": "17",
"d5s818": "13",
"d7s820": "10",
"d8s1179": "12, 13",
"fga": "24",
"penta_d": "9, 12",
"penta_e": "7, 13",
"th01": "6, 8",
"tpox": "11",
},
{
"id": "sample2",
"csf1po": "12",
"d13s317": "11, 12",
"d16s539": "8, 12",
"d18s51": "17, 18",
"d21s11": "28, 33.2",
"d3s1358": "16",
"d5s818": "11",
"d7s820": "8, 13",
"d8s1179": "9, 10",
"fga": "21, 25",
"penta_d": "9, 12",
"penta_e": "7",
"th01": "7, 9.3",
"tpox": "8",
},
{
"id": "sample3",
"csf1po": "11, 12",
"d13s317": "8",
"d16s539": "11",
"d18s51": "18",
"d21s11": pd.NA,
"d3s1358": "16",
"d5s818": "10, 11",
"d7s820": "12",
"d8s1179": "11",
"fga": "26",
"penta_d": pd.NA,
"penta_e": "13.1, 12.1",
"th01": "6, 9.3",
"tpox": "12",
},
]
)
tanabe_scores = str_sim_scorer.compare(
df,
id_col_name="id",
locus_col_names=[
"csf1po",
"d13s317",
"d16s539",
"d18s51",
"d21s11",
"d3s1358",
"d5s818",
"d7s820",
"d8s1179",
"fga",
"penta_d",
"penta_e",
"th01",
"tpox",
],
output="df",
)
Using output="df" returns a data frame for distinct pairs of IDs:
>>> print(tanabe_scores)
id1 id2 shared_alleles common_loci tanabe_score
0 sample1 sample2 6 46 0.260870
1 sample1 sample3 2 35 0.114286
2 sample2 sample3 5 35 0.285714
Using output="symmetric_df" returns the same data frame with (id1, id2) rows repeated as (id2, id1):
>>> print(tanabe_scores)
id1 id2 shared_alleles common_loci tanabe_score
0 sample1 sample2 6 46 0.260870
1 sample1 sample3 2 35 0.114286
2 sample2 sample3 5 35 0.285714
3 sample2 sample1 6 46 0.260870
4 sample3 sample1 2 35 0.114286
5 sample3 sample2 5 35 0.285714
Finally, output="array" returns a dictionary of a symmetric numpy arrays for shared_alleles, common_loci, and tanabe_score, along with their row/column names:
>>> print(tanabe_scores["tanabe_scores"])
array([[1. , 0.26086957, 0.11428571],
[0.26086957, 1. , 0.28571429],
[0.11428571, 0.28571429, 1. ]])
This package computes the number of shared alleles and common markers largely through matrix algebra, so it is fast enough to be run on thousands of records (millions of pairs).
Development
Installation
-
Install the required system dependencies:
-
Install the required Python version (>=3.9):
pyenv install "$(cat .python-version)"
-
Confirm that
pythonmaps to the correct version:python --version -
Set the Poetry interpreter and install the Python dependencies:
poetry env use "$(pyenv which python)" poetry install
Run poetry run pyright to check static types with Pyright.
Testing
poetry run pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file str_sim_scorer-1.0.0.tar.gz.
File metadata
- Download URL: str_sim_scorer-1.0.0.tar.gz
- Upload date:
- Size: 5.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.5 CPython/3.12.9 Darwin/24.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f7d105665079266064fc2517e894bd98ec612da3114a541db6e1b47db069fe2
|
|
| MD5 |
f8c987fb051954318d03d1c6803b08ac
|
|
| BLAKE2b-256 |
a4a5a93b77aeedb58b9811e10464ebb9fe05ac39af10abb35ac8f0280279c8c0
|
File details
Details for the file str_sim_scorer-1.0.0-py3-none-any.whl.
File metadata
- Download URL: str_sim_scorer-1.0.0-py3-none-any.whl
- Upload date:
- Size: 6.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.5 CPython/3.12.9 Darwin/24.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
efec4ab8bf96b9b27ef29ed46cfe82c9052100eb00a1ae22f40cd8dde9741629
|
|
| MD5 |
982d7942f37193d69d462a8b661378f3
|
|
| BLAKE2b-256 |
e2f61d3271c5954a399fbfa5dfff25b372a8c4dc8d7b3bdbd3771f85c5c1ff78
|