texim
Project description
texim
texim: text similarity text similarity tool, and it works better for record linkage!
Description
texim is text similarity tool, for record linkage task.
we proposed 2 points for cosine andjaccard similarity:
- length sensitive weight
- semi-match method for field matching
weight type
Classical cosine similarity use TF-IDF as weight of tokens, and we use TF here just for short string. It is common for record linkage to match some field. like name, email, address and so on.
we have 3 weight types here:
- tf : token frequency of token
- len : length of token
- 1 : const 1
semi-match
Abbreviations is common for us, "alan turing" vs "a turing", and semi-match can match "alan"="a" and "turing"="turing".
Install
pip install texim
Examples
from texim import similarity, words_match
## semi-match
tokens1 = ["vandesompele","j"]
tokens2 = ["jo", "vandesompele"]
words_match(tokens1, tokens2, semi_match=False)
# [("vandesompele", "vandesompele")]
words_match(tokens1, tokens2, semi_match=True)
# [("vandesompele", "vandesompele"), ("j", "jo")]
## cosine similarity
text1, text2 = "vandesompele j", "jo vandesompele"
text3, text4 = "a b b b v", "a b b c"
similarity(text1, text2, semi_match=False, wtype="len") # 0.98
similarity(text1, text2, semi_match=True, wtype="len") # 1.0
similarity(text1, text2, semi_match=False, wtype="tf") # 0.5
similarity(text3, text4, semi_match=False, wtype="tf") # 0.86
similarity(text3, text4, semi_match=False, wtype="1") # 0.67
## jaccard similarity
similarity(text1, text2, method="jaccard", semi_match=False, wtype="tf") # 0.33
similarity(text1, text2, method="jaccard", semi_match=False, wtype="len") # 0.8
similarity(text1, text2, method="jaccard", semi_match=True, wtype="tf") # 1.0
similarity(text3, text4, method="jaccard", semi_match=False, wtype="tf") # 0.64
similarity(text3, text4, method="jaccard", semi_match=False, wtype="1") # 0.5
Notice
- all fields need to be converted to lower case.
- you can call texim.cosine and texim.jaccard directly if you need a custom token cutting and weight countting.
Check the setup.py pls!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
texim-0.0.1.tar.gz
(4.9 kB
view details)
File details
Details for the file texim-0.0.1.tar.gz
.
File metadata
- Download URL: texim-0.0.1.tar.gz
- Upload date:
- Size: 4.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.3.1 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e9c8f525e1b36768246bf2957c5885c8128e708194dadb838d34bafa06542043 |
|
MD5 | ebe7e2e63540d6f2318d075ac3cbc2a1 |
|
BLAKE2b-256 | 04b24c0e9fd95e560508795f8263838218f0291048640c8c761c91b401b2dd02 |