Skip to main content

texim

Project description

texim

texim: text similarity text similarity tool, and it works better for record linkage!

Description

texim is text similarity tool, for record linkage task.
we proposed 2 points for cosine andjaccard similarity:

  • length sensitive weight
  • semi-match method for field matching

weight type

Classical cosine similarity use TF-IDF as weight of tokens, and we use TF here just for short string. It is common for record linkage to match some field. like name, email, address and so on.

we have 3 weight types here:

  • tf : token frequency of token
  • len : length of token
  • 1 : const 1

semi-match

Abbreviations is common for us, "alan turing" vs "a turing", and semi-match can match "alan"="a" and "turing"="turing".

Install

pip install texim 

Examples

from texim import similarity, words_match

## semi-match 
tokens1 = ["vandesompele","j"]
tokens2 = ["jo", "vandesompele"]
words_match(tokens1, tokens2, semi_match=False)
# [("vandesompele", "vandesompele")]
words_match(tokens1, tokens2, semi_match=True)
# [("vandesompele", "vandesompele"), ("j", "jo")]

## cosine similarity
text1, text2 = "vandesompele j", "jo vandesompele"
text3, text4 = "a b b b v", "a b b c"
similarity(text1, text2, semi_match=False, wtype="len") # 0.98
similarity(text1, text2, semi_match=True, wtype="len")  # 1.0
similarity(text1, text2, semi_match=False, wtype="tf")  # 0.5
similarity(text3, text4, semi_match=False, wtype="tf")  # 0.86
similarity(text3, text4, semi_match=False, wtype="1")   # 0.67

## jaccard similarity
similarity(text1, text2, method="jaccard", semi_match=False, wtype="tf")  # 0.33
similarity(text1, text2, method="jaccard", semi_match=False, wtype="len")  # 0.8
similarity(text1, text2, method="jaccard", semi_match=True, wtype="tf")  # 1.0
similarity(text3, text4, method="jaccard", semi_match=False, wtype="tf")  # 0.64
similarity(text3, text4, method="jaccard", semi_match=False, wtype="1")  # 0.5

Notice

  • all fields need to be converted to lower case.
  • you can call texim.cosine and texim.jaccard directly if you need a custom token cutting and weight countting.

E-Mail

Check the setup.py pls!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

texim-0.0.1.tar.gz (4.9 kB view details)

Uploaded Source

File details

Details for the file texim-0.0.1.tar.gz.

File metadata

  • Download URL: texim-0.0.1.tar.gz
  • Upload date:
  • Size: 4.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.3.1 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.7.6

File hashes

Hashes for texim-0.0.1.tar.gz
Algorithm Hash digest
SHA256 e9c8f525e1b36768246bf2957c5885c8128e708194dadb838d34bafa06542043
MD5 ebe7e2e63540d6f2318d075ac3cbc2a1
BLAKE2b-256 04b24c0e9fd95e560508795f8263838218f0291048640c8c761c91b401b2dd02

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page