Skip to main content

A toolkit to estimate semantic similarity or relatedness

Project description

Semantic Similarity and Relatedness Toolkit

A toolkit to estimate semantic similarity and relatedness between two words/sentences.

Installation

pip install semantic-kit

Functions

  1. Lesk algorithm and improved version
  2. Similarity algorithms including WordNet , word2vec similarity, LDA, and googlenews-based methods
  3. Distance algorithms like jaccard, soren, levenshtein, and their improved versions
  4. Use Open Multilingual Wordnet to generate relevant keywords from multiple language

Examples

Lesk Algorithm

from semantickit.relatedness.lesk import lesk
from semantickit.relatedness.lesk_max_overlap import lesk_max_overlap
sent = ['I', 'went', 'to', 'the', 'bank', 'to', 'deposit', 'money', '.']
m1, s1 = lesk(sent, 'bank', 'n')
m2, s2 = lesk_max_overlap(sent, 'bank', 'n')
print(m1,s1)
print(m2,s2)

WordNet-based Similarity

from semantickit.similarity.wordnet_similarity import wordnet_similarity_all
print(wordnet_similarity_all("dog.n.1","cat.n.1"))

Corpus-based Similarity

from semantickit.similarity.word2vec_similarity import build_model,similarity_model
# build similarity model based on text source
build_model(data_path="text8",save_path="wiki_model")
# estimate similarity between words using the built model
sim=similarity_model("wiki_model","france","spain")
# print result
print("word2vec similarity: ",sim)

Pre-trained model-based Similarity

from semantickit.similarity.googlenews_similarity import googlenews_similarity
data_path= r'GoogleNews-vectors-negative300.bin'
sim=googlenews_similarity(data_path,'human','people')
print(sim)

Weighted Levenshtein

from semantickit.distance.n_gram.train_ngram import TrainNgram
from semantickit.distance.weighted_levenshtein import weighted_levenshtein,Build_TFIDF

# train model
train_data_path = 'wlev/icd10_train.txt'
wordict_path = 'wlev/word_dict.model'
transdict_path = 'wlev/trans_dict.model'
words_path="wlev/dict_words.txt"
trainer = TrainNgram()
trainer.train(train_data_path, wordict_path, transdict_path)

# build words tf-idf file
Build_TFIDF(train_data_path,words_path)

# estimate weight lev distance
s0='颈结缔组织良性肿瘤'
s1='耳软骨良性肿瘤'
result=weighted_levenshtein(s0,s1, word_dict_path=wordict_path,trans_dict_path=transdict_path,data_path=train_data_path,words_path=words_path)
print(result)

Get related words

# use wordnet to generate multi-lang keywords
from semantickit.lang.wordnet import *
if __name__=="__main__":
    text = "digitalization meets carbon neutrality, digital economy"
    nltk.download("wordnet")
    nltk.download('omw')
    dict_lang_all=get_all_related_word_from_text(text)
    print()
    for lang in dict_lang_all:
        print(lang, ','.join(dict_lang_all[lang]))

Chinese WordVector Similarity

from semantickit.similarity.chinese_word2vec_similarity import ChineseWord2Vec
cwv=ChineseWord2Vec(
    data_path="data/source",
    output_segment_path="data/segment",
    stop_words_path="data/stop_words.txt",
    user_dict_path="data/user_dict.txt",
    word2vec_model_path="models/word2Vec.model"
)
cwv.train()
sim=cwv.similarity('沙瑞金','易学习')
print(sim)

License

The Semantic-Kit project is provided by Donghua Chen.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic-kit-0.0.3.tar.gz (43.8 kB view details)

Uploaded Source

Built Distribution

semantic_kit-0.0.3-py3-none-any.whl (50.3 kB view details)

Uploaded Python 3

File details

Details for the file semantic-kit-0.0.3.tar.gz.

File metadata

  • Download URL: semantic-kit-0.0.3.tar.gz
  • Upload date:
  • Size: 43.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.6

File hashes

Hashes for semantic-kit-0.0.3.tar.gz
Algorithm Hash digest
SHA256 bec985817c89b5ded2817eefae8ee45916dfa90cea244cf79357c054933f1ebb
MD5 158aadefcfe6194cd6bc5ece273c66fd
BLAKE2b-256 895e6a81d63925f1448ff076528f03a4b69612f0eb4e66e524077ddf515203aa

See more details on using hashes here.

File details

Details for the file semantic_kit-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: semantic_kit-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 50.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.6

File hashes

Hashes for semantic_kit-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e216ad45f94c9754fd9883d5fcfedd1c0be41e882bad8afbe6098a563750b192
MD5 188bc5b6fae17aee7417c2ec3527b9e2
BLAKE2b-256 db4a317c4fa130fcf0bf0b1834cb416c504b2cb8bcdbe1c444ed620c3e813dc8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page