A toolkit to estimate semantic similarity or relatedness
Project description
Semantic Similarity and Relatedness Toolkit
A toolkit to estimate semantic similarity and relatedness between two words/sentences.
Installation
pip install semantic-kit
Functions
- Lesk algorithm and improved version
- Similarity algorithms including WordNet , word2vec similarity, LDA, and googlenews-based methods
- Distance algorithms like jaccard, soren, levenshtein, and their improved versions
- Use Open Multilingual Wordnet to generate relevant keywords from multiple language
Examples
Lesk Algorithm
from semantickit.relatedness.lesk import lesk
from semantickit.relatedness.lesk_max_overlap import lesk_max_overlap
sent = ['I', 'went', 'to', 'the', 'bank', 'to', 'deposit', 'money', '.']
m1, s1 = lesk(sent, 'bank', 'n')
m2, s2 = lesk_max_overlap(sent, 'bank', 'n')
print(m1,s1)
print(m2,s2)
WordNet-based Similarity
from semantickit.similarity.wordnet_similarity import wordnet_similarity_all
print(wordnet_similarity_all("dog.n.1","cat.n.1"))
Corpus-based Similarity
from semantickit.similarity.word2vec_similarity import build_model,similarity_model
# build similarity model based on text source
build_model(data_path="text8",save_path="wiki_model")
# estimate similarity between words using the built model
sim=similarity_model("wiki_model","france","spain")
# print result
print("word2vec similarity: ",sim)
Pre-trained model-based Similarity
from semantickit.similarity.googlenews_similarity import googlenews_similarity
data_path= r'GoogleNews-vectors-negative300.bin'
sim=googlenews_similarity(data_path,'human','people')
print(sim)
Weighted Levenshtein
from semantickit.distance.n_gram.train_ngram import TrainNgram
from semantickit.distance.weighted_levenshtein import weighted_levenshtein,Build_TFIDF
# train model
train_data_path = 'wlev/icd10_train.txt'
wordict_path = 'wlev/word_dict.model'
transdict_path = 'wlev/trans_dict.model'
words_path="wlev/dict_words.txt"
trainer = TrainNgram()
trainer.train(train_data_path, wordict_path, transdict_path)
# build words tf-idf file
Build_TFIDF(train_data_path,words_path)
# estimate weight lev distance
s0='颈结缔组织良性肿瘤'
s1='耳软骨良性肿瘤'
result=weighted_levenshtein(s0,s1, word_dict_path=wordict_path,trans_dict_path=transdict_path,data_path=train_data_path,words_path=words_path)
print(result)
Get related words
# use wordnet to generate multi-lang keywords
from semantickit.lang.wordnet import *
if __name__=="__main__":
text = "digitalization meets carbon neutrality, digital economy"
nltk.download("wordnet")
nltk.download('omw')
dict_lang_all=get_all_related_word_from_text(text)
print()
for lang in dict_lang_all:
print(lang, ','.join(dict_lang_all[lang]))
Chinese WordVector Similarity
from semantickit.similarity.chinese_word2vec_similarity import ChineseWord2Vec
cwv=ChineseWord2Vec(
data_path="data/source",
output_segment_path="data/segment",
stop_words_path="data/stop_words.txt",
user_dict_path="data/user_dict.txt",
word2vec_model_path="models/word2Vec.model"
)
cwv.train()
sim=cwv.similarity('沙瑞金','易学习')
print(sim)
License
The Semantic-Kit
project is provided by Donghua Chen.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
semantic-kit-0.0.3.tar.gz
(43.8 kB
view details)
Built Distribution
File details
Details for the file semantic-kit-0.0.3.tar.gz
.
File metadata
- Download URL: semantic-kit-0.0.3.tar.gz
- Upload date:
- Size: 43.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bec985817c89b5ded2817eefae8ee45916dfa90cea244cf79357c054933f1ebb |
|
MD5 | 158aadefcfe6194cd6bc5ece273c66fd |
|
BLAKE2b-256 | 895e6a81d63925f1448ff076528f03a4b69612f0eb4e66e524077ddf515203aa |
File details
Details for the file semantic_kit-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: semantic_kit-0.0.3-py3-none-any.whl
- Upload date:
- Size: 50.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e216ad45f94c9754fd9883d5fcfedd1c0be41e882bad8afbe6098a563750b192 |
|
MD5 | 188bc5b6fae17aee7417c2ec3527b9e2 |
|
BLAKE2b-256 | db4a317c4fa130fcf0bf0b1834cb416c504b2cb8bcdbe1c444ed620c3e813dc8 |