Chinese text similarity calculation package of Tensorflow/Pytorch
Project description
Text-Similarity
Overview
- Dataset: 中文/English 语料, ☞ 点这里
- Paper: 相关论文详解, ☞ 点这里
- The implemented method is as follows::
- TF-IDF
- BM25
- LSH
- SIF/uSIF
- FastText
- RNN Base (Siamese RNN, Stack RNN)
- CNN Base (Fast Text, Text CNN, Char CNN, VDCNN)
- Bert Base
- Albert
- NEZHA
- RoBERTa
- SimCSE
- Poly-Encoder
- ColBERT
- RE2(Simple-Effective-Text-Matching)
Usages
1:examples目录下有不同模型对应的 preprocess/train/evalute代码,可自行修改
2:如下示例从examples中引入actuator方法,准备好对应的模型配置文件即可执行
3:examples目录下的inference.py为训练好的模型推理代码
TF-IDF
# Example
# Sklearn version
from examples.run_tfidf_sklearn import actuator
actuator("./corpus/chinese/breeno/train.tsv", query1="12 23 4160 276", query2="29 23 169 1495")
# Custom version
from examples.run_tfidf import actuator
actuator("./corpus/chinese/breeno/train.tsv", query1="12 23 4160 276", query2="29 23 169 1495")
# 工具调用
from sim.tf_idf import TFIdf
tokens_list = ["这是 一个 什么 样 的 工具", "..."]
query = ["非常 好用 的 工具"]
tf_idf = TFIdf(tokens_list, split=" ")
print(tf_idf.get_score(query, 0)) # score
print(tf_idf.get_score_list(query, 10)) # [(index, score), ...]
print(tf_idf.weight()) # list or numpy array
BM25
# Example
from examples.run_bm25 import actuator
actuator("./corpus/chinese/breeno/train.tsv", query1="12 23 4160 276", query2="29 23 169 1495")
# 工具调用
from sim.bm25 import BM25
tokens_list = ["这是 一个 什么 样 的 工具", "..."]
query = ["非常 好用 的 工具"]
bm25 = BM25(tokens_list, split=" ")
print(bm25.get_score(query, 0)) # score
print(bm25.get_score_list(query, 10)) # [(index, score), ...]
print(bm25.weight()) # list or numpy array
LSH
from sim.lsh import E2LSH
from sim.lsh import MinHash
e2lsh = E2LSH()
min_hash = MinHash()
candidates = [[3.6216, 8.6661, -2.8073, -0.44699, 0], ...]
query = [-2.7769, -5.6967, 5.9179, 0.37671, 1]
print(e2lsh.search(candidates, query)) # index in candidates
print(min_hash.search(candidates, query)) # index in candidates
SIF
- A Simple But Tough-To-Beat Baseline For Sentence Embeddings
- Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline
sentences = [["token1", "token2", "..."], ...]
vector = [[[1, 1, 1], [2, 2, 2], [...]], ...]
from sim.sif_usif import SIF
from sim.sif_usif import uSIF
sif = SIF(n_components=5, component_type="svd")
sif.fit(tokens_list=sentences, vector_list=vector)
usif = uSIF(n_components=5, n=1, component_type="svd")
usif.fit(tokens_list=sentences, vector_list=vector)
FastText
# TensorFlow version
from examples.tensorflow.run_fast_text import actuator
actuator(execute_type="train", model_type="bert", model_dir="./data/chinese_wwm_L-12_H-768_A-12")
# Pytorch version
from examples.pytorch.run_fast_text import actuator
actuator(execute_type="train", model_type="bert", model_dir="./data/chinese_wwm_pytorch")
RNN Base
- Siamese Recurrent Architectures for Learning Sentence Similarity
- Learning Text Similarity with Siamese Recurrent Networks
# TensorFlow version
from examples.tensorflow.run_siamese_rnn import actuator
actuator("./data/config/siamse_rnn.json", execute_type="train")
# Pytorch version
from examples.pytorch.run_siamese_rnn import actuator
actuator("./data/config/siamse_rnn.json", execute_type="train")
CNN Base
- Convolutional Neural Networks for Sentence Classification
- Character-Aware Neural Language Models
- Highway Networks
- Very Deep Convolutional Networks for Text Classification
# TensorFlow version
from examples.tensorflow.run_cnn_base import actuator
actuator(execute_type="train", model_type="bert", model_dir="./data/chinese_wwm_L-12_H-768_A-12")
# Pytorch version
from examples.pytorch.run_cnn_base import actuator
actuator(execute_type="train", model_type="bert", model_dir="./data/chinese_wwm_pytorch")
Bert Base
# TensorFlow version
from examples.tensorflow.run_basic_bert import actuator
actuator(model_dir="./data/chinese_wwm_L-12_H-768_A-12", execute_type="train")
# Pytorch version
from examples.pytorch.run_basic_bert import actuator
actuator(model_dir="./data/chinese_wwm_pytorch", execute_type="train")
Albert
# TensorFlow version
from examples.tensorflow.run_albert import actuator
actuator(model_dir="./data/albert_small_zh_google", execute_type="train")
# Pytorch version
from examples.pytorch.run_albert import actuator
actuator(model_dir="./data/albert_chinese_small", execute_type="train")
NEZHA
# TensorFlow version
from examples.tensorflow.run_nezha import actuator
actuator(model_dir="./data/NEZHA-Base-WWM", execute_type="train")
# Pytorch version
from examples.pytorch.run_nezha import actuator
actuator(model_dir="./data/nezha-base-wwm", execute_type="train")
RoBERTa
# TensorFlow version
from examples.tensorflow.run_basic_bert import actuator
actuator(model_dir="./data/chinese_roberta_L-6_H-384_A-12", execute_type="train")
# Pytorch version
from examples.pytorch.run_basic_bert import actuator
actuator(model_dir="./data/chinese-roberta-wwm-ext", execute_type="train")
SimCSE
# TensorFlow version
from examples.tensorflow.run_simcse import actuator
actuator(model_dir="./data/chinese_wwm_L-12_H-768_A-12", execute_type="train", model_type="bert")
# Pytorch version
from examples.pytorch.run_simcse import actuator
actuator(model_dir="./data/chinese_wwm_pytorch", execute_type="train", model_type="bert")
Poly-Encoder
# TensorFlow version
from examples.tensorflow.run_poly_encoder import actuator
actuator(model_dir="./data/chinese_wwm_L-12_H-768_A-12", execute_type="train", model_type="bert")
# Pytorch version
from examples.pytorch.run_poly_encoder import actuator
actuator(model_dir="./data/chinese_wwm_pytorch", execute_type="train", model_type="bert")
ColBERT
# TensorFlow version
from examples.tensorflow.run_colbert import actuator
actuator(model_dir="./data/chinese_wwm_L-12_H-768_A-12", execute_type="train", model_type="bert")
# Pytorch version
from examples.pytorch.run_colbert import actuator
actuator(model_dir="./data/chinese_wwm_pytorch", execute_type="train", model_type="bert")
RE2
# TensorFlow version
from examples.tensorflow.run_re2 import actuator
actuator("./data/config/re2.json", execute_type="train")
# Pytorch version
from examples.pytorch.run_re2 import actuator
actuator("./data/config/re2.json", execute_type="train")
Cite
@misc{text-similarity,
title={text-similarity},
author={Bocong Deng},
year={2021},
howpublished={\url{https://github.com/DengBoCong/text-similarity}},
}
Reference
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
text-sim-1.0.7.tar.gz
(72.4 kB
view details)
Built Distribution
text_sim-1.0.7-py3-none-any.whl
(100.9 kB
view details)
File details
Details for the file text-sim-1.0.7.tar.gz
.
File metadata
- Download URL: text-sim-1.0.7.tar.gz
- Upload date:
- Size: 72.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ce3c23c391230ee80f73a28b84ab5bad45adc027801c0546ea5ca8238612e7e6 |
|
MD5 | 466e3f267fcf7a6f3a6f863d921586ac |
|
BLAKE2b-256 | f2a7bc37c7af4064c4ed59659715a09be75ad307b1f1e50c3c0d4b3a76b30e14 |
File details
Details for the file text_sim-1.0.7-py3-none-any.whl
.
File metadata
- Download URL: text_sim-1.0.7-py3-none-any.whl
- Upload date:
- Size: 100.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.0 CPython/3.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c73a1b751c58486c2ebbc491fa68ca6afb59e94a9540c8a85bd5356593d0ea0d |
|
MD5 | 73e1a9d1374f5960b42664a71d31fdbb |
|
BLAKE2b-256 | 4218d616fd2a00187ba50a2447c1b24874e17f28803f7ebea74f014d85ea59ce |