Skip to main content

Text to vector Tool, encode text

Project description

PyPI version Downloads Contributions welcome License Apache 2.0 python_version GitHub issues Wechat Group

Text2vec

text2vec, Text to Vector.

文本向量表征工具,把文本转化为向量矩阵,是文本进行计算机处理的第一步。

text2vec实现了Word2Vec、RankBM25、BERT、Sentence-BERT、CoSENT等多种文本表征、文本相似度计算模型,并在文本语义匹配(相似度计算)任务上比较了各模型的效果。

Guide

Feature

文本向量表示模型

  • Word2Vec:通过腾讯AI Lab开源的大规模高质量中文词向量数据(800万中文词轻量版) (文件名:light_Tencent_AILab_ChineseEmbedding.bin 密码: tawe)实现词向量检索,本项目实现了句子(词向量求平均)的word2vec向量表示
  • SBERT(Sentence-BERT):权衡性能和效率的句向量表示模型,训练时通过有监督训练上层分类函数,文本匹配预测时直接句子向量做余弦,本项目基于PyTorch复现了Sentence-BERT模型的训练和预测
  • CoSENT(Cosine Sentence):CoSENT模型提出了一种排序的损失函数,使训练过程更贴近预测,模型收敛速度和效果比Sentence-BERT更好,本项目基于PyTorch实现了CoSENT模型的训练和预测

Evaluation

文本匹配

  • 英文匹配数据集的评测结果:
Arch Backbone Model Name English-STS-B
GloVe glove Avg_word_embeddings_glove_6B_300d 61.77
BERT bert-base-uncased BERT-base-cls 20.29
BERT bert-base-uncased BERT-base-first_last_avg 59.04
BERT bert-base-uncased BERT-base-first_last_avg-whiten(NLI) 63.65
SBERT sentence-transformers/bert-base-nli-mean-tokens SBERT-base-nli-cls 73.65
SBERT sentence-transformers/bert-base-nli-mean-tokens SBERT-base-nli-first_last_avg 77.96
SBERT xlm-roberta-base paraphrase-multilingual-MiniLM-L12-v2 84.42
CoSENT bert-base-uncased CoSENT-base-first_last_avg 69.93
CoSENT sentence-transformers/bert-base-nli-mean-tokens CoSENT-base-nli-first_last_avg 79.68
  • 中文匹配数据集的评测结果:
Arch Backbone Model Name ATEC BQ LCQMC PAWSX STS-B Avg QPS
CoSENT hfl/chinese-macbert-base CoSENT-macbert-base 50.39 72.93 79.17 60.86 80.51 68.77 3008
CoSENT Langboat/mengzi-bert-base CoSENT-mengzi-base 50.52 72.27 78.69 12.89 80.15 58.90 2502
CoSENT bert-base-chinese CoSENT-bert-base 49.74 72.38 78.69 60.00 80.14 68.19 2653
SBERT bert-base-chinese SBERT-bert-base 46.36 70.36 78.72 46.86 66.41 61.74 3365
SBERT hfl/chinese-macbert-base SBERT-macbert-base 47.28 68.63 79.42 55.59 64.82 63.15 2948
CoSENT hfl/chinese-roberta-wwm-ext CoSENT-roberta-ext 50.81 71.45 79.31 61.56 81.13 68.85 -
SBERT hfl/chinese-roberta-wwm-ext SBERT-roberta-ext 48.29 69.99 79.22 44.10 72.42 62.80 -
  • 本项目release模型的中文匹配评测结果:
Arch Backbone Model Name ATEC BQ LCQMC PAWSX STS-B Avg QPS
Word2Vec word2vec w2v-light-tencent-chinese 20.00 31.49 59.46 2.57 55.78 33.86 23769
SBERT xlm-roberta-base paraphrase-multilingual-MiniLM-L12-v2 18.42 38.52 63.96 10.14 78.90 41.99 3138
CoSENT hfl/chinese-macbert-base shibing624/text2vec-base-chinese 31.93 42.67 70.16 17.21 79.30 48.25 3008
CoSENT hfl/chinese-lert-large GanymedeNil/text2vec-large-chinese - - - - - - -

说明:

  • 结果值均使用spearman系数
  • 结果均只用该数据集的train训练,在test上评估得到的表现,没用外部数据
  • shibing624/text2vec-base-chinese模型,是用CoSENT方法训练,基于MacBERT在中文STS-B数据训练得到,并在中文STS-B测试集评估达到SOTA,运行examples/training_sup_text_matching_model.py代码可复现结果,模型文件已经上传到huggingface的模型库shibing624/text2vec-base-chinese,中文语义匹配任务推荐使用
  • SBERT-macbert-base模型,是用SBERT方法训练,运行examples/training_sup_text_matching_model.py代码复现结果
  • paraphrase-multilingual-MiniLM-L12-v2模型名称是sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2,是用SBERT训练,是paraphrase-MiniLM-L12-v2模型的多语言版本,支持中文、英文等
  • w2v-light-tencent-chinese是腾讯词向量的Word2Vec模型,CPU加载使用,适用于中文字面匹配任务和缺少数据的冷启动情况
  • 各预训练模型均可以通过transformers调用,如MacBERT模型:--model_name hfl/chinese-macbert-base 或者roberta模型:--model_name uer/roberta-medium-wwm-chinese-cluecorpussmall
  • 中文匹配数据集下载链接见下方
  • 中文匹配任务实验表明,pooling最优是first_last_avg,即 SentenceModel 的EncoderType.FIRST_LAST_AVG,其与EncoderType.MEAN的方法在预测效果上差异很小
  • QPS的GPU测试环境是Tesla V100,显存32GB

Demo

Official Demo: https://www.mulanai.com/product/short_text_sim/

HuggingFace Demo: https://huggingface.co/spaces/shibing624/text2vec

run example: examples/gradio_demo.py to see the demo:

python examples/gradio_demo.py

Install

pip install torch # conda install pytorch
pip install -U text2vec

or

pip install torch # conda install pytorch
pip install -r requirements.txt

git clone https://github.com/shibing624/text2vec.git
cd text2vec
pip install --no-deps .

Usage

文本向量表征

基于pretrained model计算文本向量:

>>> from text2vec import SentenceModel
>>> m = SentenceModel()
>>> m.encode("如何更换花呗绑定银行卡")
Embedding shape: (768,)

example: examples/computing_embeddings_demo.py

import sys

sys.path.append('..')
from text2vec import SentenceModel
from text2vec import Word2Vec


def compute_emb(model):
    # Embed a list of sentences
    sentences = [
        '卡',
        '银行卡',
        '如何更换花呗绑定银行卡',
        '花呗更改绑定银行卡',
        'This framework generates embeddings for each input sentence',
        'Sentences are passed as a list of string.',
        'The quick brown fox jumps over the lazy dog.'
    ]
    sentence_embeddings = model.encode(sentences)
    print(type(sentence_embeddings), sentence_embeddings.shape)

    # The result is a list of sentence embeddings as numpy arrays
    for sentence, embedding in zip(sentences, sentence_embeddings):
        print("Sentence:", sentence)
        print("Embedding shape:", embedding.shape)
        print("Embedding head:", embedding[:10])
        print()


if __name__ == "__main__":
    # 中文句向量模型(CoSENT),中文语义匹配任务推荐,支持fine-tune继续训练
    t2v_model = SentenceModel("shibing624/text2vec-base-chinese")
    compute_emb(t2v_model)

    # 支持多语言的句向量模型(Sentence-BERT),英文语义匹配任务推荐,支持fine-tune继续训练
    sbert_model = SentenceModel("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
    compute_emb(sbert_model)

    # 中文词向量模型(word2vec),中文字面匹配任务和冷启动适用
    w2v_model = Word2Vec("w2v-light-tencent-chinese")
    compute_emb(w2v_model)

output:

<class 'numpy.ndarray'> (7, 768)
Sentence: 卡
Embedding shape: (768,)

Sentence: 银行卡
Embedding shape: (768,)
 ... 
  • 返回值embeddingsnumpy.ndarray类型,shape为(sentences_size, model_embedding_size),三个模型任选一种即可,推荐用第一个。
  • shibing624/text2vec-base-chinese模型是CoSENT方法在中文STS-B数据集训练得到的,模型已经上传到huggingface的 模型库shibing624/text2vec-base-chinese, 是text2vec.SentenceModel指定的默认模型,可以通过上面示例调用,或者如下所示用transformers库调用, 模型自动下载到本机路径:~/.cache/huggingface/transformers
  • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2模型是Sentence-BERT的多语言句向量模型, 适用于释义(paraphrase)识别,文本匹配,通过text2vec.SentenceModelsentence-transformers库都可以调用该模型
  • w2v-light-tencent-chinese是通过gensim加载的Word2Vec模型,使用腾讯词向量Tencent_AILab_ChineseEmbedding.tar.gz计算各字词的词向量,句子向量通过单词词 向量取平均值得到,模型自动下载到本机路径:~/.text2vec/datasets/light_Tencent_AILab_ChineseEmbedding.bin

Usage (HuggingFace Transformers)

Without text2vec, you can use the model like this:

First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

example: examples/use_origin_transformers_demo.py

import os
import torch
from transformers import AutoTokenizer, AutoModel

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"


# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('shibing624/text2vec-base-chinese')
model = AutoModel.from_pretrained('shibing624/text2vec-base-chinese')
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling. In this case, max pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Usage (sentence-transformers)

sentence-transformers is a popular library to compute dense vector representations for sentences.

Install sentence-transformers:

pip install -U sentence-transformers

Then load model and predict:

from sentence_transformers import SentenceTransformer

m = SentenceTransformer("shibing624/text2vec-base-chinese")
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']

sentence_embeddings = m.encode(sentences)
print("Sentence embeddings:")
print(sentence_embeddings)

Word2Vec词向量

提供两种Word2Vec词向量,任选一个:

下游任务

1. 句子相似度计算

example: examples/semantic_text_similarity_demo.py

import sys

sys.path.append('..')
from text2vec import Similarity

# Two lists of sentences
sentences1 = ['如何更换花呗绑定银行卡',
              'The cat sits outside',
              'A man is playing guitar',
              'The new movie is awesome']

sentences2 = ['花呗更改绑定银行卡',
              'The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']

sim_model = Similarity()
for i in range(len(sentences1)):
    for j in range(len(sentences2)):
        score = sim_model.get_score(sentences1[i], sentences2[j])
        print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[j], score))

output:

如何更换花呗绑定银行卡 		 花呗更改绑定银行卡 		 Score: 0.9477
如何更换花呗绑定银行卡 		 The dog plays in the garden 		 Score: -0.1748
如何更换花呗绑定银行卡 		 A woman watches TV 		 Score: -0.0839
如何更换花呗绑定银行卡 		 The new movie is so great 		 Score: -0.0044
The cat sits outside 		 花呗更改绑定银行卡 		 Score: -0.0097
The cat sits outside 		 The dog plays in the garden 		 Score: 0.1908
The cat sits outside 		 A woman watches TV 		 Score: -0.0203
The cat sits outside 		 The new movie is so great 		 Score: 0.0302
A man is playing guitar 		 花呗更改绑定银行卡 		 Score: -0.0010
A man is playing guitar 		 The dog plays in the garden 		 Score: 0.1062
A man is playing guitar 		 A woman watches TV 		 Score: 0.0055
A man is playing guitar 		 The new movie is so great 		 Score: 0.0097
The new movie is awesome 		 花呗更改绑定银行卡 		 Score: 0.0302
The new movie is awesome 		 The dog plays in the garden 		 Score: -0.0160
The new movie is awesome 		 A woman watches TV 		 Score: 0.1321
The new movie is awesome 		 The new movie is so great 		 Score: 0.9591

句子余弦相似度值score范围是[-1, 1],值越大越相似。

2. 文本匹配搜索

一般在文档候选集中找与query最相似的文本,常用于QA场景的问句相似匹配、文本相似检索等任务。

example: examples/semantic_search_demo.py

import sys

sys.path.append('..')
from text2vec import SentenceModel, cos_sim, semantic_search

embedder = SentenceModel()

# Corpus with example sentences
corpus = [
    '花呗更改绑定银行卡',
    '我什么时候开通了花呗',
    'A man is eating food.',
    'A man is eating a piece of bread.',
    'The girl is carrying a baby.',
    'A man is riding a horse.',
    'A woman is playing violin.',
    'Two men pushed carts through the woods.',
    'A man is riding a white horse on an enclosed ground.',
    'A monkey is playing drums.',
    'A cheetah is running behind its prey.'
]
corpus_embeddings = embedder.encode(corpus)

# Query sentences:
queries = [
    '如何更换花呗绑定银行卡',
    'A man is eating pasta.',
    'Someone in a gorilla costume is playing a set of drums.',
    'A cheetah chases prey on across a field.']

for query in queries:
    query_embedding = embedder.encode(query)
    hits = semantic_search(query_embedding, corpus_embeddings, top_k=5)
    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")
    hits = hits[0]  # Get the hits for the first query
    for hit in hits:
        print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))

output:

Query: 如何更换花呗绑定银行卡
Top 5 most similar sentences in corpus:
花呗更改绑定银行卡 (Score: 0.9477)
我什么时候开通了花呗 (Score: 0.3635)
A man is eating food. (Score: 0.0321)
A man is riding a horse. (Score: 0.0228)
Two men pushed carts through the woods. (Score: 0.0090)

======================
Query: A man is eating pasta.
Top 5 most similar sentences in corpus:
A man is eating food. (Score: 0.6734)
A man is eating a piece of bread. (Score: 0.4269)
A man is riding a horse. (Score: 0.2086)
A man is riding a white horse on an enclosed ground. (Score: 0.1020)
A cheetah is running behind its prey. (Score: 0.0566)

======================
Query: Someone in a gorilla costume is playing a set of drums.
Top 5 most similar sentences in corpus:
A monkey is playing drums. (Score: 0.8167)
A cheetah is running behind its prey. (Score: 0.2720)
A woman is playing violin. (Score: 0.1721)
A man is riding a horse. (Score: 0.1291)
A man is riding a white horse on an enclosed ground. (Score: 0.1213)

======================
Query: A cheetah chases prey on across a field.
Top 5 most similar sentences in corpus:
A cheetah is running behind its prey. (Score: 0.9147)
A monkey is playing drums. (Score: 0.2655)
A man is riding a horse. (Score: 0.1933)
A man is riding a white horse on an enclosed ground. (Score: 0.1733)
A man is eating food. (Score: 0.0329)

下游任务支持库

similarities库[推荐]

文本相似度计算和文本匹配搜索任务,推荐使用 similarities库 ,兼容本项目release的 Word2vec、SBERT、Cosent类语义匹配模型,还支持字面维度相似度计算、匹配搜索算法,支持文本、图像。

安装: pip install -U similarities

句子相似度计算:

from similarities import Similarity

m = Similarity()
r = m.similarity('如何更换花呗绑定银行卡', '花呗更改绑定银行卡')
print(f"similarity score: {float(r)}")  # similarity score: 0.855146050453186

Models

CoSENT model

CoSENT(Cosine Sentence)文本匹配模型,在Sentence-BERT上改进了CosineRankLoss的句向量方案

Network structure:

Training:

Inference:

CoSENT 监督模型

训练和预测CoSENT模型:

  • 在中文STS-B数据集训练和评估CoSENT模型

example: examples/training_sup_text_matching_model.py

cd examples
python training_sup_text_matching_model.py --model_arch cosent --do_train --do_predict --num_epochs 10 --model_name hfl/chinese-macbert-base --output_dir ./outputs/STS-B-cosent
  • 在蚂蚁金融匹配数据集ATEC上训练和评估CoSENT模型

支持这些中文匹配数据集的使用:'ATEC', 'STS-B', 'BQ', 'LCQMC', 'PAWSX',具体参考HuggingFace datasets https://huggingface.co/datasets/shibing624/nli_zh

python training_sup_text_matching_model.py --task_name ATEC --model_arch cosent --do_train --do_predict --num_epochs 10 --model_name hfl/chinese-macbert-base --output_dir ./outputs/ATEC-cosent
  • 在自有中文数据集上训练模型

example: examples/training_sup_text_matching_model_selfdata.py

python training_sup_text_matching_model_selfdata.py --do_train --do_predict
  • 在英文STS-B数据集训练和评估CoSENT模型

example: examples/training_sup_text_matching_model_en.py

cd examples
python training_sup_text_matching_model_en.py --model_arch cosent --do_train --do_predict --num_epochs 10 --model_name bert-base-uncased  --output_dir ./outputs/STS-B-en-cosent

CoSENT 无监督模型

  • 在英文NLI数据集训练CoSENT模型,在STS-B测试集评估效果

example: examples/training_unsup_text_matching_model_en.py

cd examples
python training_unsup_text_matching_model_en.py --model_arch cosent --do_train --do_predict --num_epochs 10 --model_name bert-base-uncased --output_dir ./outputs/STS-B-en-unsup-cosent

Sentence-BERT model

Sentence-BERT文本匹配模型,表征式句向量表示方案

Network structure:

Training:

Inference:

SentenceBERT 监督模型

  • 在中文STS-B数据集训练和评估SBERT模型

example: examples/training_sup_text_matching_model.py

cd examples
python training_sup_text_matching_model.py --model_arch sentencebert --do_train --do_predict --num_epochs 10 --model_name hfl/chinese-macbert-base --output_dir ./outputs/STS-B-sbert
  • 在英文STS-B数据集训练和评估SBERT模型

example: examples/training_sup_text_matching_model_en.py

cd examples
python training_sup_text_matching_model_en.py --model_arch sentencebert --do_train --do_predict --num_epochs 10 --model_name bert-base-uncased --output_dir ./outputs/STS-B-en-sbert

SentenceBERT 无监督模型

  • 在英文NLI数据集训练SBERT模型,在STS-B测试集评估效果

example: examples/training_unsup_text_matching_model_en.py

cd examples
python training_unsup_text_matching_model_en.py --model_arch sentencebert --do_train --do_predict --num_epochs 10 --model_name bert-base-uncased --output_dir ./outputs/STS-B-en-unsup-sbert

BERT-Match model

BERT文本匹配模型,原生BERT匹配网络结构,交互式句向量匹配模型

Network structure:

Training and inference:

训练脚本同上examples/training_sup_text_matching_model.py

模型蒸馏(Model Distillation)

由于text2vec训练的模型可以使用sentence-transformers库加载,此处复用其模型蒸馏方法distillation

  1. 模型降维,参考dimensionality_reduction.py使用PCA对模型输出embedding降维,可减少milvus等向量检索数据库的存储压力,还能轻微提升模型效果。
  2. 模型蒸馏,参考model_distillation.py使用蒸馏方法,将Teacher大模型蒸馏到更少layers层数的student模型中,在权衡效果的情况下,可大幅提升模型预测速度。

模型部署

提供两种部署模型,搭建服务的方法: 1)基于Jina搭建gRPC服务【推荐】;2)基于FastAPI搭建原生Http服务。

Jina服务

采用C/S模式搭建高性能服务,支持docker云原生,gRPC/HTTP/WebSocket,支持多个模型同时预测,GPU多卡处理。

  • 安装: pip install jina

  • 启动服务:

example: examples/jina_server_demo.py

from jina import Flow

port = 50001
f = Flow(port=port).add(
    uses='jinahub://Text2vecEncoder',
    uses_with={'model_name': 'shibing624/text2vec-base-chinese'}
)

with f:
    # backend server forever
    f.block()

该模型预测方法(executor)已经上传到JinaHub,里面包括docker、k8s部署方法。

  • 调用服务:
from jina import Client
from docarray import Document, DocumentArray

port = 50001

c = Client(port=port)

data = ['如何更换花呗绑定银行卡',
        '花呗更改绑定银行卡']
print("data:", data)
print('data embs:')
r = c.post('/', inputs=DocumentArray([Document(text='如何更换花呗绑定银行卡'), Document(text='花呗更改绑定银行卡')]))
print(r.embeddings)

批量调用方法见example: examples/jina_client_demo.py

FastAPI服务

  • 安装: pip install fastapi uvicorn

  • 启动服务:

example: examples/fastapi_server_demo.py

cd examples
python fastapi_server_demo.py
  • 调用服务:
curl -X 'GET' \
  'http://0.0.0.0:8001/emb?q=hello' \
  -H 'accept: application/json'

数据集

中文语义匹配数据集已经上传到huggingface datasets https://huggingface.co/datasets/shibing624/nli_zh

数据集使用示例:

pip install datasets
from datasets import load_dataset

dataset = load_dataset("shibing624/nli_zh", "STS-B") # ATEC or BQ or LCQMC or PAWSX or STS-B
print(dataset)
print(dataset['test'][0])

output:

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label'],
        num_rows: 5231
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label'],
        num_rows: 1458
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label'],
        num_rows: 1361
    })
})
{'sentence1': '一个女孩在给她的头发做发型。', 'sentence2': '一个女孩在梳头。', 'label': 2}

常见中文语义匹配数据集,包含ATECBQLCQMCPAWSXSTS-B共5个任务。 可以从数据集对应的链接自行下载,也可以从百度网盘(提取码:qkt6)下载。 其中senteval_cn目录是评测数据集汇总,senteval_cn.zip是senteval目录的打包,两者下其一就好。

文本向量方法介绍

Question

文本向量表示咋做?文本匹配任务用哪个模型效果好?

许多NLP任务的成功离不开训练优质有效的文本表示向量。特别是文本语义匹配(Semantic Textual Similarity,如paraphrase检测、QA的问题对匹配)、文本向量检索(Dense Text Retrieval)等任务。

Solution

传统方法:基于特征的匹配

  • 基于TF-IDF、BM25、Jaccord、SimHash、LDA等算法抽取两个文本的词汇、主题等层面的特征,然后使用机器学习模型(LR, xgboost)训练分类模型
  • 优点:可解释性较好
  • 缺点:依赖人工寻找特征,泛化能力一般,而且由于特征数量的限制,模型的效果比较一般

代表模型:

  • BM25

BM25算法,通过候选句子的字段对qurey字段的覆盖程度来计算两者间的匹配得分,得分越高的候选项与query的匹配度更好,主要解决词汇层面的相似度问题。

深度方法:基于表征的匹配

  • 基于表征的匹配方式,初始阶段对两个文本各自单独处理,通过深层的神经网络进行编码(encode),得到文本的表征(embedding),再对两个表征进行相似度计算的函数得到两个文本的相似度
  • 优点:基于BERT的模型通过有监督的Fine-tune在文本表征和文本匹配任务取得了不错的性能
  • 缺点:BERT自身导出的句向量(不经过Fine-tune,对所有词向量求平均)质量较低,甚至比不上Glove的结果,因而难以反映出两个句子的语义相似度

主要原因是:

1.BERT对所有的句子都倾向于编码到一个较小的空间区域内,这使得大多数的句子对都具有较高的相似度分数,即使是那些语义上完全无关的句子对。

2.BERT句向量表示的聚集现象和句子中的高频词有关。具体来说,当通过平均词向量的方式计算句向量时,那些高频词的词向量将会主导句向量,使之难以体现其原本的语义。当计算句向量时去除若干高频词时,聚集现象可以在一定程度上得到缓解,但表征能力会下降。

代表模型:

由于2018年BERT模型在NLP界带来了翻天覆地的变化,此处不讨论和比较2018年之前的模型(如果有兴趣了解的同学,可以参考中科院开源的MatchZooMatchZoo-py)。

所以,本项目主要调研以下比原生BERT更优、适合文本匹配的向量表示模型:Sentence-BERT(2019)、BERT-flow(2020)、SimCSE(2021)、CoSENT(2022)。

深度方法:基于交互的匹配

  • 基于交互的匹配方式,则认为在最后阶段才计算文本的相似度会过于依赖文本表征的质量,同时也会丢失基础的文本特征(比如词法、句法等),所以提出尽可能早的对文本特征进行交互,捕获更基础的特征,最后在高层基于这些基础匹配特征计算匹配分数
  • 优点:基于交互的匹配模型端到端处理,效果好
  • 缺点:这类模型(Cross-Encoder)的输入要求是两个句子,输出的是句子对的相似度值,模型不会产生句子向量表示(sentence embedding),我们也无法把单个句子输入给模型。因此,对于需要文本向量表示的任务来说,这类模型并不实用

代表模型:

Cross-Encoder适用于向量检索精排。

Contact

  • Issue(建议):GitHub issues
  • 邮件我:xuming: xuming624@qq.com
  • 微信我:加我微信号:xuming624, 备注:姓名-公司-NLP 进NLP交流群。

Citation

如果你在研究中使用了text2vec,请按如下格式引用:

APA:

Xu, M. Text2vec: Text to vector toolkit (Version 1.1.2) [Computer software]. https://github.com/shibing624/text2vec

BibTeX:

@misc{Text2vec,
  author = {Xu, Ming},
  title = {Text2vec: Text to vector toolkit},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/shibing624/text2vec}},
}

License

授权协议为 The Apache License 2.0,可免费用做商业用途。请在产品说明中附加text2vec的链接和授权协议。

Contribute

项目代码还很粗糙,如果大家对代码有所改进,欢迎提交回本项目,在提交之前,注意以下两点:

  • tests添加相应的单元测试
  • 使用python -m pytest -v来运行所有单元测试,确保所有单测都是通过的

之后即可提交PR。

Reference

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text2vec-1.1.8.tar.gz (70.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page