Text to vector Tool, encode text
Project description
text2vec-onnx
本项目是 text2vec 项目的 onnxruntime 推理版本,实现了向量获取和文本匹配搜索。为了保证项目的轻量,只使用了 onnxruntime
、 tokenizers
和 numpy
三个库。
主要在 GanymedeNil/text2vec-base-chinese-onnx 模型上进行测试,理论上支持 BERT 系列模型。
安装
CPU 版本
pip install text2vec2onnx[cpu]
GPU 版本
pip install text2vec2onnx[gpu]
使用
模型下载
以下载 GanymedeNil/text2vec-base-chinese-onnx 为例,下载模型到本地。
- huggingface 模型下载
huggingface-cli download --resume-download GanymedeNil/text2vec-base-chinese-onnx --local-dir text2vec-base-chinese-onnx
向量获取
from text2vec2onnx import SentenceModel
embedder = SentenceModel(model_dir_path='local-dir')
emb = embedder.encode("你好")
文本匹配搜索
from text2vec2onnx import SentenceModel, semantic_search
embedder = SentenceModel(model_dir_path='local-dir')
corpus = [
"谢谢观看 下集再见",
"感谢您的观看",
"请勿模仿",
"记得订阅我们的频道哦",
"The following are sentences in English.",
"Thank you. Bye-bye.",
"It's true",
"I don't know.",
"Thank you for watching!",
]
corpus_embeddings = embedder.encode(corpus)
queries = [
'Thank you. Bye.',
'你干啥呢',
'感谢您的收听']
for query in queries:
query_embedding = embedder.encode(query)
hits = semantic_search(query_embedding, corpus_embeddings, top_k=1)
print("\n\n======================\n\n")
print("Query:", query)
print("\nTop 5 most similar sentences in corpus:")
hits = hits[0] # Get the hits for the first query
for hit in hits:
print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
License
References
Buy me a coffee
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
text2vec2onnx-1.0.0.tar.gz
(14.2 kB
view hashes)
Built Distribution
Close
Hashes for text2vec2onnx-1.0.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5f30c3ee3008f63f79d3cd99bab245a3cfe9754256e2c20477c3a664ea3cdaba |
|
MD5 | b5a01e3e820ce34231e2d851757283d5 |
|
BLAKE2b-256 | 3c3a9273608af39a1d0caeb36929db92c1082b63919df534547f17fc98e1e455 |