Skip to main content

Text to vector Tool, encode text

Project description

text2vec-onnx

本项目是 text2vec 项目的 onnxruntime 推理版本,实现了向量获取和文本匹配搜索。为了保证项目的轻量,只使用了 onnxruntimetokenizersnumpy 三个库。

主要在 GanymedeNil/text2vec-base-chinese-onnx 模型上进行测试,理论上支持 BERT 系列模型。

安装

CPU 版本

pip install text2vec2onnx[cpu]

GPU 版本

pip install text2vec2onnx[gpu]

使用

模型下载

以下载 GanymedeNil/text2vec-base-chinese-onnx 为例,下载模型到本地。

  • huggingface 模型下载
huggingface-cli download --resume-download GanymedeNil/text2vec-base-chinese-onnx --local-dir text2vec-base-chinese-onnx

向量获取

from text2vec2onnx import SentenceModel
embedder = SentenceModel(model_dir_path='local-dir')
emb = embedder.encode("你好")

文本匹配搜索

from text2vec2onnx import SentenceModel, semantic_search

embedder = SentenceModel(model_dir_path='local-dir')

corpus = [
    "谢谢观看 下集再见",
    "感谢您的观看",
    "请勿模仿",
    "记得订阅我们的频道哦",
    "The following are sentences in English.",
    "Thank you. Bye-bye.",
    "It's true",
    "I don't know.",
    "Thank you for watching!",
]
corpus_embeddings = embedder.encode(corpus)

queries = [
    'Thank you. Bye.',
    '你干啥呢',
    '感谢您的收听']

for query in queries:
    query_embedding = embedder.encode(query)
    hits = semantic_search(query_embedding, corpus_embeddings, top_k=1)
    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")
    hits = hits[0]  # Get the hits for the first query
    for hit in hits:
        print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))

License

Appache License 2.0

References

Buy me a coffee

Buy Me A Coffee

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text2vec2onnx-1.0.0.tar.gz (14.2 kB view hashes)

Uploaded Source

Built Distribution

text2vec2onnx-1.0.0-py3-none-any.whl (11.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page