Skip to main content

an elegant bert4vector

Project description

bert4vector

向量计算、存储、检索、相似度计算(兼容sentence_transformers)

licence GitHub release PyPI PyPI - Downloads GitHub stars GitHub Issues contributions welcome

Documentation | Bert4torch | Examples | Source code

1. 下载安装

  • 安装稳定版
pip install bert4vector
  • 安装最新版
pip install git+https://github.com/Tongjilibo/bert4vector

2. 快速使用

  • 向量计算
from bert4vector.core import BertSimilarity
model = BertSimilarity('/data/pretrain_ckpt/Tongjilibo/simbert-chinese-tiny')
sentences = ['喜欢打篮球的男生喜欢什么样的女生', '西安下雪了?是不是很冷啊?', '第一次去见女朋友父母该如何表现?', '小蝌蚪找妈妈怎么样', '给我推荐一款红色的车', '我喜欢北京']
vecs = model.encode(sentences, convert_to_numpy=True, normalize_embeddings=False)
print(vecs.shape)
# (6, 312)
  • 相似度计算
from bert4vector.core import BertSimilarity
text2vec = BertSimilarity('/data/pretrain_ckpt/Tongjilibo/simbert-chinese-tiny')
sent1 = ['你好', '天气不错']
sent2 = ['你好啊', '天气很好']
similarity = text2vec.similarity(sent1, sent2)
print(similarity)
# [[0.9075422  0.42991278]
#  [0.19584633 0.72635853]]
  • 向量存储和检索
from bert4vector.core import BertSimilarity
model = BertSimilarity('/data/pretrain_ckpt/Tongjilibo/simbert-chinese-tiny')
model.add_corpus(['你好', '我选你', '天气不错', '人很好看'])
print(model.search('你好'))
# {'你好': [{'corpus_id': 0, 'score': 0.9999, 'text': '你好'},
#           {'corpus_id': 3, 'score': 0.5694, 'text': '人很好看'}]} 
  • api部署
from bert4vector.pipelines import SimilaritySever
server = SimilaritySever('/data/pretrain_ckpt/embedding/BAAI--bge-base-zh-v1.5')
server.run(port=port)
# 接口调用可以参考'./examples/api.py'

3. 支持的句向量权重(除了以下权重,还支持 sentence_transformers支持的任意权重)

模型分类 模型名称 权重来源 权重链接 备注(若有)
simbert simbert 追一科技 Tongjilibo/simbert-chinese-base
Tongjilibo/simbert-chinese-small
Tongjilibo/simbert-chinese-tiny
simbert_v2/roformer-sim 追一科技 junnyu/roformer_chinese_sim_char_base
junnyu/roformer_chinese_sim_char_ft_base
junnyu/roformer_chinese_sim_char_small
junnyu/roformer_chinese_sim_char_ft_small
junnyu/roformer_chinese_sim_char_base
junnyu/roformer_chinese_sim_char_ft_base
junnyu/roformer_chinese_sim_char_small
junnyu/roformer_chinese_sim_char_ft_small
embedding text2vec-base-chinese shibing624 shibing624/text2vec-base-chinese shibing624/text2vec-base-chinese
m3e moka-ai moka-ai/m3e-base moka-ai/m3e-base
bge BAAI BAAI/bge-large-en-v1.5
BAAI/bge-large-zh-v1.5
BAAI/bge-base-en-v1.5
BAAI/bge-base-zh-v1.5
BAAI/bge-small-en-v1.5
BAAI/bge-small-zh-v1.5
BAAI/bge-large-en-v1.5
BAAI/bge-large-zh-v1.5
BAAI/bge-base-en-v1.5
BAAI/bge-base-zh-v1.5
BAAI/bge-small-en-v1.5
BAAI/bge-small-zh-v1.5
gte thenlper thenlper/gte-large-zh
thenlper/gte-base-zh
thenlper/gte-base-zh
thenlper/gte-large-zh

*注:

  1. 除了以上模型外,也支持 sentence_transformers支持的任意模型

  2. 高亮格式(如 Tongjilibo/simbert-chinese-small)的表示可直接联网下载

  3. 国内镜像网站加速下载

    • HF_ENDPOINT=https://hf-mirror.com python your_script.py
    • export HF_ENDPOINT=https://hf-mirror.com后再执行python代码
    • 在python代码开头如下设置
    import os
    os.environ['HF_ENDPOINT'] = "https://hf-mirror.com"
    

4. 版本历史

更新日期 bert4vector 版本说明
20251013 0.0.7.post2 去除对torch的完全依赖
20251009 0.0.7 增加 OpenaiSimilarityRequestOpenaiSimilarityAiohttp用于访问openai格式的远程模型
20250601 0.0.6 add_corpus增加 corpus_property入参;增加 delete_corpus方法;支持任意 sentence_transformers模型
20240928 0.0.5 小修改,api中可以reset
20240710 0.0.4 增加最长公共子序列字面召回,不安装torch也可以使用部分功能
20240628 0.0.3 增加多种字面召回,增加api接口部署

5. 更新历史:

  • 20240928:小修改,api中可以reset
  • 20240710:增加最长公共子序列字面召回,不安装torch也可以使用部分功能
  • 20240628:增加多种字面召回,增加api接口部署

6. Reference

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bert4vector-0.0.7.post2.tar.gz (1.2 MB view details)

Uploaded Source

File details

Details for the file bert4vector-0.0.7.post2.tar.gz.

File metadata

  • Download URL: bert4vector-0.0.7.post2.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 colorama/0.4.4 importlib-metadata/4.6.4 keyring/23.5.0 pkginfo/1.8.2 readme-renderer/34.0 requests-toolbelt/0.9.1 requests/2.25.1 rfc3986/1.5.0 tqdm/4.57.0 urllib3/1.26.5 CPython/3.10.12

File hashes

Hashes for bert4vector-0.0.7.post2.tar.gz
Algorithm Hash digest
SHA256 704f29fe184fab455f3e2482b8841eafa08f68bcd13f4c62a7e08e116f8ca48d
MD5 0749fac3aacd230365f8e22d2df7fbd1
BLAKE2b-256 ea6e1b1aa087c0403a2741097eeeb1fd5f9d99f65181339198675ddd9275aec1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page