Skip to main content

an elegant bert4vector

Project description

bert4vector

向量计算、存储、检索、相似度计算

licence GitHub release PyPI PyPI - Downloads GitHub stars GitHub Issues contributions welcome

Documentation | Bert4torch | Examples | Source code

1. 下载安装

  • 安装稳定版
pip install bert4vector
  • 安装最新版
pip install git+https://github.com/Tongjilibo/bert4vector

2. 快速使用

  • 向量计算
from bert4vector.core import BertSimilarity
model = BertSimilarity('/data/pretrain_ckpt/simbert/sushen@simbert_chinese_tiny')
sentences = ['喜欢打篮球的男生喜欢什么样的女生', '西安下雪了?是不是很冷啊?', '第一次去见女朋友父母该如何表现?', '小蝌蚪找妈妈怎么样', '给我推荐一款红色的车', '我喜欢北京']
vecs = model.encode(sentences, convert_to_numpy=True, normalize_embeddings=False)
print(vecs.shape)
# (6, 312)
  • 相似度计算
from bert4vector.core import BertSimilarity
text2vec = BertSimilarity('/data/pretrain_ckpt/simbert/sushen@simbert_chinese_tiny')
sent1 = ['你好', '天气不错']
sent2 = ['你好啊', '天气很好']
similarity = text2vec.similarity(sent1, sent2)
print(similarity)
# [[0.9075422  0.42991278]
#  [0.19584633 0.72635853]]
  • 向量存储和检索
from bert4vector.core import BertSimilarity
model = BertSimilarity('/data/pretrain_ckpt/simbert/sushen@simbert_chinese_tiny')
model.add_corpus(['你好', '我选你', '天气不错', '人很好看'])
print(model.search('你好'))
# {'你好': [{'corpus_id': 0, 'score': 0.9999, 'text': '你好'},
#           {'corpus_id': 3, 'score': 0.5694, 'text': '人很好看'}]} 
  • api部署
from bert4vector.pipelines import SimilaritySever
server = SimilaritySever('/data/pretrain_ckpt/embedding/BAAI--bge-base-zh-v1.5')
server.run(port=port)
# 接口调用可以参考'./examples/api.py'

3. 支持的句向量权重

模型分类 模型名称 权重来源 权重链接 备注(若有)
simbert simbert 追一科技 Tongjilibo/simbert-chinese-base, Tongjilibo/simbert-chinese-small, Tongjilibo/simbert-chinese-tiny
simbert_v2/roformer-sim 追一科技 junnyu/roformer_chinese_sim_char_basejunnyu/roformer_chinese_sim_char_ft_basejunnyu/roformer_chinese_sim_char_smalljunnyu/roformer_chinese_sim_char_ft_small roformer_chinese_sim_char_base, roformer_chinese_sim_char_ft_base, roformer_chinese_sim_char_small, roformer_chinese_sim_char_ft_small
embedding text2vec-base-chinese shibing624 shibing624/text2vec-base-chinese text2vec-base-chinese
m3e moka-ai moka-ai/m3e-base m3e-base
bge BAAI BAAI/bge-large-en-v1.5, BAAI/bge-large-zh-v1.5, BAAI/bge-base-en-v1.5, BAAI/bge-base-zh-v1.5, BAAI/bge-small-en-v1.5, BAAI/bge-small-zh-v1.5 bge-large-en-v1.5, bge-large-zh-v1.5, bge-base-en-v1.5, bge-base-zh-v1.5, bge-small-en-v1.5, bge-small-zh-v1.5
gte thenlper thenlper/gte-large-zh, thenlper/gte-base-zh gte-base-zh, gte-large-zh

*注:

  1. 高亮格式(如Tongjilibo/simbert-chinese-small)的表示可直接联网下载
  2. 国内镜像网站加速下载
    • HF_ENDPOINT=https://hf-mirror.com python your_script.py
    • export HF_ENDPOINT=https://hf-mirror.com后再执行python代码
    • 在python代码开头如下设置
    import os
    os.environ['HF_ENDPOINT'] = "https://hf-mirror.com"
    

4. 版本历史

更新日期 bert4vector 版本说明
20240928 0.0.5 小修改,api中可以reset
20240710 0.0.4 增加最长公共子序列字面召回,不安装torch也可以使用部分功能
20240628 0.0.3 增加多种字面召回,增加api接口部署

5. 更新历史:

  • 20240928:小修改,api中可以reset
  • 20240710:增加最长公共子序列字面召回,不安装torch也可以使用部分功能
  • 20240628:增加多种字面召回,增加api接口部署

6. Reference

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bert4vector-0.0.5.tar.gz (30.9 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page