Chinese Massive Text Embedding Benchmark
Project description
Chinese Massive Text Embedding Benchmark
Installation | Evaluation | Leaderboard | Tasks | Acknowledgement |
Installation
C-MTEB is devloped based on MTEB.
pip install -U C_MTEB
Or clone this repo and install as editable
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding/C_MTEB
pip install -e .
Evaluation
Evaluate reranker
python eval_cross_encoder.py --model_name_or_path BAAI/bge-reranker-base
Evaluate embedding model
- With our scripts
You can reproduce the results of baai-general-embedding (bge)
using the provided python script (see eval_C-MTEB.py )
python eval_C-MTEB.py --model_name_or_path BAAI/bge-large-zh
# for MTEB leaderboard
python eval_MTEB.py --model_name_or_path BAAI/bge-large-en
- With sentence-transformers
You can use C-MTEB easily in the same way as MTEB.
Note that the original sentence-transformers model doesn't support instruction.
So this method cannot test the performance of bge-*
models.
from mteb import MTEB
from C_MTEB import *
from sentence_transformers import SentenceTransformer
# Define the sentence-transformers model name
model_name = "bert-base-uncased"
model = SentenceTransformer(model_name)
evaluation = MTEB(task_langs=['zh'])
results = evaluation.run(model, output_folder=f"zh_results/{model_name}")
- Using a custom model
To evaluate a new model, you can load it via sentence_transformers if it is supported by sentence_transformers. Otherwise, models should be implemented like below (implementing anencode
function taking as input a list of sentences, and returning a list of embeddings (embeddings can benp.array
,torch.tensor
, etc.).):
class MyModel():
def encode(self, sentences, batch_size=32, **kwargs):
""" Returns a list of embeddings for the given sentences.
Args:
sentences (`List[str]`): List of sentences to encode
batch_size (`int`): Batch size for the encoding
Returns:
`List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
"""
pass
model = MyModel()
evaluation = MTEB(tasks=["T2Retrival"])
evaluation.run(model)
Leaderboard
1. Reranker
Model | T2Reranking | T2RerankingZh2En* | T2RerankingEn2Zh* | MMarcoReranking | CMedQAv1 | CMedQAv2 | Avg |
---|---|---|---|---|---|---|---|
text2vec-base-multilingual | 64.66 | 62.94 | 62.51 | 14.37 | 48.46 | 48.6 | 50.26 |
multilingual-e5-small | 65.62 | 60.94 | 56.41 | 29.91 | 67.26 | 66.54 | 57.78 |
multilingual-e5-large | 64.55 | 61.61 | 54.28 | 28.6 | 67.42 | 67.92 | 57.4 |
multilingual-e5-base | 64.21 | 62.13 | 54.68 | 29.5 | 66.23 | 66.98 | 57.29 |
m3e-base | 66.03 | 62.74 | 56.07 | 17.51 | 77.05 | 76.76 | 59.36 |
m3e-large | 66.13 | 62.72 | 56.1 | 16.46 | 77.76 | 78.27 | 59.57 |
bge-base-zh-v1.5 | 66.49 | 63.25 | 57.02 | 29.74 | 80.47 | 84.88 | 63.64 |
bge-large-zh-v1.5 | 65.74 | 63.39 | 57.03 | 28.74 | 83.45 | 85.44 | 63.97 |
BAAI/bge-reranker-base | 67.28 | 63.95 | 60.45 | 35.46 | 81.26 | 84.1 | 65.42 |
BAAI/bge-reranker-large | 67.6 | 64.03 | 61.44 | 37.16 | 82.15 | 84.18 | 66.09 |
* : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval task
2. Embedding
Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
---|---|---|---|---|---|---|---|---|
BAAI/bge-large-zh-v1.5 | 1024 | 64.53 | 70.46 | 56.25 | 81.6 | 69.13 | 65.84 | 48.99 |
BAAI/bge-base-zh-v1.5 | 768 | 63.13 | 69.49 | 53.72 | 79.75 | 68.07 | 65.39 | 47.53 |
BAAI/bge-small-zh-v1.5 | 512 | 57.82 | 61.77 | 49.11 | 70.41 | 63.96 | 60.92 | 44.18 |
BAAI/bge-large-zh | 1024 | 64.20 | 71.53 | 54.98 | 78.94 | 68.32 | 65.11 | 48.39 |
BAAI/bge-large-zh-noinstruct | 1024 | 63.53 | 70.55 | 53 | 76.77 | 68.58 | 64.91 | 50.01 |
BAAI/bge-base-zh | 768 | 62.96 | 69.53 | 54.12 | 77.5 | 67.07 | 64.91 | 47.63 |
multilingual-e5-large | 1024 | 58.79 | 63.66 | 48.44 | 69.89 | 67.34 | 56.00 | 48.23 |
BAAI/bge-small-zh | 512 | 58.27 | 63.07 | 49.45 | 70.35 | 63.64 | 61.48 | 45.09 |
m3e-base | 768 | 57.10 | 56.91 | 50.47 | 63.99 | 67.52 | 59.34 | 47.68 |
m3e-large | 1024 | 57.05 | 54.75 | 50.42 | 64.3 | 68.2 | 59.66 | 48.88 |
multilingual-e5-base | 768 | 55.48 | 61.63 | 46.49 | 67.07 | 65.35 | 54.35 | 40.68 |
multilingual-e5-small | 384 | 55.38 | 59.95 | 45.27 | 66.45 | 65.85 | 53.86 | 45.26 |
text-embedding-ada-002(OpenAI) | 1536 | 53.02 | 52.0 | 43.35 | 69.56 | 64.31 | 54.28 | 45.68 |
luotuo | 1024 | 49.37 | 44.4 | 42.78 | 66.62 | 61 | 49.25 | 44.39 |
text2vec-base | 768 | 47.63 | 38.79 | 43.41 | 67.41 | 62.19 | 49.45 | 37.66 |
text2vec-large | 1024 | 47.36 | 41.94 | 44.97 | 70.86 | 60.66 | 49.16 | 30.02 |
2.1. Retrieval
Model | T2Retrieval | MMarcoRetrieval | DuRetrieval | CovidRetrieval | CmedqaRetrieval | EcomRetrieval | MedicalRetrieval | VideoRetrieval | Avg |
---|---|---|---|---|---|---|---|---|---|
luotuo-bert-medium | 58.67 | 55.31 | 59.36 | 55.48 | 18.04 | 40.48 | 29.8 | 38.04 | 44.4 |
text2vec-large-chinese | 50.52 | 45.96 | 51.87 | 60.48 | 15.53 | 37.58 | 30.93 | 42.65 | 41.94 |
text2vec-base-chinese | 51.67 | 44.06 | 52.23 | 44.81 | 15.91 | 34.59 | 27.56 | 39.52 | 38.79 |
m3e-base | 73.14 | 65.45 | 75.76 | 66.42 | 30.33 | 50.27 | 42.8 | 51.11 | 56.91 |
m3e-large | 72.36 | 61.06 | 74.69 | 61.33 | 30.73 | 45.18 | 48.66 | 44.02 | 54.75 |
OpenAI(text-embedding-ada-002) | 69.14 | 69.86 | 71.17 | 57.21 | 22.36 | 44.49 | 37.92 | 43.85 | 52.0 |
multilingual-e5-small | 71.39 | 73.17 | 81.35 | 72.82 | 24.38 | 53.56 | 44.84 | 58.09 | 59.95 |
multilingual-e5-base | 70.86 | 76.04 | 81.64 | 73.45 | 27.2 | 54.17 | 48.35 | 61.3 | 61.63 |
multilingual-e5-large | 76.11 | 79.2 | 85.32 | 75.51 | 28.67 | 54.75 | 51.44 | 58.25 | 63.66 |
BAAI/bge-small-zh | 77.59 | 67.56 | 77.89 | 68.95 | 35.18 | 58.17 | 49.9 | 69.33 | 63.07 |
BAAI/bge-base-zh | 83.35 | 79.11 | 86.02 | 72.07 | 41.77 | 63.53 | 56.64 | 73.76 | 69.53 |
bge-large-zh-noinstruct | 84.39 | 81.38 | 84.68 | 75.07 | 41.03 | 65.6 | 58.28 | 73.94 | 70.55 |
bge-large-zh | 84.82 | 81.28 | 86.94 | 74.06 | 42.4 | 66.12 | 59.39 | 77.19 | 71.53 |
2.2. STS
Model | ATEC | BQ | LCQMC | PAWSX | STSB | AFQMC | QBQTC | STS22 (zh) | Avg |
---|---|---|---|---|---|---|---|---|---|
luotuo-bert-medium | 30.84 | 43.33 | 66.74 | 12.31 | 73.22 | 22.24 | 27.2 | 66.4 | 42.78 |
text2vec-large-chinese | 32.45 | 44.22 | 69.16 | 14.55 | 79.45 | 24.51 | 29.51 | 65.94 | 44.97 |
text2vec-base-chinese | 31.93 | 42.67 | 70.16 | 17.21 | 79.3 | 26.06 | 24.62 | 55.35 | 43.41 |
m3e-base | 41.27 | 63.81 | 74.88 | 12.19 | 76.97 | 35.87 | 32.07 | 66.73 | 50.47 |
m3e-large | 41.8 | 65.2 | 74.2 | 15.95 | 74.16 | 36.53 | 32.65 | 62.91 | 50.42 |
OpenAI(text-embedding-ada-002) | 29.25 | 45.33 | 68.41 | 16.55 | 70.61 | 23.88 | 30.27 | 62.53 | 43.35 |
multilingual-e5-small | 35.14 | 43.27 | 72.7 | 11.01 | 77.73 | 25.21 | 30.25 | 66.84 | 45.27 |
multilingual-e5-base | 37.01 | 45.45 | 74.15 | 12.14 | 79.05 | 29.67 | 28.81 | 65.64 | 46.49 |
multilingual-e5-large | 39.81 | 46.44 | 75.95 | 14.63 | 81.08 | 33.02 | 29.77 | 66.82 | 48.44 |
BAAI/bge-small-zh | 43.17 | 55.47 | 72.61 | 9.97 | 76.48 | 33.93 | 36.45 | 67.54 | 49.45 |
BAAI/bge-base-zh | 48.28 | 61.21 | 74.98 | 20.65 | 78.66 | 42.53 | 38.01 | 68.64 | 54.12 |
bge-large-zh-noinstruct | 48.29 | 60.53 | 74.71 | 16.64 | 78.41 | 43.06 | 35.2 | 67.19 | 53 |
bge-large-zh | 49.75 | 62.93 | 75.45 | 22.45 | 78.51 | 44.57 | 38.92 | 67.24 | 54.98 |
2.3. PairClassification
Model | Ocnli | Cmnli | Avg |
---|---|---|---|
luotuo-bert-medium | 60.7 | 72.55 | 66.62 |
text2vec-large-chinese | 64.04 | 77.67 | 70.86 |
text2vec-base-chinese | 60.95 | 73.87 | 67.41 |
m3e-base | 58.0 | 69.98 | 63.99 |
m3e-large | 59.33 | 69.27 | 64.3 |
OpenAI(text-embedding-ada-002) | 63.08 | 76.03 | 69.56 |
multilingual-e5-small | 60.77 | 72.12 | 66.45 |
multilingual-e5-base | 59.63 | 74.51 | 67.07 |
multilingual-e5-large | 78.18 | 78.18 | 69.89 |
BAAI/bge-small-zh | 65.25 | 75.46 | 70.35 |
BAAI/bge-base-zh | 73.32 | 81.69 | 77.5 |
bge-large-zh-noinstruct | 71.37 | 82.17 | 76.77 |
bge-large-zh | 75.75 | 82.12 | 78.94 |
2.4. Classification
Model | TNews | IFlyTek | MultilingualSentiment | JDReview | OnlineShopping | Waimai | AmazonReviewsClassification (zh) | MassiveIntentClassification (zh-CN) | MassiveScenarioClassification (zh-CN) | Avg |
---|---|---|---|---|---|---|---|---|---|---|
luotuo-bert-medium | 45.22 | 41.75 | 61.21 | 79.68 | 84.3 | 79.57 | 34.46 | 57.47 | 65.32 | 61 |
text2vec-large-chinese | 38.92 | 41.54 | 58.97 | 81.56 | 83.51 | 76.01 | 33.77 | 63.23 | 68.45 | 60.66 |
text2vec-base-chinese | 43.02 | 42.05 | 60.98 | 82.14 | 85.69 | 77.22 | 34.12 | 63.98 | 70.52 | 62.19 |
m3e-base | 48.28 | 44.42 | 71.9 | 85.33 | 87.77 | 83.99 | 43.02 | 68.4 | 74.6 | 67.52 |
m3e-large | 48.26 | 43.96 | 72.47 | 86.92 | 89.59 | 86.1 | 44.44 | 67.23 | 74.88 | 68.2 |
OpenAI(text-embedding-ada-002) | 45.77 | 44.62 | 67.99 | 74.6 | 88.94 | 82.37 | 38.3 | 64.81 | 71.4 | 64.31 |
multilingual-e5-small | 48.38 | 47.35 | 64.74 | 79.34 | 88.73 | 83.9 | 37.5 | 68.24 | 74.47 | 65.85 |
multilingual-e5-base | 47.06 | 44.93 | 65.28 | 76.21 | 88.4 | 84.42 | 37.23 | 69.16 | 75.42 | 65.35 |
multilingual-e5-large | 48.38 | 45.47 | 68.58 | 80.99 | 90.81 | 85.02 | 38.83 | 71.12 | 76.83 | 67.34 |
BAAI/bge-small-zh | 47.67 | 42.07 | 65.07 | 80.64 | 87.4 | 83.8 | 37.31 | 61.44 | 67.39 | 63.64 |
BAAI/bge-base-zh | 49.97 | 44.54 | 70.63 | 83.92 | 91.38 | 85.46 | 40.68 | 65.72 | 71.3 | 67.07 |
bge-large-zh-noinstruct | 52.05 | 45.32 | 73.7 | 85.38 | 91.66 | 86.83 | 41.94 | 66.96 | 73.39 | 68.58 |
bge-large-zh | 50.84 | 45.09 | 74.41 | 85.08 | 91.6 | 86.54 | 42.39 | 67.18 | 71.76 | 68.32 |
2.5. Reranking
Model | T2Reranking | MmarcoReranking | CMedQAv1 | CMedQAv2 | Avg |
---|---|---|---|---|---|
luotuo-bert-medium | 65.76 | 14.55 | 57.82 | 58.88 | 49.25 |
text2vec-large-chinese | 64.82 | 12.48 | 58.92 | 60.41 | 49.16 |
text2vec-base-chinese | 65.95 | 12.76 | 59.26 | 59.82 | 49.45 |
m3e-base | 66.03 | 17.51 | 77.05 | 76.76 | 59.34 |
m3e-large | 66.13 | 16.46 | 77.76 | 78.27 | 59.66 |
OpenAI(text-embedding-ada-002) | 66.65 | 23.39 | 63.08 | 64.02 | 54.28 |
multilingual-e5-small | 65.24 | 24.33 | 63.44 | 62.41 | 53.86 |
multilingual-e5-base | 64.39 | 21.76 | 65.21 | 66.06 | 54.35 |
multilingual-e5-large | 65.83 | 21.34 | 68.25 | 68.56 | 56.00 |
BAAI/bge-small-zh | 66.2 | 22.82 | 77.08 | 79.82 | 61.48 |
BAAI/bge-base-zh | 66.49 | 28.24 | 80.12 | 84.78 | 64.91 |
bge-large-zh-noinstruct | 66.16 | 27.1 | 81.72 | 84.64 | 64.91 |
bge-large-zh | 66.19 | 26.23 | 83.01 | 85.01 | 65.11 |
2.6. Clustering
Model | CLSClusteringS2S | CLSClusteringP2P | ThuNewsClusteringS2S | ThuNewsClusteringP2P | Avg |
---|---|---|---|---|---|
luotuo-bert-medium | 33.46 | 37.01 | 48.26 | 58.83 | 44.39 |
text2vec-large-chinese | 28.77 | 30.13 | 26.14 | 35.05 | 30.02 |
text2vec-base-chinese | 32.42 | 35.27 | 40.01 | 42.92 | 37.66 |
m3e-base | 37.34 | 39.81 | 53.78 | 59.77 | 47.68 |
m3e-large | 38.02 | 38.6 | 58.51 | 60.39 | 48.88 |
OpenAI(text-embedding-ada-002) | 35.91 | 38.26 | 49.86 | 58.71 | 45.68 |
multilingual-e5-small | 37.79 | 39.14 | 48.93 | 55.18 | 45.26 |
multilingual-e5-base | 36.99 | 32.41 | 52.36 | 40.98 | 40.68 |
multilingual-e5-large | 38.59 | 40.68 | 55.59 | 58.05 | 48.23 |
BAAI/bge-small-zh | 34.34 | 38.23 | 51.84 | 55.95 | 45.09 |
BAAI/bge-base-zh | 36.59 | 38.79 | 56.16 | 59.0 | 47.63 |
bge-large-zh-noinstruct | 40.04 | 41.23 | 56.75 | 62.03 | 50.01 |
bge-large-zh | 38.05 | 40.92 | 58.79 | 55.79 | 48.39 |
Tasks
An overview of tasks and datasets available in MTEB-chinese is provided in the following table:
Name | Hub URL | Description | Type | Category | Test #Samples |
---|---|---|---|---|---|
T2Retrieval | C-MTEB/T2Retrieval | T2Ranking: A large-scale Chinese Benchmark for Passage Ranking | Retrieval | s2p | 24,832 |
MMarcoRetrieval | C-MTEB/MMarcoRetrieval | mMARCO is a multilingual version of the MS MARCO passage ranking dataset | Retrieval | s2p | 7,437 |
DuRetrieval | C-MTEB/DuRetrieval | A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine | Retrieval | s2p | 4,000 |
CovidRetrieval | C-MTEB/CovidRetrieval | COVID-19 news articles | Retrieval | s2p | 949 |
CmedqaRetrieval | C-MTEB/CmedqaRetrieval | Online medical consultation text | Retrieval | s2p | 3,999 |
EcomRetrieval | C-MTEB/EcomRetrieval | Passage retrieval dataset collected from Alibaba search engine systems in e-commerce domain | Retrieval | s2p | 1,000 |
MedicalRetrieval | C-MTEB/MedicalRetrieval | Passage retrieval dataset collected from Alibaba search engine systems in medical domain | Retrieval | s2p | 1,000 |
VideoRetrieval | C-MTEB/VideoRetrieval | Passage retrieval dataset collected from Alibaba search engine systems in video domain | Retrieval | s2p | 1,000 |
T2Reranking | C-MTEB/T2Reranking | T2Ranking: A large-scale Chinese Benchmark for Passage Ranking | Reranking | s2p | 24,382 |
MMarcoRetrieval | C-MTEB/MMarco-reranking | mMARCO is a multilingual version of the MS MARCO passage ranking dataset | Reranking | s2p | 7,437 |
CMedQAv1 | C-MTEB/CMedQAv1-reranking | Chinese community medical question answering | Reranking | s2p | 2,000 |
CMedQAv2 | C-MTEB/CMedQAv2-reranking | Chinese community medical question answering | Reranking | s2p | 4,000 |
Ocnli | C-MTEB/OCNLI | Original Chinese Natural Language Inference dataset | PairClassification | s2s | 3,000 |
Cmnli | C-MTEB/CMNLI | Chinese Multi-Genre NLI | PairClassification | s2s | 139,000 |
CLSClusteringS2S | C-MTEB/CLSClusteringS2S | Clustering of titles from CLS dataset. Clustering of 13 sets, based on the main category. | Clustering | s2s | 10,000 |
CLSClusteringP2P | C-MTEB/CLSClusteringP2P | Clustering of titles + abstract from CLS dataset. Clustering of 13 sets, based on the main category. | Clustering | p2p | 10,000 |
ThuNewsClusteringS2S | C-MTEB/ThuNewsClusteringS2S | Clustering of titles from the THUCNews dataset | Clustering | s2s | 10,000 |
ThuNewsClusteringP2P | C-MTEB/ThuNewsClusteringP2P | Clustering of titles + abstract from the THUCNews dataset | Clustering | p2p | 10,000 |
ATEC | C-MTEB/ATEC | ATEC NLP sentence pair similarity competition | STS | s2s | 20,000 |
BQ | C-MTEB/BQ | Bank Question Semantic Similarity | STS | s2s | 10,000 |
LCQMC | C-MTEB/LCQMC | A large-scale Chinese question matching corpus. | STS | s2s | 12,500 |
PAWSX | C-MTEB/PAWSX | Translated PAWS evaluation pairs | STS | s2s | 2,000 |
STSB | C-MTEB/STSB | Translate STS-B into Chinese | STS | s2s | 1,360 |
AFQMC | C-MTEB/AFQMC | Ant Financial Question Matching Corpus | STS | s2s | 3,861 |
QBQTC | C-MTEB/QBQTC | QQ Browser Query Title Corpus | STS | s2s | 5,000 |
TNews | C-MTEB/TNews-classification | Short Text Classificaiton for News | Classification | s2s | 10,000 |
IFlyTek | C-MTEB/IFlyTek-classification | Long Text classification for the description of Apps | Classification | s2s | 2,600 |
Waimai | C-MTEB/waimai-classification | Sentiment Analysis of user reviews on takeaway platforms | Classification | s2s | 1,000 |
OnlineShopping | C-MTEB/OnlineShopping-classification | Sentiment Analysis of User Reviews on Online Shopping Websites | Classification | s2s | 1,000 |
MultilingualSentiment | C-MTEB/MultilingualSentiment-classification | A collection of multilingual sentiments datasets grouped into 3 classes -- positive, neutral, negative | Classification | s2s | 3,000 |
JDReview | C-MTEB/JDReview-classification | review for iphone | Classification | s2s | 533 |
For retrieval tasks, we sample 100,000 candidates (including the ground truths) from the entire corpus to reduce the inference cost.
Acknowledgement
We thank the great tool from Massive Text Embedding Benchmark and the open-source datasets from Chinese NLP community.
Citation
If you find this repository useful, please consider citation
@misc{c-pack,
title={C-Pack: Packaged Resources To Advance General Chinese Embedding},
author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff},
year={2023},
eprint={2309.07597},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.