Chinese Massive Text Embedding Benchmark
Project description
Chinese Massive Text Embedding Benchmark
Installation | Usage | Leaderboard | Tasks | Acknowledgement |
Installation
C-MTEB is devloped based on MTEB.
pip install C_MTEB
Or clone this repo and install as editable
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding/benchmark
pip install -e .
Usage
- With FlagDERSModel
You can reproduce the results of baai-general-embedding (bge)
using the provided python script (see eval_C-MTEB.py )
python eval_C-MTEB.py --model_name_or_path BAAI/bge-large-zh
We wrap the DERSModel in mteb to FlagDERSModel which can support instruction and inference with multiple GPUs.
- With sentence-transformers
You can use C-MTEB easily in the same way as MTEB.
Noted that the original sentence-transformers model doesn't support instruction.
So this method cannot test the performance of bge-*
models.
from mteb import MTEB
from C_MTEB import *
from sentence_transformers import SentenceTransformer
# Define the sentence-transformers model name
model_name = "bert-base-uncased"
model = SentenceTransformer(model_name)
evaluation = MTEB(task_langs=['zh'])
results = evaluation.run(model, output_folder=f"zh_results/{model_name}")
- Using a custom model
To evaluate a new model, you can load it by sentence_transformers if it is supported by sentence_transformers. Otherwise, models should be implemented like this (implementing anencode
function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can benp.array
,torch.tensor
, etc.). ):
class MyModel():
def encode(self, sentences, batch_size=32, **kwargs):
""" Returns a list of embeddings for the given sentences.
Args:
sentences (`List[str]`): List of sentences to encode
batch_size (`int`): Batch size for the encoding
Returns:
`List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
"""
pass
model = MyModel()
evaluation = MTEB(tasks=["T2Retrival"])
evaluation.run(model)
Leaderboard
overall
Model | Embedding dimension | Avg | Retrieval | STS | PairClassification | Classification | Reranking | Clustering |
---|---|---|---|---|---|---|---|---|
bge-large-zh | 1024 | 64.20 | 71.53 | 53.23 | 78.94 | 72.26 | 65.11 | 48.39 |
bge-large-zh-noinstruct | 1024 | 63.53 | 70.55 | 50.98 | 76.77 | 72.49 | 64.91 | 50.01 |
BAAI/bge-base-zh | 768 | 62.96 | 69.53 | 52.05 | 77.5 | 70.98 | 64.91 | 47.63 |
BAAI/bge-small-zh | 512 | 58.27 | 63.07 | 46.87 | 70.35 | 67.78 | 61.48 | 45.09 |
m3e-base | 768 | 57.10 | 56.91 | 48.15 | 63.99 | 70.28 | 59.34 | 47.68 |
m3e-large | 1024 | 57.05 | 54.75 | 48.64 | 64.3 | 71.22 | 59.66 | 48.88 |
text-embedding-ada-002(OpenAI) | 1536 | 53.02 | 52.0 | 40.61 | 69.56 | 67.38 | 54.28 | 45.68 |
luotuo | 1024 | 49.37 | 44.4 | 39.41 | 66.62 | 65.29 | 49.25 | 44.39 |
text2vec | 768 | 47.63 | 38.79 | 41.71 | 67.41 | 65.18 | 49.45 | 37.66 |
text2vec-large | 1024 | 47.36 | 41.94 | 41.98 | 70.86 | 63.42 | 49.16 | 30.02 |
1. Retrieval
Model | T2Retrieval | MMarcoRetrieval | DuRetrieval | CovidRetrieval | CmedqaRetrieval | EcomRetrieval | MedicalRetrieval | VideoRetrieval | Avg |
---|---|---|---|---|---|---|---|---|---|
luotuo-bert-medium | 58.67 | 55.31 | 59.36 | 55.48 | 18.04 | 40.48 | 29.8 | 38.04 | 44.4 |
text2vec-large-chinese | 50.52 | 45.96 | 51.87 | 60.48 | 15.53 | 37.58 | 30.93 | 42.65 | 41.94 |
text2vec-base-chinese | 51.67 | 44.06 | 52.23 | 44.81 | 15.91 | 34.59 | 27.56 | 39.52 | 38.79 |
m3e-base | 73.14 | 65.45 | 75.76 | 66.42 | 30.33 | 50.27 | 42.8 | 51.11 | 56.91 |
m3e-large | 72.36 | 61.06 | 74.69 | 61.33 | 30.73 | 45.18 | 48.66 | 44.02 | 54.75 |
OpenAI(text-embedding-ada-002) | 69.14 | 69.86 | 71.17 | 57.21 | 22.36 | 44.49 | 37.92 | 43.85 | 52.0 |
BAAI/bge-small-zh | 77.59 | 67.56 | 77.89 | 68.95 | 35.18 | 58.17 | 49.9 | 69.33 | 63.07 |
BAAI/bge-base-zh | 83.35 | 79.11 | 86.02 | 72.07 | 41.77 | 63.53 | 56.64 | 73.76 | 69.53 |
bge-large-zh-noinstruct | 84.39 | 81.38 | 84.68 | 75.07 | 41.03 | 65.6 | 58.28 | 73.94 | 70.55 |
bge-large-zh | 84.82 | 81.28 | 86.94 | 74.06 | 42.4 | 66.12 | 59.39 | 77.19 | 71.53 |
2. STS
Model | ATEC | BQ | LCQMC | PAWSX | STSB | AFQMC | QBQTC | Avg |
---|---|---|---|---|---|---|---|---|
luotuo-bert-medium | 30.84 | 43.33 | 66.74 | 12.31 | 73.22 | 22.24 | 27.2 | 39.41 |
text2vec-large-chinese | 32.45 | 44.22 | 69.16 | 14.55 | 79.45 | 24.51 | 29.51 | 41.98 |
text2vec-base-chinese | 31.93 | 42.67 | 70.16 | 17.21 | 79.3 | 26.06 | 24.62 | 41.71 |
m3e-base | 41.27 | 63.81 | 74.88 | 12.19 | 76.97 | 35.87 | 32.07 | 48.15 |
m3e-large | 41.8 | 65.2 | 74.2 | 15.95 | 74.16 | 36.53 | 32.65 | 48.64 |
OpenAI(text-embedding-ada-002) | 29.25 | 45.33 | 68.41 | 16.55 | 70.61 | 23.88 | 30.27 | 40.61 |
BAAI/bge-small-zh | 43.17 | 55.47 | 72.61 | 9.97 | 76.48 | 33.93 | 36.45 | 46.87 |
BAAI/bge-base-zh | 48.28 | 61.21 | 74.98 | 20.65 | 78.66 | 42.53 | 38.01 | 52.05 |
bge-large-zh-noinstruct | 48.29 | 60.53 | 74.71 | 16.64 | 78.41 | 43.06 | 35.2 | 50.98 |
bge-large-zh | 49.75 | 62.93 | 75.45 | 22.45 | 78.51 | 44.57 | 38.92 | 53.23 |
3. PairClassification
Model | Ocnli | Cmnli | Avg |
---|---|---|---|
luotuo-bert-medium | 60.7 | 72.55 | 66.62 |
text2vec-large-chinese | 64.04 | 77.67 | 70.86 |
text2vec-base-chinese | 60.95 | 73.87 | 67.41 |
m3e-base | 58.0 | 69.98 | 63.99 |
m3e-large | 59.33 | 69.27 | 64.3 |
OpenAI(text-embedding-ada-002) | 63.08 | 76.03 | 69.56 |
BAAI/bge-small-zh | 65.25 | 75.46 | 70.35 |
BAAI/bge-base-zh | 73.32 | 81.69 | 77.5 |
bge-large-zh-noinstruct | 71.37 | 82.17 | 76.77 |
bge-large-zh | 75.75 | 82.12 | 78.94 |
4. Classification
Model | TNews | IFlyTek | MultilingualSentiment | JDReview | OnlineShopping | Waimai | Avg |
---|---|---|---|---|---|---|---|
luotuo-bert-medium | 45.22 | 41.75 | 61.21 | 79.68 | 84.3 | 79.57 | 65.29 |
text2vec-large-chinese | 38.92 | 41.54 | 58.97 | 81.56 | 83.51 | 76.01 | 63.42 |
text2vec-base-chinese | 43.02 | 42.05 | 60.98 | 82.14 | 85.69 | 77.22 | 65.18 |
m3e-base | 48.28 | 44.42 | 71.9 | 85.33 | 87.77 | 83.99 | 70.28 |
m3e-large | 48.26 | 43.96 | 72.47 | 86.92 | 89.59 | 86.1 | 71.22 |
OpenAI(text-embedding-ada-002) | 45.77 | 44.62 | 67.99 | 74.6 | 88.94 | 82.37 | 67.38 |
BAAI/bge-small-zh | 47.67 | 42.07 | 65.07 | 80.64 | 87.4 | 83.8 | 67.78 |
BAAI/bge-base-zh | 49.97 | 44.54 | 70.63 | 83.92 | 91.38 | 85.46 | 70.98 |
bge-large-zh-noinstruct | 52.05 | 45.32 | 73.7 | 85.38 | 91.66 | 86.83 | 72.49 |
bge-large-zh | 50.84 | 45.09 | 74.41 | 85.08 | 91.6 | 86.54 | 72.26 |
5. Reranking
Model | T2Reranking | MmarcoReranking | CMedQAv1 | CMedQAv2 | Avg |
---|---|---|---|---|---|
luotuo-bert-medium | 65.76 | 14.55 | 57.82 | 58.88 | 49.25 |
text2vec-large-chinese | 64.82 | 12.48 | 58.92 | 60.41 | 49.16 |
text2vec-base-chinese | 65.95 | 12.76 | 59.26 | 59.82 | 49.45 |
m3e-base | 66.03 | 17.51 | 77.05 | 76.76 | 59.34 |
m3e-large | 66.13 | 16.46 | 77.76 | 78.27 | 59.66 |
OpenAI(text-embedding-ada-002) | 66.65 | 23.39 | 63.08 | 64.02 | 54.28 |
BAAI/bge-small-zh | 66.2 | 22.82 | 77.08 | 79.82 | 61.48 |
BAAI/bge-base-zh | 66.49 | 28.24 | 80.12 | 84.78 | 64.91 |
bge-large-zh-noinstruct | 66.16 | 27.1 | 81.72 | 84.64 | 64.91 |
bge-large-zh | 66.19 | 26.23 | 83.01 | 85.01 | 65.11 |
6. Clustering
Model | CLSClusteringS2S | CLSClusteringP2P | ThuNewsClusteringS2S | ThuNewsClusteringP2P | Avg |
---|---|---|---|---|---|
luotuo-bert-medium | 33.46 | 37.01 | 48.26 | 58.83 | 44.39 |
text2vec-large-chinese | 28.77 | 30.13 | 26.14 | 35.05 | 30.02 |
text2vec-base-chinese | 32.42 | 35.27 | 40.01 | 42.92 | 37.66 |
m3e-base | 37.34 | 39.81 | 53.78 | 59.77 | 47.68 |
m3e-large | 38.02 | 38.6 | 58.51 | 60.39 | 48.88 |
OpenAI(text-embedding-ada-002) | 35.91 | 38.26 | 49.86 | 58.71 | 45.68 |
BAAI/bge-small-zh | 34.34 | 38.23 | 51.84 | 55.95 | 45.09 |
BAAI/bge-base-zh | 36.59 | 38.79 | 56.16 | 59.0 | 47.63 |
bge-large-zh-noinstruct | 40.04 | 41.23 | 56.75 | 62.03 | 50.01 |
bge-large-zh | 38.05 | 40.92 | 58.79 | 55.79 | 48.39 |
Tasks
An overview of tasks and datasets available in MTEB-chinese is provided in following table:
Name | Hub URL | Description | Type | Category | Test #Samples |
---|---|---|---|---|---|
T2Retrieval | C-MTEB/T2Retrieval | T2Ranking: A large-scale Chinese Benchmark for Passage Ranking | Retrieval | s2p | 24,832 |
MMarcoRetrieval | C-MTEB/MMarcoRetrieval | mMARCO is a multilingual version of the MS MARCO passage ranking dataset | Retrieval | s2p | 7,437 |
DuRetrieval | C-MTEB/DuRetrieval | A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine | Retrieval | s2p | 4,000 |
CovidRetrieval | C-MTEB/CovidRetrieval | COVID-19 news articles | Retrieval | s2p | 949 |
CmedqaRetrieval | C-MTEB/CmedqaRetrieval | Online medical consultation text | Retrieval | s2p | 3,999 |
EcomRetrieval | C-MTEB/EcomRetrieval | Passage retrieval dataset collected from Alibaba search engine systems in e-commerce domain | Retrieval | s2p | 1,000 |
MedicalRetrieval | C-MTEB/MedicalRetrieval | Passage retrieval dataset collected from Alibaba search engine systems in medical domain | Retrieval | s2p | 1,000 |
VideoRetrieval | C-MTEB/VideoRetrieval | Passage retrieval dataset collected from Alibaba search engine systems in video domain | Retrieval | s2p | 1,000 |
T2Reranking | C-MTEB/T2Reranking | T2Ranking: A large-scale Chinese Benchmark for Passage Ranking | Reranking | s2p | 24,382 |
MMarcoRetrieval | C-MTEB/Mmarco-reranking | mMARCO is a multilingual version of the MS MARCO passage ranking dataset | Reranking | s2p | 7,437 |
CMedQAv1 | C-MTEB/CMedQAv1-reranking | Chinese community medical question answering | Reranking | s2p | 2,000 |
CMedQAv2 | C-MTEB/CMedQAv2-reranking | Chinese community medical question answering | Reranking | s2p | 4,000 |
Ocnli | C-MTEB/OCNLI | Original Chinese Natural Language Inference dataset | PairClassification | s2s | 3,000 |
Cmnli | C-MTEB/CMNLI | Chinese Multi-Genre NLI | PairClassification | s2s | 139,000 |
CLSClusteringS2S | C-MTEB/CLSClusteringS2S | Clustering of titles from CLS dataset. Clustering of 13 sets, based on the main category. | Clustering | s2s | 10,000 |
CLSClusteringP2P | C-MTEB/CLSClusteringP2P | Clustering of titles + abstract from CLS dataset. Clustering of 13 sets, based on the main category. | Clustering | p2p | 10,000 |
ThuNewsClusteringS2S | C-MTEB/ThuNewsClusteringS2S | Clustering of titles from the THUCNews dataset | Clustering | s2s | 10,000 |
ThuNewsClusteringP2P | C-MTEB/ThuNewsClusteringP2P | Clustering of titles + abstract from the THUCNews dataset | Clustering | p2p | 10,000 |
ATEC | C-MTEB/ATEC | ATEC NLP sentence pair similarity competition | STS | s2s | 20,000 |
BQ | C-MTEB/BQ | Bank Question Semantic Similarity | STS | s2s | 10,000 |
LCQMC | C-MTEB/LCQMC | A large-scale Chinese question matching corpus. | STS | s2s | 12,500 |
PAWSX | C-MTEB/PAWSX | Translated PAWS evaluation pairs | STS | s2s | 2,000 |
STSB | C-MTEB/STSB | Translate STS-B into Chinese | STS | s2s | 1,360 |
AFQMC | C-MTEB/AFQMC | Ant Financial Question Matching Corpus | STS | s2s | 3,861 |
QBQTC | C-MTEB/QBQTC | QQ Browser Query Title Corpus | STS | s2s | 5,000 |
TNews | C-MTEB/TNews-classification | Short Text Classificaiton for News | Classification | s2s | 10,000 |
IFlyTek | C-MTEB/IFlyTek-classification | Long Text classification for the description of Apps | Classification | s2s | 2,600 |
Waimai | C-MTEB/waimai-classification | Sentiment Analysis of user reviews on takeaway platforms | Classification | s2s | 1,000 |
OnlineShopping | C-MTEB/OnlineShopping-classification | Sentiment Analysis of User Reviews on Online Shopping Websites | Classification | s2s | 1,000 |
MultilingualSentiment | C-MTEB/MultilingualSentiment-classification | A collection of multilingual sentiments datasets grouped into 3 classes -- positive, neutral, negative | Classification | s2s | 3,000 |
JDReview | C-MTEB/JDReview-classification | review for iphone | Classification | s2s | 533 |
In retrieval task, we sample 100,000 candidates (including the ground truths) from entire corpus to reduce the inference cost.
Acknowledgement
Thank the great tool from Massive Text Embedding Benchmark and the open-source datasets from Chinese NLP community.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.