Skip to main content

Chinese Massive Text Embedding Benchmark

Project description

Chinese Massive Text Embedding Benchmark

Build Build Build

Installation | Usage | Leaderboard | Tasks | Acknowledgement |

Installation

C-MTEB is devloped based on MTEB.

pip install C_MTEB

Or clone this repo and install as editable

git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding/benchmark
pip install -e .

Usage

You can reproduce the results of baai-general-embedding (bge) using the provided python script (see eval_C-MTEB.py )

python eval_C-MTEB.py --model_name_or_path BAAI/bge-large-zh

We wrap the DERSModel in mteb to FlagDERSModel which can support instruction and inference with multiple GPUs.

  • With sentence-transformers

You can use C-MTEB easily in the same way as MTEB.

Noted that the original sentence-transformers model doesn't support instruction. So this method cannot test the performance of bge-* models.

from mteb import MTEB
from C_MTEB import *
from sentence_transformers import SentenceTransformer

# Define the sentence-transformers model name
model_name = "bert-base-uncased"

model = SentenceTransformer(model_name)
evaluation = MTEB(task_langs=['zh'])
results = evaluation.run(model, output_folder=f"zh_results/{model_name}")
  • Using a custom model
    To evaluate a new model, you can load it by sentence_transformers if it is supported by sentence_transformers. Otherwise, models should be implemented like this (implementing an encode function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can be np.array, torch.tensor, etc.). ):
class MyModel():
    def encode(self, sentences, batch_size=32, **kwargs):
        """ Returns a list of embeddings for the given sentences.
        Args:
            sentences (`List[str]`): List of sentences to encode
            batch_size (`int`): Batch size for the encoding

        Returns:
            `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
        """
        pass

model = MyModel()
evaluation = MTEB(tasks=["T2Retrival"])
evaluation.run(model)

Leaderboard

overall

Model Embedding dimension Avg Retrieval STS PairClassification Classification Reranking Clustering
bge-large-zh 1024 64.20 71.53 53.23 78.94 72.26 65.11 48.39
bge-large-zh-noinstruct 1024 63.53 70.55 50.98 76.77 72.49 64.91 50.01
BAAI/bge-base-zh 768 62.96 69.53 52.05 77.5 70.98 64.91 47.63
BAAI/bge-small-zh 512 58.27 63.07 46.87 70.35 67.78 61.48 45.09
m3e-base 768 57.10 56.91 48.15 63.99 70.28 59.34 47.68
m3e-large 1024 57.05 54.75 48.64 64.3 71.22 59.66 48.88
text-embedding-ada-002(OpenAI) 1536 53.02 52.0 40.61 69.56 67.38 54.28 45.68
luotuo 1024 49.37 44.4 39.41 66.62 65.29 49.25 44.39
text2vec 768 47.63 38.79 41.71 67.41 65.18 49.45 37.66
text2vec-large 1024 47.36 41.94 41.98 70.86 63.42 49.16 30.02

1. Retrieval

Model T2Retrieval MMarcoRetrieval DuRetrieval CovidRetrieval CmedqaRetrieval EcomRetrieval MedicalRetrieval VideoRetrieval Avg
luotuo-bert-medium 58.67 55.31 59.36 55.48 18.04 40.48 29.8 38.04 44.4
text2vec-large-chinese 50.52 45.96 51.87 60.48 15.53 37.58 30.93 42.65 41.94
text2vec-base-chinese 51.67 44.06 52.23 44.81 15.91 34.59 27.56 39.52 38.79
m3e-base 73.14 65.45 75.76 66.42 30.33 50.27 42.8 51.11 56.91
m3e-large 72.36 61.06 74.69 61.33 30.73 45.18 48.66 44.02 54.75
OpenAI(text-embedding-ada-002) 69.14 69.86 71.17 57.21 22.36 44.49 37.92 43.85 52.0
BAAI/bge-small-zh 77.59 67.56 77.89 68.95 35.18 58.17 49.9 69.33 63.07
BAAI/bge-base-zh 83.35 79.11 86.02 72.07 41.77 63.53 56.64 73.76 69.53
bge-large-zh-noinstruct 84.39 81.38 84.68 75.07 41.03 65.6 58.28 73.94 70.55
bge-large-zh 84.82 81.28 86.94 74.06 42.4 66.12 59.39 77.19 71.53

2. STS

Model ATEC BQ LCQMC PAWSX STSB AFQMC QBQTC Avg
luotuo-bert-medium 30.84 43.33 66.74 12.31 73.22 22.24 27.2 39.41
text2vec-large-chinese 32.45 44.22 69.16 14.55 79.45 24.51 29.51 41.98
text2vec-base-chinese 31.93 42.67 70.16 17.21 79.3 26.06 24.62 41.71
m3e-base 41.27 63.81 74.88 12.19 76.97 35.87 32.07 48.15
m3e-large 41.8 65.2 74.2 15.95 74.16 36.53 32.65 48.64
OpenAI(text-embedding-ada-002) 29.25 45.33 68.41 16.55 70.61 23.88 30.27 40.61
BAAI/bge-small-zh 43.17 55.47 72.61 9.97 76.48 33.93 36.45 46.87
BAAI/bge-base-zh 48.28 61.21 74.98 20.65 78.66 42.53 38.01 52.05
bge-large-zh-noinstruct 48.29 60.53 74.71 16.64 78.41 43.06 35.2 50.98
bge-large-zh 49.75 62.93 75.45 22.45 78.51 44.57 38.92 53.23

3. PairClassification

Model Ocnli Cmnli Avg
luotuo-bert-medium 60.7 72.55 66.62
text2vec-large-chinese 64.04 77.67 70.86
text2vec-base-chinese 60.95 73.87 67.41
m3e-base 58.0 69.98 63.99
m3e-large 59.33 69.27 64.3
OpenAI(text-embedding-ada-002) 63.08 76.03 69.56
BAAI/bge-small-zh 65.25 75.46 70.35
BAAI/bge-base-zh 73.32 81.69 77.5
bge-large-zh-noinstruct 71.37 82.17 76.77
bge-large-zh 75.75 82.12 78.94

4. Classification

Model TNews IFlyTek MultilingualSentiment JDReview OnlineShopping Waimai Avg
luotuo-bert-medium 45.22 41.75 61.21 79.68 84.3 79.57 65.29
text2vec-large-chinese 38.92 41.54 58.97 81.56 83.51 76.01 63.42
text2vec-base-chinese 43.02 42.05 60.98 82.14 85.69 77.22 65.18
m3e-base 48.28 44.42 71.9 85.33 87.77 83.99 70.28
m3e-large 48.26 43.96 72.47 86.92 89.59 86.1 71.22
OpenAI(text-embedding-ada-002) 45.77 44.62 67.99 74.6 88.94 82.37 67.38
BAAI/bge-small-zh 47.67 42.07 65.07 80.64 87.4 83.8 67.78
BAAI/bge-base-zh 49.97 44.54 70.63 83.92 91.38 85.46 70.98
bge-large-zh-noinstruct 52.05 45.32 73.7 85.38 91.66 86.83 72.49
bge-large-zh 50.84 45.09 74.41 85.08 91.6 86.54 72.26

5. Reranking

Model T2Reranking MmarcoReranking CMedQAv1 CMedQAv2 Avg
luotuo-bert-medium 65.76 14.55 57.82 58.88 49.25
text2vec-large-chinese 64.82 12.48 58.92 60.41 49.16
text2vec-base-chinese 65.95 12.76 59.26 59.82 49.45
m3e-base 66.03 17.51 77.05 76.76 59.34
m3e-large 66.13 16.46 77.76 78.27 59.66
OpenAI(text-embedding-ada-002) 66.65 23.39 63.08 64.02 54.28
BAAI/bge-small-zh 66.2 22.82 77.08 79.82 61.48
BAAI/bge-base-zh 66.49 28.24 80.12 84.78 64.91
bge-large-zh-noinstruct 66.16 27.1 81.72 84.64 64.91
bge-large-zh 66.19 26.23 83.01 85.01 65.11

6. Clustering

Model CLSClusteringS2S CLSClusteringP2P ThuNewsClusteringS2S ThuNewsClusteringP2P Avg
luotuo-bert-medium 33.46 37.01 48.26 58.83 44.39
text2vec-large-chinese 28.77 30.13 26.14 35.05 30.02
text2vec-base-chinese 32.42 35.27 40.01 42.92 37.66
m3e-base 37.34 39.81 53.78 59.77 47.68
m3e-large 38.02 38.6 58.51 60.39 48.88
OpenAI(text-embedding-ada-002) 35.91 38.26 49.86 58.71 45.68
BAAI/bge-small-zh 34.34 38.23 51.84 55.95 45.09
BAAI/bge-base-zh 36.59 38.79 56.16 59.0 47.63
bge-large-zh-noinstruct 40.04 41.23 56.75 62.03 50.01
bge-large-zh 38.05 40.92 58.79 55.79 48.39

Tasks

An overview of tasks and datasets available in MTEB-chinese is provided in following table:

Name Hub URL Description Type Category Test #Samples
T2Retrieval C-MTEB/T2Retrieval T2Ranking: A large-scale Chinese Benchmark for Passage Ranking Retrieval s2p 24,832
MMarcoRetrieval C-MTEB/MMarcoRetrieval mMARCO is a multilingual version of the MS MARCO passage ranking dataset Retrieval s2p 7,437
DuRetrieval C-MTEB/DuRetrieval A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine Retrieval s2p 4,000
CovidRetrieval C-MTEB/CovidRetrieval COVID-19 news articles Retrieval s2p 949
CmedqaRetrieval C-MTEB/CmedqaRetrieval Online medical consultation text Retrieval s2p 3,999
EcomRetrieval C-MTEB/EcomRetrieval Passage retrieval dataset collected from Alibaba search engine systems in e-commerce domain Retrieval s2p 1,000
MedicalRetrieval C-MTEB/MedicalRetrieval Passage retrieval dataset collected from Alibaba search engine systems in medical domain Retrieval s2p 1,000
VideoRetrieval C-MTEB/VideoRetrieval Passage retrieval dataset collected from Alibaba search engine systems in video domain Retrieval s2p 1,000
T2Reranking C-MTEB/T2Reranking T2Ranking: A large-scale Chinese Benchmark for Passage Ranking Reranking s2p 24,382
MMarcoRetrieval C-MTEB/Mmarco-reranking mMARCO is a multilingual version of the MS MARCO passage ranking dataset Reranking s2p 7,437
CMedQAv1 C-MTEB/CMedQAv1-reranking Chinese community medical question answering Reranking s2p 2,000
CMedQAv2 C-MTEB/CMedQAv2-reranking Chinese community medical question answering Reranking s2p 4,000
Ocnli C-MTEB/OCNLI Original Chinese Natural Language Inference dataset PairClassification s2s 3,000
Cmnli C-MTEB/CMNLI Chinese Multi-Genre NLI PairClassification s2s 139,000
CLSClusteringS2S C-MTEB/CLSClusteringS2S Clustering of titles from CLS dataset. Clustering of 13 sets, based on the main category. Clustering s2s 10,000
CLSClusteringP2P C-MTEB/CLSClusteringP2P Clustering of titles + abstract from CLS dataset. Clustering of 13 sets, based on the main category. Clustering p2p 10,000
ThuNewsClusteringS2S C-MTEB/ThuNewsClusteringS2S Clustering of titles from the THUCNews dataset Clustering s2s 10,000
ThuNewsClusteringP2P C-MTEB/ThuNewsClusteringP2P Clustering of titles + abstract from the THUCNews dataset Clustering p2p 10,000
ATEC C-MTEB/ATEC ATEC NLP sentence pair similarity competition STS s2s 20,000
BQ C-MTEB/BQ Bank Question Semantic Similarity STS s2s 10,000
LCQMC C-MTEB/LCQMC A large-scale Chinese question matching corpus. STS s2s 12,500
PAWSX C-MTEB/PAWSX Translated PAWS evaluation pairs STS s2s 2,000
STSB C-MTEB/STSB Translate STS-B into Chinese STS s2s 1,360
AFQMC C-MTEB/AFQMC Ant Financial Question Matching Corpus STS s2s 3,861
QBQTC C-MTEB/QBQTC QQ Browser Query Title Corpus STS s2s 5,000
TNews C-MTEB/TNews-classification Short Text Classificaiton for News Classification s2s 10,000
IFlyTek C-MTEB/IFlyTek-classification Long Text classification for the description of Apps Classification s2s 2,600
Waimai C-MTEB/waimai-classification Sentiment Analysis of user reviews on takeaway platforms Classification s2s 1,000
OnlineShopping C-MTEB/OnlineShopping-classification Sentiment Analysis of User Reviews on Online Shopping Websites Classification s2s 1,000
MultilingualSentiment C-MTEB/MultilingualSentiment-classification A collection of multilingual sentiments datasets grouped into 3 classes -- positive, neutral, negative Classification s2s 3,000
JDReview C-MTEB/JDReview-classification review for iphone Classification s2s 533

In retrieval task, we sample 100,000 candidates (including the ground truths) from entire corpus to reduce the inference cost.

Acknowledgement

Thank the great tool from Massive Text Embedding Benchmark and the open-source datasets from Chinese NLP community.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

C_MTEB-1.0.0.tar.gz (11.1 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page