Chinese Massive Text Embedding Benchmark

Project description

Chinese Massive Text Embedding Benchmark

Installation | Evaluation | Leaderboard | Tasks | Acknowledgement |

Installation

C-MTEB is devloped based on MTEB.

pip install -U C_MTEB

Or clone this repo and install as editable

git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding/C_MTEB
pip install -e .

Evaluation

Evaluate reranker

python eval_cross_encoder.py --model_name_or_path BAAI/bge-reranker-base

Evaluate embedding model

With our scripts

You can reproduce the results of baai-general-embedding (bge) using the provided python script (see eval_C-MTEB.py )

python eval_C-MTEB.py --model_name_or_path BAAI/bge-large-zh

# for MTEB leaderboard
python eval_MTEB.py --model_name_or_path BAAI/bge-large-en

With sentence-transformers

You can use C-MTEB easily in the same way as MTEB.

Note that the original sentence-transformers model doesn't support instruction. So this method cannot test the performance of bge-* models.

from mteb import MTEB
from C_MTEB import *
from sentence_transformers import SentenceTransformer

# Define the sentence-transformers model name
model_name = "bert-base-uncased"

model = SentenceTransformer(model_name)
evaluation = MTEB(task_langs=['zh'])
results = evaluation.run(model, output_folder=f"zh_results/{model_name}")

Using a custom model
To evaluate a new model, you can load it via sentence_transformers if it is supported by sentence_transformers. Otherwise, models should be implemented like below (implementing an encode function taking as input a list of sentences, and returning a list of embeddings (embeddings can be np.array, torch.tensor, etc.).):

class MyModel():
    def encode(self, sentences, batch_size=32, **kwargs):
        """ Returns a list of embeddings for the given sentences.
        Args:
            sentences (`List[str]`): List of sentences to encode
            batch_size (`int`): Batch size for the encoding

        Returns:
            `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
        """
        pass

model = MyModel()
evaluation = MTEB(tasks=["T2Retrival"])
evaluation.run(model)

Leaderboard

1. Reranker

Model	T2Reranking	T2RerankingZh2En*	T2RerankingEn2Zh*	MMarcoReranking	CMedQAv1	CMedQAv2	Avg
text2vec-base-multilingual	64.66	62.94	62.51	14.37	48.46	48.6	50.26
multilingual-e5-small	65.62	60.94	56.41	29.91	67.26	66.54	57.78
multilingual-e5-large	64.55	61.61	54.28	28.6	67.42	67.92	57.4
multilingual-e5-base	64.21	62.13	54.68	29.5	66.23	66.98	57.29
m3e-base	66.03	62.74	56.07	17.51	77.05	76.76	59.36
m3e-large	66.13	62.72	56.1	16.46	77.76	78.27	59.57
bge-base-zh-v1.5	66.49	63.25	57.02	29.74	80.47	84.88	63.64
bge-large-zh-v1.5	65.74	63.39	57.03	28.74	83.45	85.44	63.97
BAAI/bge-reranker-base	67.28	63.95	60.45	35.46	81.26	84.1	65.42
BAAI/bge-reranker-large	67.6	64.03	61.44	37.16	82.15	84.18	66.09

* : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval task

2. Embedding

Model	Embedding dimension	Avg	Retrieval	STS	PairClassification	Classification	Reranking	Clustering
BAAI/bge-large-zh-v1.5	1024	64.53	70.46	56.25	81.6	69.13	65.84	48.99
BAAI/bge-base-zh-v1.5	768	63.13	69.49	53.72	79.75	68.07	65.39	47.53
BAAI/bge-small-zh-v1.5	512	57.82	61.77	49.11	70.41	63.96	60.92	44.18
BAAI/bge-large-zh	1024	64.20	71.53	54.98	78.94	68.32	65.11	48.39
BAAI/bge-large-zh-noinstruct	1024	63.53	70.55	53	76.77	68.58	64.91	50.01
BAAI/bge-base-zh	768	62.96	69.53	54.12	77.5	67.07	64.91	47.63
multilingual-e5-large	1024	58.79	63.66	48.44	69.89	67.34	56.00	48.23
BAAI/bge-small-zh	512	58.27	63.07	49.45	70.35	63.64	61.48	45.09
m3e-base	768	57.10	56.91	50.47	63.99	67.52	59.34	47.68
m3e-large	1024	57.05	54.75	50.42	64.3	68.2	59.66	48.88
multilingual-e5-base	768	55.48	61.63	46.49	67.07	65.35	54.35	40.68
multilingual-e5-small	384	55.38	59.95	45.27	66.45	65.85	53.86	45.26
text-embedding-ada-002(OpenAI)	1536	53.02	52.0	43.35	69.56	64.31	54.28	45.68
luotuo	1024	49.37	44.4	42.78	66.62	61	49.25	44.39
text2vec-base	768	47.63	38.79	43.41	67.41	62.19	49.45	37.66
text2vec-large	1024	47.36	41.94	44.97	70.86	60.66	49.16	30.02

2.1. Retrieval

Model	T2Retrieval	MMarcoRetrieval	DuRetrieval	CovidRetrieval	CmedqaRetrieval	EcomRetrieval	MedicalRetrieval	VideoRetrieval	Avg
luotuo-bert-medium	58.67	55.31	59.36	55.48	18.04	40.48	29.8	38.04	44.4
text2vec-large-chinese	50.52	45.96	51.87	60.48	15.53	37.58	30.93	42.65	41.94
text2vec-base-chinese	51.67	44.06	52.23	44.81	15.91	34.59	27.56	39.52	38.79
m3e-base	73.14	65.45	75.76	66.42	30.33	50.27	42.8	51.11	56.91
m3e-large	72.36	61.06	74.69	61.33	30.73	45.18	48.66	44.02	54.75
OpenAI(text-embedding-ada-002)	69.14	69.86	71.17	57.21	22.36	44.49	37.92	43.85	52.0
multilingual-e5-small	71.39	73.17	81.35	72.82	24.38	53.56	44.84	58.09	59.95
multilingual-e5-base	70.86	76.04	81.64	73.45	27.2	54.17	48.35	61.3	61.63
multilingual-e5-large	76.11	79.2	85.32	75.51	28.67	54.75	51.44	58.25	63.66
BAAI/bge-small-zh	77.59	67.56	77.89	68.95	35.18	58.17	49.9	69.33	63.07
BAAI/bge-base-zh	83.35	79.11	86.02	72.07	41.77	63.53	56.64	73.76	69.53
bge-large-zh-noinstruct	84.39	81.38	84.68	75.07	41.03	65.6	58.28	73.94	70.55
bge-large-zh	84.82	81.28	86.94	74.06	42.4	66.12	59.39	77.19	71.53

2.2. STS

Model	ATEC	BQ	LCQMC	PAWSX	STSB	AFQMC	QBQTC	STS22 (zh)	Avg
luotuo-bert-medium	30.84	43.33	66.74	12.31	73.22	22.24	27.2	66.4	42.78
text2vec-large-chinese	32.45	44.22	69.16	14.55	79.45	24.51	29.51	65.94	44.97
text2vec-base-chinese	31.93	42.67	70.16	17.21	79.3	26.06	24.62	55.35	43.41
m3e-base	41.27	63.81	74.88	12.19	76.97	35.87	32.07	66.73	50.47
m3e-large	41.8	65.2	74.2	15.95	74.16	36.53	32.65	62.91	50.42
OpenAI(text-embedding-ada-002)	29.25	45.33	68.41	16.55	70.61	23.88	30.27	62.53	43.35
multilingual-e5-small	35.14	43.27	72.7	11.01	77.73	25.21	30.25	66.84	45.27
multilingual-e5-base	37.01	45.45	74.15	12.14	79.05	29.67	28.81	65.64	46.49
multilingual-e5-large	39.81	46.44	75.95	14.63	81.08	33.02	29.77	66.82	48.44
BAAI/bge-small-zh	43.17	55.47	72.61	9.97	76.48	33.93	36.45	67.54	49.45
BAAI/bge-base-zh	48.28	61.21	74.98	20.65	78.66	42.53	38.01	68.64	54.12
bge-large-zh-noinstruct	48.29	60.53	74.71	16.64	78.41	43.06	35.2	67.19	53
bge-large-zh	49.75	62.93	75.45	22.45	78.51	44.57	38.92	67.24	54.98

2.3. PairClassification

Model	Ocnli	Cmnli	Avg
luotuo-bert-medium	60.7	72.55	66.62
text2vec-large-chinese	64.04	77.67	70.86
text2vec-base-chinese	60.95	73.87	67.41
m3e-base	58.0	69.98	63.99
m3e-large	59.33	69.27	64.3
OpenAI(text-embedding-ada-002)	63.08	76.03	69.56
multilingual-e5-small	60.77	72.12	66.45
multilingual-e5-base	59.63	74.51	67.07
multilingual-e5-large	78.18	78.18	69.89
BAAI/bge-small-zh	65.25	75.46	70.35
BAAI/bge-base-zh	73.32	81.69	77.5
bge-large-zh-noinstruct	71.37	82.17	76.77
bge-large-zh	75.75	82.12	78.94

2.4. Classification

Model	TNews	IFlyTek	MultilingualSentiment	JDReview	OnlineShopping	Waimai	AmazonReviewsClassification (zh)	MassiveIntentClassification (zh-CN)	MassiveScenarioClassification (zh-CN)	Avg
luotuo-bert-medium	45.22	41.75	61.21	79.68	84.3	79.57	34.46	57.47	65.32	61
text2vec-large-chinese	38.92	41.54	58.97	81.56	83.51	76.01	33.77	63.23	68.45	60.66
text2vec-base-chinese	43.02	42.05	60.98	82.14	85.69	77.22	34.12	63.98	70.52	62.19
m3e-base	48.28	44.42	71.9	85.33	87.77	83.99	43.02	68.4	74.6	67.52
m3e-large	48.26	43.96	72.47	86.92	89.59	86.1	44.44	67.23	74.88	68.2
OpenAI(text-embedding-ada-002)	45.77	44.62	67.99	74.6	88.94	82.37	38.3	64.81	71.4	64.31
multilingual-e5-small	48.38	47.35	64.74	79.34	88.73	83.9	37.5	68.24	74.47	65.85
multilingual-e5-base	47.06	44.93	65.28	76.21	88.4	84.42	37.23	69.16	75.42	65.35
multilingual-e5-large	48.38	45.47	68.58	80.99	90.81	85.02	38.83	71.12	76.83	67.34
BAAI/bge-small-zh	47.67	42.07	65.07	80.64	87.4	83.8	37.31	61.44	67.39	63.64
BAAI/bge-base-zh	49.97	44.54	70.63	83.92	91.38	85.46	40.68	65.72	71.3	67.07
bge-large-zh-noinstruct	52.05	45.32	73.7	85.38	91.66	86.83	41.94	66.96	73.39	68.58
bge-large-zh	50.84	45.09	74.41	85.08	91.6	86.54	42.39	67.18	71.76	68.32

2.5. Reranking

Model	T2Reranking	MmarcoReranking	CMedQAv1	CMedQAv2	Avg
luotuo-bert-medium	65.76	14.55	57.82	58.88	49.25
text2vec-large-chinese	64.82	12.48	58.92	60.41	49.16
text2vec-base-chinese	65.95	12.76	59.26	59.82	49.45
m3e-base	66.03	17.51	77.05	76.76	59.34
m3e-large	66.13	16.46	77.76	78.27	59.66
OpenAI(text-embedding-ada-002)	66.65	23.39	63.08	64.02	54.28
multilingual-e5-small	65.24	24.33	63.44	62.41	53.86
multilingual-e5-base	64.39	21.76	65.21	66.06	54.35
multilingual-e5-large	65.83	21.34	68.25	68.56	56.00
BAAI/bge-small-zh	66.2	22.82	77.08	79.82	61.48
BAAI/bge-base-zh	66.49	28.24	80.12	84.78	64.91
bge-large-zh-noinstruct	66.16	27.1	81.72	84.64	64.91
bge-large-zh	66.19	26.23	83.01	85.01	65.11

2.6. Clustering

Model	CLSClusteringS2S	CLSClusteringP2P	ThuNewsClusteringS2S	ThuNewsClusteringP2P	Avg
luotuo-bert-medium	33.46	37.01	48.26	58.83	44.39
text2vec-large-chinese	28.77	30.13	26.14	35.05	30.02
text2vec-base-chinese	32.42	35.27	40.01	42.92	37.66
m3e-base	37.34	39.81	53.78	59.77	47.68
m3e-large	38.02	38.6	58.51	60.39	48.88
OpenAI(text-embedding-ada-002)	35.91	38.26	49.86	58.71	45.68
multilingual-e5-small	37.79	39.14	48.93	55.18	45.26
multilingual-e5-base	36.99	32.41	52.36	40.98	40.68
multilingual-e5-large	38.59	40.68	55.59	58.05	48.23
BAAI/bge-small-zh	34.34	38.23	51.84	55.95	45.09
BAAI/bge-base-zh	36.59	38.79	56.16	59.0	47.63
bge-large-zh-noinstruct	40.04	41.23	56.75	62.03	50.01
bge-large-zh	38.05	40.92	58.79	55.79	48.39

Tasks

An overview of tasks and datasets available in MTEB-chinese is provided in the following table:

Name	Hub URL	Description	Type	Category	Test #Samples
T2Retrieval	C-MTEB/T2Retrieval	T2Ranking: A large-scale Chinese Benchmark for Passage Ranking	Retrieval	s2p	24,832
MMarcoRetrieval	C-MTEB/MMarcoRetrieval	mMARCO is a multilingual version of the MS MARCO passage ranking dataset	Retrieval	s2p	7,437
DuRetrieval	C-MTEB/DuRetrieval	A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine	Retrieval	s2p	4,000
CovidRetrieval	C-MTEB/CovidRetrieval	COVID-19 news articles	Retrieval	s2p	949
CmedqaRetrieval	C-MTEB/CmedqaRetrieval	Online medical consultation text	Retrieval	s2p	3,999
EcomRetrieval	C-MTEB/EcomRetrieval	Passage retrieval dataset collected from Alibaba search engine systems in e-commerce domain	Retrieval	s2p	1,000
MedicalRetrieval	C-MTEB/MedicalRetrieval	Passage retrieval dataset collected from Alibaba search engine systems in medical domain	Retrieval	s2p	1,000
VideoRetrieval	C-MTEB/VideoRetrieval	Passage retrieval dataset collected from Alibaba search engine systems in video domain	Retrieval	s2p	1,000
T2Reranking	C-MTEB/T2Reranking	T2Ranking: A large-scale Chinese Benchmark for Passage Ranking	Reranking	s2p	24,382
MMarcoRetrieval	C-MTEB/MMarco-reranking	mMARCO is a multilingual version of the MS MARCO passage ranking dataset	Reranking	s2p	7,437
CMedQAv1	C-MTEB/CMedQAv1-reranking	Chinese community medical question answering	Reranking	s2p	2,000
CMedQAv2	C-MTEB/CMedQAv2-reranking	Chinese community medical question answering	Reranking	s2p	4,000
Ocnli	C-MTEB/OCNLI	Original Chinese Natural Language Inference dataset	PairClassification	s2s	3,000
Cmnli	C-MTEB/CMNLI	Chinese Multi-Genre NLI	PairClassification	s2s	139,000
CLSClusteringS2S	C-MTEB/CLSClusteringS2S	Clustering of titles from CLS dataset. Clustering of 13 sets, based on the main category.	Clustering	s2s	10,000
CLSClusteringP2P	C-MTEB/CLSClusteringP2P	Clustering of titles + abstract from CLS dataset. Clustering of 13 sets, based on the main category.	Clustering	p2p	10,000
ThuNewsClusteringS2S	C-MTEB/ThuNewsClusteringS2S	Clustering of titles from the THUCNews dataset	Clustering	s2s	10,000
ThuNewsClusteringP2P	C-MTEB/ThuNewsClusteringP2P	Clustering of titles + abstract from the THUCNews dataset	Clustering	p2p	10,000
ATEC	C-MTEB/ATEC	ATEC NLP sentence pair similarity competition	STS	s2s	20,000
BQ	C-MTEB/BQ	Bank Question Semantic Similarity	STS	s2s	10,000
LCQMC	C-MTEB/LCQMC	A large-scale Chinese question matching corpus.	STS	s2s	12,500
PAWSX	C-MTEB/PAWSX	Translated PAWS evaluation pairs	STS	s2s	2,000
STSB	C-MTEB/STSB	Translate STS-B into Chinese	STS	s2s	1,360
AFQMC	C-MTEB/AFQMC	Ant Financial Question Matching Corpus	STS	s2s	3,861
QBQTC	C-MTEB/QBQTC	QQ Browser Query Title Corpus	STS	s2s	5,000
TNews	C-MTEB/TNews-classification	Short Text Classificaiton for News	Classification	s2s	10,000
IFlyTek	C-MTEB/IFlyTek-classification	Long Text classification for the description of Apps	Classification	s2s	2,600
Waimai	C-MTEB/waimai-classification	Sentiment Analysis of user reviews on takeaway platforms	Classification	s2s	1,000
OnlineShopping	C-MTEB/OnlineShopping-classification	Sentiment Analysis of User Reviews on Online Shopping Websites	Classification	s2s	1,000
MultilingualSentiment	C-MTEB/MultilingualSentiment-classification	A collection of multilingual sentiments datasets grouped into 3 classes -- positive, neutral, negative	Classification	s2s	3,000
JDReview	C-MTEB/JDReview-classification	review for iphone	Classification	s2s	533

For retrieval tasks, we sample 100,000 candidates (including the ground truths) from the entire corpus to reduce the inference cost.

Acknowledgement

We thank the great tool from Massive Text Embedding Benchmark and the open-source datasets from Chinese NLP community.

Citation

If you find this repository useful, please consider citation

@misc{c-pack,
      title={C-Pack: Packaged Resources To Advance General Chinese Embedding}, 
      author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff},
      year={2023},
      eprint={2309.07597},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Project details

Release history Release notifications | RSS feed

1.1.1

Apr 17, 2024

This version

1.1.0

Nov 30, 2023

1.0.0

Aug 5, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

C_MTEB-1.1.0.tar.gz (18.1 kB view hashes)

Uploaded Nov 30, 2023 Source

Hashes for C_MTEB-1.1.0.tar.gz

Hashes for C_MTEB-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a38061fe5ffd47375fb757ce8fb9b2178c85e9c9257068cb1c67b81e5ece9c1a`
MD5	`d08c481e0abf8664bba452a5bcd1c6a4`
BLAKE2b-256	`88a6c421b437f05c52b06b59a194641c483bf58344af7b4230c951fcf57b19aa`