Multi-Expert Chain for Audio Tasks (MECAT)
Project description
MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks
📖 arXiv | 🛠️ GitHub Code | 🔊 MECAT-Caption Dataset (HuggingFace) 🔊 MECAT-QA Dataset (HuggingFace)
Table of Contents
- 1. Introduction
- 2. Features
- 3. Data Distribution
- 4. Tasks
- 5. Example Data
- 6. Evaluation Metrics
- 7. Usage
- 8. Results
- 9. Acknowledgement
- 10. Contributing
- 11. Citation
- 12. License
1. Introduction
MECAT is a comprehensive benchmark constructed on large-scale data to evaluate machine understanding of audio content through two core tasks:
- Audio Captioning: Generating textual descriptions for given audio
- Audio Question Answering: Answering questions about given audio
2. Features
- Data Source:Diverse-scenario coverage via the part of ACAV100M dataset
- Processing Pipeline:
- MetaInfo: Source video metadata extraction (titles/descriptions)
- Content-Specific: Content-specific feature extraction using 10-20 dedicated models (speech/music/general audio)
- Content-Unrelated: Non-content audio analysis: quality metrics, loudness measurements, reverberation assessment
- Understanding & Genration: LLM-powered comprehension & generation with Chain-of-Thought
- Quality Control: Multi-stage verification framework
- Evluation System: Multi-perspective assessment with progressive difficulty levels
3. Data Distribution
| Data Code | Description | Audio Caption | Audio Question Answering | ||
|---|---|---|---|---|---|
| # Pairs (Train) | # Pairs (Test) | # Pairs (Train) | # Pairs (Test) | ||
| 000 | silence | 173 | 179 | 865 | 895 |
| 00A | general sound excluding speech and music | 837 | 848 | 4185 | 4240 |
| 0M0 | music | 2593 | 2593 | 12965 | 12965 |
| 0MA | music and general sound | 206 | 199 | 1030 | 995 |
| S00 | speech | 7839 | 7839 | 39195 | 39195 |
| S0A | speech and general sound | 2424 | 2439 | 12120 | 12195 |
| SM0 | speech and music | 5312 | 5312 | 26560 | 26560 |
| SMA | speech, music and general sound | 668 | 643 | 3340 | 3215 |
4. Tasks
4.1 Audio-Captioning
| Type | Subtask | Category | Level | Descrption | Evaluated Data Abbreviation |
|---|---|---|---|---|---|
| Systemtic | Short | 🔵 Specialized | Simplified caption over the whole audio within 15 words | 000, 00A, 0M0, 0MA S00, S0A, SM0, SMA |
|
| Long | 🔵 Specialized | Caption over the whole audio using 1-2 sentences | 000, 00A, 0M0, 0MA S00, S0A, SM0, SMA |
||
| Content-Specific | Speech | Clean | 🟢 Basic | Caption over clean speech | S00 |
| Mixed | 🔴 Complex | Caption over speech with music/sound interference | 0MA, S0A, SM0, SMA | ||
| Music | Clean | 🟢 Basic | Caption over clean Music | 0M0 | |
| Mixed | 🔴 Complex | Caption over music with speech/sound interference | 0MA, S0A, SM0, SMA | ||
| Sound | Clear | 🟢 Basic | Caption over general sound excluding speech and music | 00A | |
| Mixed | 🔴 Complex | Caption over sound with speech/music interference | 0MA, S0A, SM0, SMA | ||
| Content-Unrelated | Environment | 🔵 Specialized | Caption over acoustic characteristic and environment | 000, 00A, 0M0, 0MA S00, S0A, SM0, SMA |
4.2 Audio-Question-Answering
Description
| Type | Subtask | Level | Description | Data Abbreviation |
|---|---|---|---|---|
| Perception | Direct_Perception | 🟢🟡 | Perceive sound types | 000, 00A, 0M0, 0MA, S00, S0A, SM0, SMA |
| Analysis | Sound_Characteristics | 🟢🟡🟠🔴 | Analyze sound characteristics | 000, 00A, 0M0, 0MA, S00, S0A, SM0, SMA |
| Quality_Assessment | 🟢🟡🟠🔴 | Analyze sound quality | 000, 00A, 0M0, 0MA, S00, S0A, SM0, SMA | |
| Reasoning | Environment_Reasoning | 🟢🟡🟠🔴 | Reasoning acoustic environment | 000, 00A, 0M0, 0MA, S00, S0A, SM0, SMA |
| Inference_Judgment | 🟢🟡🟠🔴 | Cross-modal reasoning | 000, 00A, 0M0, 0MA, S00, S0A, SM0, SMA | |
| Application_Context | 🟢🟡🟠🔴 | Semantic understanding | 000, 00A, 0M0, 0MA, S00, S0A, SM0, SMA |
Difficulty Distribution
| Difficulty | Symbol | Ratio (%) | Description |
|---|---|---|---|
| Basic | 🟢 | 25 | Direct descriptive questions |
| Intermediate | 🟡 | 35 | Analytical questions |
| Advanced | 🟠 | 25 | Inferential questions |
| Complex | 🔴 | 15 | Comprehensive judgment questions |
5. Example Data
5.1 Audio Captioning Example (SMA - Speech, Music and General Sound)
The following example shows the comprehensive caption annotations for a single audio sample from the SMA domain. This is the first data sample from the HuggingFace dataset:
Data Source: MECAT-Caption/SMA/Test/test_0000-0000000.tar.gz
{
"RjRMEFDocEY_78_681_88_681": {
"short": [
"Energetic electronic music accompanies animated speech with intermittent dog barks and background interference.",
"Upbeat instrumental track plays under expressive dialogue and occasional canine vocalizations amid noise.",
"Dynamic speech with emotional shifts over electronic music featuring sporadic barking and audio artifacts."
],
"long": [
"A female voice delivers emotionally varied speech ranging from laughter to frustration, accompanied by rhythmic electronic instrumentation with guitar elements. Occasional dog barks emerge through persistent background static and audio distortion.",
"Expressive vocal performance transitions between cheerfulness and intensity, layered over a driving electronic beat with occasional animal sounds and recording imperfections.",
"Vivid speech with fluctuating emotional tones interacts with synth-driven musical backing, punctuated by canine noises and low-fidelity artifacts."
],
"speech": [
"Animated female speech displaying rapid emotional shifts from laughter to frustration.",
"Expressive vocal delivery alternating between cheerful and agitated tones.",
"Dynamic spoken performance transitioning between amusement and intensity."
],
"music": [
"Moderate-tempo electronic composition featuring prominent guitar and rhythmic percussion elements.",
"Driving synth-based arrangement with guitar accents and steady beat.",
"Energetic instrumental track combining electronic textures with rhythmic guitar work."
],
"sound": [
"Intermittent dog vocalizations amidst persistent electrical interference.",
"Occasional canine barks layered over background static.",
"Sporadic animal noises punctuating continuous audio distortion."
],
"environment": [
"Low-quality recording with noticeable background interference and distortion.",
"Audio artifacts and electrical noise throughout the recording.",
"Persistent static and signal degradation affecting audio clarity."
],
"domain": "SMA"
}
}
5.2 Audio Question Answering Example (SMA - Speech, Music and General Sound)
The following example shows a QA pair from the SMA domain. This is the first data sample from the HuggingFace dataset:
Data Source: MECAT-QA/SMA/Test/test_0000-0000000.tar.gz
{
"RjRMEFDocEY_78_681_88_681_ffd8b511": {
"category": "direct_perception",
"difficulty": "basic",
"question": "What type of vocal sounds are present?",
"answer": "A woman speaking expressively and dog barks.",
"domain": "SMA"
}
}
6. Evaluation Metrics
MECAT supports multiple evaluation metrics for comprehensive assessment:
- Traditional Metrics: BLEU
- FENSE: Fluency Error-based Sentence-bert Evaluation for audio captioning
- DATE: Discriminability based Audio Task Evaluation - DATE is particularly effective for audio captioning and question-answering tasks as it considers both the quality of generated text and the model's discriminative capabilities.
7. Usage
7.1 Installation
python3 -m pip install mecat
# Or the development
# pip install git+https://git.n.xiaomi.com/niuyadong/mecat_public.git
7.2 Quick Start with Qwen2-Audio Example
This section provides a complete walkthrough of evaluating audio models using MECAT, using Qwen2-Audio as a practical example. The same approach can be adapted for other audio understanding models.
7.2.1 Preliminary Steps: Environment Setup and Model Loading
import torch
from tqdm import tqdm
from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration
# Device setup
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
# Load Qwen2-Audio model and processor
model = Qwen2AudioForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-Audio-7B",
trust_remote_code=True,
device_map="auto"
)
processor = AutoProcessor.from_pretrained(
"Qwen/Qwen2-Audio-7B",
trust_remote_code=True
)
7.2.2 Audio Caption Evaluation
Step 1: Load MECAT-Caption Dataset
from datasets import load_dataset
data = load_dataset(
'mispeech/MECAT-Caption',
split='test',
)
print(f"Loaded {len(data)} samples from datasets")
Step 2: Generate and Evaluate Captions
Method 1: Single Dictionary Approach (for non-instruction-following models)
Generation:
from mecat import evaluate
# Generate general predictions using a single prompt
predictions = {}
for item in tqdm(data, desc="Generating general captions"):
key = item['__key__']
audio = item['flac']['array']
sampling_rate = item['flac']['sampling_rate']
# Note: the sampling rate of audio provided by MECAT is 16kHz
# Create general prompt for caption generation
prompt = "<|audio_bos|><|AUDIO|><|audio_eos|>Generate the caption in English:"
# Process inputs
inputs = processor(
text=prompt,
audio=audio,
sampling_rate=sampling_rate,
return_tensors="pt"
).to(device)
# Generate response
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_length=512,
do_sample=False,
temperature=0.1
)
# Decode response
generated_ids = generated_ids[:, inputs.input_ids.size(1):]
response = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)[0]
predictions[key] = response.strip()
print(f"Generated {len(predictions)} general captions")
# Save single prediction file
import csv
with open('caption_predictions.csv', 'w', encoding='utf-8') as f:
writer = csv.writer(f, quoting=csv.QUOTE_ALL)
for key, value in predictions.items():
writer.writerow([key, value])
Evaluation:
# Evaluate general predictions across all subtasks
results = evaluate(
predicted_data=predictions,
task='caption',
metrics=['fense', 'date']
)
print("\nSingle Dictionary Evaluation Results:")
print(results)
Method 2: Multi-Dictionary Approach (recommended for instruction-following models)
Generation:
# Generate task-specific predictions using different prompts
task_prompts = {
'long': "<|audio_bos|><|AUDIO|><|audio_eos|>Listen to this audio and describe it in 1-2 sentences:",
'short': "<|audio_bos|><|AUDIO|><|audio_eos|>Listen to the audio and provide a caption for this audio within 15 words:",
'speech': "<|audio_bos|><|AUDIO|><|audio_eos|>Listen to the audio and provide a caption describing the speech content in this audio:",
'music': "<|audio_bos|><|AUDIO|><|audio_eos|>Listen to the audio and provide a caption for the music content in this audio:",
'sound': "<|audio_bos|><|AUDIO|><|audio_eos|>Listen to the audio and provide a general sound excluding speech and music:",
'environment': "<|audio_bos|><|AUDIO|><|audio_eos|>Listen to the audio and provide a caption for quality or acoustic environment for this audio:"
}
# Generate predictions for each subtask
subtask_predictions = {}
for subtask, prompt_template in task_prompts.items():
print(f"\nGenerating {subtask} captions...")
subtask_predictions[subtask] = {}
for item in tqdm(data, desc=f"Generating {subtask} captions"):
key = item['__key__']
audio = item['flac']['array']
sampling_rate = item['flac']['sampling_rate']
# Process inputs with task-specific prompt
inputs = processor(
text=prompt_template,
audio=audio,
sampling_rate=sampling_rate,
return_tensors="pt"
).to(device)
# Generate response
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_length=512,
do_sample=False,
temperature=0.1
)
# Decode response
generated_ids = generated_ids[:, inputs.input_ids.size(1):]
response = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)[0]
subtask_predictions[subtask][key] = response.strip()
# Save separate prediction files for each subtask
for subtask, preds in subtask_predictions.items():
filename = f'{subtask}_caption.csv'
with open(filename, 'w', encoding='utf-8') as f:
writer = csv.writer(f, quoting=csv.QUOTE_ALL)
for key, value in preds.items():
writer.writerow([key, value])
print(f"Saved {len(preds)} {subtask} predictions to {filename}")
Evaluation:
# Evaluate task-specific predictions for optimal performance
results_multisubtask = evaluate(
predicted_data=subtask_predictions,
task='caption',
metrics=['fense', 'date']
)
print("\nMulti-Dictionary Evaluation Results:")
print(results_multisubtask)
Step 3: Expected Results
Expected Caption Evaluation Output: This result does not represent the actual performance of Qwen2-Audio-7B
subtask num_samples fense date
content_long 20052 47.3 40.5
content_short 20052 45.8 41.0
pure_speech 7839 30.9 28.5
mixed_speech 8593 31.7 27.1
pure_music 2593 42.1 50.7
mixed_music 8593 28.3 33.1
pure_sound 848 41.2 46.6
mixed_sound 8593 16.2 34.1
environment 20052 45.4 47.8
score_caption <NA> 35.2 39.3
Note:
The formulae of score_caption:
$S_{\rm caption} = 0.4\times({0.8S_{\rm long} + 0.2S_{\rm short}}) + 0.4\times(0.6S_{\rm speech} + 0.3S_{\rm music} + 0.1S_{\rm sound}) + 0.2\times S_{\rm environment}$
where $S_{\rm speech}, S_{\rm music}$ and $S_{\rm sound}$ were the average score of pure data and mixed data, e.g., $S_{\rm speech} = \frac{S_{\rm speech,pure}+S_{\rm speech,mixed}}{2}$
7.2.3 Audio Question Answering Evaluation
Step 1: Load MECAT-QA Dataset
# Load MECAT-QA test data
qa_data = load_dataset(
'mispeech/MECAT-QA',
split='test',
)
print(f"Loaded {len(qa_data)} QA samples from datasets")
Step 2: Generate and Evaluate Answers
Generation:
# Generate predictions for each question-audio pair
qa_predictions = {}
for item in tqdm(qa_data, desc="Generating answers"):
key = item['__key__']
audio = item['flac']['array']
sampling_rate = item['flac']['sampling_rate']
question = item['json']['question']
# Create prompt for QA
prompt = f"<|audio_bos|><|AUDIO|><|audio_eos|>{question}"
# Process inputs
inputs = processor(
text=prompt,
audio=audio,
sampling_rate=sampling_rate,
return_tensors="pt"
).to(device)
# Generate response
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_length=512,
do_sample=False,
temperature=0.1
)
# Decode response
generated_ids = generated_ids[:, inputs.input_ids.size(1):]
response = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)[0]
qa_predictions[key] = response.strip()
print(f"Generated {len(qa_predictions)} answers")
# Output the results to csv files
import csv
with open('qa_predictions.csv', 'w', encoding='utf-8') as f:
writer = csv.writer(f, quoting=csv.QUOTE_ALL)
for key, value in qa_predictions.items():
writer.writerow([key, value])
Evaluation:
# Evaluate using MECAT metrics
qa_results = evaluate(
predicted_data=qa_predictions,
task='qa',
metrics=['fense', 'date']
)
print("\nQA Evaluation Results:")
print(qa_results)
Step 3: Expected Results
Expected QA Evaluation Output: This result does not represent the actual performance of Qwen2-Audio-7B
subtask num_samples fense date
direct_perception 20624 44.0 54.0
sound_characteristics 19767 39.0 53.1
quality_assessment 18942 18.0 17.8
environment_reasoning 18300 42.0 35.5
inference_judgement 19756 51.0 42.0
application_context 2871 40.0 49.9
score_qa <NA> 39.0 42.1
Note: the final score is the average scores of all six subtasks
7.3 Command Line Evaluation
You can also use the command line interface for evaluation:
7.3.1 Single File Evaluation
# Caption evaluation for different audio types (using single dictionary predictions)
python -m mecat.evaluate --prediction caption_predictions.csv --task caption --subtask long --metrics fense date
python -m mecat.evaluate --prediction caption_predictions.csv --task caption --subtask short --metrics fense date
python -m mecat.evaluate --prediction caption_predictions.csv --task caption --subtask music --metrics fense date
python -m mecat.evaluate --prediction caption_predictions.csv --task caption --subtask speech --metrics fense date
python -m mecat.evaluate --prediction caption_predictions.csv --task caption --subtask sound --metrics fense date
python -m mecat.evaluate --prediction caption_predictions.csv --task caption --subtask environment --metrics fense date
# Batch evaluation across all subsets (using single dictionary predictions)
python -m mecat.evaluate --prediction caption_predictions.csv --task caption --metrics fense date
7.3.2 Multi-File Evaluation (Recommended for Caption Task)
For instruction-following models that can generate task-specific captions, you can provide multiple prediction files at once to get comprehensive evaluation results across all caption subtasks:
# Evaluate multiple caption prediction files in order: long, short, speech, music, sound, environment
python -m mecat.evaluate --prediction \
long_caption.csv \
short_caption.csv \
speech_caption.csv \
music_caption.csv \
sound_caption.csv \
environment_caption.csv \
--task caption --metrics fense date
# Evaluate with fewer files (will evaluate only available subtasks with warning)
python -m mecat.evaluate --prediction \
long_caption.csv \
short_caption.csv \
--task caption --metrics fense date
Benefits of Multi-File Evaluation:
- ✅ Complete Coverage: Evaluates all caption subtasks with task-specific predictions
- ✅ Better Performance: Each prediction file contains responses optimized for specific caption types
- ✅ Comprehensive Results: Provides the full evaluation matrix including overall scores
- ⚠️ File Order Matters: Files are mapped to subtasks in order:
long → short → speech → music → sound → environment
7.3.3 QA Task Evaluation
# QA evaluation for different question types
python -m mecat.evaluate --prediction qa_predictions.csv --task qa --subtask direct_perception --metrics fense date
python -m mecat.evaluate --prediction qa_predictions.csv --task qa --subtask sound_characteristics --metrics fense date
python -m mecat.evaluate --prediction qa_predictions.csv --task qa --subtask quality_assessment --metrics fense date
python -m mecat.evaluate --prediction qa_predictions.csv --task qa --subtask environment_reasoning --metrics fense date
python -m mecat.evaluate --prediction qa_predictions.csv --task qa --subtask inference_judgement --metrics fense date
python -m mecat.evaluate --prediction qa_predictions.csv --task qa --subtask application_context --metrics fense date
# Batch evaluation across all subsets (recommended)
python -m mecat.evaluate --prediction qa_predictions.csv --task qa --metrics fense date
Prediction File Format:
# csv File
"audio_key_1", "Generated caption or answer text"
"audio_key_2", "Another generated response"
"audio_key_3", "More predictions..."
Important Notes:
- Audio Captioning Task
- For instruction-following models (Recommended):
- Generate 6 different prediction files using task-specific prompts (one per sub-task). Requires 6 inference passes.
- Prompts example:
- long: "Listen to this audio and describe it in 1-2 sentences"
- short: "Listen to the audio and provide a caption for this audio within 15 words"
- speech: "Listen to the audio and provide a caption describing the speech content in this audio"
- music: "Listen to the audio and provide a caption for the music content in this audio"
- sound: "Listen to the audio and provide a general sound excluding speech and music"
- environment: "Listen to the audio and provide a caption for quality or acoustic environment for this audio"
- For non-instruction-following models:
- Evaluate using a single prediction file (single inference pass).
- The same predictions will be evaluated across all subtasks.
- For instruction-following models (Recommended):
- Audio Question Answering Task:
- Evaluate all sub-tasks in a single inference pass using the standard method.
- Single prediction file is sufficient as questions are task-specific.
8. Results
8.1 Audio-Captioning Task
8.1.1 DATE
| Model Type | Model Name | Systemtic | Content-Specific | Content-Unrelated | Overall | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Speech-Focused | Music-Focused | Sound-Focused | |||||||||
| long | short | pure | mixed | pure | mixed | pure | mixed | environment | |||
| Caption-Only | enclap | 48.6 | 53.1 | 30.2 | 31.8 | 17.9 | 15.9 | 48.8 | 15.2 | 6.8 | 33.3 |
| pengi | 43.5 | 46.8 | 27.2 | 29.5 | 29.3 | 13.1 | 42.8 | 14.6 | 7.1 | 30.6 | |
| LALM | audio-flamingo | 48.6 | 49.7 | 30.5 | 34.3 | 28.8 | 25.6 | 41.2 | 18.5 | 17.5 | 35.6 |
| kimi-audio | 49.5 | 54.2 | 30.0 | 31.3 | 27.7 | 16.9 | 43.1 | 16.2 | 7.0 | 34.3 | |
| omni3b | 56.4 | 55.2 | 42.5 | 41.3 | 46.6 | 29.7 | 52.9 | 23.9 | 19.4 | 42.6 | |
| omni7b | 61.1 | 56.5 | 39.9 | 40.9 | 32.1 | 30.9 | 50.7 | 23.8 | 17.9 | 43.0 |
8.1.2 FENSE
| Model Type | Model Name | Systemtic | Content-Specific | Content-Unrelated | Overall | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Speech-Focused | Music-Focused | Sound-Focused | |||||||||
| long | short | pure | mixed | pure | mixed | pure | mixed | environment | |||
| Caption-Only | enclap-both | 40.5 | 45.0 | 28.7 | 29.5 | 39.3 | 15.0 | 41.2 | 17.3 | 17.9 | 31.6 |
| pengi | 37.5 | 41.0 | 26.6 | 29.2 | 39.6 | 11.8 | 35.4 | 16.2 | 17.8 | 29.5 | |
| LALM | audio-flamingo2 | 43.8 | 43.3 | 28.5 | 33.7 | 43.1 | 30.3 | 41.0 | 24.7 | 45.4 | 39.4 |
| kimi-audio | 40.8 | 45.7 | 25.6 | 27.1 | 39.5 | 16.2 | 35.8 | 19.4 | 16.7 | 30.8 | |
| qwen2.5-omni3b | 48.3 | 45.3 | 37.3 | 37.5 | 50.7 | 34.7 | 46.6 | 34.1 | 47.8 | 44.1 | |
| qwen2.5-omni7b | 52.7 | 46.2 | 35.3 | 37.5 | 39.2 | 33.1 | 45.2 | 32.1 | 41.0 | 43.4 |
8.2 Audio-Question-Answering
8.2.1 DATE
| Model Type | Model Name | Perception | Analsysis | Reasoning | Overall | |||
|---|---|---|---|---|---|---|---|---|
| direct perception |
sound characteristics |
quality assessment |
environment reasoning |
inference judgement |
application context |
|||
| LALM | audio-flamingo2 | 45.1 | 46.3 | 34.9 | 37.5 | 44.0 | 42.4 | 41.7 |
| kimi-audio | 45.6 | 39.2 | 18.7 | 34.6 | 48.9 | 41.2 | 38.0 | |
| qwen2.5-omni3b | 55.7 | 53.2 | 38.6 | 41.1 | 51.8 | 50.8 | 48.5 | |
| qwen2.5-omni7b | 57.8 | 52.9 | 39.1 | 44.0 | 53.2 | 50.8 | 49.6 |
8.2.2 FENSE
| Model-Type | Model-Name | Perception | Analsysis | Reasoning | Overall | |||
|---|---|---|---|---|---|---|---|---|
| direct perception |
sound characteristics |
quality assessment |
environment reasoning |
inference judgement |
application context |
|||
| LALM | audio-flamingo2 | 39.1 | 39.0 | 37.4 | 41.3 | 35.5 | 35.8 | 38.0 |
| kimi-audio | 37.5 | 32.5 | 19.2 | 37.5 | 38.8 | 33.8 | 33.2 | |
| qwen2.5-omni3b | 47.2 | 43.8 | 39.7 | 43.2 | 41.0 | 41.9 | 42.8 | |
| qwen2.5-omni7b | 49.7 | 43.8 | 40.5 | 44.1 | 42.5 | 41.9 | 43.7 |
9. Acknowledgement
We have referred to the implementation of FENSE for the evaluation
10. Contributing
Yadong Niu* · Tianzi Wang* · Heinrich Dinkel · Xingwei Sun · Jiahao Zhou · Gang Li · Jizhong Liu · Xunying Liu · Junbo Zhang · Jian Luan
*: Equal Contribution
11. Citation
@article{mecat2025,
title={MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks},
author={Niu, Yadong and Wang, Tianzi and Dinkel, Heinrich and Sun, Xingwei and Zhou, Jiahao and Li, Gang and Liu, Jizhong and Liu, Xunying and Zhang, Junbo and Luan, Jian},
journal={arXiv preprint arXiv:2507.23511},
year={2025}
}
12. License
The dataset of the project is from the part of ACAV100M undert the Creative Commons Attribution License 3.0 (CC BY-3.0) license.
The code of the project is under Apache License 2.0 license.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mecat-1.0.3.tar.gz.
File metadata
- Download URL: mecat-1.0.3.tar.gz
- Upload date:
- Size: 10.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8f32d818b9fed1d8b0dad22122ce28a9ba6a83423e0fcb90c2e8b1a7f6b8ea1
|
|
| MD5 |
789b4db062e012bb4fae8eacbd0a26c1
|
|
| BLAKE2b-256 |
50b2f5f15f66d77a6da3bbff55e46cd624804abea1027a0c2ec310cb3249aef2
|
File details
Details for the file mecat-1.0.3-py3-none-any.whl.
File metadata
- Download URL: mecat-1.0.3-py3-none-any.whl
- Upload date:
- Size: 10.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc8939a76e74354f6febd36e6497557cad9c71a7c974079a87f6bd4105dd5a52
|
|
| MD5 |
a6ed42ed6f0a05e88788485f51d67394
|
|
| BLAKE2b-256 |
29c435f7f07132fa51d35811305dc80e3821ac4407fee8163f42b7111750f3d8
|