腾讯混元联合多家机构发布首个音频编辑基准MMAE

Tencent Hy@TencentHunyuan

精选69

2026-06-08 13:54·7天前

精选理由

过去一年语音和音乐生成很热，但音频编辑还没人正经测过，腾讯这个基准把现状血淋淋地摆出来了，不到5%的准确率意味着整个方向都还在起步期。

AI 摘要

腾讯混元联合上海交大、南洋理工等机构推出MMAE（Massive Multitask Audio Editing Benchmark），这是首个全面评估AI语音/音频编辑能力的基准。MMAE要求模型理解现有音频并按自然语言指令精确修改，而非简单生成。当前模型在该基准上的精确匹配率（EMR）低于5%，暴露了可靠音频编辑的短板。MMAE包含2000个真实场景高保真样本、17741条细粒度评估项，覆盖声音/音乐/语音及混合共7种模态、6种任务复杂度（基础修改到多跳推理及多轮编辑）、8种操作类型（局部到全局）。论文、代码、数据集和演示已公开。

Can AI truly edit audio， not just generate it？ 🎧

Tencent Hy， in collaboration with SJTU， SII， NTU， TJU， ZODA， PKU， FDU， and other collaborators， introduces MMAE.

MMAE--A Massive Multitask Audio Editing Benchmark， is the first comprehensive evaluation benchmark for speech and audio "Banana🍌"

Instead of simply requiring the AI to "generate" audio， it demands that the AI understand an existing audio clip and precisely modify it according to natural language instructions-altering what needs to be changed while leaving the rest untouched.

Current models show an Exact Match Rate （EMR） below 5%， revealing a major gap in reliable audio editing.

MMAE includes： ✅ 2，000 high-fidelity samples from real-world scenarios ✅ 17，741 fine-grained rubric evaluation items ✅ 7 modality settings across sound， music， speech and their mixtures ✅ 6 task complexity from basic modifications to multi-hop reasoning and multi-round editing ✅ 8 operation types across local and global granularities

How to use： arXiv： http：//arxiv.org/abs/2606.07229 GitHub： https：//github.com/ddlBoJack/MMAE HuggingFace： https：//huggingface.co/datasets/BoJack/MMAE Demo： https：//youtu.be/6At5nTWhlXI

多模态论文/研究语音

在 X 查看原推

Tencent Hy@TencentHunyuan · X