OpenAI发布新语音模型，为实时对话带来GPT-5级推理能力

2026-05-08 02:44·38天前·Matthias Bastian

AI 摘要

OpenAI发布了三款新型语音模型：GPT-Realtime-2、GPT-Realtime-Translate和GPT-Realtime-Whisper。其中，GPT-Realtime-2具备与GPT-5相匹配的实时推理能力，旨在实现更流畅、智能的实时对话交互。GPT-Realtime-Translate支持超过70种语言的实时翻译，而GPT-Realtime-Whisper则专注于实时语音转写功能。这一系列模型标志着OpenAI在实时音频处理和交互领域的重要进展，有望显著提升跨语言沟通和语音应用的体验。

原文 · 未翻译

OpenAI's new voice model brings GPT-5-level reasoning to real-time conversations

OpenAI is shipping GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper - a new generation of voice models built to reason, translate, and transcribe on the fly.

ChatGPT has had an audio mode for a while, and Google offers a similar real-time conversation feature through Gemini. But the models behind these voice interactions have been significantly weaker than their text-only counterparts, especially compared to text reasoning models that take time to think through problems.

According to OpenAI, that's no longer cutting it. A modern voice agent needs to understand what someone actually means, keep track of context, roll with changes, use tools, and respond appropriately - all at the same time.

The company came up with three new interaction patterns that can also be combined. With "Voice-to-Action," a user describes what they need out loud, and the system reasons through the request, calls the right tools, and gets the job done.

With "Systems-to-Voice," software turns context into spoken guidance. A travel app could tell a passenger that their connecting flight is still reachable despite a delay, give them the fastest route to the new gate, and confirm their luggage transfer.

With "Voice-to-Voice," AI helps people carry on live conversations across language barriers. Deutsche Telekom is already testing this pattern for customer support.

These features are also coming soon to ChatGPT's audio mode, OpenAI suggests. According to the company, "Voice can truly become the primary interface now."

GPT-Realtime-2 buys thinking time with stalling tricks

The centerpiece of the release is GPT-Realtime-2, which OpenAI says brings reasoning on par with GPT-5. The model is built for live voice interactions where it needs to hold a conversation, think through requests, call tools, and handle interruptions all at once.

On the technical side, the context window jumps from 32,000 to 128,000 tokens, which should support longer and more complex conversations. The model can call multiple tools in parallel and make those actions audible with phrases like "let me check that." Short lead-in sentences called preambles—things like "one moment"—let the user know the system is working. When something goes wrong, the model doesn't just go silent anymore. Instead, it says things like "I'm having trouble with that right now."

OpenAI says the model is better at handling specialized terminology, proper names, and medical terms than its predecessor. Tone of voice is more controllable too—calm during problem-solving, empathetic with frustrated users, and upbeat after successful actions.

Developers can dial reasoning intensity across five levels: minimal, low, medium, high, and xhigh. The default is "low" to keep latency down for simple requests, while tougher tasks can tap into more compute.

On benchmarks, GPT-Realtime-2 outperforms its predecessor, GPT-Realtime-1.5. At the "high" setting, it hits 96.6 percent accuracy on Big Bench Audio, up from 81.4 percent. On Audio MultiChallenge, which tests instruction-following in multi-turn dialogues, the "xhigh" variant pulls a 48.5 percent average pass rate compared to 34.7 percent.

Live translation covers 70+ languages, real-time transcription targets meetings and workflows

GPT-Realtime-Translate is a standalone live translation model that handles more than 70 input languages and 13 output languages, according to OpenAI. It preserves meaning while keeping pace with the speaker, even when dealing with context switches, regional accents, and specialized vocabulary. Use cases include customer support, cross-border sales, education, events, and media.

The third model, GPT-Realtime-Whisper, is a low-latency streaming transcription model. It transcribes speech as it happens, targeting live captions for meetings, classrooms, broadcasts, and events. Teams can use it to generate notes and summaries while conversations are still going, build voice agents with continuous speech understanding, and spin up faster follow-up workflows for customer support, healthcare, sales, and recruiting.

Pricing runs on tokens and minutes

All three models are available now through the Realtime API and can be tested in the Playground. GPT-Realtime-2 costs $32 per million audio input tokens ($0.40 for cached input tokens) and $64 per million audio output tokens. GPT-Realtime-Translate runs $0.034 per minute, and GPT-Realtime-Whisper comes in at $0.017 per minute.

The Realtime API supports EU data residency for EU-based applications and is covered by OpenAI's enterprise privacy commitments.

AI News Without the Hype – Curated by Humans

OpenAI推理模型发布语音

The Decoder：AI News（RSS）