Qwen3.5-LiveTranslate：从声音到视觉，从词语到准确

2026-05-19 17:40·30天前·QwenTeam

精选理由

这个版本让同声传译从“能用”变成了“好用”，语言覆盖从 18 跃升 60，延迟压到 2.8 秒，加上视觉消除歧义，做国际业务和直播的人值得跟进。

AI 摘要

Qwen3.5-LiveTranslate-Flash 是 Qwen 家族最新的同声传译模型，基于 Qwen3.5-Omni 架构，支持实时多模态翻译（音频、视频及视觉上下文）。语言覆盖大幅扩展：输入音频与输出文本从18种增至60种，输出音频从10种增至29种。采用 Readable Unit 技术，平均端到端每 token 延迟降至2.8秒，相比前代首 token 延迟降低3.45秒、每 token 延迟降低1.88秒。支持一句话启动的实时语音克隆和可动态配置的热词增强。在 FLEURS 和 CoVoST2 基准上翻译准确率超越主流商用大语音模型。

原文 · 未翻译

Qwen

Qwen Studio

Download Try Qwen Studio

Qwen3.5-LiveTranslate: From Sound to Sight, From Word to Right | Qwen

Qwen3.5-LiveTranslate: From Sound to Sight, From Word to Right

2026/05/19 · 5 minute · 1070 words · QwenTeam丨Translations:简体中文

DashScopeDemo Qwen3.5-LiveTranslate-Flash is the latest simultaneous interpretation model in the Qwen family, built on top of Qwen3.5-Omni. It delivers real-time, multimodal translation that not only hears and translates speech, but also sees and understands visual context to produce more accurate translations. Compared with its predecessor Qwen3-LiveTranslate, Qwen3.5-LiveTranslate-Flash brings major upgrades across language coverage, latency, voice cloning, and terminology handling, making it well-suited for international meetings, livestream localization, online classrooms, and business negotiations.

Key Highlights#

Massively expanded language coverage: understands 18 → 60 languages, speaks 10 → 29 languages. The language support of input audio and output text has grown from 18 to 60, and output audio language support from 10 to 29, covering far more cross-lingual combinations to meet multilingual interpretation needs in international meetings, livestream localization, online classrooms, and business negotiations.

Ultra-low latency: powered by Readable Unit technology, faster text and speech output. A novel Readable Unit real-time translation technique achieves more aggressive streaming output while preserving translation readability and semantic consistency. Average speech-to-speech per-token latency is reduced to to 2.8 seconds, ideal for latency-sensitive scenarios such as livestreams, co-hosting, and press conferences.

Real-time voice cloning: one sentence to start, instantly “interpret in your voice”. During simultaneous interpretation, the system automatically replicates the speaker’s vocal characteristics, keeping the translated speech sounding like “the same person” across languages, enhancing immersion and identity consistency, especially critical for streamers, guests, and hosts.

Hotword enhancement: proper nouns and industry terms “recognized right, written right, translated right”. Built-in Hotword capability prioritizes the recognition and translation of names, places, brand names, product models, and industry terminology. Hotwords can be dynamically configured and updated in real time per scenario, significantly reducing terminology mistranslation risk, well-suited for technical launches, medical/legal/financial meetings, and enterprise training.

Performance#

We evaluate Qwen3.5-LiveTranslate-Flash in both offline and real-time (streaming) settings.

Offline Translation#

On public multilingual speech translation benchmarks (FLEURS, CoVoST2), Qwen3.5-LiveTranslate-Flash achieves higher translation accuracy than mainstream commerical large speech models, significantly surpasses its predecessor Qwen3-LiveTranslate-Flash, and delivers breakthroughs in both language coverage and translation quality.

Expand all demos

Demo1 Overview English → X

1 / 6

Real-Time Translation#

With the Readable Unit streaming strategy, Qwen3.5-LiveTranslate-Flash reduces first-token latency by 3.45 s and per-token latency by 1.88 s compared to Qwen3-LiveTranslate-Flash, achieving an average speech-to-speech per-token latency of 2.8 s, with virtually no loss in translation quality.

Expand all demos

Demo1 Overview

1 / 1

Model Architecture#

Qwen3.5-LiveTranslate is a translation large model built on the Qwen3.5-Omni Thinker-Talker architecture. The Thinker receives interleaved visual and audio inputs and generates text translations, while the Talker takes the translated text and source audio to produce speech with crosslingual voice cloning. For real-time simultaneous interpretation, we adopt a chunk-wise streaming input mechanism and introduce Readable Unit tags to control speech synthesis granularity, effectively reducing interpretation latency. Meanwhile, dynamic crosslingual voice cloning enables the model to preserve the speaker’s original vocal characteristics during real-time translation.

Qwen3.5-LiveTranslate model architecture overview

More Supported Languages#

Compared to Qwen3-LiveTranslate, Qwen3.5-LiveTranslate significantly expands language coverage. The support of input audio and output text grows from 18 to 60 languages, and output audio support from 10 to 29 languages, enabling a far wider range of cross-lingual translation combinations across global scenarios.

| | Qwen3-LiveTranslate | Qwen3.5-LiveTranslate | | --- | --- | --- | | Input Modality | Audio / Video | Audio / Video | | Inference Mode | Offline / Streaming | Offline / Streaming | | Voice Cloning | ✗ | ✓ (3 modes: pre-registered / clone-once / real-time) | | Hotwords | Up to 1,000 | Up to 1,000 | | Input Audio Languages & Output Text Languages | 18 languages Chinese, English, Russian, French, German, Portuguese, Spanish, Italian, Indonesian, Korean, Japanese, Vietnamese, Thai, Arabic, Cantonese, Hindi, Greek, Turkish | 60 languages Afrikaans, Arabic, Asturian, Azerbaijani, Basque, Belarusian, Bengali, Bosnian, Bulgarian, Cantonese, Catalan, Cebuano, Chinese, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Interlingua, Italian, Japanese, Javanese, Kannada, Kazakh, Korean, Kyrgyz, Lingala, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Norwegian Bokmål, Nynorsk, Odia, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tajik, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Uyghur, Vietnamese | | Output Audio Languages | 10 languages Chinese, English, French, German, Russian, Italian, Spanish, Portuguese, Japanese, Korean | 29 languages Chinese, English, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian, Thai, Indonesian, Arabic, Vietnamese, Turkish, Finnish, Polish, Hindi, Dutch, Czech, Urdu, Filipino, Swedish, Danish, Hebrew, Icelandic, Malay, Norwegian, Persian |

🎬 See It in Action#

International Meeting#

A multilingual business meeting where participants speak in different languages and switch between them mid-sentence. Qwen3.5-LiveTranslate handles code-switching, diverse accents, and domain-specific terminology in real time — delivering fluent, natural translations without missing a beat.

Video 1

Traveling Abroad#

A real-world travel scenario powered by Qwen AI Glasses: a Chinese tourist orders food at a local restaurant in Thailand. The model performs live Thai-to-Chinese translation on-device, combining visual context from the menu with spoken dialogue to produce accurate, context-aware translations — making cross-language communication effortless on the go.

Video 2

Livestream Scenarios#

E-commerce livestream translation scenario. Qwen3.5-LiveTranslate accurately translates product specifications and numerical information, ensuring precise cross-language delivery of product parameters.

Video 3

Classical Chinese Translation#

A scene from Romance of the Three Kingdoms narrated in classical Chinese (文言文). Qwen3.5-LiveTranslate accurately interprets and translates archaic Chinese prose into modern English, demonstrating its ability to handle literary and historical language beyond everyday speech.

Video 4

Visual Disambiguation#

Qwen3.5-LiveTranslate leverages visual context to resolve translation ambiguities. When a word or phrase has multiple possible meanings, the model uses what it sees — on-screen text, objects, or scene context — to select the correct interpretation, producing translations that are both accurate and contextually grounded.

Video 5

Using Qwen3.5-LiveTranslate via DashScope API#

python

import osimport timeimport base64import asyncioimport jsonimport websocketsimport pyaudioimport queueimport threadingimport tracebackclass LiveTranslateClient: """Client for the DashScope live-translation service: captures mic audio, sends it to the server, and plays back the translated speech.""" def init(self, apikey: str, targetlanguage: str = "en", , audioenabled: bool = True): if not apikey: raise ValueError("API key cannot be empty.") self.apikey = apikey self.targetlanguage = targetlanguage self.audioenabled = audioenabled self.ws = None self.apiurl = "wss://dashscope.aliyuncs.com/api-ws/v1/realtime?model=qwen3.5-livetranslate-flash-realtime" # Audio input parameters (microphone capture) self.inputrate = 16000 self.inputchunk = 1600 self.inputformat = pyaudio.paInt16 self.inputchannels = 1 # Audio output parameters (local playback) self.outputrate = 24000 self.outputchunk = 2400 self.outputformat = pyaudio.paInt16 self.outputchannels = 1 # Runtime state and playback resources self.isconnected = False self.audioplayerthread = None self.audioplaybackqueue = queue.Queue() self.pyaudioinstance = pyaudio.PyAudio() async def connect(self): """Open a WebSocket connection to the translation service.""" headers = {"Authorization": f"Bearer {self.apikey}"} try: self.ws = await websockets.connect(self.apiurl, additionalheaders=headers) self.isconnected = True print(f"Successfully connected to server: {self.apiurl}") await self.configuresession() except Exception as e: print(f"Connection failed: {e}") self.isconnected = False raise async def configuresession(self): """Configure the translation session: target language, audio formats, and optional features.""" config = { "eventid": f"event{int(time.time() 1000)}", "type": "session.update", "session": { # modalities decides what the server returns: # ["text", "audio"] — both translated text and synthesized speech (recommended) # ["text"] — translated text only "modalities": ["text", "audio"] if self.audioenabled else ["text"], "inputaudioformat": "pcm", "outputaudioformat": "pcm", # inputaudiotranscription: enable source-language ASR. # Setting model to 'qwen3-asr-flash-realtime' also streams back the source transcript. # "inputaudiotranscription": { # "model": "qwen3-asr-flash-realtime", # "language": "zh" # source language; defaults to 'en' # }, "translation": { "language": self.targetlanguage, # corpus: register hotwords to boost accuracy on proper nouns and domain-specific terms. # "corpus": { # "phrases": { # "人工智能": "Artificial Intelligence", # "机器学习": "Machine Learning" # } # } } } } print(f"Sending session config: {json.dumps(config, indent=2, ensureascii=False)}") await self.ws.send(json.dumps(config)) async def sendaudiochunk(self, audiodata: bytes): """Base64-encode an audio chunk and send it to the server.""" if not self.isconnected: return event = { "eventid": f"event{int(time.time() 1000)}", "type": "inputaudiobuffer.append", "audio": base64.b64encode(audiodata).decode() } await self.ws.send(json.dumps(event)) async def sendimageframe(self, imagebytes: bytes, , eventid: str | None = None): """Send an image frame to the server as visual context for translation.""" if not self.isconnected: return if not imagebytes: raise ValueError("imagebytes cannot be empty") imageb64 = base64.b64encode(imagebytes).decode() event = { "eventid": eventid or f"event{int(time.time() 1000)}", "type": "inputimagebuffer.append", "image": imageb64, } await self.ws.send(json.dumps(event)) def audioplayertask(self): """Background thread task: drain PCM chunks from the playback queue and write them to the speaker output stream.""" stream = self.pyaudioinstance.open( format=self.outputformat, channels=self.outputchannels, rate=self.outputrate, output=True, framesperbuffer=self.outputchunk, ) try: while self.isconnected or not self.audioplaybackqueue.empty(): try: audiochunk = self.audioplaybackqueue.get(timeout=0.1) if audiochunk is None: # sentinel: stop the playback loop break stream.write(audiochunk) self.audioplaybackqueue.taskdone() except queue.Empty: continue finally: stream.stopstream() stream.close() def startaudioplayer(self): """Spin up the background audio playback thread (no-op when audio output is disabled).""" if not self.audioenabled: return if self.audioplayerthread is None or not self.audioplayerthread.isalive(): self.audioplayerthread = threading.Thread(target=self.audioplayertask, daemon=True) self.audioplayerthread.start() async def handleservermessages(self, ontextreceived): """Continuously receive and dispatch event messages pushed by the server.""" try: async for message in self.ws: event = json.loads(message) eventtype = event.get("type") if eventtype == "response.audio.delta" and self.audioenabled: audiob64 = event.get("delta", "") if audiob64: audiodata = base64.b64decode(audiob64) self.audioplaybackqueue.put(audiodata) elif eventtype == "response.done": print("\n[INFO] Response complete.") usage = event.get("response", {}).get("usage", {}) if usage: print(f"[INFO] Token usage: {json.dumps(usage, indent=2, ensureascii=False)}") # Receive source-language ASR results (requires inputaudiotranscription.model to be enabled) # elif eventtype == "conversation.item.inputaudiotranscription.text": # stash = event.get("stash", "") # streaming partial result, not yet finalized # print(f"[Recognizing] {stash}") # elif eventtype == "conversation.item.inputaudiotranscription.completed": # transcript = event.get("transcript", "") # final transcript for an utterance # print(f"[Source] {transcript}") # In voice + text mode, the translation text arrives alongside synthesized audio under the transcript field elif eventtype == "response.audiotranscript.done": print("\n[INFO] Translation complete.") text = event.get("transcript", "") if text: print(f"[INFO] Translation: {text}") # In text-only mode, the translation arrives via response.text.done under the text field elif eventtype == "response.text.done": print("\n[INFO] Translation complete.") text = event.get("text", "") if text: print(f"[INFO] Translation: {text}") except websockets.exceptions.ConnectionClosed as e: print(f"[WARNING] Connection closed: {e}") self.isconnected = False except Exception as e: print(f"[ERROR] Unknown error during message handling: {e}") traceback.printexc() self.isconnected = False async def startmicrophonestreaming(self): """Continuously capture microphone audio and stream it to the server in real time.""" stream = self.pyaudioinstance.open( format=self.inputformat, channels=self.inputchannels, rate=self.inputrate, input=True, framesperbuffer=self.inputchunk ) print("Microphone started, please begin speaking...") try: while self.isconnected: audiochunk = await asyncio.geteventloop().runinexecutor( None, stream.read, self.inputchunk ) await self.sendaudiochunk(audiochunk) finally: stream.stopstream() stream.close() async def close(self): """Gracefully close the WebSocket connection and release audio resources.""" self.isconnected = False if self.ws: await self.ws.close() print("WebSocket connection closed.") if self.audioplayerthread: self.audioplaybackqueue.put(None) # signal the playback thread to exit self.audioplayerthread.join(timeout=1) print("Audio playback thread stopped.") self.pyaudioinstance.terminate() print("PyAudio instance released.")def printbanner(): print("=" 60) print(" Powered by Qwen qwen3.5-livetranslate-flash-realtime") print("=" 60 + "\n")def getuserconfig(): """Collect runtime parameters from the user via CLI: output mode and target language.""" print("Select mode:") print("1. Voice + Text [default] | 2. Text only") modechoice = input("Enter option (press Enter for Voice + Text): ").strip() audioenabled = (modechoice != "2") if audioenabled: langmap = { "1": "en", "2": "zh", "3": "ru", "4": "fr", "5": "de", "6": "pt", "7": "es", "8": "it", "9": "ko", "10": "ja", "11": "yue" } print("Select target translation language (Voice + Text mode):") print("1. English | 2. Chinese | 3. Russian | 4. French | 5. German | 6. Portuguese | 7. Spanish | 8. Italian | 9. Korean | 10. Japanese | 11. Cantonese") else: langmap = { "1": "en", "2": "zh", "3": "ru", "4": "fr", "5": "de", "6": "pt", "7": "es", "8": "it", "9": "id", "10": "ko", "11": "ja", "12": "vi", "13": "th", "14": "ar", "15": "yue", "16": "hi", "17": "el", "18": "tr" } print("Select target translation language (Text only mode):") print("1. English | 2. Chinese | 3. Russian | 4. French | 5. German | 6. Portuguese | 7. Spanish | 8. Italian | 9. Indonesian | 10. Korean | 11. Japanese | 12. Vietnamese | 13. Thai | 14. Arabic | 15. Cantonese | 16. Hindi | 17. Greek | 18. Turkish") choice = input("Enter option (default is the first one): ").strip() targetlanguage = langmap.get(choice, next(iter(langmap.values()))) return targetlanguage, audioenabledasync def main(): """Program entry point: connect, configure the session, and drive the live-translation loop.""" printbanner() apikey = os.environ.get("DASHSCOPEAPIKEY") if not apikey: print("[ERROR] Please set the environment variable DASHSCOPEAPIKEY") print(" Example: export DASHSCOPEAPIKEY='yourapikeyhere'") return targetlanguage, audioenabled = getuserconfig() print("\nConfiguration complete:") print(f" - Target language: {targetlanguage}") if not audioenabled: print(" - Output mode: Text only") client = LiveTranslateClient(apikey=apikey, targetlanguage=targetlanguage, audioenabled=audioenabled) # Callback fired as translated text arrives — stream it to stdout, character by character def ontranslationtext(text): print(text, end="", flush=True) try: print("Connecting to the translation service...") await client.connect() # Launch the audio playback thread (only does real work when audio output is enabled) client.startaudioplayer() print("\n" + "-" 60) print("Connected! Please speak into the microphone.") print("The program will translate your speech in real time and play the results. Press Ctrl+C to exit.") print("-" 60 + "\n") # Run two coroutines concurrently: server-message handling + microphone audio upload messagehandler = asyncio.createtask(client.handleservermessages(ontranslationtext)) tasks = [messagehandler] # Microphone capture is the translation input source — required regardless of output mode microphonestreamer = asyncio.createtask(client.startmicrophonestreaming()) tasks.append(microphonestreamer) await asyncio.gather(tasks) except KeyboardInterrupt: print("\n\nUser interrupted, exiting...") except Exception as e: print(f"\nFatal error occurred: {e}") finally: print("\nCleaning up resources...") await client.close() print("Program exited.")if name == "main": asyncio.run(main())

Future Directions#

We will continue exploring the capability boundaries of multimodal translation and focus on the following directions:

Lower latency: keep reducing end-to-end simultaneous interpretation latency toward real-time experience limits. More languages and dialects: expand input/output coverage for low-resource languages, regional dialects, and cross-regional expressions. Longer context and stronger consistency: maintain terminology, names, and context consistency in long meetings and multi-turn dialogues. Higher-fidelity voice cloning: preserve speaker characteristics while restoring ambient sounds and on-site atmosphere more naturally. Richer interaction modes: support multilingual, mixed-dialect expression, speaker separation, and joint multimodal modeling with gestures, lip movement, and expressions.

Citation#

Feel free to cite the following article if you find Qwen3.5-LiveTranslate helpful:

bibtex

@misc{qwen35livetranslateblog, title = {Qwen3.5-LiveTranslate: From Sound to Sight, From Word to Right}, url = {https://qwen.ai/blog?id=qwen3.5-livetranslate}, author = {Qwen Team}, month = {May}, year = {2026}}

Try Qwen Studio

Web

iOS

Android

macOS

Windows

Qwen Studio

Qwen Studio Overview

Download

API Platform

Our Flagship Models

Platform Overview

API Platform

Qwen Cloud

Research

Latest Advancements

Research Index

GitHub

Terms & Policies

Usage Policy

Cookies Notice

Training Data Summary

Manage Cookies

多模态模型发布语音

Qwen：Blog Retrieval（API）

精选77

Qwen3.5-LiveTranslate：从声音到视觉，从词语到准确

2026-05-19 17:40·30天前·QwenTeam

精选理由

这个版本让同声传译从“能用”变成了“好用”，语言覆盖从 18 跃升 60，延迟压到 2.8 秒，加上视觉消除歧义，做国际业务和直播的人值得跟进。

AI 摘要

原文 · 保持原样，未翻译

Qwen

Qwen Studio

Download Try Qwen Studio

Qwen3.5-LiveTranslate: From Sound to Sight, From Word to Right | Qwen

Qwen3.5-LiveTranslate: From Sound to Sight, From Word to Right

2026/05/19 · 5 minute · 1070 words · QwenTeam丨Translations:简体中文

Key Highlights#