整合千问realtime模型
tio-boot 集成 Qwen-Omni-Realtime(中国区,北京)DashScope Java SDK 接入完整文档
在 tio-boot 上 WebSocket 语音链路(浏览器 16k PCM 上行、服务端转发、浏览器播放 24k PCM 下行)的 Java 开发者,
1. 目标与整体架构
目标能力:
- 浏览器采集麦克风音频(16kHz、PCM16、单声道)并实时发送到 tio-boot
- tio-boot 通过 DashScope Realtime(Qwen-Omni-Realtime)建立会话并持续
append音频 - 模型实时返回:
- 输入转写(ASR)
- 输出转写(字幕)
- 输出音频(TTS,Qwen3-Omni-Flash-Realtime 支持
pcm24) - 生命周期事件(speech_started/speech_stopped/response.done 等)
数据流:
浏览器 ↔ tio-boot WebSocket ↔ QwenOmniRealtimeBridge(DashScope Java SDK) ↔ DashScope Realtime 服务
2. 先决条件
2.1 地域与地址(中国区北京)
- WebSocket 端点(北京):
wss://dashscope.aliyuncs.com/api-ws/v1/realtime(国际新加坡是dashscope-intl,本文只写北京) - 连接时需要 query 参数
model,例如?model=qwen3-omni-flash-realtime - 鉴权方式:HTTP Header
Authorization: Bearer DASHSCOPE_API_KEY(SDK 里由apikey提供) 上述连接与参数要求来自 Realtime/Omni-Realtime 文档与示例。([AlibabaCloud][1])
2.2 DashScope Java SDK 版本
安装sdk https://bailian.console.aliyun.com/cn-beijing/?tab=api#/api/?type=model&url=2712193
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>dashscope-sdk-java</artifactId>
<version>2.22.10</version>
</dependency>
如果有日志库冲突,请移除slf4j-simple
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>dashscope-sdk-java</artifactId>
<version>2.22.10</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
</exclusion>
</exclusions>
</dependency>
2.3 API Key
- 环境变量:
DASHSCOPE_API_KEY - 北京地域调用需要使用北京地域的 Key(“北京/新加坡 Key 分离”说明与官方地域说明一致)。([AlibabaCloud][4])
3. 模型与会话参数要点
3.1 模型
推荐从稳定版开始:
qwen3-omni-flash-realtime
(也可用快照版,例如带日期的版本号;生产建议固定版本或稳定版。)
3.2 会话配置(session.update 对应 SDK 的 updateSession)
关键字段(配置与官方事件文档一致):
modalities:["text","audio"]或仅["text"]voice: 例如Cherryinput_audio_format: 仅支持pcm16output_audio_format:- Qwen3-Omni-Flash-Realtime:支持
pcm24
- Qwen3-Omni-Flash-Realtime:支持
instructions: 系统指令turn_detection:server_vad:服务端自动判定起止(通话场景推荐)null:Manual 模式,客户端要commit+response.create
input_audio_buffer.commit 的语义:只提交缓冲区,不会自动触发回复,服务端会回 input_audio_buffer.committed,要出回复还需要 response.create(Manual 模式)。([AlibabaCloud][5])
4. tio-boot 协议如何映射到 DashScope Realtime
前端/后端协议(Gemini Live 方案)是:
- 二进制:浏览器 → 后端:16k PCM16 裸流
- 二进制:后端 → 浏览器:24k PCM16 裸流(直接播放)
- JSON 控制:
setup/text/audio_end/close
而 DashScope Omni-Realtime 协议(服务端)要求:
input_audio_buffer.append:JSON 事件,音频需要 Base64 字符串- 输出音频
response.audio.delta:JSON 事件,音频是 Base64 字符串 - 各类转写/生命周期也都是 JSON 事件(
conversation.item.input_audio_transcription.completed、response.audio_transcript.delta、response.done等)
4.1 最省改法:前端不动,后端做转码与事件桥接
浏览器仍然发送 binary PCM16 到 tio-boot
tio-boot 在
onBytes:- Base64 编码这段 PCM
- 通过 DashScope SDK 调用
conversation.appendAudio(base64)(或等价方法)
DashScope 回来的
response.audio.delta:- Base64 解码得到 24k PCM bytes
- 仍旧用 tio-boot 给浏览器发 binary
DashScope 的转写事件:
- 转成现有的
WsVoiceAgentResponseMessage(type="transcript_in/out")发给前端
- 转成现有的
这样 index.html/app.js/mic-worklet.js 基本不用改。
4.2 audio_end 的含义要重新定义
现在的 audio_end 在 Gemini Live 里是“告诉模型音频流结束”。 在 DashScope 里:
如果使用 server_vad 模式:一般不需要
audio_end,服务端会自动在静音达到阈值后提交并触发响应如果使用 Manual 模式:
audio_end可以映射为:input_audio_buffer.commitresponse.create
这与官方 Manual 交互流程一致。([AlibabaCloud][6])
5. 配置
5.1 环境变量
export DASHSCOPE_API_KEY=xxxx
6. 后端实现:QwenOmniRealtimeBridge(替代 GeminiLiveBridge)
- 保留
RealtimeBridgeCallback接口不变(复用现有 WebSocket 下发逻辑) - 新增
QwenOmniRealtimeBridge:内部用 DashScope Java SDK 的OmniRealtimeConversation - 在
VoiceSocketHandler里把GeminiLiveBridge换成QwenOmniRealtimeBridge
RealtimeModelBridge
package com.litongjava.voice.agent.bridge;
import java.util.concurrent.CompletableFuture;
/**
* 统一抽象:任何“实时语音/音频”模型桥接都实现这个接口。
* VoiceSocketHandler 只依赖此接口。
*/
public interface RealtimeModelBridge {
//public RealtimeModelBridge(RealtimeBridgeCallback sender,String url,String model,String voiceName)
/**
* 建立到模型的会话连接,并完成必要的 session/setup 配置。
*/
CompletableFuture<Void> connect(RealtimeSetup setup);
/**
* 发送一段上行音频(浏览器推来的 PCM16 16kHz mono little-endian)。
*/
CompletableFuture<Void> sendPcm16k(byte[] pcm16k);
/**
* 结束当前音频输入/触发模型生成:
* - Gemini:sendAudioStreamEnd()
* - Qwen Manual:commit + response.create
* - Qwen Server VAD:可以是 no-op(或作为手动触发的补充)
*/
CompletableFuture<Void> endAudioInput();
/**
* 可选:发送文本输入(聊天框)。
*/
CompletableFuture<Void> sendText(String text);
/**
* 关闭会话并释放资源。
*/
CompletableFuture<Void> close();
}
6.2 QwenOmniRealtimeBridge.java
package com.litongjava.voice.agent.bridge;
import java.util.Arrays;
import java.util.Base64;
import java.util.List;
import java.util.Map;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.atomic.AtomicBoolean;
import com.alibaba.dashscope.audio.omni.OmniRealtimeCallback;
import com.alibaba.dashscope.audio.omni.OmniRealtimeConfig;
import com.alibaba.dashscope.audio.omni.OmniRealtimeConversation;
import com.alibaba.dashscope.audio.omni.OmniRealtimeModality;
import com.alibaba.dashscope.audio.omni.OmniRealtimeParam;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import com.litongjava.tio.utils.environment.EnvUtils;
import com.litongjava.tio.utils.hutool.StrUtil;
import com.litongjava.tio.utils.json.JsonUtils;
import com.litongjava.voice.agent.model.WsVoiceAgentResponseMessage;
import lombok.extern.slf4j.Slf4j;
@Slf4j
public class QwenOmniRealtimeBridge implements RealtimeModelBridge {
// 中国区(北京)
private String url = "wss://dashscope.aliyuncs.com/api-ws/v1/realtime";
private String model = "qwen3-omni-flash-realtime";
private String voiceName = "Cherry";
private final RealtimeBridgeCallback callback;
private volatile OmniRealtimeConversation conversation;
private final AtomicBoolean connected = new AtomicBoolean(false);
// 前端上行是 16k PCM16;Qwen3-Omni-Flash-Realtime 下行通常是 24k PCM16(pcm24)
// 注意:DashScope 事件里音频是 base64;仍可给浏览器发 bytes(二进制)
public QwenOmniRealtimeBridge(RealtimeBridgeCallback callback) {
this.callback = callback;
}
public QwenOmniRealtimeBridge(RealtimeBridgeCallback callback, String url, String model, String voiceName) {
this.callback = callback;
if (url != null) {
this.url = url;
}
if (model != null) {
this.model = model;
}
if (voiceName != null) {
this.voiceName = voiceName;
}
}
public CompletableFuture<Void> connect(RealtimeSetup setup) {
return CompletableFuture.runAsync(() -> {
try {
String apiKey = EnvUtils.getStr("DASHSCOPE_API_KEY");
if (StrUtil.isBlank(apiKey)) {
throw new IllegalStateException("DASHSCOPE_API_KEY is empty");
}
OmniRealtimeParam param = OmniRealtimeParam.builder().model(model).apikey(apiKey).url(url).build();
this.conversation = new OmniRealtimeConversation(param, new OmniRealtimeCallback() {
@Override
public void onOpen() {
connected.set(true);
sendJson(new WsVoiceAgentResponseMessage("qwen_connected", model));
}
@Override
public void onClose(int code, String reason) {
connected.set(false);
sendJson(new WsVoiceAgentResponseMessage("close", reason));
callback.close("dashscope closed: " + code + ", " + reason);
}
@Override
public void onEvent(JsonObject event) {
handleEvent(event);
}
});
conversation.connect();
// 会话配置:建议默认 server_vad + text/audio + transcription
OmniRealtimeConfig cfg = buildSessionConfig(setup);
conversation.updateSession(cfg);
sendJson(new WsVoiceAgentResponseMessage("setup_sent_to_model"));
} catch (NoApiKeyException e) {
log.error("NoApiKeyException", e);
sendError("no_api_key", e.getMessage());
callback.close("no api key");
} catch (Exception e) {
log.error("connect error", e);
sendError("connect_error", e.getMessage());
callback.close("connect failed");
}
});
}
public CompletableFuture<Void> close() {
return CompletableFuture.runAsync(() -> {
try {
OmniRealtimeConversation c = this.conversation;
if (c != null) {
c.close(1000, "server close");
}
} catch (Exception ignore) {
} finally {
connected.set(false);
callback.close("close");
}
});
}
/**
* 浏览器推来的 16k PCM16 bytes(little-endian)
* DashScope 需要 base64 后通过 input_audio_buffer.append 事件发送(SDK 封装为 appendAudio)。
*/
public CompletableFuture<Void> sendPcm16k(byte[] pcm16k) {
return CompletableFuture.runAsync(() -> {
OmniRealtimeConversation c = this.conversation;
if (c == null || !connected.get() || pcm16k == null || pcm16k.length == 0) {
return;
}
try {
String b64 = Base64.getEncoder().encodeToString(pcm16k);
c.appendAudio(b64);
} catch (Exception e) {
log.error("appendAudio failed", e);
sendError("append_audio_failed", e.getMessage());
}
});
}
/**
* Manual 模式:把现有的 audio_end 映射为 commit + createResponse
* 如果使用 server_vad,可不调用这个方法。
*/
public CompletableFuture<Void> commitAndCreateResponse() {
return CompletableFuture.runAsync(() -> {
OmniRealtimeConversation c = this.conversation;
if (c == null || !connected.get())
return;
try {
c.commit();
c.createResponse(null, null);
} catch (Exception e) {
log.error("commit/createResponse failed", e);
sendError("commit_failed", e.getMessage());
}
});
}
/**
* 文本输入:SDK 里有的版本支持发文本项;若只做语音助手,可先不实现。
* 如果确实要支持“输入框发文本”,建议用 Realtime 的对话项事件(conversation.item.create)
* 或使用 SDK 提供的文本接口(视 SDK 版本而定)。
*/
public CompletableFuture<Void> sendText(String text) {
// 先给出最安全的行为:转为 instructions 追加或忽略
return CompletableFuture.completedFuture(null);
}
private OmniRealtimeConfig buildSessionConfig(RealtimeSetup setup) {
// 把的 setup 组合成 instructions
String instructions = buildInstructions(setup);
// server_vad(通话模式):enableTurnDetection(true)
// manual(按下即说):enableTurnDetection(false),并在 audio_end 时 commit+createResponse
boolean useServerVad = true;
List<OmniRealtimeModality> modalities = Arrays.asList(OmniRealtimeModality.AUDIO, OmniRealtimeModality.TEXT);
OmniRealtimeConfig.OmniRealtimeConfigBuilder b = OmniRealtimeConfig.builder()
//
.modalities(modalities).voice(voiceName)
//
.enableInputAudioTranscription(true)
//
.enableTurnDetection(useServerVad);
//
if (instructions != null) {
b.parameters(Map.of("instructions", instructions));
}
return b.build();
}
private String buildInstructions(RealtimeSetup setup) {
if (setup == null) {
return null;
}
StringBuilder sb = new StringBuilder();
if (StrUtil.notBlank(setup.getSystem_prompt()))
sb.append(setup.getSystem_prompt()).append("\n");
if (StrUtil.notBlank(setup.getJob_description()))
sb.append(setup.getJob_description()).append("\n");
if (StrUtil.notBlank(setup.getResume()))
sb.append(setup.getResume()).append("\n");
if (StrUtil.notBlank(setup.getGreeting()))
sb.append(setup.getGreeting()).append("\n");
if (StrUtil.notBlank(setup.getQuestions()))
sb.append(setup.getQuestions()).append("\n");
return sb.length() == 0 ? null : sb.toString();
}
private void handleEvent(JsonObject event) {
try {
String type = event.has("type") ? event.get("type").getAsString() : "";
switch (type) {
// 会话创建/更新
case "session.created":
sendJson(new WsVoiceAgentResponseMessage("setup_complete"));
break;
case "session.updated":
// 可选:记录配置
break;
// 服务端 VAD 生命周期(可用于前端“打断播放”)
case "input_audio_buffer.speech_started":
sendJson(new WsVoiceAgentResponseMessage("speech_started"));
break;
case "input_audio_buffer.speech_stopped":
sendJson(new WsVoiceAgentResponseMessage("speech_stopped"));
break;
// 用户输入转写完成
case "conversation.item.input_audio_transcription.completed": {
String transcript = event.has("transcript") ? event.get("transcript").getAsString() : "";
sendJson(new WsVoiceAgentResponseMessage("transcript_in", transcript));
break;
}
// 模型输出字幕(增量/完成)
case "response.audio_transcript.delta": {
String delta = event.has("delta") ? event.get("delta").getAsString() : "";
sendJson(new WsVoiceAgentResponseMessage("transcript_out", delta));
break;
}
case "response.audio_transcript.done": {
String transcript = event.has("transcript") ? event.get("transcript").getAsString() : "";
// 也可以用 text 事件发完整句
sendJson(new WsVoiceAgentResponseMessage("text", transcript));
break;
}
// 模型输出音频(base64)
case "response.audio.delta": {
String b64 = event.has("delta") ? event.get("delta").getAsString() : "";
if (StrUtil.notBlank(b64)) {
byte[] pcm = Base64.getDecoder().decode(b64);
callback.sendBinary(pcm);
}
break;
}
case "response.audio.done":
// 音频段结束(可选)
break;
// 一轮完成
case "response.done":
sendJson(new WsVoiceAgentResponseMessage("turn_complete"));
break;
// 错误
case "error":
// 不同版本字段可能是 error/message
sendError("remote_error", event.toString());
break;
default:
// 需要排查时打开
// sendJson(new WsVoiceAgentResponseMessage("evt", event.toString()));
break;
}
} catch (Exception e) {
log.error("handleEvent error", e);
sendError("handle_event_error", e.getMessage());
}
}
private void sendJson(WsVoiceAgentResponseMessage msg) {
try {
String json = JsonUtils.toSkipNullJson(msg);
callback.sendText(json);
} catch (Exception e) {
callback.sendText("{\"type\":\"error\",\"message\":\"serialize error\"}");
}
}
private void sendError(String where, String message) {
WsVoiceAgentResponseMessage m = new WsVoiceAgentResponseMessage("error");
m.setWhere(where);
m.setMessage(message == null ? "" : message);
sendJson(m);
}
@Override
public CompletableFuture<Void> endAudioInput() {
return CompletableFuture.completedFuture(null);
}
}
这段桥接的核心点:
- 浏览器继续发 binary PCM16,后端
sendPcm16k()里 Base64 后调用 SDK 的appendAudio - 模型下行的
response.audio.delta是 Base64,后端解码成 bytes,再走现有的callback.sendBinary(bytes)给浏览器播放 response.done映射为前端习惯的turn_complete
commitAndCreateResponse() 用于 Manual 模式;如果走 server_vad 模式(推荐通话),通常不用调用。Manual 模式的 commit + response.create 语义与官方说明一致。([AlibabaCloud][5])
7. 修改 VoiceSocketHandler
package com.litongjava.voice.agent.bridge;
import com.litongjava.consts.ModelPlatformName;
import com.litongjava.tio.utils.environment.EnvUtils;
public class RealtimeModelBridgeFactory {
public static RealtimeModelBridge createBridge(String platform, RealtimeBridgeCallback callback) {
if (platform == null) {
platform = EnvUtils.getStr("vioce.agent.platform");
}
RealtimeModelBridge bridge = null;
if (ModelPlatformName.GOOGLE.equals(platform)) {
bridge = new GoogleGeminiRealtimeBridge(callback);
} else if (ModelPlatformName.BAILIAN.equals(platform)) {
bridge = new QwenOmniRealtimeBridge(callback);
} else {
bridge = new GoogleGeminiRealtimeBridge(callback);
}
return bridge;
}
}
package com.litongjava.voice.agent.handler;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import com.litongjava.tio.core.ChannelContext;
import com.litongjava.tio.core.Tio;
import com.litongjava.tio.http.common.HttpRequest;
import com.litongjava.tio.http.common.HttpResponse;
import com.litongjava.tio.utils.json.JsonUtils;
import com.litongjava.tio.websocket.common.WebSocketRequest;
import com.litongjava.tio.websocket.common.WebSocketResponse;
import com.litongjava.tio.websocket.common.WebSocketSessionContext;
import com.litongjava.tio.websocket.server.handler.IWebSocketHandler;
import com.litongjava.voice.agent.audio.SessionAudioRecorder;
import com.litongjava.voice.agent.bridge.RealtimeBridgeCallback;
import com.litongjava.voice.agent.bridge.RealtimeModelBridge;
import com.litongjava.voice.agent.bridge.RealtimeModelBridgeFactory;
import com.litongjava.voice.agent.bridge.RealtimeSetup;
import com.litongjava.voice.agent.callback.WsRealtimeBridgeCallback;
import com.litongjava.voice.agent.consts.VoiceAgentConst;
import com.litongjava.voice.agent.model.WsVoiceAgentRequestMessage;
import com.litongjava.voice.agent.model.WsVoiceAgentResponseMessage;
import com.litongjava.voice.agent.model.WsVoiceAgentType;
import com.litongjava.voice.agent.utils.ChannelContextUtils;
import lombok.extern.slf4j.Slf4j;
@Slf4j
public class VoiceSocketHandler implements IWebSocketHandler {
// 一个前端连接一个 bridge
private static final Map<String, RealtimeModelBridge> BRIDGES = new ConcurrentHashMap<>();
@Override
public HttpResponse handshake(HttpRequest httpRequest, HttpResponse response, ChannelContext channelContext)
throws Exception {
log.info("请求信息: {}", httpRequest);
return response;
}
@Override
public void onAfterHandshaked(HttpRequest httpRequest, HttpResponse httpResponse, ChannelContext channelContext)
throws Exception {
log.info("握手完成: {}", httpRequest);
}
@Override
public Object onClose(WebSocketRequest wsRequest, byte[] bytes, ChannelContext channelContext) throws Exception {
String k = ChannelContextUtils.key(channelContext);
RealtimeModelBridge bridge = BRIDGES.remove(k);
if (bridge != null) {
bridge.close();
}
Tio.remove(channelContext, "客户端主动关闭连接");
return null;
}
@Override
public Object onBytes(WebSocketRequest wsRequest, byte[] bytes, ChannelContext channelContext) throws Exception {
String k = ChannelContextUtils.key(channelContext);
// 前端推:16k PCM mono 裸流,记录用户上行音频(前端发来 16k PCM)
try {
SessionAudioRecorder.appendUserPcm(k, bytes);
} catch (Exception ex) {
log.warn("appendUserPcm failed: {}", ex.getMessage());
}
RealtimeModelBridge bridge = BRIDGES.get(ChannelContextUtils.key(channelContext));
if (bridge != null) {
bridge.sendPcm16k(bytes);
}
return null;
}
@Override
public Object onText(WebSocketRequest wsRequest, String text, ChannelContext channelContext) throws Exception {
WebSocketSessionContext wsSessionContext = (WebSocketSessionContext) channelContext.get();
String path = wsSessionContext.getHandshakeRequest().getRequestLine().path;
log.info("路径:{},收到消息:{}", path, text);
String t = text == null ? "" : text.trim();
// 先尝试解析为 JSON -> WsMessage
WsVoiceAgentRequestMessage msg = null;
try {
msg = JsonUtils.parse(t, WsVoiceAgentRequestMessage.class);
} catch (Exception je) {
// 解析失败:降级为普通文本处理
log.debug("收到非 JSON 文本或无法解析为 WsMessage", je.getMessage());
return null;
} catch (Throwable e) {
log.error("解析收到的消息异常", e);
return null;
}
RealtimeModelBridge bridge = BRIDGES.get(ChannelContextUtils.key(channelContext));
if (bridge == null && msg != null && msg.getType() != null) {
String typeStr = msg.getType().trim().toUpperCase();
WsVoiceAgentType typeEnum = null;
try {
typeEnum = WsVoiceAgentType.valueOf(typeStr);
} catch (Exception ex) {
// 未识别的 type,降级处理
log.debug("未知的 type: {}", typeStr);
}
switch (typeEnum) {
case SETUP:
String platform = msg.getPlatform();
String systemPrompt = msg.getSystem_prompt();
String user_prompt = msg.getUser_prompt();
String job_description = msg.getJob_description();
String resume = msg.getResume();
String questions = msg.getQuestions();
String greeting = msg.getGreeting();
RealtimeSetup realtimeSetup = new RealtimeSetup(systemPrompt, user_prompt, job_description, resume, questions,
greeting);
connectLLM(channelContext, platform, realtimeSetup);
// 回显确认
String json = toJson(new WsVoiceAgentResponseMessage(WsVoiceAgentType.SETUP_RECEIVED.name()));
Tio.send(channelContext, WebSocketResponse.fromText(json, VoiceAgentConst.CHARSET));
break;
default:
break;
}
return null;
}
if (bridge == null) {
String respJson = toJson(new WsVoiceAgentResponseMessage(WsVoiceAgentType.ERROR.name(), "no bridge"));
Tio.send(channelContext, WebSocketResponse.fromText(respJson, VoiceAgentConst.CHARSET));
return null;
}
try {
if (msg != null && msg.getType() != null) {
String typeStr = msg.getType().trim().toUpperCase();
WsVoiceAgentType typeEnum = null;
try {
typeEnum = WsVoiceAgentType.valueOf(typeStr);
} catch (Exception ex) {
// 未识别的 type,降级处理
log.debug("未知的 type: {}", typeStr);
}
if (typeEnum != null) {
switch (typeEnum) {
case AUDIO_END:
bridge.endAudioInput();
break;
case TEXT:
String userText = msg.getText() == null ? "" : msg.getText();
bridge.sendText(userText);
break;
case CLOSE:
bridge.close();
Tio.remove(channelContext, "client requested close");
break;
default:
// 其它类型:回显原始 JSON
Tio.send(channelContext, WebSocketResponse.fromText(
toJson(new WsVoiceAgentResponseMessage(WsVoiceAgentType.IGNORED.name(), t)), VoiceAgentConst.CHARSET));
break;
}
}
}
} catch (Exception e) {
log.error(e.getMessage(), e);
}
return null;
}
private String toJson(WsVoiceAgentResponseMessage wsVoiceAgentResponseMessage) {
return JsonUtils.toSkipNullJson(wsVoiceAgentResponseMessage);
}
private void connectLLM(ChannelContext channelContext, String platform, RealtimeSetup setup) {
String k = ChannelContextUtils.key(channelContext);
// 启动 recorder(用户是 16k,模型默认 24k)
try {
SessionAudioRecorder.start(k, 16000, 24000);
} catch (Exception e) {
log.warn("start recorder failed: {}", e.getMessage());
}
RealtimeBridgeCallback callback = new WsRealtimeBridgeCallback(channelContext);
callback.start(setup);
RealtimeModelBridge bridge = RealtimeModelBridgeFactory.createBridge(platform, callback);
BRIDGES.put(k, bridge);
// 连接 Gemini Live(异步)
bridge.connect(setup);
}
}
8. 前端是否需要改
8.1 不改也能跑通(推荐先这样)
前端现在已经:
- 16k PCM16 上行(binary)
- 24k PCM16 下行(binary)并做重采样播放
后端桥接把 DashScope 事件转换成二进制下发,前端无需感知 DashScope 的 Base64/事件名差异。
8.2 可选增强:支持“打断播放”
DashScope server_vad 会发:
input_audio_buffer.speech_startedinput_audio_buffer.speech_stopped
可以在前端 ws.onmessage 的 JSON 分支里加:
- 收到
speech_started时:清空播放队列(现在用nextPlayTime排队,可以加一个“清空/重置”逻辑,把nextPlayTime = playCtx.currentTime + 0.01,并把未播的 chunk 计数清掉) - 这样实现通话场景常见的 barge-in 体验
9. 音频规格与包大小建议
- 上行:16kHz、PCM16、单声道、小端
- 下行(Qwen3-Omni-Flash-Realtime):通常是 24kHz、PCM16(配置为
pcm24) - 发送分片:建议 100ms 左右一包(16k * 0.1s * 2bytes = 3200 bytes)。官方 Java 示例也用 3200 bytes 作为 chunk。 过小的分片容易触发“buffer 太小/空 buffer”类问题(尤其 Manual 模式 commit 时)。Manual 模式的交互语义也强调:先 append,再 commit,再 create_response。([AlibabaCloud][5])
10. 部署与调试清单(中国区)
确认环境变量
DASHSCOPE_API_KEY在运行时可读确认选择北京域名:
wss://dashscope.aliyuncs.com/api-ws/v1/realtime(不要用 intl)([AlibabaCloud][1])首次联调建议:
- 先关掉业务层的录音落盘、转码等额外逻辑,只保留转发
- 打开桥接层 event 日志(必要时把
default分支 event.toString() 打出来)
如果出现“没有响应”:
- server_vad:检查
enableTurnDetection(true)与环境噪声,必要时换 Manual - Manual:确认收到了音频 append 后再
commit + createResponse,并且每轮音频不为空([AlibabaCloud][5])
- server_vad:检查
11. 常见扩展点
11.1 图片输入(后续做视频通话)
DashScope Omni-Realtime 支持 input_image_buffer.append(Base64 JPG/JPEG),要求在发送图片前至少 append 过一次音频。交互流程与事件列表在 Omni-Realtime 交互说明中有完整描述。([AlibabaCloud][6])
11.2 只要 ASR 或只要 TTS
- 只要 ASR:
modalities=["text"],并开启 input_audio_transcription - 只要语音回复:
modalities=["audio"](如果允许),或保留 text 便于调试
12. 开启输入转写
qwen3-omni-flash-realtime 支持“用户音频转写”,但是需要指定“输入音频转写所用的 ASR 模型”**,或者在 Manual 模式没有触发 commit,所以服务端只给了“模型输出音频的字幕”,没给“用户输入音频的转写”。
1)qwen3-omni-flash-realtime 的“用户音频转写”需要单独的 ASR 模型
官方 Java SDK 文档里明确写了:enableInputAudioTranscription 只是是否开启输入音频识别,真正做识别的模型需要配置 InputAudioTranscription,目前仅支持 gummy-realtime-v1。([Alibaba Cloud][1]) 并且也说明:提交 input audio buffer(commit)时,如果配置了 input_audio_transcription,系统才会进行转写。([Alibaba Cloud][1]) 它还解释了原因:Omni 模型的文本输出是“对输入的回答”,不是逐字转写,所以输入转写必须用独立 ASR。([Alibaba Cloud][1])
2)为什么只看到“模型音频的转写”
现在看到的“模型音频转写”通常对应这些服务端事件:
response.audio_transcript.delta / done(助手输出音频的字幕) 而“用户输入音频转写”对应事件是:conversation.item.input_audio_transcription.completed
如果没配置 InputAudioTranscription=gummy-realtime-v1,或者在 Manual 模式下没 commit,就很容易只出现助手字幕,不出现用户转写。
3)Java SDK:正确打开“用户输入音频转写”的最小配置
VAD 模式(enableTurnDetection=true)
VAD 模式下服务端会自动提交缓冲区,因此重点是:除了 enableInputAudioTranscription(true),还要把 InputAudioTranscription 设为 gummy-realtime-v1。([Alibaba Cloud][1])
示例(按 SDK 实际方法名二选一;用 IDE 看 builder 上到底是哪个):
conversation.updateSession(OmniRealtimeConfig.builder()
.modalities(Arrays.asList(OmniRealtimeModality.AUDIO, OmniRealtimeModality.TEXT))
.voice("Cherry")
.enableTurnDetection(true)
.enableInputAudioTranscription(true)
// 关键:指定输入转写 ASR 模型(仅支持 gummy-realtime-v1)
.inputAudioTranscription("gummy-realtime-v1") // 如果的 SDK builder 有这个方法
// 或者:.inputAudioTranscriptionModel("gummy-realtime-v1")
.parameters(Map.of(
"instructions", "是…",
"smooth_output", true
))
.build()
);
并在回调里处理:
case "conversation.item.input_audio_transcription.completed":
System.out.println("用户: " + event.get("transcript").getAsString());
break;
Manual 模式(enableTurnDetection=false)
Manual 模式下:输入音频的转写通常在 commit 之后才会发生。([Alibaba Cloud][1]) 所以流程要是:
- appendAudio(...) 多次
conversation.commit()(触发“把本轮输入提交给服务端”;这里才会触发输入转写)([Alibaba Cloud][1])conversation.createResponse(...)(开始生成模型回复)
