整合千问realtime模型

tio-boot 集成 Qwen-Omni-Realtime（中国区，北京）DashScope Java SDK 接入完整文档

在 tio-boot 上 WebSocket 语音链路（浏览器 16k PCM 上行、服务端转发、浏览器播放 24k PCM 下行）的 Java 开发者，

1. 目标与整体架构

目标能力：

浏览器采集麦克风音频（16kHz、PCM16、单声道）并实时发送到 tio-boot
tio-boot 通过 DashScope Realtime（Qwen-Omni-Realtime）建立会话并持续 append 音频
模型实时返回：
- 输入转写（ASR）
- 输出转写（字幕）
- 输出音频（TTS，Qwen3-Omni-Flash-Realtime 支持 pcm24）
- 生命周期事件（speech_started/speech_stopped/response.done 等）

数据流：

浏览器 ↔ tio-boot WebSocket ↔ QwenOmniRealtimeBridge（DashScope Java SDK） ↔ DashScope Realtime 服务

2. 先决条件

2.1 地域与地址（中国区北京）

WebSocket 端点（北京）：wss://dashscope.aliyuncs.com/api-ws/v1/realtime（国际新加坡是 dashscope-intl，本文只写北京）
连接时需要 query 参数 model，例如 ?model=qwen3-omni-flash-realtime
鉴权方式：HTTP Header Authorization: Bearer DASHSCOPE_API_KEY（SDK 里由 apikey 提供）上述连接与参数要求来自 Realtime/Omni-Realtime 文档与示例。([AlibabaCloud][1])

2.2 DashScope Java SDK 版本

安装sdk https://bailian.console.aliyun.com/cn-beijing/?tab=api#/api/?type=model&url=2712193

    <dependency>
      <groupId>com.alibaba</groupId>
      <artifactId>dashscope-sdk-java</artifactId>
      <version>2.22.10</version>
    </dependency>

如果有日志库冲突,请移除slf4j-simple

    <dependency>
      <groupId>com.alibaba</groupId>
      <artifactId>dashscope-sdk-java</artifactId>
      <version>2.22.10</version>
      <exclusions>
        <exclusion>
          <groupId>org.slf4j</groupId>
          <artifactId>slf4j-simple</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

2.3 API Key

获取API_KEY

环境变量：DASHSCOPE_API_KEY
北京地域调用需要使用北京地域的 Key（“北京/新加坡 Key 分离”说明与官方地域说明一致）。([AlibabaCloud][4])

3. 模型与会话参数要点

3.1 模型

推荐从稳定版开始：

qwen3-omni-flash-realtime

（也可用快照版，例如带日期的版本号；生产建议固定版本或稳定版。）

3.2 会话配置（session.update 对应 SDK 的 updateSession）

关键字段（配置与官方事件文档一致）：

modalities: ["text","audio"] 或仅 ["text"]
voice: 例如 Cherry
input_audio_format: 仅支持 pcm16
output_audio_format:
- Qwen3-Omni-Flash-Realtime：支持 pcm24
instructions: 系统指令
turn_detection:
- server_vad：服务端自动判定起止（通话场景推荐）
- null：Manual 模式，客户端要 commit + response.create

input_audio_buffer.commit 的语义：只提交缓冲区，不会自动触发回复，服务端会回 input_audio_buffer.committed，要出回复还需要 response.create（Manual 模式）。([AlibabaCloud][5])

4. tio-boot 协议如何映射到 DashScope Realtime

前端/后端协议（Gemini Live 方案）是：

二进制：浏览器 → 后端：16k PCM16 裸流
二进制：后端 → 浏览器：24k PCM16 裸流（直接播放）
JSON 控制：setup / text / audio_end / close

而 DashScope Omni-Realtime 协议（服务端）要求：

input_audio_buffer.append：JSON 事件，音频需要 Base64 字符串
输出音频 response.audio.delta：JSON 事件，音频是 Base64 字符串
各类转写/生命周期也都是 JSON 事件（conversation.item.input_audio_transcription.completed、response.audio_transcript.delta、response.done 等）

4.1 最省改法：前端不动，后端做转码与事件桥接

浏览器仍然发送 binary PCM16 到 tio-boot
tio-boot 在 onBytes：
- Base64 编码这段 PCM
- 通过 DashScope SDK 调用 conversation.appendAudio(base64)（或等价方法）
DashScope 回来的 response.audio.delta：
- Base64 解码得到 24k PCM bytes
- 仍旧用 tio-boot 给浏览器发 binary
DashScope 的转写事件：
- 转成现有的 WsVoiceAgentResponseMessage(type="transcript_in/out") 发给前端

这样 index.html/app.js/mic-worklet.js 基本不用改。

4.2 `audio_end` 的含义要重新定义

现在的 audio_end 在 Gemini Live 里是“告诉模型音频流结束”。在 DashScope 里：

如果使用 server_vad 模式：一般不需要 audio_end，服务端会自动在静音达到阈值后提交并触发响应
如果使用 Manual 模式：audio_end 可以映射为：
1. input_audio_buffer.commit
2. response.create

这与官方 Manual 交互流程一致。([AlibabaCloud][6])

5. 配置

5.1 环境变量

export DASHSCOPE_API_KEY=xxxx

6. 后端实现：QwenOmniRealtimeBridge（替代 GeminiLiveBridge）

保留 RealtimeBridgeCallback 接口不变（复用现有 WebSocket 下发逻辑）
新增 QwenOmniRealtimeBridge：内部用 DashScope Java SDK 的 OmniRealtimeConversation
在 VoiceSocketHandler 里把 GeminiLiveBridge 换成 QwenOmniRealtimeBridge

RealtimeModelBridge

package com.litongjava.voice.agent.bridge;

import java.util.concurrent.CompletableFuture;

/**
 * 统一抽象：任何“实时语音/音频”模型桥接都实现这个接口。
 * VoiceSocketHandler 只依赖此接口。
 */
public interface RealtimeModelBridge {

  //public RealtimeModelBridge(RealtimeBridgeCallback sender,String url,String model,String voiceName)

  /**
   * 建立到模型的会话连接，并完成必要的 session/setup 配置。
   */
  CompletableFuture<Void> connect(RealtimeSetup setup);

  /**
   * 发送一段上行音频（浏览器推来的 PCM16 16kHz mono little-endian）。
   */
  CompletableFuture<Void> sendPcm16k(byte[] pcm16k);

  /**
   * 结束当前音频输入/触发模型生成：
   * - Gemini：sendAudioStreamEnd()
   * - Qwen Manual：commit + response.create
   * - Qwen Server VAD：可以是 no-op（或作为手动触发的补充）
   */
  CompletableFuture<Void> endAudioInput();

  /**
   * 可选：发送文本输入（聊天框）。
   */
  CompletableFuture<Void> sendText(String text);

  /**
   * 关闭会话并释放资源。
   */
  CompletableFuture<Void> close();
}

6.2 QwenOmniRealtimeBridge.java

package com.litongjava.voice.agent.bridge;

import java.util.Arrays;
import java.util.Base64;
import java.util.List;
import java.util.Map;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.atomic.AtomicBoolean;

import com.alibaba.dashscope.audio.omni.OmniRealtimeCallback;
import com.alibaba.dashscope.audio.omni.OmniRealtimeConfig;
import com.alibaba.dashscope.audio.omni.OmniRealtimeConversation;
import com.alibaba.dashscope.audio.omni.OmniRealtimeModality;
import com.alibaba.dashscope.audio.omni.OmniRealtimeParam;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.google.gson.JsonObject;
import com.litongjava.tio.utils.environment.EnvUtils;
import com.litongjava.tio.utils.hutool.StrUtil;
import com.litongjava.tio.utils.json.JsonUtils;
import com.litongjava.voice.agent.model.WsVoiceAgentResponseMessage;

import lombok.extern.slf4j.Slf4j;

@Slf4j
public class QwenOmniRealtimeBridge implements RealtimeModelBridge {

  // 中国区（北京）
  private String url = "wss://dashscope.aliyuncs.com/api-ws/v1/realtime";
  private String model = "qwen3-omni-flash-realtime";
  private String voiceName = "Cherry";

  private final RealtimeBridgeCallback callback;

  private volatile OmniRealtimeConversation conversation;
  private final AtomicBoolean connected = new AtomicBoolean(false);

  // 前端上行是 16k PCM16；Qwen3-Omni-Flash-Realtime 下行通常是 24k PCM16（pcm24）
  // 注意：DashScope 事件里音频是 base64；仍可给浏览器发 bytes（二进制）
  public QwenOmniRealtimeBridge(RealtimeBridgeCallback callback) {
    this.callback = callback;
  }

  public QwenOmniRealtimeBridge(RealtimeBridgeCallback callback, String url, String model, String voiceName) {
    this.callback = callback;
    if (url != null) {
      this.url = url;
    }
    if (model != null) {
      this.model = model;
    }

    if (voiceName != null) {
      this.voiceName = voiceName;
    }
  }

  public CompletableFuture<Void> connect(RealtimeSetup setup) {
    return CompletableFuture.runAsync(() -> {
      try {
        String apiKey = EnvUtils.getStr("DASHSCOPE_API_KEY");
        if (StrUtil.isBlank(apiKey)) {
          throw new IllegalStateException("DASHSCOPE_API_KEY is empty");
        }

        OmniRealtimeParam param = OmniRealtimeParam.builder().model(model).apikey(apiKey).url(url).build();

        this.conversation = new OmniRealtimeConversation(param, new OmniRealtimeCallback() {
          @Override
          public void onOpen() {
            connected.set(true);
            sendJson(new WsVoiceAgentResponseMessage("qwen_connected", model));
          }

          @Override
          public void onClose(int code, String reason) {
            connected.set(false);
            sendJson(new WsVoiceAgentResponseMessage("close", reason));
            callback.close("dashscope closed: " + code + ", " + reason);
          }

          @Override
          public void onEvent(JsonObject event) {
            handleEvent(event);
          }
        });

        conversation.connect();

        // 会话配置：建议默认 server_vad + text/audio + transcription
        OmniRealtimeConfig cfg = buildSessionConfig(setup);
        conversation.updateSession(cfg);

        sendJson(new WsVoiceAgentResponseMessage("setup_sent_to_model"));

      } catch (NoApiKeyException e) {
        log.error("NoApiKeyException", e);
        sendError("no_api_key", e.getMessage());
        callback.close("no api key");
      } catch (Exception e) {
        log.error("connect error", e);
        sendError("connect_error", e.getMessage());
        callback.close("connect failed");
      }
    });
  }

  public CompletableFuture<Void> close() {
    return CompletableFuture.runAsync(() -> {
      try {
        OmniRealtimeConversation c = this.conversation;
        if (c != null) {
          c.close(1000, "server close");
        }
      } catch (Exception ignore) {
      } finally {
        connected.set(false);
        callback.close("close");
      }
    });
  }

  /**
   * 浏览器推来的 16k PCM16 bytes（little-endian）
   * DashScope 需要 base64 后通过 input_audio_buffer.append 事件发送（SDK 封装为 appendAudio）。
   */
  public CompletableFuture<Void> sendPcm16k(byte[] pcm16k) {
    return CompletableFuture.runAsync(() -> {
      OmniRealtimeConversation c = this.conversation;
      if (c == null || !connected.get() || pcm16k == null || pcm16k.length == 0) {
        return;
      }
      try {
        String b64 = Base64.getEncoder().encodeToString(pcm16k);
        c.appendAudio(b64);
      } catch (Exception e) {
        log.error("appendAudio failed", e);
        sendError("append_audio_failed", e.getMessage());
      }
    });
  }

  /**
   * Manual 模式：把现有的 audio_end 映射为 commit + createResponse
   * 如果使用 server_vad，可不调用这个方法。
   */
  public CompletableFuture<Void> commitAndCreateResponse() {
    return CompletableFuture.runAsync(() -> {
      OmniRealtimeConversation c = this.conversation;
      if (c == null || !connected.get())
        return;
      try {
        c.commit();
        c.createResponse(null, null);
      } catch (Exception e) {
        log.error("commit/createResponse failed", e);
        sendError("commit_failed", e.getMessage());
      }
    });
  }

  /**
   * 文本输入：SDK 里有的版本支持发文本项；若只做语音助手，可先不实现。
   * 如果确实要支持“输入框发文本”，建议用 Realtime 的对话项事件（conversation.item.create）
   * 或使用 SDK 提供的文本接口（视 SDK 版本而定）。
   */
  public CompletableFuture<Void> sendText(String text) {
    // 先给出最安全的行为：转为 instructions 追加或忽略
    return CompletableFuture.completedFuture(null);
  }

  private OmniRealtimeConfig buildSessionConfig(RealtimeSetup setup) {
    // 把的 setup 组合成 instructions
    String instructions = buildInstructions(setup);

    // server_vad（通话模式）：enableTurnDetection(true)
    // manual（按下即说）：enableTurnDetection(false)，并在 audio_end 时 commit+createResponse
    boolean useServerVad = true;

    List<OmniRealtimeModality> modalities = Arrays.asList(OmniRealtimeModality.AUDIO, OmniRealtimeModality.TEXT);
    OmniRealtimeConfig.OmniRealtimeConfigBuilder b = OmniRealtimeConfig.builder()
        //
        .modalities(modalities).voice(voiceName)
        //
        .enableInputAudioTranscription(true)
        //
        .enableTurnDetection(useServerVad);
    //

    if (instructions != null) {
      b.parameters(Map.of("instructions", instructions));
    }

    return b.build();
  }

  private String buildInstructions(RealtimeSetup setup) {
    if (setup == null) {
      return null;
    }

    StringBuilder sb = new StringBuilder();
    if (StrUtil.notBlank(setup.getSystem_prompt()))
      sb.append(setup.getSystem_prompt()).append("\n");
    if (StrUtil.notBlank(setup.getJob_description()))
      sb.append(setup.getJob_description()).append("\n");
    if (StrUtil.notBlank(setup.getResume()))
      sb.append(setup.getResume()).append("\n");
    if (StrUtil.notBlank(setup.getGreeting()))
      sb.append(setup.getGreeting()).append("\n");
    if (StrUtil.notBlank(setup.getQuestions()))
      sb.append(setup.getQuestions()).append("\n");
    return sb.length() == 0 ? null : sb.toString();
  }

  private void handleEvent(JsonObject event) {
    try {
      String type = event.has("type") ? event.get("type").getAsString() : "";

      switch (type) {

      // 会话创建/更新
      case "session.created":
        sendJson(new WsVoiceAgentResponseMessage("setup_complete"));
        break;
      case "session.updated":
        // 可选：记录配置
        break;

      // 服务端 VAD 生命周期（可用于前端“打断播放”）
      case "input_audio_buffer.speech_started":
        sendJson(new WsVoiceAgentResponseMessage("speech_started"));
        break;
      case "input_audio_buffer.speech_stopped":
        sendJson(new WsVoiceAgentResponseMessage("speech_stopped"));
        break;

      // 用户输入转写完成
      case "conversation.item.input_audio_transcription.completed": {
        String transcript = event.has("transcript") ? event.get("transcript").getAsString() : "";
        sendJson(new WsVoiceAgentResponseMessage("transcript_in", transcript));
        break;
      }

      // 模型输出字幕（增量/完成）
      case "response.audio_transcript.delta": {
        String delta = event.has("delta") ? event.get("delta").getAsString() : "";
        sendJson(new WsVoiceAgentResponseMessage("transcript_out", delta));
        break;
      }
      case "response.audio_transcript.done": {
        String transcript = event.has("transcript") ? event.get("transcript").getAsString() : "";
        // 也可以用 text 事件发完整句
        sendJson(new WsVoiceAgentResponseMessage("text", transcript));
        break;
      }

      // 模型输出音频（base64）
      case "response.audio.delta": {
        String b64 = event.has("delta") ? event.get("delta").getAsString() : "";
        if (StrUtil.notBlank(b64)) {
          byte[] pcm = Base64.getDecoder().decode(b64);
          callback.sendBinary(pcm);
        }
        break;
      }
      case "response.audio.done":
        // 音频段结束（可选）
        break;

      // 一轮完成
      case "response.done":
        sendJson(new WsVoiceAgentResponseMessage("turn_complete"));
        break;

      // 错误
      case "error":
        // 不同版本字段可能是 error/message
        sendError("remote_error", event.toString());
        break;

      default:
        // 需要排查时打开
        // sendJson(new WsVoiceAgentResponseMessage("evt", event.toString()));
        break;
      }
    } catch (Exception e) {
      log.error("handleEvent error", e);
      sendError("handle_event_error", e.getMessage());
    }
  }

  private void sendJson(WsVoiceAgentResponseMessage msg) {
    try {
      String json = JsonUtils.toSkipNullJson(msg);
      callback.sendText(json);
    } catch (Exception e) {
      callback.sendText("{\"type\":\"error\",\"message\":\"serialize error\"}");
    }
  }

  private void sendError(String where, String message) {
    WsVoiceAgentResponseMessage m = new WsVoiceAgentResponseMessage("error");
    m.setWhere(where);
    m.setMessage(message == null ? "" : message);
    sendJson(m);
  }

  @Override
  public CompletableFuture<Void> endAudioInput() {
    return CompletableFuture.completedFuture(null);
  }
}

这段桥接的核心点：

浏览器继续发 binary PCM16，后端 sendPcm16k() 里 Base64 后调用 SDK 的 appendAudio
模型下行的 response.audio.delta 是 Base64，后端解码成 bytes，再走现有的 callback.sendBinary(bytes) 给浏览器播放
response.done 映射为前端习惯的 turn_complete

commitAndCreateResponse() 用于 Manual 模式；如果走 server_vad 模式（推荐通话），通常不用调用。Manual 模式的 commit + response.create 语义与官方说明一致。([AlibabaCloud][5])

7. 修改 VoiceSocketHandler

package com.litongjava.voice.agent.bridge;

import com.litongjava.consts.ModelPlatformName;
import com.litongjava.tio.utils.environment.EnvUtils;

public class RealtimeModelBridgeFactory {

  public static RealtimeModelBridge createBridge(String platform, RealtimeBridgeCallback callback) {
    if (platform == null) {
      platform = EnvUtils.getStr("vioce.agent.platform");
    }
    RealtimeModelBridge bridge = null;
    if (ModelPlatformName.GOOGLE.equals(platform)) {
      bridge = new GoogleGeminiRealtimeBridge(callback);

    } else if (ModelPlatformName.BAILIAN.equals(platform)) {
      bridge = new QwenOmniRealtimeBridge(callback);

    } else {
      bridge = new GoogleGeminiRealtimeBridge(callback);
    }
    return bridge;
  }
}

package com.litongjava.voice.agent.handler;

import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;

import com.litongjava.tio.core.ChannelContext;
import com.litongjava.tio.core.Tio;
import com.litongjava.tio.http.common.HttpRequest;
import com.litongjava.tio.http.common.HttpResponse;
import com.litongjava.tio.utils.json.JsonUtils;
import com.litongjava.tio.websocket.common.WebSocketRequest;
import com.litongjava.tio.websocket.common.WebSocketResponse;
import com.litongjava.tio.websocket.common.WebSocketSessionContext;
import com.litongjava.tio.websocket.server.handler.IWebSocketHandler;
import com.litongjava.voice.agent.audio.SessionAudioRecorder;
import com.litongjava.voice.agent.bridge.RealtimeBridgeCallback;
import com.litongjava.voice.agent.bridge.RealtimeModelBridge;
import com.litongjava.voice.agent.bridge.RealtimeModelBridgeFactory;
import com.litongjava.voice.agent.bridge.RealtimeSetup;
import com.litongjava.voice.agent.callback.WsRealtimeBridgeCallback;
import com.litongjava.voice.agent.consts.VoiceAgentConst;
import com.litongjava.voice.agent.model.WsVoiceAgentRequestMessage;
import com.litongjava.voice.agent.model.WsVoiceAgentResponseMessage;
import com.litongjava.voice.agent.model.WsVoiceAgentType;
import com.litongjava.voice.agent.utils.ChannelContextUtils;

import lombok.extern.slf4j.Slf4j;

@Slf4j
public class VoiceSocketHandler implements IWebSocketHandler {
  // 一个前端连接一个 bridge
  private static final Map<String, RealtimeModelBridge> BRIDGES = new ConcurrentHashMap<>();

  @Override
  public HttpResponse handshake(HttpRequest httpRequest, HttpResponse response, ChannelContext channelContext)
      throws Exception {
    log.info("请求信息: {}", httpRequest);
    return response;
  }

  @Override
  public void onAfterHandshaked(HttpRequest httpRequest, HttpResponse httpResponse, ChannelContext channelContext)
      throws Exception {
    log.info("握手完成: {}", httpRequest);
  }

  @Override
  public Object onClose(WebSocketRequest wsRequest, byte[] bytes, ChannelContext channelContext) throws Exception {
    String k = ChannelContextUtils.key(channelContext);
    RealtimeModelBridge bridge = BRIDGES.remove(k);
    if (bridge != null) {
      bridge.close();
    }
    Tio.remove(channelContext, "客户端主动关闭连接");
    return null;
  }

  @Override
  public Object onBytes(WebSocketRequest wsRequest, byte[] bytes, ChannelContext channelContext) throws Exception {
    String k = ChannelContextUtils.key(channelContext);
    // 前端推：16k PCM mono 裸流,记录用户上行音频（前端发来 16k PCM）
    try {
      SessionAudioRecorder.appendUserPcm(k, bytes);
    } catch (Exception ex) {
      log.warn("appendUserPcm failed: {}", ex.getMessage());
    }

    RealtimeModelBridge bridge = BRIDGES.get(ChannelContextUtils.key(channelContext));
    if (bridge != null) {
      bridge.sendPcm16k(bytes);
    }
    return null;
  }

  @Override
  public Object onText(WebSocketRequest wsRequest, String text, ChannelContext channelContext) throws Exception {
    WebSocketSessionContext wsSessionContext = (WebSocketSessionContext) channelContext.get();
    String path = wsSessionContext.getHandshakeRequest().getRequestLine().path;
    log.info("路径：{}，收到消息：{}", path, text);

    String t = text == null ? "" : text.trim();

    // 先尝试解析为 JSON -> WsMessage
    WsVoiceAgentRequestMessage msg = null;
    try {
      msg = JsonUtils.parse(t, WsVoiceAgentRequestMessage.class);
    } catch (Exception je) {
      // 解析失败：降级为普通文本处理
      log.debug("收到非 JSON 文本或无法解析为 WsMessage", je.getMessage());
      return null;
    } catch (Throwable e) {
      log.error("解析收到的消息异常", e);
      return null;
    }
    RealtimeModelBridge bridge = BRIDGES.get(ChannelContextUtils.key(channelContext));

    if (bridge == null && msg != null && msg.getType() != null) {
      String typeStr = msg.getType().trim().toUpperCase();
      WsVoiceAgentType typeEnum = null;
      try {
        typeEnum = WsVoiceAgentType.valueOf(typeStr);
      } catch (Exception ex) {
        // 未识别的 type，降级处理
        log.debug("未知的 type: {}", typeStr);
      }
      switch (typeEnum) {
      case SETUP:
        String platform = msg.getPlatform();
        String systemPrompt = msg.getSystem_prompt();
        String user_prompt = msg.getUser_prompt();
        String job_description = msg.getJob_description();
        String resume = msg.getResume();
        String questions = msg.getQuestions();
        String greeting = msg.getGreeting();

        RealtimeSetup realtimeSetup = new RealtimeSetup(systemPrompt, user_prompt, job_description, resume, questions,
            greeting);

        connectLLM(channelContext, platform, realtimeSetup);
        // 回显确认
        String json = toJson(new WsVoiceAgentResponseMessage(WsVoiceAgentType.SETUP_RECEIVED.name()));
        Tio.send(channelContext, WebSocketResponse.fromText(json, VoiceAgentConst.CHARSET));
        break;
      default:
        break;
      }
      return null;
    }

    if (bridge == null) {
      String respJson = toJson(new WsVoiceAgentResponseMessage(WsVoiceAgentType.ERROR.name(), "no bridge"));
      Tio.send(channelContext, WebSocketResponse.fromText(respJson, VoiceAgentConst.CHARSET));
      return null;
    }

    try {
      if (msg != null && msg.getType() != null) {
        String typeStr = msg.getType().trim().toUpperCase();
        WsVoiceAgentType typeEnum = null;
        try {
          typeEnum = WsVoiceAgentType.valueOf(typeStr);
        } catch (Exception ex) {
          // 未识别的 type，降级处理
          log.debug("未知的 type: {}", typeStr);
        }

        if (typeEnum != null) {
          switch (typeEnum) {
          case AUDIO_END:
            bridge.endAudioInput();
            break;

          case TEXT:
            String userText = msg.getText() == null ? "" : msg.getText();
            bridge.sendText(userText);
            break;

          case CLOSE:
            bridge.close();
            Tio.remove(channelContext, "client requested close");
            break;

          default:
            // 其它类型：回显原始 JSON
            Tio.send(channelContext, WebSocketResponse.fromText(
                toJson(new WsVoiceAgentResponseMessage(WsVoiceAgentType.IGNORED.name(), t)), VoiceAgentConst.CHARSET));
            break;
          }
        }
      }
    } catch (Exception e) {
      log.error(e.getMessage(), e);
    }
    return null;
  }

  private String toJson(WsVoiceAgentResponseMessage wsVoiceAgentResponseMessage) {
    return JsonUtils.toSkipNullJson(wsVoiceAgentResponseMessage);
  }

  private void connectLLM(ChannelContext channelContext, String platform, RealtimeSetup setup) {
    String k = ChannelContextUtils.key(channelContext);

    // 启动 recorder（用户是 16k，模型默认 24k）
    try {
      SessionAudioRecorder.start(k, 16000, 24000);
    } catch (Exception e) {
      log.warn("start recorder failed: {}", e.getMessage());
    }

    RealtimeBridgeCallback callback = new WsRealtimeBridgeCallback(channelContext);
    callback.start(setup);
    RealtimeModelBridge bridge = RealtimeModelBridgeFactory.createBridge(platform, callback);
    BRIDGES.put(k, bridge);
    // 连接 Gemini Live（异步）
    bridge.connect(setup);
  }

}

8. 前端是否需要改

8.1 不改也能跑通（推荐先这样）

前端现在已经：

16k PCM16 上行（binary）
24k PCM16 下行（binary）并做重采样播放

后端桥接把 DashScope 事件转换成二进制下发，前端无需感知 DashScope 的 Base64/事件名差异。

8.2 可选增强：支持“打断播放”

DashScope server_vad 会发：

input_audio_buffer.speech_started
input_audio_buffer.speech_stopped

可以在前端 ws.onmessage 的 JSON 分支里加：

收到 speech_started 时：清空播放队列（现在用 nextPlayTime 排队，可以加一个“清空/重置”逻辑，把 nextPlayTime = playCtx.currentTime + 0.01，并把未播的 chunk 计数清掉）
这样实现通话场景常见的 barge-in 体验

9. 音频规格与包大小建议

上行：16kHz、PCM16、单声道、小端
下行（Qwen3-Omni-Flash-Realtime）：通常是 24kHz、PCM16（配置为 pcm24）
发送分片：建议 100ms 左右一包（16k * 0.1s * 2bytes = 3200 bytes）。官方 Java 示例也用 3200 bytes 作为 chunk。过小的分片容易触发“buffer 太小/空 buffer”类问题（尤其 Manual 模式 commit 时）。Manual 模式的交互语义也强调：先 append，再 commit，再 create_response。([AlibabaCloud][5])

10. 部署与调试清单（中国区）

确认环境变量 DASHSCOPE_API_KEY 在运行时可读
确认选择北京域名：wss://dashscope.aliyuncs.com/api-ws/v1/realtime（不要用 intl）([AlibabaCloud][1])
首次联调建议：
- 先关掉业务层的录音落盘、转码等额外逻辑，只保留转发
- 打开桥接层 event 日志（必要时把 default 分支 event.toString() 打出来）
如果出现“没有响应”：
- server_vad：检查 enableTurnDetection(true) 与环境噪声，必要时换 Manual
- Manual：确认收到了音频 append 后再 commit + createResponse，并且每轮音频不为空([AlibabaCloud][5])

11. 常见扩展点

11.1 图片输入（后续做视频通话）

DashScope Omni-Realtime 支持 input_image_buffer.append（Base64 JPG/JPEG），要求在发送图片前至少 append 过一次音频。交互流程与事件列表在 Omni-Realtime 交互说明中有完整描述。([AlibabaCloud][6])

11.2 只要 ASR 或只要 TTS

只要 ASR：modalities=["text"]，并开启 input_audio_transcription
只要语音回复：modalities=["audio"]（如果允许），或保留 text 便于调试

12. 开启输入转写

qwen3-omni-flash-realtime 支持“用户音频转写”，但是需要指定“输入音频转写所用的 ASR 模型”**，或者在 Manual 模式没有触发 commit，所以服务端只给了“模型输出音频的字幕”，没给“用户输入音频的转写”。

1）qwen3-omni-flash-realtime 的“用户音频转写”需要单独的 ASR 模型

官方 Java SDK 文档里明确写了：enableInputAudioTranscription 只是是否开启输入音频识别，真正做识别的模型需要配置 InputAudioTranscription，目前仅支持 gummy-realtime-v1。([Alibaba Cloud][1]) 并且也说明：提交 input audio buffer（commit）时，如果配置了 input_audio_transcription，系统才会进行转写。([Alibaba Cloud][1]) 它还解释了原因：Omni 模型的文本输出是“对输入的回答”，不是逐字转写，所以输入转写必须用独立 ASR。([Alibaba Cloud][1])

2）为什么只看到“模型音频的转写”

现在看到的“模型音频转写”通常对应这些服务端事件：

response.audio_transcript.delta / done（助手输出音频的字幕）而“用户输入音频转写”对应事件是：
conversation.item.input_audio_transcription.completed

如果没配置 InputAudioTranscription=gummy-realtime-v1，或者在 Manual 模式下没 commit，就很容易只出现助手字幕，不出现用户转写。

3）Java SDK：正确打开“用户输入音频转写”的最小配置

VAD 模式（enableTurnDetection=true）

VAD 模式下服务端会自动提交缓冲区，因此重点是：除了 enableInputAudioTranscription(true)，还要把 InputAudioTranscription 设为 gummy-realtime-v1。([Alibaba Cloud][1])

示例（按 SDK 实际方法名二选一；用 IDE 看 builder 上到底是哪个）：

conversation.updateSession(OmniRealtimeConfig.builder()
    .modalities(Arrays.asList(OmniRealtimeModality.AUDIO, OmniRealtimeModality.TEXT))
    .voice("Cherry")
    .enableTurnDetection(true)
    .enableInputAudioTranscription(true)
    // 关键：指定输入转写 ASR 模型（仅支持 gummy-realtime-v1）
    .inputAudioTranscription("gummy-realtime-v1")   // 如果的 SDK builder 有这个方法
    // 或者：.inputAudioTranscriptionModel("gummy-realtime-v1")
    .parameters(Map.of(
        "instructions", "是…",
        "smooth_output", true
    ))
    .build()
);

并在回调里处理：

case "conversation.item.input_audio_transcription.completed":
    System.out.println("用户: " + event.get("transcript").getAsString());
    break;

Manual 模式（enableTurnDetection=false）

Manual 模式下：输入音频的转写通常在 commit 之后才会发生。([Alibaba Cloud][1]) 所以流程要是：

appendAudio(...) 多次
conversation.commit() （触发“把本轮输入提交给服务端”；这里才会触发输入转写）([Alibaba Cloud][1])
conversation.createResponse(...)（开始生成模型回复）