音声エージェントの構築

音声の扱い

既定の OpenAIRealtimeWebRTC のような一部のトランスポートレイヤーは、音声の入力と出力を自動的に処理します。OpenAIRealtimeWebSocket のような他のトランスポートでは、セッションの音声を自分で処理する必要があります:

import {
  RealtimeAgent,
  RealtimeSession,
  TransportLayerAudio,
} from '@openai/agents/realtime';

const agent = new RealtimeAgent({ name: 'My agent' });
const session = new RealtimeSession(agent);
const newlyRecordedAudio = new ArrayBuffer(0);

session.on('audio', (event: TransportLayerAudio) => {
  // play your audio
});

// send new audio to the agent
session.sendAudio(newlyRecordedAudio);

セッションの設定

RealtimeSession のコンストラクターに追加オプションを渡すか、connect(...) を呼び出す際にオプションを渡すことで、セッションを設定できます。

import { RealtimeAgent, RealtimeSession } from '@openai/agents/realtime';

const agent = new RealtimeAgent({
  name: 'Greeter',
  instructions: 'Greet the user with cheer and answer questions.',
});

const session = new RealtimeSession(agent, {
  model: 'gpt-realtime',
  config: {
    inputAudioFormat: 'pcm16',
    outputAudioFormat: 'pcm16',
    inputAudioTranscription: {
      model: 'gpt-4o-mini-transcribe',
    },
  },
});

これらのトランスポートレイヤーでは、session に一致する任意のパラメーターを渡せます。

RealtimeSessionConfig に対応するパラメーターがまだない新しいパラメーターについては、providerData を使用できます。providerData に渡したものは session オブジェクトの一部としてそのまま渡されます。

ハンドオフ

通常のエージェントと同様に、ハンドオフを使ってエージェントを複数のエージェントに分割し、それらの間をオーケストレーションして、エージェントのパフォーマンスを高め、問題の範囲をより明確にできます。

import { RealtimeAgent } from '@openai/agents/realtime';

const mathTutorAgent = new RealtimeAgent({
  name: 'Math Tutor',
  handoffDescription: 'Specialist agent for math questions',
  instructions:
    'You provide help with math problems. Explain your reasoning at each step and include examples',
});

const agent = new RealtimeAgent({
  name: 'Greeter',
  instructions: 'Greet the user with cheer and answer questions.',
  handoffs: [mathTutorAgent],
});

通常のエージェントとは異なり、ハンドオフはリアルタイムエージェントでは少し動作が異なります。ハンドオフが行われると、進行中のセッションは新しいエージェント設定で更新されます。このため、エージェントは進行中の会話履歴に自動的にアクセスでき、入力フィルターは現在適用されません。

さらに、ハンドオフの一部として voice や model を変更することはできません。また、接続できるのは他のリアルタイムエージェントのみです。別のモデル、たとえば gpt-5-mini のような推論モデルを使用する必要がある場合は、delegation through tools を使用できます。

ツール

通常のエージェントと同様に、リアルタイムエージェントはアクションを実行するためにツールを呼び出せます。通常のエージェントで使用するのと同じ tool() 関数でツールを定義できます。

import { tool, RealtimeAgent } from '@openai/agents/realtime';
import { z } from 'zod';

const getWeather = tool({
  name: 'get_weather',
  description: 'Return the weather for a city.',
  parameters: z.object({ city: z.string() }),
  async execute({ city }) {
    return `The weather in ${city} is sunny.`;
  },
});

const weatherAgent = new RealtimeAgent({
  name: 'Weather assistant',
  instructions: 'Answer weather questions.',
  tools: [getWeather],
});

リアルタイムエージェントで使用できるのは関数ツールのみで、これらのツールはリアルタイムセッションと同じ場所で実行されます。つまり、ブラウザでリアルタイムセッションを実行している場合は、ツールもブラウザで実行されます。より機密性の高いアクションを実行する必要がある場合は、ツール内からバックエンドサーバーへの HTTP リクエストを実行できます。

ツールの実行中、エージェントはユーザーからの新しいリクエストを処理できません。体験を向上させる 1 つの方法は、ツールの実行直前にアナウンスさせたり、ツール実行のための時間を稼ぐ決まり文句を話すようにエージェントに指示することです。

会話履歴へのアクセス

エージェントが特定のツールを呼び出した際の引数に加えて、リアルタイムセッションが追跡する現在の会話履歴のスナップショットにもアクセスできます。これは、会話の現在の状態に基づいてより複雑なアクションを実行する必要がある場合や、tools for delegation を使用する予定がある場合に役立ちます。

import {
  tool,
  RealtimeContextData,
  RealtimeItem,
} from '@openai/agents/realtime';
import { z } from 'zod';

const parameters = z.object({
  request: z.string(),
});

const refundTool = tool<typeof parameters, RealtimeContextData>({
  name: 'Refund Expert',
  description: 'Evaluate a refund',
  parameters,
  execute: async ({ request }, details) => {
    // The history might not be available
    const history: RealtimeItem[] = details?.context?.history ?? [];
    // making your call to process the refund request
  },
});

ツール実行前の承認

needsApproval: true でツールを定義すると、エージェントはツールを実行する前に tool_approval_requested イベントを発行します。

このイベントをリッスンすることで、ツール呼び出しを承認または拒否するための UI をユーザーに表示できます。

import { session } from './agent';

session.on('tool_approval_requested', (_context, _agent, request) => {
  // show a UI to the user to approve or reject the tool call
  // you can use the `session.approve(...)` or `session.reject(...)` methods to approve or reject the tool call

  session.approve(request.approvalItem); // or session.reject(request.rawItem);
});

ガードレール

ガードレールは、エージェントの発言が一連のルールに違反したかどうかを監視し、違反した場合に即座にレスポンスを遮断する方法を提供します。これらのガードレールチェックはエージェントのレスポンスの書き起こしに基づいて実行されるため、モデルのテキスト出力が有効である必要があります（デフォルトで有効です）。

提供したガードレールは、モデルのレスポンスが返ってくるのと同時に非同期で実行され、あらかじめ定義された分類トリガー（例:「特定の禁止ワードに言及」）に基づいてレスポンスを遮断できます。

ガードレールが作動すると、セッションは guardrail_tripped イベントを発行します。このイベントは、ガードレールをトリガーした itemId を含む details オブジェクトも提供します。

import { RealtimeOutputGuardrail, RealtimeAgent, RealtimeSession } from '@openai/agents/realtime';

const agent = new RealtimeAgent({
  name: 'Greeter',
  instructions: 'Greet the user with cheer and answer questions.',
});

const guardrails: RealtimeOutputGuardrail[] = [
  {
    name: 'No mention of Dom',
    async execute({ agentOutput }) {
      const domInOutput = agentOutput.includes('Dom');
      return {
        tripwireTriggered: domInOutput,
        outputInfo: { domInOutput },
      };
    },
  },
];

const guardedSession = new RealtimeSession(agent, {
  outputGuardrails: guardrails,
});

デフォルトでは、ガードレールは 100 文字ごと、またはレスポンステキストの生成が終了した時点で実行されます。テキストを話し終えるまでには通常それより時間がかかるため、ほとんどの場合、ユーザーが聞く前にガードレールが違反を検出できます。

この動作を変更したい場合は、セッションに outputGuardrailSettings オブジェクトを渡せます。

import { RealtimeAgent, RealtimeSession } from '@openai/agents/realtime';

const agent = new RealtimeAgent({
  name: 'Greeter',
  instructions: 'Greet the user with cheer and answer questions.',
});

const guardedSession = new RealtimeSession(agent, {
  outputGuardrails: [
    /*...*/
  ],
  outputGuardrailSettings: {
    debounceTextLength: 500, // run guardrail every 500 characters or set it to -1 to run it only at the end
  },
});

ターン検出 / 音声活動検出

リアルタイムセッションは、ユーザーが話しているタイミングを自動的に検出し、組み込みの Realtime API の音声活動検出モードを使用して新しいターンをトリガーします。

音声活動検出モードは、セッションに turnDetection オブジェクトを渡すことで変更できます。

import { RealtimeSession } from '@openai/agents/realtime';
import { agent } from './agent';

const session = new RealtimeSession(agent, {
  model: 'gpt-realtime',
  config: {
    turnDetection: {
      type: 'semantic_vad',
      eagerness: 'medium',
      createResponse: true,
      interruptResponse: true,
    },
  },
});

ターン検出の設定を調整すると、不要な割り込みのキャリブレーションや無音への対処に役立ちます。各種設定の詳細は Realtime API ドキュメントを参照してください

割り込み

組み込みの音声活動検出を使用している場合、エージェントの発話にかぶせて話すと、自動的に検出され、発話内容に基づいてコンテキストが更新されます。同時に audio_interrupted イベントも発行されます。これは、すべての音声再生を即座に停止するために使用できます（WebSocket 接続にのみ適用）。

import { session } from './agent';

session.on('audio_interrupted', () => {
  // handle local playback interruption
});

UI に「停止」ボタンを提供するなど、手動で割り込みを行いたい場合は、interrupt() を手動で呼び出せます:

import { session } from './agent';

session.interrupt();
// this will still trigger the `audio_interrupted` event for you
// to cut off the audio playback when using WebSockets

いずれの場合も、リアルタイムセッションはエージェントの生成の中断、ユーザーに話した内容の切り捨て、および履歴の更新を処理します。

エージェントへの接続に WebRTC を使用している場合は、音声出力もクリアされます。WebSocket を使用している場合は、キューに入っている音声の再生を停止するなど、これを自分で処理する必要があります。

テキスト入力

エージェントにテキスト入力を送信したい場合は、RealtimeSession の sendMessage メソッドを使用できます。

これは、エージェントとのインターフェースを両方のモダリティで有効にしたい場合や、会話に追加のコンテキストを提供したい場合に役立ちます。

import { RealtimeSession, RealtimeAgent } from '@openai/agents/realtime';

const agent = new RealtimeAgent({
  name: 'Assistant',
});

const session = new RealtimeSession(agent, {
  model: 'gpt-realtime',
});

session.sendMessage('Hello, how are you?');

会話履歴の管理

RealtimeSession は history プロパティで会話履歴を自動管理します:

これを使用して、顧客に履歴を表示したり、追加の処理を実行したりできます。会話の過程でこの履歴は常に変化するため、history_updated イベントをリッスンできます。

履歴を変更したい場合（メッセージを完全に削除する、または書き起こしを更新するなど）は、updateHistory メソッドを使用できます。

import { RealtimeSession, RealtimeAgent } from '@openai/agents/realtime';

const agent = new RealtimeAgent({
  name: 'Assistant',
});

const session = new RealtimeSession(agent, {
  model: 'gpt-realtime',
});

await session.connect({ apiKey: '<client-api-key>' });

// listening to the history_updated event
session.on('history_updated', (history) => {
  // returns the full history of the session
  console.log(history);
});

// Option 1: explicit setting
session.updateHistory([
  /* specific history */
]);

// Option 2: override based on current state like removing all agent messages
session.updateHistory((currentHistory) => {
  return currentHistory.filter(
    (item) => !(item.type === 'message' && item.role === 'assistant'),
  );
});

制限事項

現時点では、後から関数ツールの呼び出しを更新/変更できません
履歴内のテキスト出力には、書き起こしとテキストモダリティが有効である必要があります
割り込みにより切り捨てられたレスポンスには書き起こしが存在しません

delegation through tools

Delegation through tools

会話履歴とツール呼び出しを組み合わせることで、会話を別のバックエンドエージェントに委任して、より複雑なアクションを実行し、その結果をユーザーに返すことができます。

import {
  RealtimeAgent,
  RealtimeContextData,
  tool,
} from '@openai/agents/realtime';
import { handleRefundRequest } from './serverAgent';
import z from 'zod';

const refundSupervisorParameters = z.object({
  request: z.string(),
});

const refundSupervisor = tool<
  typeof refundSupervisorParameters,
  RealtimeContextData
>({
  name: 'escalateToRefundSupervisor',
  description: 'Escalate a refund request to the refund supervisor',
  parameters: refundSupervisorParameters,
  execute: async ({ request }, details) => {
    // This will execute on the server
    return handleRefundRequest(request, details?.context?.history ?? []);
  },
});

const agent = new RealtimeAgent({
  name: 'Customer Support',
  instructions:
    'You are a customer support agent. If you receive any requests for refunds, you need to delegate to your supervisor.',
  tools: [refundSupervisor],
});

以下のコードはサーバー上で実行されます。この例では Next.js の server actions を通じて実行されます。

// This runs on the server
import 'server-only';

import { Agent, run } from '@openai/agents';
import type { RealtimeItem } from '@openai/agents/realtime';
import z from 'zod';

const agent = new Agent({
  name: 'Refund Expert',
  instructions:
    'You are a refund expert. You are given a request to process a refund and you need to determine if the request is valid.',
  model: 'gpt-5-mini',
  outputType: z.object({
    reasong: z.string(),
    refundApproved: z.boolean(),
  }),
});

export async function handleRefundRequest(
  request: string,
  history: RealtimeItem[],
) {
  const input = `
The user has requested a refund.

The request is: ${request}

Current conversation history:
${JSON.stringify(history, null, 2)}
`.trim();

  const result = await run(agent, input);

  return JSON.stringify(result.finalOutput, null, 2);
}