Using OpenAI's RealTime API

October 16, 2024 · 21 min read

CTO of WorkAdventure

In this article, I'm going to describe our experience creating a WorkAdventure bot using the new OpenAI's Realtime API. This API is revolutionary because it allows you to interact with an AI model in speech to speech mode.

Before diving into the details, let's take a look at the final result:

info

This article is targeted at developers who want to start using the new OpenAI's Realtime API in their projects. The API is still in beta as I'm writing this article so things might have moved when you will read this article.

Previously, to interact with an AI model, you had to turn your voice into text, send it to the model, and then turn the model's response back into speech. This process was somewhat slow. It could take a few seconds for the AI to respond, leading to an awkward conversation.

We have already experimented with the previous OpenAI API in WorkAdventure. The results were good, but the conversation was not as smooth as we would have liked.

With the new OpenAI's Realtime API, the model directly takes your voice as input and responds in real-time. This makes the conversation smoother. Furthermore, because there is no need to convert speech to text, the model does not loose important information that could be lost during the transcription process, like the tone of your voice. And it can also respond with an appropriate tone.

The API is still in beta, but demos were very impressive. So we decided to give it a try.

Interacting with the API is very different from the previous chat completion API. Because we are dealing with audio, the API keeps sending and receiving messages. This is done through a WebSocket. The API sends audio chunks to the model and receives audio chunks as responses. Because we are using a WebSocket, the API is now stateful. This means you no longer need to resend the context of the conversation at each turn. The model keeps track of the conversation context and can respond accordingly.

The context of WorkAdventure is somewhat special. Bots are actually scripts that run in JavaScript on the browser side. Each bot is running in a headless browser tab on a server. Bots are using the Scripting API to interact with the WorkAdventure map. Because the bots are running in a browser, on the server-side, we can actually put the OpenAI key right into the browser. This saves us from having to manage the key on the server-side and use a separate server as a proxy to the API.

If you are looking to implement the real-time API in your own project, it is likely your setup will be different. You might need a relay server that will live on the server, take calls from the client, and forward them to the OpenAI API, adding the API key in the process. I would also like to mention that Livekit seems to have a great higher-level library for handling bots: https://docs.livekit.io/agents/openai/overview/ I haven't had the opportunity to test it yet, but it looks promising and you probably should have a look at it before starting.

Getting started

Instead of directly talking on the WebSocket with the OpenAI Realtime API, we decided to use a wrapper library provided by OpenAI: https://github.com/openai/openai-realtime-api-beta.

$ npm install openai/openai-realtime-api-beta --save

This wrapper is still in its infancy:

Yeah... version 0.0.0 with 17 weekly downloads. This is bleeding-edge technology!

Now that the library is installed, let's create our "Robot" class that will handle the communication with the API:

import {RealtimeClient} from "@openai/realtime-api-beta";

class Robot {
    private realtimeClient: RealtimeClient;

    constructor(
        private audioTranscriptionModel = "whisper-1",
        private voice = "alloy",
    ) {
    }

    async init(instructions: string): Promise<void> {
        console.log("Robot is starting...");

        this.realtimeClient = new RealtimeClient({
            // In WorkAdventure, we get the API key from the URL hash 
            // parameters.
            // The way you get the API key will depend on your setup.
            apiKey: WA.room.hashParameters.openaiApiKey,
            // We are fine with having the API key in the browser because 
            // this is a headless browser running on the server.
            dangerouslyAllowAPIKeyInBrowser: true,
        });
        
        // The "session", in the Realtime API, refers to the WebSocket 
        // connection.
        // There is one session per WebSocket connection.
        
        // We update the session with the instructions we want to send to 
        // the model.
        this.realtimeClient.updateSession({
            instructions: instructions
        });

        // We update the session with the voice and the audio transcription 
        // model we want to use.
        this.realtimeClient.updateSession({voice: this.voice});
        this.realtimeClient.updateSession({
            // VAD means "Voice Activity Detection".
            // In "server_vad" mode, we let OpenAI decide when to start and 
            // stop speaking.
            turn_detection: {type: 'server_vad'},
            input_audio_transcription: {model: this.audioTranscriptionModel},
        });

        // Each time the conversation is updated (each time we receive a 
        // bit of audio from the model or a word from the transcription),
        // this callback is called.
        this.realtimeClient.on('conversation.updated', (event) => {
            console.log(event.delta);
            if (event.delta.audio) {
                // Let's do something with the audio.
            }
        });

        // Connect to Realtime API
        // Note: in the future, we should be able to pass the LLM model here.
        await this.realtimeClient.connect();

        // Send an initial message to the model to trigger a response.
        this.realtimeClient.sendUserMessageContent([
            {type: 'input_text', text: `How are you?`}
        ]);
    }
}

At this point, we can see in the console that the model is responding with audio chunks.

If we take a look at the "delta" part of the event, we can see some events contain "transcripts", which are usually single words, and other events contain "audio" chunks.

Playing the audio

The next step is to send the audio chunks to the browser's audio API to play the audio. In the case of WorkAdventure, the audio would be played in the headless browser on the server.

"On the server, no one can hear you scream."
-- Alien... almost (1979)

Ok, so we have the audio chunks and we need to turn them into a MediaStream. A media stream is a stream of audio data that can be played by the browser. We can create a MediaStream from an array of audio chunks. The MediaStream can then be played by the browser or sent to a WebRTC connection.

The audio chunks sent by the Realtime API are in a PCM 16 format (that is: raw 16 bit PCM audio at 24kHz, 1 channel, little-endian).

In the browser however, the API in charge of managing audio is the Web Audio API. It is operating on 32bit float numbers. So we need to convert the 16bit integer audio chunks into 32bit float audio chunks.

First of all, I was very worried about this 'little-endian' thing. Little endian means that the least significant byte is stored first. Handling that in JavaScript is not straightforward. However, when dealing with audio output from OpenAI, the audio is correctly formatted, so you don't have to worry about endianness.

So turning from 16bit PCM to 32bit float is actually turning an integer value between -32768 and 32767 into a float value between -1 and 1. Which is quite simple.

Here is the code to do this conversion:

const int16Array = event.delta.audio;

// Convert Int16Array to Float32Array as the Web Audio API uses Float32
const float32Array = new Float32Array(int16Array.length);
for (let i = 0; i < int16Array.length; i++) {
    // Convert 16-bit PCM to float [-1, 1]
    float32Array[i] = int16Array[i] / 32768.0;
}

Now that we have the audio in the correct format, we need to create a MediaStream from it.

There are many ways to do that. The simplest one involves using a ScriptProcessorNode. This is easy, but unfortunately, it is deprecated. You can also use a AudioBufferSourceNode but the documentation says it does not play well with streaming audio and is more adapted to play short audio clips.

The technique I will use is using the AudioWorkletNode. This is the modern way to do things, but it is a bit more complex.

Audio worklets are running in a separate thread and can be used to process audio in real-time. Although we don't perform heavy computations that could block the main thread, the usage of audio worklets mandates processing audio in a separate thread.

The audio worklet can receive audio chunks if a MediaStream is plugged in input and can send some audio output to a destination MediaStream. It can also communicate with the main thread using the postMessage API.

In this case, we will use the audio worklet to play the audio chunks we receive from the Realtime API. We will send the audio chunks to the audio worklet using the postMessage API and the audio worklet will generate a MediaStream.

The code will be split into 2 files. One file is containing the audio worklet processor and the other file is the main file that will create the audio worklet node and connect it to the audio output.

Let's start with the audio worklet processor (the one running in a dedicated thread)

Data will be arriving via the postMessage API (in the onmessage method), as an object with this structure:

{
    pcmData: Float32Array
}

The pcmData field contains the audio data in the correct format (32bit float). We will store this data in the audioQueue.

class OutputAudioWorkletProcessor extends AudioWorkletProcessor {
    private audioQueue: Float32Array[] = [];

    constructor() {
        super();
        this.port.onmessage = (event: MessageEvent) => {
            const data = event.data.pcmData;
            if (data instanceof Float32Array) {
                this.audioQueue.push(data);
            } else {
                console.error("Invalid data type received in worklet", 
                    event.data);
            }
        };
    }

    process(inputs: Float32Array[][], 
            outputs: Float32Array[][], 
            parameters: Record<string, Float32Array>): boolean {
        const output = outputs[0];
        const outputData = output[0];

        let nextChunk: Float32Array | undefined;
        let currentOffset = 0;
        while (nextChunk = this.audioQueue[0]) {
            if (currentOffset + nextChunk.length <= outputData.length) {
                // If the next chunk fits in the output buffer, 
                // copy it and remove it from the queue
                outputData.set(nextChunk, currentOffset);
                currentOffset += nextChunk.length;
                this.audioQueue.shift();
            } else {
                // If the next chunk does not fit in the output buffer, 
                // copy only what fits and keep the rest in the queue
                outputData.set(
                    nextChunk.subarray(0, outputData.length - currentOffset), 
                    currentOffset);
                this.audioQueue[0] = nextChunk.subarray(
                    outputData.length - currentOffset
                );
                break;
            }
        }

        return true; // Keep processor alive
    }
}

// Required registration for the worklet
registerProcessor('output-pcm-worklet-processor', OutputAudioWorkletProcessor);
export {};

When the processor is running, the process method is called each time the audio worklet needs to process audio data (so every few milliseconds). The process method has "inputs" and "outputs" parameters. The "inputs" parameter will be completely ignored in our case. The "outputs" parameter is an array of arrays of Float32Arrays. The first array contains the output data. In our case, we will only use the first output array.

If there is data in the audioQueue, the processor will copy the data to the output buffer, trying to fill it as much as possible. It is important to note that when the process function is called, the output buffer is already filled with zeros (silence). So if there is no data in the audioQueue, the output buffer will be filled with silence. This is exactly what we want.

Last but not least, please note that we call registerProcessor at the end of the file. The output-pcm-worklet-processor is the name we will use to create the audio worklet node in the main file.

Typescript support

If you are using Typescript, it is likely you will be missing the AudioWorkletProcessor type. There is a NPM package to add this (@types/audioworklet), but it conflicts with the DOM types. So instead, I copied types from a GitHub issue found here into my project.

Did I say that we are working on bleeding-edge technology? Yes, we are.

Instantiating the audio worklet node in the main file

Now that we have the audio worklet processor, we need to create the audio worklet node in the main file.

From a bird's eye view, without any error management, the process looks like this:

// Sample rate returned by OpenAI is 24kHz
const audioContext = new AudioContext({ sampleRate: 24000 });

// Let's create the MediaStream
const mediaStreamDestination = this.audioContext.createMediaStreamDestination();

await this.audioContext.resume();

// Let's load the audio worklet processor (assuming the file is available in the pcm-processor.js)
await this.audioContext.audioWorklet.addModule("output-pcm-processor.js");

// Instantiate the audio worklet node (we use the 'output-pcm-worklet-processor' string to match the name we used in the registerProcessor call)
const workletNode = new AudioWorkletNode(this.audioContext, 'output-pcm-worklet-processor');

// Connect the worklet node to the media stream destination
workletNode.connect(this.mediaStreamDestination);

In practice, it is a bit more complex. In a real-world application, you will use a bundler. So we cannot reference the "pcm-processor.js" file directly without letting the bundler know about it.

In our case, we are using Vite.

In Vite, we can use the "?worker&url" parameter in imports to reference a file directly. This is very useful for our use case.

import audioWorkletProcessorUrl from "./pcm-processor.ts?worker&url";

// ...
await this.audioContext.audioWorklet.addModule(audioWorkletProcessorUrl);

When this is done, we can send the audio chunks to the audio worklet processor:

workletNode.port.postMessage(
    { pcmData: float32Array }, 
    { transfer: [float32Array.buffer] }
);

The transfer option is used to transfer the ownership of the buffer to the audio worklet processor. This is important because the buffer is a large object and we don't want to copy it (that would waste CPU cycles). We want to transfer it.

And that's it!

The full class looks like this:

import audioWorkletProcessorUrl from "./AudioWorkletProcessor.ts?worker&url";

export class OutputPCMStreamer {
    private audioContext: AudioContext;
    private mediaStreamDestination: MediaStreamAudioDestinationNode;
    private workletNode: AudioWorkletNode | null = null;
    private isWorkletLoaded = false;

    constructor(sampleRate = 24000) {
        this.audioContext = new AudioContext({ sampleRate });
        this.mediaStreamDestination = 
            this.audioContext.createMediaStreamDestination();
    }

    // Initialize the AudioWorklet and load the processor script
    public async initWorklet() {
        if (!this.isWorkletLoaded) {
            try {
                await this.audioContext.resume();
                await this.audioContext.audioWorklet.addModule(audioWorkletProcessorUrl);
                this.workletNode = new AudioWorkletNode(this.audioContext, 'output-pcm-worklet-processor');
                this.workletNode.connect(this.mediaStreamDestination);
                this.isWorkletLoaded = true;
            } catch (err) {
                console.error('Failed to load AudioWorkletProcessor:', err);
            }
        }
    }

    // Method to get the MediaStream that can be added to WebRTC
    public getMediaStream(): MediaStream {
        return this.mediaStreamDestination.stream;
    }

    // Method to append new PCM data (Int16Array) to the audio stream
    public appendPCMData(float32Array: Float32Array): void {
        if (!this.isWorkletLoaded || !this.workletNode) {
            console.error('AudioWorklet is not loaded yet.');
            return;
        }

        // Send the PCM data to the AudioWorkletProcessor via its port
        this.workletNode.port.postMessage(
            { pcmData: float32Array },
            {transfer: [float32Array.buffer]}
        );
    }

    // Method to close the AudioWorkletNode and clean up resources
    public close(): void {
        if (this.workletNode) {
            // Disconnect the worklet node from the audio context
            this.workletNode.disconnect();
            this.workletNode.port.close(); // Close the MessagePort

            // Set the worklet node to null to avoid further use
            this.workletNode = null;
        }

        // Optionally, stop the AudioContext if no longer needed
        if (this.audioContext.state !== 'closed') {
            this.audioContext.close().then(() => {
                console.log('AudioContext closed.');
            }).catch((err) => {
                console.error('Error closing AudioContext:', err);
            });
        }

        this.isWorkletLoaded = false;
    }
}

From there, you can use the PCMStreamer class to stream audio data to a MediaStream that can be used in a WebRTC connection. Whenever the Realtime API is sending an audio chunk, you convert it to a Float32Array and send it to the PCMStreamer instance via the appendPCMData method.

this.realtimeClient.on('conversation.updated', (event) => {
    if (event.delta.audio) {
        const int16Array = event.delta.audio;
        // Convert Int16Array to Float32Array as the Web Audio API uses Float32
        const float32Array = new Float32Array(int16Array.length);
        for (let i = 0; i < int16Array.length; i++) {
            float32Array[i] = int16Array[i] / 32768.0; // Convert 16-bit PCM to float [-1, 1]
        }

        audioStream.appendAudioData(float32Array);
    }
});

Success!

But this is only the beginning. So far, we are getting audio chunks from the Realtime API and playing them in the browser. Now, we need to send our audio data to the Realtime API!

Sending audio to the Realtime API

The Realtime API is expecting audio chunks in the PCM 16 format. This is the same format as the audio chunks we are receiving.

Hopefully, the openai-realtime-api-beta library manages automatic conversion of 32bit float audio data to 16bit PCM audio data in little endian on our behalf (the little endian part is important here).

We will just have to make sure we are sending the samples at the requested 24kHz sample rate.

Turning a MediaStream into audio chunks turns out to be quite similar to what we did previously. We will use the Web Audio API, design a new Worklet. This time, the Worklet will receive a MediaStream -- in the case of WorkAdventure were multiple people can talk to the IA at the same time, it can receive many MediaStreams. The MediaStreams will be mixed together and sent back to the main thread using the postMessage API.

The code of our audio Worklet will look like this:

class InputAudioWorkletProcessor extends AudioWorkletProcessor {
    process(inputs: Float32Array[][], outputs: Float32Array[][], parameters: Record<string, Float32Array>): boolean {

        if (!inputs || inputs.length === 0 || inputs[0].length === 0) {
            return true; // Keep processor alive
        }

        // Let's merge all the inputs in one big Float32Array by summing them
        const mergedInput = new Float32Array(inputs[0][0].length);

        for (let k = 0; k < inputs.length; k++) {
            const input = inputs[k];
            const channelNumber = input.length;
            for (let j = 0; j < channelNumber; j++) {
                const channel = input[j];
                for (let i = 0; i < channel.length; i++) {
                    // We are dividing by the number of channels to avoid clipping
                    // Clipping could still happen if the sum of all the inputs is too high
                    mergedInput[i] += channel[i] / channelNumber;
                }
            }
        }

        this.port.postMessage({ pcmData: mergedInput }, {transfer: [mergedInput.buffer]});

        return true; // Keep processor alive
    }
}

// Required registration for the worklet
registerProcessor('input-pcm-worklet-processor', InputAudioWorkletProcessor);
export {};

A quick note about the parameters passed to the process function. The inputs parameter is an array of arrays of Float32Arrays. Why do we have an array of array of array?

The innermost array is the audio data for a single channel. Each value represents the audio level at a given time. But microphones can be stereo. In this case, we don't have one, but two channels. So we wrap the Float32Arrays into an array that will contain one channel if the microphone is mono, or two channels if the microphone is stereo.

But we can also have many input data sources! (many microphones, or many streams coming from different WebRTC sources...). The third array is used to represent the different input sources.

On each call to the process function, we will sum all the input sources together to create a single audio stream, and send this stream to the main thread via the postMessage API.

The code to instantiate the audio Worklet in the main file will look like this:

import audioWorkletProcessorUrl from "./input-pcm-processor.ts?worker&url";

// Sample rate required by OpenAI is 24kHz
const audioContext = new AudioContext({ sampleRate: 24000 });

await this.audioContext.resume();

// Let's load the audio worklet processor (assuming the file is available in the pcm-processor.js)
await this.audioContext.audioWorklet.addModule(audioWorkletProcessorUrl);

// Instantiate the audio worklet node (we use the 'input-pcm-worklet-processor' string to match the name we used in the registerProcessor call)
const workletNode = new AudioWorkletNode(this.audioContext, 'input-pcm-worklet-processor');

// Connect the media stream as an input to our worklet node (this assume you already have a "mediaStream" variable connected
// to your microphone)
const sourceNode = this.audioContext.createMediaStreamSource(mediaStream);
sourceNode.connect(workletNode);

workletNode.port.onmessage = (event: MessageEvent) => {
    const data = event.data.pcmData;
    if (data instanceof Float32Array) {
        console.log('Received audio data from the input worklet:', data);
    } else {
        console.error("Invalid data type received in worklet", event.data);
    }
};

Sending the audio chunks to the Realtime API

Sending the audio chunks to the Realtime API is quite simple. We just need to call the appendInputAudio method of the RealtimeClient. The data we pass must be a 16bit PCM audio buffer. We can use the RealtimeUtils.floatTo16BitPCM method provided by the Realtime API to convert the 32bit float audio data to 16bit PCM.

this.realtimeClient.appendInputAudio(RealtimeUtils.floatTo16BitPCM(audioBuffer));

However, just doing this will lead to a conversation that will fail after a few seconds. In our experience, this is because we are calling the Realtime API too often.

Throttling the audio chunks

Out of the box, when I tested the Audio Worklet, it generated Float32Array with 128 samples (i.e. 128 values in the array). Because we run at 24kHz, this means we are sending 24000 / 128 = 187.5 chunks per second. This is probably a bit too much for the Realtime API.

We don't need to send audio chunks at a very high rate. One audio chunk every ~50ms should be more than enough. That means we could target a chunk size of 24000 * 0.05 = 1200 samples.

We can simply buffer the audio chunks in the main thread and send them to the Realtime API when we have enough data.

const audioBuffer = new Float32Array(24000 * 0.05); // 50ms of audio data

workletNode.port.onmessage = (event: MessageEvent) => {
    const data = event.data.pcmData;
    if (data instanceof Float32Array) {
        if (bufferIndex + data.length > audioBuffer.length) {
            // If the buffer is going to be filled, keep filling it until it is full, then send the audio to the Realtime API.
            audioBuffer.set(data.subarray(0, audioBuffer.length - bufferIndex), bufferIndex);
            this.realtimeClient.appendInputAudio(RealtimeUtils.floatTo16BitPCM(audioBuffer));
            bufferIndex = 0;
            // Finally, append the remaining data to the buffer
            audioBuffer.set(data.subarray(audioBuffer.length - bufferIndex));
        } else {
            // Otherwise, append the data to the buffer
            audioBuffer.set(data, bufferIndex);
            bufferIndex += data.length;
        }
    } else {
        console.error("Invalid data type received in worklet", event.data);
    }
};

Managing interruptions

The Realtime API is sending audio chunks faster that we can play them. It means at some point in time, we might have several seconds of audio chunks in our OutputAudioWorklet.

If we interrupt the conversation, this buffer is already on our computer and will still be played. This is not what we want.

Hopefully, in VAD mode, the Realtime API is sending a conversation.interrupted event whenever it detects the user has interrupted the conversation. We can use this event to clear the buffer. The buffer is inside the OutputAudioWorklet so we need to send a message to the worklet to clear the buffer.

Let's say that if we receive the following event, we clear the buffer:

{
    emptyBuffer: true
}

The updated audio worklet code will look like this:

class OutputAudioWorkletProcessor extends AudioWorkletProcessor {
    private audioQueue: Float32Array[] = [];

    constructor() {
        super();
        this.port.onmessage = (event: MessageEvent) => {
            // In case the event is a buffer clear event, we empty the buffer
            if (event.data.emptyBuffer === true) {
                this.audioQueue = [];
            } else {
                const data = event.data.pcmData;
                if (data instanceof Float32Array) {
                    this.audioQueue.push(data);
                } else {
                    console.error("Invalid data type received in worklet", event.data);
                }
            }
        };
    }

    //...
}

On the main thread side, we can add an additional method to the OutputPCMStreamer class to clear the buffer:

export class OutputPCMStreamer {
    // ...
    public resetAudioBuffer(): void {
        if (!this.isWorkletLoaded || !this.workletNode) {
            console.error('AudioWorklet is not loaded yet.');
            return;
        }

        // Send the PCM data to the AudioWorkletProcessor via its port
        this.workletNode.port.postMessage({ emptyBuffer: true });
    }
}

When we receive the conversation.interrupted event, we can call the resetAudioBuffer method of the OutputPCMStreamer instance.

this.realtimeClient.on('conversation.interrupted', () => {
    outputPCMStreamer.resetAudioBuffer();
});

What's remaining?

The system is working well, but there is still room for improvement.

Remember that OpenAI sends audio chunks faster than we can play them? Just imagine if the conversation is interrupted. We are going to remove a lot of audio chunks for the buffer, but we did not tell the Realtime API that those audio chunks have not been played. So the Realtime API will "think" we heard everything it said, but we didn't.

We will fix this in the next article.

Using WorkAdventure?

If you are using WorkAdventure (you should!) and are looking to implement a bot using the Realtime API, we added a few methods to the scripting API to help you with that.

You can now use WA.player.proximityMeeting.listenToAudioStream to listen to an audio streams coming from a bubble, and WA.player.proximityMeeting.startAudioStream to send an audio stream to a bubble.

This will make things considerably easier for you as we take care of the audio worklet instantiation and the audio stream management.

Conclusion

This is it! We have a fully working system that allows us to interact with bots via voice! The bots reactivity is insane. The conversation is smooth and the tone of the bot is appropriate.

It took us about 3 days of work to get this first version working and we are very happy with the result. We are now looking forward to integrating this system deeper into WorkAdventure, by adding more features via the function calling feature of the Realtime API!

Next steps will be to have the bot better handle interruptions, and then react to the environment, guide you through your office, etc...

Read the next article!

Getting started​

Playing the audio​

Typescript support​

Instantiating the audio worklet node in the main file​

Sending audio to the Realtime API​

Sending the audio chunks to the Realtime API​

Throttling the audio chunks​

Managing interruptions​

What's remaining?​

Using WorkAdventure?​

Conclusion​