Real Time Audio Manipulation in the Browser

How AudioContext saved the day (again)

Shane Kerr
Jan 29, 2026

Problem

The instinct is almost always correct - push the hard stuff to the backend. This would seem like the hard stuff, audio manipulation and mixing of music, effects and a number of speaker tracks. That’s what we did, that’s what we do, we have a pipeline to handle exactly this. Let me take a step back and walk you through what we have, and what the issue is using a theoretical example. It’s 2027 and there exists a fantastic reboot of the Golden Girls starring Marissa Tomme, Cortney Cox, Gwendoline Christie, Tina Fey and Paul Rudd as Stanley. Not surprisingly the whole budget was blown on the english version, just look at that star power! We run the episodes through the Mockingbird pipeline; we extract what was said by whom, we translate that text, we generate speech in a new language, and we mix each speaker down to a single vocal track and mix that track along with the background music and effects. Bingo bango, Las Chicas Doradas or Die Goldenen Mädchen.

In our dubbing review feature we let you change text and re-run the TTS for any specific utterance. We have a challenge - how do we engage these new utterances in the Video preview. How can we mix what might be 10’s of speakers on the fly and remain responsive.

Let’s consider our options and the issues.

Use existing pipeline to regenerate the mix. This felt like the easy option.

Pros: We are not changing how we approach the files - post processing, mixing and overlap mitigation are all the same. Maintaining complete consistency.

Cons: Terribly long process with file transfer, component execution.

We could ask the user to choose the new TTS by only listening to that utterance, but we knew that context is likely the reason for the change. To get the right inflection or tone, or to change a translation to fit better, either by impact or timing, would need a mixed context.

We separated the music and effects and the vocals, but maintained the single, mixed vocal files. Though having all the speakers vocal tracks would have led to an easier execution, we couldn’t be sure that we would have enough memory to load all speaker tracks into the browser.

Approach

Each utterance was contained in a segment. With the start and end times we know where in that main vocal file an utterance exists. If an utterance’s TTS changed we could punch out the section it exists in, and insert the new TTS. We set out to implement this approach, on the front end.

We’ll dig into the specifics and run over a number of handy uses of the AudioContext. I’ll introduce the API, nodes, audio graphs, and dig into some simple audio manipulation. We’ll also look briefly at how to connect all this to a video.

AudioContext

For most audio applications on the web <audio> tags can work just fine. It plays the audio, in fact that’s all it is; a blackbox audio player. But what if you need more control? The AudioContext gives us access to the raw signal chain, allowing us to intercept and manipulate the audio data in real time.

The Audio Context is the sandbox where all your audio lives and interacts. It owns the pieces that facilitate audio playback:

It owns the timeline and keeps track of currentTime (with a high-precision internal clock),
It owns the destination (think your speakers)
It owns and manages the state: running, suspended, closed

Within this context we are able to build an Audio Graph with audio nodes and connections as the building blocks

Nodes: Audio Sources, Effects, Analyzers, basic audio operations

Connections: Virtual cables connecting nodes

The Web Audio API handles operations inside an Audio Context, and allows modular routing through audio nodes. I think of this like a single channel on a mixer, or as a guitar set up. Guitar (source), connected to the pedals for gain/effects (processing nodes) and then to the amp (destination), and this all exists inside the Studio (AudioContext).

Let’s have a look at some specifics of the AudioContext

Destination: This is the final node in the graph, this represents the actual audio hardware that we are listening to.
currentTime: A double representing the time in seconds since the context started. This is the master clock that allows syncing multiple source nodes, and allows for scheduling of events
state: Crucial for modern browsers. The context often starts in a suspended state to prevent auto-play, we’ll look at this later

So what does using the AudioContext do for us and our issue? First it allows us full control over the mix of vocals, music and effects, and gives us the ability to swap in sources if we need to. Secondly it exposes the raw audio data, and allows us to manipulate it directly. We’ll briefly go over the concepts we need for this, but I would suggest reading more about the basic concepts of audio graphs, samples, channels and others here

Audio Samples

Digital audio is possible by breaking down a continuous signal to a discrete signal (sampling). These samples are 32-bit floating point values representing the audio at a specific moment within a particular channel (left/right for stereo). The sample rate is the quantity of those samples that will play in one second, measured in Hz.

Audio Buffers

Audio buffers are groupings of these samples and have specific characteristics.

Number of channels (1 for mono, 2 for stereo, 6 for 5.1)
Sample rate
Length - the number of samples inside the buffer

Manipulation of the data in these buffers gives us our punch out approach to our problem and allows us to slide in new audio to an existing buffer. Though there are many possible sources, we will mostly be using AudioBuffer.

The Setup

Let’s look at the set up for the AudioContext:

const AudioContextClass = globalThis.window?.AudioContext || globalThis.window?.webkitAudioContext

this.ctx = new AudioContextClass()

One thing to keep in mind here is that when you initialize a new AudioContext, it often starts in a suspended state. If you try to play sound immediately without user intervention, the browser will ignore you or throw a warning in the console. The audio clock effectively starts only after the user has interacted with the page (clicked, tapped, or pressed a key). We will tie this to our video playback interaction, but any user action will work to unlock playback.

Next we will load the audio files to get the audio buffer we will be handling.

const arrayBuffer = await response.arrayBuffer()

const sourceBuffer = await this.ctx.decodeAudioData(arrayBuffer)

So we can control the volume of this track individually, and control the volume of all the tracks at the same time we will create a couple of GainNodes. Attaching one to the destination itself, and one to each track.

this.masterGain = this.ctx.createGain()

this.masterGain.connect(this.ctx.destination)

We’ll use this master gain as a gate to our destination, connecting source to its gain, then to the master gain and then to the destination

const gainNode = this.ctx.createGain()

gainNode.gain.value = 0.8

sourceBuffer.connect(track.gainNode)

gainNode.connect(this.masterGain)

The last setup piece is controlling the audio from, and tying playback to the video. We can include the video audio in our playback by using the context to createMediaElementSource

const vidSource = this.ctx.createMediaElementSource(this.mediaElement)

const vidGain = this.ctx.createGain()

vidSource.connect(vidGain).connect(this.masterGain)

Our current Audio Graph, assuming we have loaded BG, and Vocals audio tracks looks like this:

A word on syncing external audio playback with the video. A simple approach is to use video events to drive AudioContext playback. However, because <video> and AudioContext use independent clocks, audio drift will naturally accumulate over time. To combat this, we treat the video element as the authoritative timeline and continuously measure drift between video.currentTime and the AudioContext clock. Small offsets are corrected using subtle playback-rate adjustments, while larger offsets trigger a controlled re-alignment of the audio graph. Significant events such as seeking, buffering, or tab suspension always force a hard resync.

Now that we have access to the Vocal track samples through sourceBuffer, we can look at the case where we want to change a specific utterance.

Update Utterances

The concept here is pretty simple. We regenerate the utterance, apply post processing to match what is done to all the utterances in the first mix of the Vocals track, and then replace the samples in sourceBuffer with those from the new utterance. First we need to create a function that does the punching. We’ll take the buffers, the startTime and endTime, of where we want to insert.

A couple of simplifications:

We will assume the sampleRates are the same as the base
We will also assume that the number of channels is the same
We will fit the new clip from the start,

If it is too long we will clip it to fit
If it is too short we will add silence to fit

async punchIn(

   baseBuffer: AudioBuffer,

   clipBuffer: AudioBuffer,

   startTime: number,

   endTime: number,

 ): Promise<AudioBuffer> {

First step is to convert time to samples using the sampleRate

const startSample = Math.floor(startTime * sampleRate)

const endSample = Math.floor(endTime * sampleRate)

const windowSize = endSample - startSample

We need to create a destination buffer matching the base buffer, because in place buffer manipulation doesn’t work as we would want. So we will be replacing the whole source buffer with the new one

const outputBuffer = this.ctx.createBuffer(

  baseBuffer.numberOfChannels,

  baseBuffer.length,

  baseBuffer.sampleRate

Now for each channel we just write the samples we want to the new buffer. Taking from the base, unless it’s in the window, and we’ll crossfade the boundaries of the window to avoid any audible blips at the edges:

   for (let channel = 0; channel < baseBuffer.numberOfChannels; channel++) {

     const baseData = baseBuffer.getChannelData(channel)

     const outData = outputBuffer.getChannelData(channel)

     const clipCh = channel % clipBuffer.numberOfChannels

     const clipData = clipBuffer.getChannelData(clipCh)

     outData.set(baseData.subarray(0, startSample), 0)

     const copyLength = Math.min(clipData.length, windowSize)

     const segment = new Float32Array(clipData.subarray(0, copyLength))

     this.crossFade(segment, options.fadeMs, sampleRate)

     outData.set(segment, startSample)

     outData.set(baseData.subarray(endSample), endSample)

That’s it. Now we have a new buffer with the new utterance, we can restart playback to use the new audio.

There are a number of edge cases and some more complex mixing strategies for this specific use case that we didn’t cover here, but hopefully this gives a brief overview of what can be done with AudioContext and manipulation of audio data.

Considerations

Size matters: There is only so much browser memory available
Support matters: Not every feature is supported by every browser
Audio processing can get expensive fast, see AudioWorklet
Buffers are effectively immutable during playback (why we created a new one)
Nodes are not auto-disconnected. Nodes persist until:

Disconnected
Or Garbage Collected

Diarization: Make or Break

AI Evaluation - Metrics & Human in the Loop