The road to gapless
Part 3 of the devblog for MusicatOriginally published on Github discussions
Ever since I started working on musicat, the audiophile in me wanted to implement gapless playback. Yet I decided to park it, live with the gaps on The Dark Side Of The Moon and just use the <audio>
element. That way I could focus on building out features that I needed first.
But now that I’m quite happy with how the app is turning out, it’s time to (temporarily) stop making fancy features and start getting my hands dirty in a PCM stream. It’s also time to offload the CPU-heavy stuff away from the the web frontend, and start actually learning Rust since this is a Tauri app after all.
So over the past few months I have played around with a few ideas to try and enable gapless support, constantly re-architecting the whole playback engine and eventually scrapping WebAudio altogether and doing the decoding and the playback in Rust (Symphonia + cpal). Of course, going native is great, but let’s have a look at why the various WebAudio approaches didn’t work:
- Approach 1: No WebAudio, just
<audio src="local track">
:- Need two
<audio>
elements and try to crossfade between them during the gapless transition. - No control over decoding, resampling and playback when changing
src
- No access to samples, can’t precisely time things, so gaps are inevitable.
- Need two
- Approach 2: mse-gapless-poc
- Consisted of fetching the audio chunks manually over Tauri’s asset protocol, and feeding them to a
MediaSource
, to which you can append chunks usingSourceBuffer.appendBuffer()
. Browsers can be very strict with how you use this buffer, and I found the implementation to be very clunky. I believe this is what YouTube and Spotify use. - Managing buffered/played chunks was a pain, you constantly run into the QuotaExceededError, and the quota is different for every browser..
- Decoding and resampling is still handled by the browser/webview
- Without sample-level precision, wasn’t sure how to implement precise seeking, so I had to approximate the desired seek position and guess which chunk to request. Not ideal
- Consisted of fetching the audio chunks manually over Tauri’s asset protocol, and feeding them to a
- Approach 3: webaudio-worklet-poc
- Here we take control over the decoding the chunks using WASM libraries, then feed the decoded raw PCM into an AudioWorklet
- The browser is no longer in charge of resampling, so I tried resampling in the AudioWorklet, which proved to be quite slow for real-time. So I settled for re-initializing the AudioContext with a new sample rate if it changes between files. So gapless was only possible between files that had the sample sample rate.
- However this still introduced gaps, sometimes even between chunks, especially when playing MP3s where compression requires data from previous frames. Basically - don’t decode MP3’s chunk-by-chunk unless you have split the file on the exact frame boundaries.
- Approach 4: webrtc-worklet-poc
- The decoding is offloaded to the Rust backend using Symphonia, which streams the raw PCM to the web app over a WebRTC DataChannel, played back using an AudioWorklet, no chunking needed. When I thought of this, I knew it was overkill for a local app, but I tried it anyway. This research paper follows a similar approach.
- Requires flow control on both sides, to make sure that sender can keep up with playback, but not so fast that it overflows the ring buffer. I had to implement an VLC-like algorithm where the consumer sends the producer it’s “receive rate” at regular intervals, and the decoder slows down or speeds up accordingly, between 0.8x and 3x playback speed.
- The approach generally worked, and since it’s a local app there weren’t any “lost packets”. So technically we could do gapless. But the CPU usage increased heavily due to WebRTC, and when moving the app to the background I sometimes experienced the WebKit WebView throttling the connection, causing the playback to stutter.
- Final approach: rust-audio-backend
- Decoding, playback and resampling is done in Rust using Symphonia + cpal on a separate thread.
- Direct control over the raw audio stream sent to the native device
- No flow control required, since we’re now using a ring buffer to send samples to the device
- Frontend communicates necessary file, volume, seek information
- WebRTC is still used, but only for streaming the real-time FFT viz data for the spectroscope visualization on the frontend.
- CPU usage is acceptable 15-20%
- Road to gapless and other audio features paved ahead!
Essentially, I went down a rabbit hole, progressively getting closer to the raw audio stream in the process. And I’m so glad I did, because I feel like I have peeled off all the layers of abstraction and browser limitations that were getting in the way. In hindsight, building a desktop audio player using web technologies probably wasn’t the best idea. But I’m happy with a hybrid architecture where Rust is the I/O, audio, heavy-lifting layer, and the Svelte app is just the presentation layer. Granted, the database is still using IndexedDB which is in the browser, but that’s something for another day.