Making a better sound engine

Hey folks!

I hope you’re ready for another technical article, although I promise — this time — not to include even a single line of code.

Two weeks ago I briefly touched the subject of sound, how SDL_mixer didn’t cut it, and how we had to code our own engine. But I didn’t really get into the details, and I wanted to quickly do so.

First of all, I want to state clearly that SDL_mixer doesn’t suck. On the contrary, it handles the 99% case very well: a game with a single background music playing at any given time, and various one-shot sound effects.

In most games, the playback position of the music makes no difference - therefore it makes sense for SDL_mixer not to expose that information, especially since it has to support a variety of backends, not only compressed audio formats like ogg, mp3, flac, but also module formats like midi and mod.

In our case, though, we really, really need to know where exactly we are in a given music. And we might want to play several music files at the same time, synchronized, and adjust volumes individually. And we also might want to store our one-shot sound effects as .ogg to save a little disk space (we can preload them, performance really isn’t an issue here).


If we’re going to code our own audio engine, we’re going to need a way to send audio to the sound interface. OpenAL is pretty good at doing that. The joke I made two weeks ago about OpenAL not really “making sound” is because, similar to its cousin OpenGL, OpenAL is really just a specification and it’s up to the driver to “render” sound — but that’s a pretty uninteresting implementation detail.

What we’re really interested in, is how to send audio down the pipes. And there we go with diagrams again:

Each buffer contains a bunch of audio samples: a value that determines how high on the sound wave you’re riding. If you want to play stereo sound, you need twice as many samples. If you store your samples as 8-bit integers, you only have 256 values to choose from. 16-bit will give you 65536 values per sample - that’s how many different height values you have available.

And then there’s frequency: the more samples per second, the better, and you measure that in hertz: for example, 22050Hz (if you’re living in the past). I’m simplifying here, of course. It’s generally accepted that anything higher than 16bit / 44.1Khz or 16bit / 44.8Khz is pointless due to mathematics and the range of frequencies the human ear can perceive.

So, playing a sound goes a little bit like this:

  • Create a source
  • Queue a few buffers
  • Ask OpenAL to play the source
  • Once in a while, unqueue processed buffers and queue fresh buffers

The last step is crucial: if you don’t queue buffers fast enough, you’ll experience an audio underrun: sound will stop playing for a moment, until it’s filled again.

Nowadays it’s not something we hear much, but a few years back it wasn’t uncommon for the CPU not to be able to generate sound fast enough for the soundcard to play when doing heavy shound synthesis on a single-core processor. This would result in choppy audio.

We don’t want to queue too many buffers either: while it’s unlikely that we’ll fill a modern-day computer’s memory with a few minutes of uncompressed audio, we don’t want to use more memory than necessary.


Now, we know how to play sound, but the explanations above might have clued you in on the fact that uncompressed audio can get big pretty fast. 16 bits multiplied by 44100 Hz, multiplied by 2 channels, multiplied by god knows how many seconds — that’s a lot of bits!

In practice, we store audio in a compressed file format. Ogg Vorbis is a lossy audio compression format, which means you lose information when you compress audio using it. That’s definitely worth it for the savings involved, though — and it takes a good human ear to tell the difference!

Vorbis is also a variable bit rate (VBR) audio codec. It means that every packet (which contains a bunch of samples) can be of a different size, depending on what goes on at that particular portion of the sound. If you have a few seconds of silence, you’ll get a series of very small packets. If you’re in the middle of a Progressive Metal solo, packets might get bigger to maintain good fidelity.

So, our ogg vorbis file looks a little bit like that:

Except there’s many, many more packets.

If we had to decode it ourselves, things would get really complicated at this point. I already did that before and frankly, I have a game to finish. Instead, we can use libvorbis/libvorbisfile, generously provided by the foundation under a commercial-friendly BSD license.

libvorbis is a very neat abstraction because it allows us to treat the ogg file as a music stream that has a length, and from which we can extract any amount of (uncompressed audio) samples we want, no matter where it is in the file, how big the packets actually are, what its compression settings are, etc.

From there, our battle plan is abundantly clear:

Everything under the line of time is handled by OpenAL / a sound driver / an operating system / whatever low-level trickery is going on there. All we have to do is check periodically on our sound sources, and when we’ve processed a certain amount of buffers, refill it with freshly decoded buffers we get from libvorbis.

There’s two kind of things that could go wrong with this scheme:

  • The game suddenly running extremely slow, e.g. taking a few seconds per frame. In this case, we would run out of queued buffers and the audio would cut (although, if that happens, the player has other problems, believe me)
  • Decoding audio takes too much time: FPS would drop, although probably not significantly. With modern hardware, decoding a very small amount of vorbis audio is nothing compared to the strain our very naive graphics engine puts on both the CPU and the GPU.

A final subtlety

Let’s go over our checklist again, to make sure our new engine satisfies all our needs:

  • Play multiple tracks at the same time: sure, OpenAL allows a great number of sources. As long as the CPU is fast enough to decode & mix, it’ll work.
  • Know the length of tracks: libvorbis gives us that information.
  • Know where we are in the tracks: it’s a simple question of knowing how much time each buffer represents, and keeping track of how many buffers have been processed.

For determing the current time, the above method is actually not enough: intra-buffer precision is required when you rely on that to display musical notes flying towards the player. Thankfully, OpenAL allows one more way to retrieve the playback position in a given source. Adding the total amount of time we have dequeued from the source, plus the current position in the source (counting both processed-but-not-yet-dequeed and buffers awaiting processing), we can get precise-enough data.

But what we didn’t talk about is seeking! Jumping to a particular position in time within a track. Our whole ‘refill when needed’ scheme falls apart there, because suddenly the data we already queued is not valid anymore!

The solution is actually quite simple:

  • Tell libvorbis to seek at a certain offset in seconds — and let it do its magic
  • Stop our source
  • Unqueue all buffers from our source
  • Ask libvorbis for a few buffer’s worth of uncompressed audio
  • Queue them
  • Play our source

And that’s it! In the end, it’s just a bunch of samples anyway.