At Recall.ai we process a lot of real-time media. We build bots that join video-conferences across Zoom, Teams, Google Meet and more. These bots ingest, transcode and analyze the audio and video in real-time. This comes with a number of engineering challenges and this is the story of how we identified and fixed a segfault in our audio processing pipeline that was triggered by roughly 1 in every 36,000,000 seconds of audio.
A few months ago we began to notice a strange pattern in some of our bots. Very occasionally, a bot would abruptly crash without any prior indication of anything going awry. We use Kubernetes to orchestrate our bots and Karpenter to manage our elastic scaling. kubectl describe pod showed that the bots exited with code 139 = 128 + 11(signal). Signal 11 corresponds to SIGSEGV, a segmentation fault or segfault for short, this occurs when a program tries to access memory it is not supposed to. This mistake is bad enough that the execution of the program should halt immediately.
Getting a core dump
So how do we determine the cause of a SIGSEGV? The answer is often a core dump file. A core dump lets us inspect the state of the program at the instant of the crash, this lets us see many useful things such as the stacktrace, variable/memory contents and more.
Core dumps are output as files on disk when a process crashes, however because our bots are run as containers in Kubernetes, the disk is ephemeral and becomes inaccessible when the container exits and Karpenter terminates the underlying VM. In order to diagnose the cause of this segfault we need to start reliably collecting core dumps across nodes in our cluster.
We want something lightweight and that could be easily integrated into existing docker images, so we built Garbage Truck which detects and uploads core dump files to S3 before the pod terminates.
It is not often I've been eager to see segfaults occur in production but this was one such occasion. After deploying Garbage Truck we waited until the next crash so we could further the investigation.
Days passed, then a week and then two. At this point we started to wonder if Garbage Truck itself had changed the execution environment in such a way that prevented the segfaults from happening. However, on the 16th day we observed another crash and Garbage Truck diligently transported the core dump to S3 for us.
Inspecting the stack trace
Excitedly, I spun up a new pod on our cluster, using the same image where the crash occurred to ensure all binaries and system libraries are where the core dump expects them to be. We use a debian base image, so I installed gdb and loaded up the core dump:
# Install tools
> sudo apt install -y gdb file
# Tell us about the core dmp
> file /path/to/core-dump
core-dump: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from 'python3 entrypoint.py', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/bin/python3', platform: 'x86_64'
# file tells us that /usr/bin/python3 was the executing binary
# Lets load up the core dump with gdb
> gdb /usr/bin/python3 /path/to/core-dump
[Current thread is 1 (Thread 0x7f1c3e1fc6c0 (LWP 432))]
# Show me the stacktrace
(gdb) bt
0 0x00007f1caaa58cbb in ?? () from /lib/x86_64-linux-gnu/libvo-aacenc.so.0
1 0x00007f1caaa5bbda in ?? () from /lib/x86_64-linux-gnu/libvo-aacenc.so.0
2 0x00007f1caaa6540b in ?? () from /lib/x86_64-linux-gnu/libvo-aacenc.so.0
3 0x00007f1caaa597ed in ?? () from /lib/x86_64-linux-gnu/libvo-aacenc.so.0
4 0x00007f1caaa58fdb in ?? () from /lib/x86_64-linux-gnu/libvo-aacenc.so.0
5 0x00007f1cab2a4a58 in ?? () from /lib/x86_64-linux-gnu/gstreamer-1.0/libgstvoaacenc.so
6 0x00007f1caa6ca481 in ?? () from /lib/x86_64-linux-gnu/libgstaudio-1.0.so.0
7 0x00007f1caa6cb6ac in ?? () from /lib/x86_64-linux-gnu/libgstaudio-1.0.so.0
8 0x00007f1cadd202a9 in ?? () from /lib/x86_64-linux-gnu/libgstreamer-1.0.so.0
9 0x00007f1cadd227b1 in ?? () from /lib/x86_64-linux-gnu/libgstreamer-1.0.so.0
10 0x00007f1cadd29b6b in gst_pad_push () from /lib/x86_64-linux-gnu/libgstreamer-1.0.so.0
11 0x00007f1caaad5ed4 in ?? () from /lib/x86_64-linux-gnu/gstreamer-1.0/libgstcoreelements.so
12 0x00007f1cadd59441 in ?? () from /lib/x86_64-linux-gnu/libgstreamer-1.0.so.0
13 0x00007f1cae2ec6ca in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
14 0x00007f1cae2ebcfd in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
15 0x00007f1cb0191fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
16 0x00007f1cb021266c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Now we have the stacktrace telling us where our program crashed. We use GStreamer as our media processing framework so it is unsurprising see the crash originating there. The immediate location of the crash occured in /lib/x86_64-linux-gnu/libvo-aacenc.so.0 which was our AAC audio encoding library. The majority of the stack frames are designated with ?? meaning gdb does not have the symbol information for the associated binaries. Without this, besides staring blindly at the disassembly, there is little we can do to better understand the crash. At this point I suspected the following in the order of probability:
We are subtly misusing some GStreamer API causing a use-after-free or corrupting the encoder state
The voaacenc GStreamer plugin has a bug causing the crash
# Specify the debuginfod url for gdb
> export DEBUGINFOD_URLS="https://debuginfod.debian.net"
# Load up the core dump
> gdb /usr/bin/python3 /path/to/core-dump
# gdb loads all the required debug files...
# Show me the stacktrace
(gdb) bt
0 0x00007f1caaa58cbb in voAACEnc_pow2_xy (x=<optimized out>, y=y@entry=28) at aacenc/basic_op/oper_32b.c:358
1 0x00007f1caaa5bbda in allowMoreHoles (desiredPe=<optimized out>, nChannels=<optimized out>, ahParam=0x7f1c74021082, ahFlag=0x7f1c3e1fb354, peData=0x7f1c3e1fae90, psyOutElement=0x7f1c74022228,
psyOutChannel=0x7f1c74022488) at aacenc/src/adj_thr.c:676
2 adaptThresholdsToPe (msaParam=0x7f1c74021088, ahParam=0x7f1c74021082, desiredPe=1467, nChannels=1, peData=0x7f1c3e1fae90, logSfbEnergy=<optimized out>, psyOutElement=0x7f1c74022228,
psyOutChannel=0x7f1c74022488) at aacenc/src/adj_thr.c:812
3 AdjustThresholds (adjThrState=adjThrState@entry=0x7f1c7402105c, AdjThrStateElement=AdjThrStateElement@entry=0x7f1c7402107c, psyOutChannel=psyOutChannel@entry=0x7f1c74022488,
psyOutElement=psyOutElement@entry=0x7f1c74022228, chBitDistribution=chBitDistribution@entry=0x7f1c3e1fb8d4, logSfbEnergy=logSfbEnergy@entry=0x7f1c74021284, sfbNRelevantLines=0x7f1c74021194,
qcOE=0x7f1c74022208, elBits=0x7f1c7402104c, nChannels=1, maxBitFac=395) at aacenc/src/adj_thr.c:1187
4 0x00007f1caaa6540b in QCMain (hQC=hQC@entry=0x7f1c7402103c, elBits=elBits@entry=0x7f1c7402104c, adjThrStateElement=adjThrStateElement@entry=0x7f1c7402107c, psyOutChannel=0x7f1c74022488,
psyOutElement=psyOutElement@entry=0x7f1c74022228, qcOutChannel=0x7f1c74021378, qcOutElement=0x7f1c74022208, nChannels=1, ancillaryDataBytes=0) at aacenc/src/qc_main.c:298
5 0x00007f1caaa597ed in AacEncEncode (aacEnc=aacEnc@entry=0x7f1c74021020, timeSignal=<optimized out>, ancBytes=ancBytes@entry=0x0, numAncBytes=numAncBytes@entry=0x7f1c3e1fb9a6, outBytes=<optimized out>,
numOutBytes=numOutBytes@entry=0x7f1c3e1fba18) at aacenc/src/aacenc_core.c:180
6 0x00007f1caaa58fdb in voAACEncGetOutputData (hCodec=0x7f1c74021020, pOutput=0x7f1c3e1fba10, pOutInfo=0x7f1c3e1fba30) at aacenc/src/aacenc.c:247
7 0x00007f1cab2a4a58 in gst_voaacenc_handle_frame (benc=0x3547ab0, buf=0x7f1c30017360) at ../ext/voaacenc/gstvoaacenc.c:424
8 0x00007f1caa6ca481 in gst_audio_encoder_push_buffers (enc=enc@entry=0x3547ab0, force=force@entry=0) at ../gst-libs/gst/audio/gstaudioencoder.c:1180
9 0x00007f1caa6cb6ac in gst_audio_encoder_chain (pad=<optimized out>, parent=<optimized out>, buffer=0x7f1c68054d80) at ../gst-libs/gst/audio/gstaudioencoder.c:1410
10 0x00007f1cadd202a9 in gst_pad_chain_data_unchecked (pad=pad@entry=0x338ef50, type=type@entry=4112, data=data@entry=0x7f1c68054d80) at ../gst/gstpad.c:4463
11 0x00007f1cadd227b1 in gst_pad_push_data (pad=pad@entry=0x338ed00, type=type@entry=4112, data=data@entry=0x7f1c68054d80) at ../gst/gstpad.c:4739
12 0x00007f1cadd29b6b in gst_pad_push (pad=0x338ed00, buffer=buffer@entry=0x7f1c68054d80) at ../gst/gstpad.c:4858
13 0x00007f1caaad5ed4 in gst_queue_push_one (queue=0x3544160) at ../plugins/elements/gstqueue.c:1388
14 gst_queue_loop (pad=<optimized out>) at ../plugins/elements/gstqueue.c:1541
15 0x00007f1cadd59441 in gst_task_func (task=0x354acb0) at ../gst/gsttask.c:384
16 0x00007f1cae2ec6ca in g_thread_pool_thread_proxy (data=<optimized out>) at ../../../glib/gthreadpool.c:352
17 0x00007f1cae2ebcfd in g_thread_proxy (data=0x7f1c74017520) at ../../../glib/gthread.c:831
18 0x00007f1cb0191fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
19 0x00007f1cb021266c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
From the above we know that this crash occurs in voAACEnc_pow2_xy in libvo-aacenc. A quick google search does not reveal much either. Only one poor soul, using the same encoder via ffmpeg, potentially ran into this over 9 years ago.
It turns out that GStreamer is relying on v0.1.3 of this library. This release was over ten years old. Perhaps this is not the most well-maintained GStreamer plugin...
It seems that we are indexing out-of-bounds of the statically-allocated pow2Table variable. The function itself is an optimised implementation of 2 ^ (x/y) with some invariants. So perhaps we are breaking these invariants (x<=0, y > 0) somehow.
Digging (much) deeper
As the next step in my investigation, I looked at the code that called this pow2_xy function. Here is an annotated version:
// average energy
avgEn = 0;
// minimum energy
minEn = MAX_32;
// counter
ahCnt = 0;
// for each channel
for (ch=0; ch<nChannels; ch++) {
PSY_OUT_CHANNEL *psyOutChan = &psyOutChannel[ch];
for (sfb=startSfb[ch]; sfb<psyOutChan->sfbCnt; sfb++) {
if ((ahFlag[ch][sfb] != NO_AH) &&
(psyOutChan->sfbEnergy[sfb] > psyOutChan->sfbThreshold[sfb])) {
// calculate the running minimum
minEn = min(minEn, psyOutChan->sfbEnergy[sfb]);
// sum all energies
avgEn = L_add(avgEn, psyOutChan->sfbEnergy[sfb]);
ahCnt++;
}
}
}
// calculate the average using fixed-point multiplication to "divide" by ahCnt
if(ahCnt) {
Word32 iahCnt;
shift = norm_l(ahCnt);
iahCnt = Div_32( 1 << shift, ahCnt << shift );
avgEn = fixmul(avgEn, iahCnt);
}
// calculate the logarithmic-scale difference
enDiff = iLog4(avgEn) - iLog4(minEn);
// what ever this means...
/* calc some energy borders between minEn and avgEn */
for (enIdx=0; enIdx<4; enIdx++) {
Word32 enFac;
enFac = ((6-(enIdx << 1)) * enDiff);
en[enIdx] = fixmul(avgEn, pow2_xy(L_negate(enFac),7*4));
}
To try to decipher which invariant (if any) was violated, I inspected the variables and registers through gdb.
Because this is a heavily-optimised release build, the core dump is stubbornly refusing to reveal most of the variables state. From the above we know y is 28 and fPart is an absurdly large 0xfffffffa. This points to some kind of overflow. We don't know x.
To figure out what x might have been, I modified the function and fuzzed over a range of x:
Word32 pow2_xy(Word32 x, Word32 y)
{
UWord32 iPart;
UWord32 fPart;
Word32 res;
Word32 tmp, tmp2;
Word32 shift, shift2;
tmp2 = -x;
iPart = tmp2 / y;
fPart = tmp2 - iPart*y;
iPart = min(iPart,INT_BITS-1);
if (fPart == 4294967290) {
printf("**reproduced!**\n");
return 0;
}
if ((POW2_TABLE_SIZE*fPart)/y >= 256) {
printf("invalid state\n");
return 0;
}
res = pow2Table[(POW2_TABLE_SIZE*fPart)/y] >> iPart;
return(res);
}
int main(int argc, char *argv[])
{
for (int i = -10; i < 10; i++) {
int res = pow2_xy(i, 28);
printf("pow2_xy(%d, %d) = %d\n", i, 28, res);
}
}
That seems to suggest that if enDiff = -1 it would reproduce our crash exactly. In general if enDiff < 0 for any reason, this would trigger a segfault in pow2_xy. Because enDiff = iLog4(avgEn) - iLog4(minEn);, if enDiff < 0 then iLog4(avgEn) < iLog4(minEn) which implies that avgEn < minEn. If average(energy) < minimum(energy) something else has gone very wrong.
If we could reproduce this locally, then we could step the code and (hopefully) pinpoint exactly where things started to go awry. Reading through the libvo-aacenc source some more, it seems that the encoding functions are pure over the encoder state. Or put simply, they only rely on the current state of the encoder to operate. So what if we dump the encoder state from our core dump resume the execution locally? gdb's dump binary memory is our friend here. Inspecting our encoder structure in gdb:
Now we should be able to execute the encoder and reproduce the crash!
> make && ./aac-enc
# snip...
loaded memdump/encoder-state.bin into buffer size 17408
loaded memdump/quantSpec.bin into buffer size 10240
loaded memdump/maxValueInSfb.bin into buffer size 10240
loaded memdump/scf.bin into buffer size 10240
loaded memdump/quantSpec2.bin into buffer size 10240
loaded memdump/maxValueInSfb2.bin into buffer size 10240
loaded memdump/scf2.bin into buffer size 10240
loaded memdump/sfbEnergy.bin into buffer size 10240
loaded memdump/sfbSpreadedEnergy.bin into buffer size 10240
loaded memdump/sfbThreshold.bin into buffer size 10240
loaded memdump/mdctSpectrum.bin into buffer size 10240
loaded memdump/sfbOffset.bin into buffer size 10240
loaded memdump/sfbOffset2.bin into buffer size 10240
loaded memdump/mdctDelayBuffer.bin into buffer size 10240
loaded memdump/mdctSpectrum.bin into buffer size 10240
loaded memdump/mdctDelayBuffer2.bin into buffer size 10240
loaded memdump/mdctSpectrum2.bin into buffer size 10240
loaded memdump/pScratchTns.bin into buffer size 10240
loaded memdump/bitBufBase.bin into buffer size 102400
loaded memdump/hBitStream.bin into buffer size 102400
loaded memdump/bitBufBase.bin into buffer size 102400
loaded memdump/intbuf.bin into buffer size 102400
loaded memdump/encbuf.bin into buffer size 10240
loaded memdump/output.bin into buffer size 8
loaded memdump/output-buf.bin into buffer size 2048
Done
Done!? Wait, where is our burning pile of invalid memory dereferences? It seems that the resumed encoder exited cleanly. What if we add some logging to pow2_xy?
There are no invalid x values here either. Stepping through the code using gdb makes it clear as to why. Many of the internal functions mutate the encoder state throughout the lifetime of the GetOutputData entrypoint. So the state of the the encoder that we restored was the state at the instant of the segfault and not the actual state when the bot called GetOutputData prior to the crash. Unfortunately we cannot recover the original encoder state from the core dump either.
So let us go deeper then. How about if we restore the state at the frame above pow2_xy? Lets dump the memory from those arguments and try again:
# Go up a frame to "allowMoreHoles"
(gdb) f 1
#1 0x00007f1caaa5bbda in allowMoreHoles (desiredPe=<optimized out>, nChannels=<optimized out>, ahParam=0x7f1c74021082, ahFlag=0x7f1c3e1fb354, peData=0x7f1c3e1fae90, psyOutElement=0x7f1c74022228,
psyOutChannel=0x7f1c74022488) at aacenc/src/adj_thr.c:676
676 en[enIdx] = fixmul(avgEn, pow2_xy(L_negate(enFac),7*4));
# Show the arguments
(gdb) info args
desiredPe = <optimized out>
nChannels = <optimized out>
ahParam = 0x7f1c74021082
ahFlag = 0x7f1c3e1fb354
peData = 0x7f1c3e1fae90
psyOutElement = 0x7f1c74022228
psyOutChannel = 0x7f1c74022488
# Dump the argument state
(gdb) dump binary memory psyOutChannel.bin 0x7f1c74022488 0x7f1c74022488+sizeof(PSY_OUT_CHANNEL)*2
(gdb) dump binary memory sfbEnergy.bin 0x7f1c74023970 0x7f1c74023970+100
(gdb) dump binary memory sfbSpreadedEnergy.bin 0x7f1c74023f58 0x7f1c74023f58+100
(gdb) dump binary memory sfbThreshold.bin 0x7f1c740236a0 0x7f1c740236a0+10240
(gdb) dump binary memory mdctSpectrum.bin 0x7f1c74026480 0x7f1c74026480+10240
(gdb) dump binary memory peData.bin 0x7f1c3e1fae90 0x7f1c3e1fae90+sizeof(PE_DATA)
(gdb) dump binary memory ahFlag.bin 0x7f1c3e1fb354 0x7f1c3e1fb354+1024
> make && ./aac-enc
# snip...
loaded memdump/psyOutChannel.bin into buffer size 3040
loaded memdump/sfbEnergy.bin into buffer size 10240
loaded memdump/sfbSpreadedEnergy.bin into buffer size 10240
loaded memdump/sfbThreshold.bin into buffer size 10240
loaded memdump/mdctSpectrum.bin into buffer size 10240
loaded memdump/peData.bin into buffer size 2420
loaded memdump/ahFlag.bin into buffer size 1024
loaded memdump/ahParam.bin into buffer size 6
pow2_xy(6, 28)
Segmentation fault
Success! It seems we can now reproduce the crash, even if it is directly from the above stack frame. Now we can modify our compiler flags with -O0 to disable optimisations and inspect the local state during execution.
> gdb ./.libs/aac-enc.
Reading symbols from ./.libs/aac-enc...
(gdb) run
Starting program: /storage/workspace/bots/debug/vo-aacenc-0.1.3/.libs/aac-enc
# snip...
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7f942e6 in voAACEnc_pow2_xy (x=6, y=28) at aacenc/basic_op/oper_32b.c:370
370 res = pow2Table[(POW2_TABLE_SIZE*fPart)/y] >> iPart;
(gdb) bt
#0 0x00007ffff7f942e6 in voAACEnc_pow2_xy (x=6, y=28) at aacenc/basic_op/oper_32b.c:370
#1 0x00007ffff7f97ea0 in allowMoreHoles (psyOutChannel=0x55555555a2a0, psyOutElement=0x0, peData=0x5555555654c0, ahFlag=0x5555555702b0, ahParam=0x555555566250,
nChannels=1, desiredPe=1540) at aacenc/src/adj_thr.c:714
#2 0x0000555555556402 in main (argc=1, argv=0x7fffffffe058) at aac-enc.c:70
(gdb) f 1
#1 0x00007ffff7f97ea0 in allowMoreHoles (psyOutChannel=0x55555555a2a0, psyOutElement=0x0, peData=0x5555555654c0, ahFlag=0x5555555702b0, ahParam=0x555555566250,
nChannels=1, desiredPe=1540) at aacenc/src/adj_thr.c:714
714 en[0] = fixmul(avgEn, pow2_xy(-(6 * enDiff), 7*4));
(gdb) info locals
enDiff = -1
avgEn = 510 # !!
minEn = 512 # !!
ahCnt = 1
en = {507315022, 0, 1679226483, 0}
maxSfb = 0
done = 0
startSfb = {15, 0}
enIdx = 0
minSfb = 1000
ch = 1
sfb = 43
actPe = 1732
shift = 30
(gdb)
We have just confirmed our previous conjecture that avgEn < minEn! So why does this happen? Lets step through the relevant code.
# Set a breakpoint
(gdb) b allowMoreHoles
# Start the program
(gdb) run
Starting program: /storage/workspace/bots/debug/vo-aacenc-0.1.3/.libs/aac-enc
Breakpoint 1, allowMoreHoles (psyOutChannel=0x55555555a2a0, psyOutElement=0x0, peData=0x5555555654c0, ahFlag=0x5555555702b0, ahParam=0x555555566250, nChannels=1, desiredPe=1540) at aacenc/src/adj_thr.c:699
699 if(ahCnt) {
# Print out minEn and avgEn after initial calculation
(gdb) p minEn
$1 = 512
(gdb) p avgEn
$2 = 512
# They look good at this stage, keep going...
(gdb) n
701 shift = norm_l(ahCnt);
(gdb) n
702 iahCnt = Div_32( 1 << shift, ahCnt << shift );
(gdb) n
703 avgEn = fixmul(avgEn, iahCnt);
# We just updated avgEn, lets look again
(gdb) p avgEn
$8 = 510
# Oh no, so this `fixmul` nudged `avgEn` below `minEn`
(gdb) n
707 enDiff = iLog4(avgEn) - iLog4(minEn);
# And that means enDiff is now negative...
(gdb) p enDiff
$10 = -1
# And all hell breaks loose...
(gdb) n
714 en[0] = fixmul(avgEn, pow2_xy(-(6 * enDiff), 7*4));
(gdb) n
pow2_xy(6, 28)
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7f942e6 in voAACEnc_pow2_xy (x=6, y=28) at aacenc/basic_op/oper_32b.c:370
370 res = pow2Table[(POW2_TABLE_SIZE*fPart)/y] >> iPart;
So now we know at least roughly where things have gone wrong. The fixed-point calculation of the average energy avgEn is introducing an error causing avgEn < minEn in rare cases.
But what is actually happening in norm_l, Div_32 and fixmul that leads to this error? The source code mentions these operations are in "double precision" however most of the values are stored in 32-bit integers. What is going on here? Google leads us to this relic, last modified in 1996 by the France Telecom:
/*___________________________________________________________________________
| |
| This file contains operations in double precision. |
| These operations are not standard double precision operations. |
| They are used where single precision is not enough but the full 32 bits |
| precision is not necessary. For example, the function Div_32() has a |
| 24 bits precision which is enough for our purposes. |
| |
| The double precision numbers use a special representation: |
| |
| L_32 = hi<<16 + lo<<1 |
| |
| L_32 is a 32 bit integer. |
| hi and lo are 16 bit signed integers. |
| As the low part also contains the sign, this allows fast multiplication. |
| |
| 0x8000 0000 <= L_32 <= 0x7fff fffe. |
| |
| We will use DPF (Double Precision Format )in this file to specify |
| this special format. |
|___________________________________________________________________________|
*/
Ahhh, so is a special kind of "double precision" split across 32 bits. Now with this in mind we can annotate the relevant code:
if(ahCnt) {
// calc how many bits needed to shift ahCnt into the normalised range [0x40000000, 0x7fff0000]
shift = norm_l(ahCnt);
// perform fixed-width division of normalised 1 and ahCnt
iahCnt = Div_32(1 << shift, ahCnt << shift);
// perform fixed-width multiplication of avgEn and iahCnt
avgEn = fixmul(avgEn, iahCnt);
}
So where are we losing precision then? Lets step through our scenario in gdb:
# Break at `main`
(gdb) b main
Breakpoint 1 at 0x22f0: file aac-enc.c, line 60.
# Start the program
(gdb) run
Starting program: /storage/workspace/bots/debug/vo-aacenc-0.1.3/.libs/aac-enc
Breakpoint 1, main (argc=1, argv=0x7fffffffe068) at aac-enc.c:60
# Set up the input
(gdb) set $avgEn = 512
# Calculate the normalisation left shift for ahCnt = 1
(gdb) set $shift = norm_l(1)
(gdb) p $shift
$1 = 30
# Perform the shifts
(gdb) set $num = 1 << $shift
(gdb) set $denom = 1 << $shift
# Perform the division
(gdb) set $iahCnt = voAACEnc_Div_32($num, $denom)
# Perform the multiplication
(gdb) set $mul = (uint64_t)$avgEn * (uint64_t)$iahCnt
(gdb) set $avgEn = ($mul >> 32) << 1
# Observe the results
(gdb) p (double)$iahCnt / (double)((uint64_t)2 << 31)
$3 = 0.4999999962747097
(gdb) p (double)$mul / (double)((uint64_t)2 << 31)
$3 = 254.99999810010195
(gdb) p $avgEn
$5 = 510
There are a couple of things that do not look right. Firstly, we are executing Div_32($num, $denom) where $num = $demon breaking the num < denom invariant. We then lose some precision in our Div_32 call and this propagates to our 64-bit multiplication of $avgEn, at this stage the floating point equivalent is ~255 and our error is -1. The final left-shift then doubles the error to -2 and we are left with 510.
What if Div_32 did not lose precision?
# Break at `main`
(gdb) b main
Breakpoint 1 at 0x22f0: file aac-enc.c, line 60.
# Start the program
(gdb) run
Starting program: /storage/workspace/bots/debug/vo-aacenc-0.1.3/.libs/aac-enc
Breakpoint 1, main (argc=1, argv=0x7fffffffe068) at aac-enc.c:60
# Set up the input
(gdb) set $avgEn = 512
# 2 << 30 is exactly floating point 0.5 in Q31
(gdb) set $iahCnt = (uint64_t)2 << 30
# Perform the multiplication
(gdb) set $mul = (uint64_t)$avgEn * (uint64_t)$iahCnt
(gdb) set $avgEn = ($mul >> 32) << 1
# Observe the results
(gdb) p (double)$iahCnt / (double)((uint64_t)2 << 31)
$3 = 0.5
(gdb) p (double)$mul / (double)((uint64_t)2 << 31)
$3 = 256
(gdb) p $avgEn
$5 = 512
Voila! avgEn is now exactly what we expect it to be after dividing it by 1.
So how do we fix this? Could we modify Div_32 to never lose precision? Probably not without harming its performance characteristics.
In this particular case the local invariant is minEn < avgEn. So lets just add implement a lower-bound for avgEn:
That seems to have worked! So lets push this fix to prod shall we?
Uncovering a fossil
Not so fast.... At this point I was struggling to believe that nobody else had encountered this issue previously. I scoured Google, GitHub and finally Bing (of all places) to find this mailing list thread. A quick read reveals that it is almost certainly referring to the same bug. And that it was fixed upstream nearly 9 year ago. However, the fixes were never released to the gstreamer plugin.
Although it was satisfying to understand and fix the bug that had been causing these sporadic crashes, pressing on with unmaintained codec is not a liability we want to take on moving forwards. Rather than relying on an unmaintained AAC encoder from gst-plugins-bad how about we find something used more broadly and hopefully more stable? The libav AAC encoder used in ffmpeg and available as a gstreamer plugin seems like a reasonable choice.
UPDATE: We have since rolled out this shiny new codec and are yet to spot another segfault! Kudos to the ffmpeg maintainers!