Deep Dive

Debugging a 1 in 36,000,000 segfault in our audio encoder

Elliot Levin

March 29, 2023

Introduction

At Recall.ai we process a lot of real-time media. We build bots that join video-conferences across Zoom, Teams, Google Meet and more. These bots ingest, transcode and analyze the audio and video in real-time. This comes with a number of engineering challenges and this is the story of how we identified and fixed a segfault in our audio processing pipeline that was triggered by roughly 1 in every 36,000,000 seconds of audio.

A few months ago we began to notice a strange pattern in some of our bots. Very occasionally, a bot would abruptly crash without any prior indication of anything going awry. We use Kubernetes to orchestrate our bots and Karpenter to manage our elastic scaling. kubectl describe pod showed that the bots exited with code 139 = 128 + 11(signal). Signal 11 corresponds to SIGSEGV, a segmentation fault or segfault for short, this occurs when a program tries to access memory it is not supposed to. This mistake is bad enough that the execution of the program should halt immediately.

Getting a core dump

So how do we determine the cause of a SIGSEGV? The answer is often a core dump file. A core dump lets us inspect the state of the program at the instant of the crash, this lets us see many useful things such as the stacktrace, variable/memory contents and more.

Core dumps are output as files on disk when a process crashes, however because our bots are run as containers in Kubernetes, the disk is ephemeral and becomes inaccessible when the container exits and Karpenter terminates the underlying VM. In order to diagnose the cause of this segfault we need to start reliably collecting core dumps across nodes in our cluster.

We want something lightweight and that could be easily integrated into existing docker images, so we built Garbage Truck which detects and uploads core dump files to S3 before the pod terminates.

It is not often I've been eager to see segfaults occur in production but this was one such occasion. After deploying Garbage Truck we waited until the next crash so we could further the investigation.

Days passed, then a week and then two. At this point we started to wonder if Garbage Truck itself had changed the execution environment in such a way that prevented the segfaults from happening. However, on the 16th day we observed another crash and Garbage Truck diligently transported the core dump to S3 for us.

Inspecting the stack trace

Excitedly, I spun up a new pod on our cluster, using the same image where the crash occurred to ensure all binaries and system libraries are where the core dump expects them to be. We use a debian base image, so I installed gdb and loaded up the core dump:


# Install tools
> sudo apt install -y gdb file
# Tell us about the core dmp
> file /path/to/core-dump
core-dump: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style, from 'python3 entrypoint.py', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/bin/python3', platform: 'x86_64'
# file tells us that /usr/bin/python3 was the executing binary
# Lets load up the core dump with gdb
>  gdb /usr/bin/python3 /path/to/core-dump
[Current thread is 1 (Thread 0x7f1c3e1fc6c0 (LWP 432))]
# Show me the stacktrace
(gdb) bt
0  0x00007f1caaa58cbb in ?? () from /lib/x86_64-linux-gnu/libvo-aacenc.so.0
1  0x00007f1caaa5bbda in ?? () from /lib/x86_64-linux-gnu/libvo-aacenc.so.0
2  0x00007f1caaa6540b in ?? () from /lib/x86_64-linux-gnu/libvo-aacenc.so.0
3  0x00007f1caaa597ed in ?? () from /lib/x86_64-linux-gnu/libvo-aacenc.so.0
4  0x00007f1caaa58fdb in ?? () from /lib/x86_64-linux-gnu/libvo-aacenc.so.0
5  0x00007f1cab2a4a58 in ?? () from /lib/x86_64-linux-gnu/gstreamer-1.0/libgstvoaacenc.so
6  0x00007f1caa6ca481 in ?? () from /lib/x86_64-linux-gnu/libgstaudio-1.0.so.0
7  0x00007f1caa6cb6ac in ?? () from /lib/x86_64-linux-gnu/libgstaudio-1.0.so.0
8  0x00007f1cadd202a9 in ?? () from /lib/x86_64-linux-gnu/libgstreamer-1.0.so.0
9  0x00007f1cadd227b1 in ?? () from /lib/x86_64-linux-gnu/libgstreamer-1.0.so.0
10 0x00007f1cadd29b6b in gst_pad_push () from /lib/x86_64-linux-gnu/libgstreamer-1.0.so.0
11 0x00007f1caaad5ed4 in ?? () from /lib/x86_64-linux-gnu/gstreamer-1.0/libgstcoreelements.so
12 0x00007f1cadd59441 in ?? () from /lib/x86_64-linux-gnu/libgstreamer-1.0.so.0
13 0x00007f1cae2ec6ca in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
14 0x00007f1cae2ebcfd in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
15 0x00007f1cb0191fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
16 0x00007f1cb021266c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Now we have the stacktrace telling us where our program crashed. We use GStreamer as our media processing framework so it is unsurprising see the crash originating there. The immediate location of the crash occured in /lib/x86_64-linux-gnu/libvo-aacenc.so.0 which was our AAC audio encoding library. The majority of the stack frames are designated with ?? meaning gdb does not have the symbol information for the associated binaries. Without this, besides staring blindly at the disassembly, there is little we can do to better understand the crash. At this point I suspected the following in the order of probability:

  1. We are subtly misusing some GStreamer API causing a use-after-free or corrupting the encoder state
  2. The voaacenc GStreamer plugin has a bug causing the crash
  3. libvo-aacenc has a bug causing the crash

Thankfully, our Debian base image compiles most libraries with separate debugging symbols and makes them readily available to our gdb:


# Specify the debuginfod url for gdb
> export DEBUGINFOD_URLS="https://debuginfod.debian.net"
# Load up the core dump
> gdb /usr/bin/python3 /path/to/core-dump
# gdb loads all the required debug files...
# Show me the stacktrace
(gdb) bt
0  0x00007f1caaa58cbb in voAACEnc_pow2_xy (x=<optimized out>, y=y@entry=28) at aacenc/basic_op/oper_32b.c:358
1  0x00007f1caaa5bbda in allowMoreHoles (desiredPe=<optimized out>, nChannels=<optimized out>, ahParam=0x7f1c74021082, ahFlag=0x7f1c3e1fb354, peData=0x7f1c3e1fae90, psyOutElement=0x7f1c74022228,
   psyOutChannel=0x7f1c74022488) at aacenc/src/adj_thr.c:676
2  adaptThresholdsToPe (msaParam=0x7f1c74021088, ahParam=0x7f1c74021082, desiredPe=1467, nChannels=1, peData=0x7f1c3e1fae90, logSfbEnergy=<optimized out>, psyOutElement=0x7f1c74022228,
   psyOutChannel=0x7f1c74022488) at aacenc/src/adj_thr.c:812
3  AdjustThresholds (adjThrState=adjThrState@entry=0x7f1c7402105c, AdjThrStateElement=AdjThrStateElement@entry=0x7f1c7402107c, psyOutChannel=psyOutChannel@entry=0x7f1c74022488,
   psyOutElement=psyOutElement@entry=0x7f1c74022228, chBitDistribution=chBitDistribution@entry=0x7f1c3e1fb8d4, logSfbEnergy=logSfbEnergy@entry=0x7f1c74021284, sfbNRelevantLines=0x7f1c74021194,
   qcOE=0x7f1c74022208, elBits=0x7f1c7402104c, nChannels=1, maxBitFac=395) at aacenc/src/adj_thr.c:1187
4  0x00007f1caaa6540b in QCMain (hQC=hQC@entry=0x7f1c7402103c, elBits=elBits@entry=0x7f1c7402104c, adjThrStateElement=adjThrStateElement@entry=0x7f1c7402107c, psyOutChannel=0x7f1c74022488,
   psyOutElement=psyOutElement@entry=0x7f1c74022228, qcOutChannel=0x7f1c74021378, qcOutElement=0x7f1c74022208, nChannels=1, ancillaryDataBytes=0) at aacenc/src/qc_main.c:298
5  0x00007f1caaa597ed in AacEncEncode (aacEnc=aacEnc@entry=0x7f1c74021020, timeSignal=<optimized out>, ancBytes=ancBytes@entry=0x0, numAncBytes=numAncBytes@entry=0x7f1c3e1fb9a6, outBytes=<optimized out>,
   numOutBytes=numOutBytes@entry=0x7f1c3e1fba18) at aacenc/src/aacenc_core.c:180
6  0x00007f1caaa58fdb in voAACEncGetOutputData (hCodec=0x7f1c74021020, pOutput=0x7f1c3e1fba10, pOutInfo=0x7f1c3e1fba30) at aacenc/src/aacenc.c:247
7  0x00007f1cab2a4a58 in gst_voaacenc_handle_frame (benc=0x3547ab0, buf=0x7f1c30017360) at ../ext/voaacenc/gstvoaacenc.c:424
8  0x00007f1caa6ca481 in gst_audio_encoder_push_buffers (enc=enc@entry=0x3547ab0, force=force@entry=0) at ../gst-libs/gst/audio/gstaudioencoder.c:1180
9  0x00007f1caa6cb6ac in gst_audio_encoder_chain (pad=<optimized out>, parent=<optimized out>, buffer=0x7f1c68054d80) at ../gst-libs/gst/audio/gstaudioencoder.c:1410
10 0x00007f1cadd202a9 in gst_pad_chain_data_unchecked (pad=pad@entry=0x338ef50, type=type@entry=4112, data=data@entry=0x7f1c68054d80) at ../gst/gstpad.c:4463
11 0x00007f1cadd227b1 in gst_pad_push_data (pad=pad@entry=0x338ed00, type=type@entry=4112, data=data@entry=0x7f1c68054d80) at ../gst/gstpad.c:4739
12 0x00007f1cadd29b6b in gst_pad_push (pad=0x338ed00, buffer=buffer@entry=0x7f1c68054d80) at ../gst/gstpad.c:4858
13 0x00007f1caaad5ed4 in gst_queue_push_one (queue=0x3544160) at ../plugins/elements/gstqueue.c:1388
14 gst_queue_loop (pad=<optimized out>) at ../plugins/elements/gstqueue.c:1541
15 0x00007f1cadd59441 in gst_task_func (task=0x354acb0) at ../gst/gsttask.c:384
16 0x00007f1cae2ec6ca in g_thread_pool_thread_proxy (data=<optimized out>) at ../../../glib/gthreadpool.c:352
17 0x00007f1cae2ebcfd in g_thread_proxy (data=0x7f1c74017520) at ../../../glib/gthread.c:831
18 0x00007f1cb0191fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
19 0x00007f1cb021266c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

From the above we know that this crash occurs in voAACEnc_pow2_xy in libvo-aacenc. A quick google search does not reveal much either. Only one poor soul, using the same encoder via ffmpeg, potentially ran into this over 9 years ago.

It turns out that GStreamer is relying on v0.1.3 of this library. This release was over ten years old. Perhaps this is not the most well-maintained GStreamer plugin...

The source code of voAACEnc_pow2_xy is:


/*
  \brief calculates 2 ^ (x/y) for x<=0, y > 0, x <= 32768 * y

  avoids integer division

  \return
*/
Word32 pow2_xy(Word32 x, Word32 y)
{
  UWord32 iPart;
  UWord32 fPart;
  Word32 res;
  Word32 tmp, tmp2;
  Word32 shift, shift2;

  tmp2 = -x;
  iPart = tmp2 / y;
  fPart = tmp2 - iPart*y;
  iPart = min(iPart,INT_BITS-1);

  // crash occurs here
  res = pow2Table[(POW2_TABLE_SIZE*fPart)/y] >> iPart;

  return(res);
}

It seems that we are indexing out-of-bounds of the statically-allocated pow2Table variable. The function itself is an optimised implementation of 2 ^ (x/y) with some invariants. So perhaps we are breaking these invariants (x<=0, y > 0) somehow.

Digging (much) deeper

As the next step in my investigation, I looked at the code that called this pow2_xy function. Here is an annotated version:


    // average energy
    avgEn = 0;
    // minimum energy
    minEn = MAX_32;
    // counter
    ahCnt = 0;
    // for each channel
    for (ch=0; ch<nChannels; ch++) {
      PSY_OUT_CHANNEL *psyOutChan = &psyOutChannel[ch];
      for (sfb=startSfb[ch]; sfb<psyOutChan->sfbCnt; sfb++) {

        if ((ahFlag[ch][sfb] != NO_AH) &&
            (psyOutChan->sfbEnergy[sfb] > psyOutChan->sfbThreshold[sfb])) {
          // calculate the running minimum
          minEn = min(minEn, psyOutChan->sfbEnergy[sfb]);
          // sum all energies
          avgEn = L_add(avgEn, psyOutChan->sfbEnergy[sfb]);
          ahCnt++;
        }
      }
    }

    // calculate the average using fixed-point multiplication to "divide" by ahCnt
    if(ahCnt) {
      Word32 iahCnt;
      shift = norm_l(ahCnt);
      iahCnt = Div_32( 1 << shift, ahCnt << shift );
      avgEn = fixmul(avgEn, iahCnt);
    }

    // calculate the logarithmic-scale difference 
    enDiff = iLog4(avgEn) - iLog4(minEn);
    // what ever this means...
    /* calc some energy borders between minEn and avgEn */
    for (enIdx=0; enIdx<4; enIdx++) {
      Word32 enFac;
      enFac = ((6-(enIdx << 1)) * enDiff);
      en[enIdx] = fixmul(avgEn, pow2_xy(L_negate(enFac),7*4));
    }

To try to decipher which invariant (if any) was violated, I inspected the variables and registers through gdb.


#0  0x00007f1caaa58cbb in voAACEnc_pow2_xy (x=<optimized out>, y=y@entry=28) at aacenc/basic_op/oper_32b.c:358
358      res = pow2Table[(POW2_TABLE_SIZE*fPart)/y] >> iPart;
# Print local variables
(gdb) info locals
iPart = <optimized out>
fPart = 4294967290
res = <optimized out>
tmp = <optimized out>
tmp2 = <optimized out>
shift = <optimized out>
shift2 = <optimized out>
# Dump registers
(gdb) info reg
rax            0x9249212           153391634
rbx            0x604               1540
rcx            0x0                 0
rdx            0x7f1caaa6b000      139761098862592
rsi            0x1c                28 # y
rdi            0xfffffffa          4294967290 # fpart
rbp            0xfffffffe          0xfffffffe
rsp            0x7f1c3e1fadc8      0x7f1c3e1fadc8
r8             0x1fff8000          536838144
r9             0x1fe               510
r10            0x200               512
r11            0x1                 1
r12            0x4                 4
r13            0x1fe               510
r14            0x7f1c3e1fae84      139759278075524
r15            0x6c4               1732
rip            0x7f1caaa58cbb      0x7f1caaa58cbb <voAACEnc_pow2_xy+43>
# Go up a frame
(gdb) f 1
#1  0x00007f1caaa5bbda in allowMoreHoles (desiredPe=<optimized out>, nChannels=<optimized out>, ahParam=0x7f1c74021082, ahFlag=0x7f1c3e1fb354, peData=0x7f1c3e1fae90, psyOutElement=0x7f1c74022228,
    psyOutChannel=0x7f1c74022488) at aacenc/src/adj_thr.c:676
676          en[enIdx] = fixmul(avgEn, pow2_xy(L_negate(enFac),7*4));
(gdb) info locals
enFac = <optimized out>
enDiff = <optimized out>
avgEn = <optimized out>
minEn = <optimized out>
ahCnt = <optimized out>
en = {4, 28, 158, 854}
maxSfb = <optimized out>
done = <optimized out>
startSfb = {15, 0}
enIdx = <optimized out>
minSfb = <optimized out>
ch = <optimized out>
sfb = <optimized out>
actPe = 1732
shift = <optimized out>
ch = <optimized out>
sfb = <optimized out>
actPe = <optimized out>
shift = <optimized out>
psyOutChanL = <optimized out>
psyOutChanR = <optimized out>
minEn = <optimized out>
startSfb = <optimized out>
avgEn = <optimized out>
minEn = <optimized out>
ahCnt = <optimized out>
enIdx = <optimized out>
enDiff = <optimized out>
en = <optimized out>
minSfb = <optimized out>
maxSfb = <optimized out>
done = <optimized out>
psyOutChan = <optimized out>
iahCnt = <optimized out>
enFac = <optimized out>
psyOutChan = <optimized out>

Because this is a heavily-optimised release build, the core dump is stubbornly refusing to reveal most of the variables state. From the above we know y is 28 and fPart is an absurdly large 0xfffffffa. This points to some kind of overflow. We don't know x.

To figure out what x might have been, I modified the function and fuzzed over a range of x:


Word32 pow2_xy(Word32 x, Word32 y)
{
  UWord32 iPart;
  UWord32 fPart;
  Word32 res;
  Word32 tmp, tmp2;
  Word32 shift, shift2;

  tmp2 = -x;
  iPart = tmp2 / y;
  fPart = tmp2 - iPart*y;
  iPart = min(iPart,INT_BITS-1);

  if (fPart == 4294967290) {
    printf("**reproduced!**\n");
    return 0;
  }
  if ((POW2_TABLE_SIZE*fPart)/y >= 256) {
    printf("invalid state\n");
    return 0;
  }

  res = pow2Table[(POW2_TABLE_SIZE*fPart)/y] >> iPart;

  return(res);
}

int main(int argc, char *argv[])
{
    for (int i = -10; i < 10; i++) {
        int res = pow2_xy(i, 28);
        printf("pow2_xy(%d, %d) = %d\n", i, 28, res);
    }
}

Prints the following:


pow2_xy(-10, 28) = 1678506817
pow2_xy(-9, 28) = 1719911875
pow2_xy(-8, 28) = 1762338305
pow2_xy(-7, 28) = 1805811302
pow2_xy(-6, 28) = 1855373507
pow2_xy(-5, 28) = 1901141476
pow2_xy(-4, 28) = 1948038440
pow2_xy(-3, 28) = 1996092249
pow2_xy(-2, 28) = 2045331439
pow2_xy(-1, 28) = 2095785251
pow2_xy(0, 28) = 2147483647
invalid state
pow2_xy(1, 28) = 0
invalid state
pow2_xy(2, 28) = 0
invalid state
pow2_xy(3, 28) = 0
invalid state
pow2_xy(4, 28) = 0
invalid state
pow2_xy(5, 28) = 0
**reproduced!**
pow2_xy(6, 28) = 0
invalid state
pow2_xy(7, 28) = 0
invalid state
pow2_xy(8, 28) = 0
invalid state
pow2_xy(9, 28) = 0

So it is probable that we are invoking pow2_xy with x = 6 causing our bot to crash abruptly.

I next looked at a simplified version of the calling code from earlier (with the for loop unrolled), to see if that sheds any light:


    enDiff = iLog4(avgEn) - iLog4(minEn);
    /* calc some energy borders between minEn and avgEn */
    en[0] = fixmul(avgEn, pow2_xy(-(6 * enDiff), 7*4));
    en[1] = fixmul(avgEn, pow2_xy(-(4 * enDiff), 7*4));
    en[2] = fixmul(avgEn, pow2_xy(-(2 * enDiff), 7*4));
    en[3] = fixmul(avgEn, pow2_xy(0, 7*4));

That seems to suggest that if enDiff = -1 it would reproduce our crash exactly. In general if enDiff < 0 for any reason, this would trigger a segfault in pow2_xy. Because enDiff = iLog4(avgEn) - iLog4(minEn);, if enDiff < 0 then iLog4(avgEn) < iLog4(minEn) which implies that avgEn < minEn. If average(energy) < minimum(energy) something else has gone very wrong.

If we could reproduce this locally, then we could step the code and (hopefully) pinpoint exactly where things started to go awry. Reading through the libvo-aacenc source some more, it seems that the encoding functions are pure over the encoder state. Or put simply, they only rely on the current state of the encoder to operate. So what if we dump the encoder state from our core dump resume the execution locally? gdb's dump binary memory is our friend here. Inspecting our encoder structure in gdb:


(gdb) p *(AAC_ENCODER*)hCodec
$58 = {config = {sampleRate = 16000, bitRate = 21333, nChannelsIn = 1, nChannelsOut = 1, bandWidth = 15000, adtsUsed = 0}, elInfo = {elType = ID_SCE, instanceTag = 0, nChannelsInEl = 1, ChannelIndex = {0, 1}},
  qcKernel = {averageBitsTot = 1368, maxBitsTot = 6144, globStatBits = 3, nChannels = 1, bitResTot = 4592, maxBitFac = 395, padding = {paddingRest = 14720}, elementBits = {chBitrate = 21333, averageBits = 1365,
      maxBits = 6144, bitResLevel = 4592, maxBitResBits = 4776, relativeBits = 16384}, adjThr = {bresParamLong = {clipSaveLow = 20, clipSaveHigh = 95, minBitSave = -5, maxBitSave = 30, clipSpendLow = 20,
        clipSpendHigh = 95, minBitSpend = -10, maxBitSpend = 40}, bresParamShort = {clipSaveLow = 20, clipSaveHigh = 75, minBitSave = 0, maxBitSave = 20, clipSpendLow = 20, clipSpendHigh = 75, minBitSpend = -5,
        maxBitSpend = 50}, adjThrStateElem = {peMin = 3690, peMax = 4898, peOffset = 50, ahParam = {modifyMinSnr = 1 '\001', startSfbL = 15, startSfbS = 3}, minSnrAdaptParam = {maxRed = 536870912,
  //// snip ////
              0}, {0, 0, 0, 0, 0, 0, 0, 0}}, windowNrgF = {{0, 0, 0, 0, 0, 0, 0, 0}, {0, 0, 0, 0, 0, 0, 0, 0}}, iirStates = {0, 0}, maxWindowNrg = 0, accWindowNrg = 0}, mdctDelayBuffer = 0x7f1c7402b160,
        sfbThresholdnm1 = {41468, 13120, 6576, 2080, 1312, 1312, 1312, 1312, 1312, 1312, 2624, 2624, 2624, 2624, 2624, 2624, 2624, 3936, 3936, 3936, 3936, 5248, 5248, 6560, 6560, 7872, 7872, 9184, 9184, 20960,
          20960, 33184, 33184, 33184, 104960, 104960, 104960, 1048544, 1048544, 1048544, 1048544, 10502400, 10502400, 10502400, 10502400, 10502400, 10502400, 10502400, 31507200, 0, 0}, mdctScalenm1 = 0,
        sfbThreshold = {sfbLong = {0 <repeats 60 times>}, sfbShort = {{0 <repeats 15 times>}, {0 <repeats 15 times>}, {0 <repeats 15 times>}, {0 <repeats 15 times>}, {0 <repeats 15 times>}, {
  //// snip ////
                  0 <repeats 12 times>}, predictionGain = 0}, {tnsActive = 0, parcor = {0 <repeats 12 times>}, predictionGain = 0}, {tnsActive = 0, parcor = {0 <repeats 12 times>}, predictionGain = 0}}}}}},
    pScratchTns = 0x7f1c740284a0, sampleRateIdx = 8}, bseInit = {nChannels = 1, bitrate = 21333, sampleRate = 16000, profile = 1}, bitStream = {pBitBufBase = 0x7f1c2c00c6e0 "\0016\365\2106H#\032\006\3032",
    pBitBufEnd = 0x7f1c2c00cedf "\234\220", pWriteNext = 0x7f1c2c00c6e0 "\0016\365\2106H#\032\006\3032", cache = 0, wBitPos = 0, cntBits = 0, size = 16384, isValid = 1}, hBitStream = 0x7f1c74025378, initOK = 0,
  intbuf = 0x7f1c74025440, encbuf = 0x354b000, inbuf = 0x354b000, enclen = 1024, inlen = 1024, intlen = 0, uselength = 0, hCheck = 0x0, voMemop = 0x3547dc0, voMemoprator = {Alloc = 0x0, Free = 0x0, Set = 0x0,
    Copy = 0x0, Check = 0x0, Compare = 0x0, Move = 0x0}}

Yikes! It seems that our encoder state is a complicated structure. Oh well, we should still be able to reconstruct it...


# Dump our encoder state variables...
(gdb) dump binary memory quantSpec.bin 0x7f1c7402be00 0x7f1c7402be00+1024
(gdb) dump binary memory maxValueInSfb.bin 0x7f1c7401a660 0x7f1c7401a660+60
(gdb) dump binary memory scf.bin 0x7f1c74017820 0x7f1c74017820+60
(gdb) dump binary memory quantSpec2.bin 0x7f1c7402c600 0x7f1c7402c600+1024
(gdb) dump binary memory maxValueInSfb2.bin 0x7f1c7401a6d8 0x7f1c7401a6d8+60
(gdb) dump binary memory scf2.bin 0x7f1c74017898 0x7f1c74017898+60
(gdb) dump binary memory sfbEnergy.bin 0x7f1c74023970 0x7f1c74023970+100
(gdb) dump binary memory sfbSpreadedEnergy.bin 0x7f1c74023f58 0x7f1c74023f58+100
(gdb) dump binary memory quantSpec.bin 0x7f1c7402be00 0x7f1c7402be00+10240
(gdb) dump binary memory sfbThreshold.bin 0x7f1c740236a0 0x7f1c740236a0+10240
(gdb) dump binary memory maxValueInSfb.bin 0x7f1c7401a660 0x7f1c7401a660+10240
(gdb) dump binary memory scf.bin 0x7f1c74017820 0x7f1c74017820+10240
(gdb) dump binary memory quantSpec2.bin 0x7f1c7402c600 0x7f1c7402c600+10240
(gdb) dump binary memory maxValueInSfb2.bin 0x7f1c7401a6d8 0x7f1c7401a6d8+10240
(gdb) dump binary memory scf2.bin 0x7f1c74017898 0x7f1c74017898+10240
(gdb) dump binary memory sfbEnergy.bin 0x7f1c74023970 0x7f1c74023970+10240
(gdb) dump binary memory sfbSpreadedEnergy.bin 0x7f1c74023f58 0x7f1c74023f58+10240
(gdb) dump binary memory sfbThreshold.bin 0x7f1c740236a0 0x7f1c740236a0+10240
(gdb) dump binary memory mdctSpectrum.bin 0x7f1c74026480 0x7f1c74026480+10240
(gdb) dump binary memory sfbOffset.bin 0x7f1caaa6cbe0 0x7f1caaa6cbe0+10240
(gdb) dump binary memory sfbOffset2.bin 0x7f1caaa6cd38 0x7f1caaa6cd38+10240
(gdb) dump binary memory mdctDelayBuffer.bin 0x7f1c7402a4e0 0x7f1c7402a4e0+10240
(gdb) dump binary memory mdctDelayBuffer2.bin 0x7f1c7402b160 0x7f1c7402b160+10240
(gdb) dump binary memory mdctSpectrum2.bin 0x7f1c74027480 0x7f1c74027480+10240
(gdb) dump binary memory pScratchTns.bin 0x7f1c740284a0 0x7f1c740284a0+10240
(gdb) dump binary memory bitBufBase.bin 0x7f1c2c00c6e0 0x7f1c2c00c6e0+102400
(gdb) dump binary memory hBitStream.bin 0x7f1c74025378 0x7f1c74025378+102400
(gdb) dump binary memory intbuf.bin 0x7f1c74025440 0x7f1c74025440+102400
(gdb) dump binary memory encbuf.bin 0x354b000 0x354b000+10240
# Note down relative offsets...
(gdb) p 0x7f1c2c00cedf - 0x7f1c2c00c6e0 # bitBufEnd
$12 = 2047
(gdb) p 0x7f1c2c00c6e0 - 0x7f1c2c00c6e0 # pWriteNext
$13 = 0

Now we can hack together some C to restore our encoder state from the memory dumps:


#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <aacenc/inc/aacenc_core.h>
#include <aacenc/basic_op/oper_32b.h>
#include <aacenc/basic_op/oper_32b.c>
#include <common/include/voAAC.h>
#include <common/include/cmnMemory.h>
#include "wavreader.h"

// loads a file into an malloc'd region
void *read_to_mem(char *filename)
{
    struct stat st;
    if (stat(filename, &st) != 0)
    {
        printf("stat failed: %s", filename);
        exit(1);
    }

    void *buf = (void *)malloc(st.st_size);
    FILE *file = fopen(filename, "r");

    if (fread((unsigned char *)buf, 1, st.st_size, file) <= 0)
    {
        printf("read failed: %s", filename);
        exit(1);
    }
    fclose(file);

    printf("loaded %s into buffer size %lld\n", filename, (long long)st.st_size);

    return buf;
}

int main(int argc, char *argv[])
{
    VO_AUDIO_CODECAPI codec_api = {0};
    // load the root encoder state
    AAC_ENCODER *handle = (AAC_ENCODER *)read_to_mem("memdump/encoder-state.bin");
    // update referenced pointers
    handle->qcOut.qcChannel[0].quantSpec = (Word16 *)read_to_mem("memdump/quantSpec.bin");
    handle->qcOut.qcChannel[0].maxValueInSfb = (Word16 *)read_to_mem("memdump/maxValueInSfb.bin");
    handle->qcOut.qcChannel[0].scf = (Word16 *)read_to_mem("memdump/scf.bin");
    handle->qcOut.qcChannel[1].quantSpec = (Word16 *)read_to_mem("memdump/quantSpec2.bin");
    handle->qcOut.qcChannel[1].maxValueInSfb = (Word16 *)read_to_mem("memdump/maxValueInSfb2.bin");
    handle->qcOut.qcChannel[1].scf = (Word16 *)read_to_mem("memdump/scf2.bin");
    handle->psyOut.psyOutChannel[0].sfbEnergy = (Word32 *)read_to_mem("memdump/sfbEnergy.bin");
    handle->psyOut.psyOutChannel[0].sfbSpreadedEnergy = (Word32 *)read_to_mem("memdump/sfbSpreadedEnergy.bin");
    handle->psyOut.psyOutChannel[0].sfbThreshold = (Word32 *)read_to_mem("memdump/sfbThreshold.bin");
    handle->psyOut.psyOutChannel[0].mdctSpectrum = (Word32 *)read_to_mem("memdump/mdctSpectrum.bin");
    handle->psyKernel.psyConfLong.sfbOffset = (Word16 *)read_to_mem("memdump/sfbOffset.bin");
    handle->psyKernel.psyConfShort.sfbOffset = (Word16 *)read_to_mem("memdump/sfbOffset2.bin");
    handle->psyKernel.psyData[0].mdctDelayBuffer = (Word16 *)read_to_mem("memdump/mdctDelayBuffer.bin");
    handle->psyKernel.psyData[0].mdctSpectrum = (Word32 *)read_to_mem("memdump/mdctSpectrum.bin");
    handle->psyKernel.psyData[1].mdctDelayBuffer = (Word16 *)read_to_mem("memdump/mdctDelayBuffer2.bin");
    handle->psyKernel.psyData[1].mdctSpectrum = (Word32 *)read_to_mem("memdump/mdctSpectrum2.bin");
    handle->psyKernel.pScratchTns = (Word32 *)read_to_mem("memdump/pScratchTns.bin");
    handle->bitStream.pBitBufBase = (UWord8 *)read_to_mem("memdump/bitBufBase.bin");
    handle->hBitStream = (HANDLE_BIT_BUF)read_to_mem("memdump/hBitStream.bin");
    handle->hBitStream->pBitBufBase = (UWord8 *)read_to_mem("memdump/bitBufBase.bin");
    handle->intbuf = (short *)read_to_mem("memdump/intbuf.bin");
    handle->encbuf = (short *)read_to_mem("memdump/encbuf.bin");
    // update relative pointers
    handle->bitStream.pBitBufEnd = handle->bitStream.pBitBufBase + 2047;
    handle->bitStream.pWriteNext = handle->bitStream.pBitBufBase;
    handle->hBitStream->pBitBufEnd = handle->hBitStream->pBitBufBase + 2047;
    handle->hBitStream->pWriteNext = handle->hBitStream->pBitBufBase;
    handle->inbuf = handle->encbuf;

    // update function pointers
    handle->voMemop = (VO_MEM_OPERATOR *)malloc(sizeof(VO_MEM_OPERATOR));
    handle->voMemop->Alloc = cmnMemAlloc;
    handle->voMemop->Copy = cmnMemCopy;
    handle->voMemop->Free = cmnMemFree;
    handle->voMemop->Set = cmnMemSet;
    handle->voMemop->Check = cmnMemCheck;

    VO_CODECBUFFER *output = (VO_CODECBUFFER *)read_to_mem("memdump/output.bin");
    output->Buffer = (VO_PBYTE)read_to_mem("memdump/output-buf.bin");
    VO_AUDIO_OUTPUTINFO output_info = {0};

    voGetAACEncAPI(&codec_api);

    // trigger the encoder!
    unsigned int res = codec_api.GetOutputData((VO_HANDLE)handle, output, &output_info);
    if (res != VO_ERR_NONE)
    {
        fprintf(stderr, "Unable to encode frame: %u\n", res);
        return 1;
    }

    codec_api.Uninit(handle);

    fprintf(stderr, "Done\n");
    return 0;
}

Now we should be able to execute the encoder and reproduce the crash!


> make && ./aac-enc 
# snip...
loaded memdump/encoder-state.bin into buffer size 17408
loaded memdump/quantSpec.bin into buffer size 10240
loaded memdump/maxValueInSfb.bin into buffer size 10240
loaded memdump/scf.bin into buffer size 10240
loaded memdump/quantSpec2.bin into buffer size 10240
loaded memdump/maxValueInSfb2.bin into buffer size 10240
loaded memdump/scf2.bin into buffer size 10240
loaded memdump/sfbEnergy.bin into buffer size 10240
loaded memdump/sfbSpreadedEnergy.bin into buffer size 10240
loaded memdump/sfbThreshold.bin into buffer size 10240
loaded memdump/mdctSpectrum.bin into buffer size 10240
loaded memdump/sfbOffset.bin into buffer size 10240
loaded memdump/sfbOffset2.bin into buffer size 10240
loaded memdump/mdctDelayBuffer.bin into buffer size 10240
loaded memdump/mdctSpectrum.bin into buffer size 10240
loaded memdump/mdctDelayBuffer2.bin into buffer size 10240
loaded memdump/mdctSpectrum2.bin into buffer size 10240
loaded memdump/pScratchTns.bin into buffer size 10240
loaded memdump/bitBufBase.bin into buffer size 102400
loaded memdump/hBitStream.bin into buffer size 102400
loaded memdump/bitBufBase.bin into buffer size 102400
loaded memdump/intbuf.bin into buffer size 102400
loaded memdump/encbuf.bin into buffer size 10240
loaded memdump/output.bin into buffer size 8
loaded memdump/output-buf.bin into buffer size 2048
Done

Done!? Wait, where is our burning pile of invalid memory dereferences? It seems that the resumed encoder exited cleanly. What if we add some logging to pow2_xy?


# snip
pow2_xy(-104, 320)
pow2_xy(-141, 320)
pow2_xy(-58, 320)
pow2_xy(-64, 320)
pow2_xy(-40, 320)
pow2_xy(-68, 320)
pow2_xy(-136, 320)
pow2_xy(-40, 320)
pow2_xy(-40, 320)
pow2_xy(-40, 320)
pow2_xy(-64, 320)
pow2_xy(-40, 320)
pow2_xy(-40, 320)
pow2_xy(-90, 320)
pow2_xy(-94, 320)
pow2_xy(-130, 320)
pow2_xy(-82, 320)
pow2_xy(-104, 320)
pow2_xy(-126, 320)
pow2_xy(-86, 320)
pow2_xy(-58, 320)
pow2_xy(-166, 320)
pow2_xy(-112, 320)
pow2_xy(-76, 320)
pow2_xy(-108, 320)
pow2_xy(-10769, 1380)
pow2_xy(-10504, 1380)
pow2_xy(-66, 28)
pow2_xy(-44, 28)
pow2_xy(-22, 28)
pow2_xy(0, 28)
Done

There are no invalid x values here either. Stepping through the code using gdb makes it clear as to why. Many of the internal functions mutate the encoder state throughout the lifetime of the GetOutputData entrypoint. So the state of the the encoder that we restored was the state at the instant of the segfault and not the actual state when the bot called GetOutputData prior to the crash. Unfortunately we cannot recover the original encoder state from the core dump either.

So let us go deeper then. How about if we restore the state at the frame above pow2_xy? Lets dump the memory from those arguments and try again:


# Go up a frame to "allowMoreHoles"
(gdb) f 1
#1  0x00007f1caaa5bbda in allowMoreHoles (desiredPe=<optimized out>, nChannels=<optimized out>, ahParam=0x7f1c74021082, ahFlag=0x7f1c3e1fb354, peData=0x7f1c3e1fae90, psyOutElement=0x7f1c74022228,
    psyOutChannel=0x7f1c74022488) at aacenc/src/adj_thr.c:676
676          en[enIdx] = fixmul(avgEn, pow2_xy(L_negate(enFac),7*4));
# Show the arguments
(gdb) info args
desiredPe = <optimized out>
nChannels = <optimized out>
ahParam = 0x7f1c74021082
ahFlag = 0x7f1c3e1fb354
peData = 0x7f1c3e1fae90
psyOutElement = 0x7f1c74022228
psyOutChannel = 0x7f1c74022488
# Dump the argument state
(gdb) dump binary memory psyOutChannel.bin 0x7f1c74022488 0x7f1c74022488+sizeof(PSY_OUT_CHANNEL)*2
(gdb) dump binary memory sfbEnergy.bin 0x7f1c74023970 0x7f1c74023970+100
(gdb) dump binary memory sfbSpreadedEnergy.bin 0x7f1c74023f58 0x7f1c74023f58+100
(gdb) dump binary memory sfbThreshold.bin 0x7f1c740236a0 0x7f1c740236a0+10240
(gdb) dump binary memory mdctSpectrum.bin 0x7f1c74026480 0x7f1c74026480+10240
(gdb) dump binary memory peData.bin 0x7f1c3e1fae90 0x7f1c3e1fae90+sizeof(PE_DATA)
(gdb) dump binary memory ahFlag.bin 0x7f1c3e1fb354 0x7f1c3e1fb354+1024

Now revise our main:


int main(int argc, char *argv[])
{
    PSY_OUT_CHANNEL* psyOutChannel = (PSY_OUT_CHANNEL *)read_to_mem("memdump/psyOutChannel.bin");
    psyOutChannel[0].sfbEnergy = (Word32 *)read_to_mem("memdump/sfbEnergy.bin");
    psyOutChannel[0].sfbSpreadedEnergy = (Word32 *)read_to_mem("memdump/sfbSpreadedEnergy.bin");
    psyOutChannel[0].sfbThreshold = (Word32 *)read_to_mem("memdump/sfbThreshold.bin");
    psyOutChannel[0].mdctSpectrum = (Word32 *)read_to_mem("memdump/mdctSpectrum.bin");
    PE_DATA* peData = (PE_DATA*)read_to_mem("memdump/peData.bin");
    Word16* ahFlag = (Word16*)read_to_mem("memdump/ahFlag.bin");
    Word16* ahParam = (Word16*)read_to_mem("memdump/ahParam.bin");

    allowMoreHoles(psyOutChannel, NULL, peData, ahFlag, ahParam, 1, 1540);
}

Finally, we can run the program:


> make && ./aac-enc
# snip...
loaded memdump/psyOutChannel.bin into buffer size 3040
loaded memdump/sfbEnergy.bin into buffer size 10240
loaded memdump/sfbSpreadedEnergy.bin into buffer size 10240
loaded memdump/sfbThreshold.bin into buffer size 10240
loaded memdump/mdctSpectrum.bin into buffer size 10240
loaded memdump/peData.bin into buffer size 2420
loaded memdump/ahFlag.bin into buffer size 1024
loaded memdump/ahParam.bin into buffer size 6
pow2_xy(6, 28)
Segmentation fault

Success! It seems we can now reproduce the crash, even if it is directly from the above stack frame. Now we can modify our compiler flags with -O0 to disable optimisations and inspect the local state during execution.


> gdb ./.libs/aac-enc.
Reading symbols from ./.libs/aac-enc...
(gdb) run
Starting program: /storage/workspace/bots/debug/vo-aacenc-0.1.3/.libs/aac-enc 
# snip...
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7f942e6 in voAACEnc_pow2_xy (x=6, y=28) at aacenc/basic_op/oper_32b.c:370
370       res = pow2Table[(POW2_TABLE_SIZE*fPart)/y] >> iPart;
(gdb) bt
#0  0x00007ffff7f942e6 in voAACEnc_pow2_xy (x=6, y=28) at aacenc/basic_op/oper_32b.c:370
#1  0x00007ffff7f97ea0 in allowMoreHoles (psyOutChannel=0x55555555a2a0, psyOutElement=0x0, peData=0x5555555654c0, ahFlag=0x5555555702b0, ahParam=0x555555566250, 
    nChannels=1, desiredPe=1540) at aacenc/src/adj_thr.c:714
#2  0x0000555555556402 in main (argc=1, argv=0x7fffffffe058) at aac-enc.c:70
(gdb) f 1
#1  0x00007ffff7f97ea0 in allowMoreHoles (psyOutChannel=0x55555555a2a0, psyOutElement=0x0, peData=0x5555555654c0, ahFlag=0x5555555702b0, ahParam=0x555555566250, 
    nChannels=1, desiredPe=1540) at aacenc/src/adj_thr.c:714
714         en[0] = fixmul(avgEn, pow2_xy(-(6 * enDiff), 7*4));
(gdb) info locals
enDiff = -1
avgEn = 510 # !!
minEn = 512 # !!
ahCnt = 1
en = {507315022, 0, 1679226483, 0}
maxSfb = 0
done = 0
startSfb = {15, 0}
enIdx = 0
minSfb = 1000
ch = 1
sfb = 43
actPe = 1732
shift = 30
(gdb) 

We have just confirmed our previous conjecture that avgEn < minEn! So why does this happen? Lets step through the relevant code.


# Set a breakpoint
(gdb) b allowMoreHoles
# Start the program
(gdb) run
Starting program: /storage/workspace/bots/debug/vo-aacenc-0.1.3/.libs/aac-enc 
Breakpoint 1, allowMoreHoles (psyOutChannel=0x55555555a2a0, psyOutElement=0x0, peData=0x5555555654c0, ahFlag=0x5555555702b0, ahParam=0x555555566250, nChannels=1, desiredPe=1540) at aacenc/src/adj_thr.c:699
699         if(ahCnt) {
# Print out minEn and avgEn after initial calculation
(gdb) p minEn 
$1 = 512
(gdb) p avgEn
$2 = 512
# They look good at this stage, keep going...
(gdb) n
701           shift = norm_l(ahCnt);
(gdb) n
702               iahCnt = Div_32( 1 << shift, ahCnt << shift );
(gdb) n
703           avgEn = fixmul(avgEn, iahCnt);
# We just updated avgEn, lets look again
(gdb) p avgEn
$8 = 510
# Oh no, so this `fixmul` nudged `avgEn` below `minEn` 
(gdb) n
707         enDiff = iLog4(avgEn) - iLog4(minEn);
# And that means enDiff is now negative...
(gdb) p enDiff
$10 = -1
# And all hell breaks loose...
(gdb) n
714         en[0] = fixmul(avgEn, pow2_xy(-(6 * enDiff), 7*4));
(gdb) n
pow2_xy(6, 28)

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7f942e6 in voAACEnc_pow2_xy (x=6, y=28) at aacenc/basic_op/oper_32b.c:370
370       res = pow2Table[(POW2_TABLE_SIZE*fPart)/y] >> iPart;

So now we know at least roughly where things have gone wrong. The fixed-point calculation of the average energy avgEn is introducing an error causing avgEn < minEn in rare cases.

But what is actually happening in norm_l, Div_32 and fixmul that leads to this error? The source code mentions these operations are in "double precision" however most of the values are stored in 32-bit integers. What is going on here? Google leads us to this relic, last modified in 1996 by the France Telecom:


/*___________________________________________________________________________
 |                                                                           |
 |  This file contains operations in double precision.                       |
 |  These operations are not standard double precision operations.           |
 |  They are used where single precision is not enough but the full 32 bits  |
 |  precision is not necessary. For example, the function Div_32() has a     |
 |  24 bits precision which is enough for our purposes.                      |
 |                                                                           |
 |  The double precision numbers use a special representation:               |
 |                                                                           |
 |     L_32 = hi<<16 + lo<<1                                                 |
 |                                                                           |
 |  L_32 is a 32 bit integer.                                                |
 |  hi and lo are 16 bit signed integers.                                    |
 |  As the low part also contains the sign, this allows fast multiplication. |
 |                                                                           |
 |      0x8000 0000 <= L_32 <= 0x7fff fffe.                                  |
 |                                                                           |
 |  We will use DPF (Double Precision Format )in this file to specify        |
 |  this special format.                                                     |
 |___________________________________________________________________________|
*/

Ahhh, so is a special kind of "double precision" split across 32 bits. Now with this in mind we can annotate the relevant code:


if(ahCnt) {
  // calc how many bits needed to shift ahCnt into the normalised range [0x40000000, 0x7fff0000]
  shift = norm_l(ahCnt);
  // perform fixed-width division of normalised 1 and ahCnt
  iahCnt = Div_32(1 << shift, ahCnt << shift); 
  // perform fixed-width multiplication of avgEn and iahCnt 
  avgEn = fixmul(avgEn, iahCnt);
}

So where are we losing precision then? Lets step through our scenario in gdb:


# Break at `main`
(gdb) b main
Breakpoint 1 at 0x22f0: file aac-enc.c, line 60.
# Start the program
(gdb) run
Starting program: /storage/workspace/bots/debug/vo-aacenc-0.1.3/.libs/aac-enc 
Breakpoint 1, main (argc=1, argv=0x7fffffffe068) at aac-enc.c:60
# Set up the input
(gdb) set $avgEn = 512
# Calculate the normalisation left shift for ahCnt = 1
(gdb) set $shift = norm_l(1)
(gdb) p $shift
$1 = 30
# Perform the shifts
(gdb) set $num = 1 << $shift
(gdb) set $denom = 1 << $shift
# Perform the division
(gdb) set $iahCnt = voAACEnc_Div_32($num, $denom)
# Perform the multiplication
(gdb) set $mul = (uint64_t)$avgEn * (uint64_t)$iahCnt
(gdb) set $avgEn = ($mul >> 32) << 1
# Observe the results
(gdb) p (double)$iahCnt / (double)((uint64_t)2 << 31)
$3 = 0.4999999962747097
(gdb) p (double)$mul / (double)((uint64_t)2 << 31)
$3 = 254.99999810010195
(gdb) p $avgEn
$5 = 510

There are a couple of things that do not look right. Firstly, we are executing Div_32($num, $denom) where $num = $demon breaking the num < denom invariant. We then lose some precision in our Div_32 call and this propagates to our 64-bit multiplication of $avgEn, at this stage the floating point equivalent is ~255 and our error is -1. The final left-shift then doubles the error to -2 and we are left with 510.

What if Div_32 did not lose precision?


# Break at `main`
(gdb) b main
Breakpoint 1 at 0x22f0: file aac-enc.c, line 60.
# Start the program
(gdb) run
Starting program: /storage/workspace/bots/debug/vo-aacenc-0.1.3/.libs/aac-enc 
Breakpoint 1, main (argc=1, argv=0x7fffffffe068) at aac-enc.c:60
# Set up the input
(gdb) set $avgEn = 512
# 2 << 30 is exactly floating point 0.5 in Q31
(gdb) set $iahCnt = (uint64_t)2 << 30
# Perform the multiplication
(gdb) set $mul = (uint64_t)$avgEn * (uint64_t)$iahCnt
(gdb) set $avgEn = ($mul >> 32) << 1
# Observe the results
(gdb) p (double)$iahCnt / (double)((uint64_t)2 << 31)
$3 = 0.5
(gdb) p (double)$mul / (double)((uint64_t)2 << 31)
$3 = 256
(gdb) p $avgEn
$5 = 512

Voila! avgEn is now exactly what we expect it to be after dividing it by 1.

So how do we fix this? Could we modify Div_32 to never lose precision? Probably not without harming its performance characteristics.

In this particular case the local invariant is minEn < avgEn. So lets just add implement a lower-bound for avgEn:


if(ahCnt) {
  Word32 iahCnt;
  shift = norm_l(ahCnt);
  iahCnt = Div_32( 1 << shift, ahCnt << shift );
  avgEn = fixmul(avgEn, iahCnt);
  if (avgEn < minEn) {
    avgEn = minEn;
  }
}

Now running the program again:


> make && ./aac-enc 
# snip...
loaded memdump/ahFlag.bin into buffer size 1024
pow2_xy(0, 28)
pow2_xy(0, 28)
pow2_xy(0, 28)
pow2_xy(0, 28)
> echo $?
0

That seems to have worked! So lets push this fix to prod shall we?

Uncovering a fossil

Not so fast.... At this point I was struggling to believe that nobody else had encountered this issue previously. I scoured Google, GitHub and finally Bing (of all places) to find this mailing list thread. A quick read reveals that it is almost certainly referring to the same bug. And that it was fixed upstream nearly 9 year ago. However, the fixes were never released to the gstreamer plugin.

Although it was satisfying to understand and fix the bug that had been causing these sporadic crashes, pressing on with unmaintained codec is not a liability we want to take on moving forwards. Rather than relying on an unmaintained AAC encoder from gst-plugins-bad how about we find something used more broadly and hopefully more stable? The libav AAC encoder used in ffmpeg and available as a gstreamer plugin seems like a reasonable choice.

UPDATE: We have since rolled out this shiny new codec and are yet to spot another segfault! Kudos to the ffmpeg maintainers!

Happy debugging!