Deep Dive

Debugging Audio Artifacts Caused by... a Serial Port?

Chris Constantine

October 14, 2024

Table of Contents

The joys of scaling a 10,000 node cluster

At Recall.ai we run enormous infrastructure to process millions of meetings per month, in real-time.

At peak load, our cluster contains over 10,000 VMs at processing over 1TB / sec of raw video.

The first version of our cluster ran on Kubernetes, using Amazon's managed EKS control plane with self-managed nodes.

During 2023 we grew extremely rapidly and EKS's control plane could not keep up with our spiky load patterns, so we migrated off Kubernetes and wrote our own cluster scaler to replace it.

I deployed the migration on Friday night, and over Saturday and Sunday everything seemed to be running perfectly.

Because our customers do significantly more meetings on weekdays, we time our major deploys for Friday nights so we can monitor performance during the low-utilization periods on the weekend.

However, on Monday, we started getting customer reports that there were popping noises in the captured audio.

The least-expected failure mode

Out of all the ways this migration could have caused problems, this is not what I was expecting. Our audio recording code is completely unrelated to our cluster scaler.

On top of that, our new cluster runs exactly the same Docker containers as the Kubernetes cluster. Not a single line of application code was changed.

But we had not deployed any other changes, and there were zero affected bots in our un-migrated clusters.

Huh?

Problem solved?

In the past, I had seen audio artifacts like popping sounds appearing during periods of CPU starvation.

There are a few processes in the bot that are especially latency sensitive, which we have tuned the nice value for.

No CPU spikes were showing up in our monitoring, but I hypothesized that there could be spikes too brief to show up in our coarse grained metrics.

I dug into this theory and struck gold almost immediately. The way I was setting nice was incorrect and priority wasn't being applied to the appropriate threads.

This was a quick fix, and I deployed the change into production.

Problem solved?

Unfortunately not. The next day, even more customers reported audio artifacts.

Spooky action-at-a-distance

So far, I had not been able to reproduce this issue at all.

Our customers confirmed the issue was definitely real, but none of the dozens of bots I tested had any audio artifacts whatsoever.

I found a breakthrough when I tried to launch bots with precisely the same config as one of the affected bots. With that specific config, one bot in a dozen had popping audio.

I bisected the config options, and through a trial-and-error-binary-search determined that enabling real time transcription was the critical factor in causing these artifacts to appear.

This was an extremely bizarre result. Transcription happens in a completely different part of the code from audio encoding and it is not particularly CPU intensive.

On top of that, the bots do transcription by sending the audio over WebSocket to 3rd party transcription partners, so nothing transcription-related runs in our cluster anyway.

???

Spot the difference

After figuring out a way to reproduce the issue, I went through our infrastructure with a fine-tooth comb.

There was something different between our new infrastructure and our old infrastructure that was causing a problem, and I just needed to figure out what that difference was.

The only difference I found was extremely far removed from our audio processing. The thought of this difference being the root cause of our audio artifacts was laughable.

Our logging system was different.

In our old Kubernetes cluster, we used the built-in Kubernetes logging infrastructure.

In our new self-built cluster, we send logs over a unix socket with Fluentbit.

Presumably, if Fluentbit logging was the culprit, disabling Fluentbit would resolve the problem.

So I disabled Fluentbit logging.

Launched a fresh bot.

And still heard popping audio.

A lucky break

Thoroughly confused, I spent a few hours just poking around on the new machines.

I noticed offhand that the Docker logs were ending up in SystemD's journal logs.

This was mysterious because I did not configure Docker to log to SystemD's journal.

Digging in to this, I realized this was happening because I run the container from a Docker command inside of our EC2 instance's cloud-init.

The cloud-init output gets redirected to a few places, including SystemD, as well as an AWS internal system that allows you to get system logs from a running instance.

The AWS feature in question.

Back-back-back-back-pressure

This was a thread to follow. Maybe, there was something in SystemD or in AWS's cloud-init logging system that was blocking somewhere.

So I tested this theory, that cloud-init logging was the root cause of our problems.

I disabled cloud-init logging.

Launched a fresh bot.

And heard pure and clear audio, without a hint of popping.

Success! Finally, after several hours, a dozen deploys, and chasing a ghost up and down several levels of the stack, I was close to finding the root cause.

Now that I had a single parameter that I confirmed was triggering the issue, I dove in to discover the root cause.

Comparing the journalctl outputs of affected and unaffected bots, I saw the following log line repeated over and over again in every bot that produced audio artifacts:

...
serial8250: too much work for irq4
serial8250: too much work for irq4
serial8250: too much work for irq4
serial8250: too much work for irq4
serial8250: too much work for irq4
...

I searched this log line in Google, and found the following delightful GitHub thread..

It turns out, that EC2's console logging system writes to a virtual serial device.

And if you write too much data to that serial device, it gets overwhelmed and will hang momentarily, blocking the writer from proceeding.

I traced this issue all the way back up the stack, to our audio encoding process, and I finally realized what was happening:

  1. Our transcription module emitted a large volume of debug logs, written to stdout
  2. Because we launched the Docker container in cloud-init, stdout was getting consumed by EC2's internal logging system
  3. EC2's internal logging system writes to a virtual serial device
  4. The interrupt handler used to write to the serial device cannot be interrupted
  5. The time spent in the interrupt handler acted like CPU starvation for our audio processing.