Reducing Latency by 83% with Esoteric Linux Process Flags
It was a fine Sunday evening. I had just landed in LA for our company offsite, and I met the whole team in person for the first time.
A little later, my phone buzzed with a notification, “Antonio assigned you an issue on Linear.” Antonio is an engineering team lead at Recall.ai, and he tasked me with optimizing Output Media start latency so that our customers’ AI agents would launch faster.
I thought it was going to be the quickest, most straightforward thing I’d work on all week, but nope, it took much longer than anticipated.
This is the story of how a seemingly simple task turned into a deep dive into the undocumented intricacies of the Linux kernel, Bubblewrap sandboxing, and Rust Tokio’s threading model.
The plan
Output Media is a Recall.ai feature that enables outputting ultra-low-latency audio and video from bots. Our customers use Output Media to build in-meeting AI agents, interactive applications, and more.
Under the hood, Output Media works by rendering a customer-supplied web page into audio and video, which is then emitted through the bot.
Because rendering a webpage involves running arbitrary untrusted code, we take a lot of measures to ensure this feature is secure. The most prominent technique we use is sandboxing. For performance and simplicity, we sandbox our Output Media code with Bubblewrap, giving it limited permissions – just enough for it to function properly.
The current state of the feature was pretty decent – it was reliable and the number of things you could build with it was endless. There was just one small issue. Whenever customers activated Output Media, it took a sluggish 12 seconds for video to start streaming. Yikes!
The reason this was taking so long was because when Output Media was activated, we’d launch an instance of Chromium to render the web page. Chromium is known to be complex, large and resource intensive, and the bots operate in a very resource-constrained environment.
So, what could be done to fix this?
The plan seemed simple: pre-load Chromium when the bot starts up, instead of launching it only when Output Media is activated. That way, by the time a developer requests a bot to join their meeting, Chromium would already be ready, and their page could be loaded instantaneously.
Roadblock of the century
I was working my way through the code, refactoring it so that we could launch Chromium before our meeting join logic. It was going pretty well, so then I started testing it locally.
Ok… Bot’s running, it’s loading the pre-requisites, it’s launching Chromium, done… Now it’s announced its status. Great! Let’s push this.
At that point, I was pretty convinced I had finished the task. I pushed it to our staging platform and waited for our dozens of integration tests to succeed. But, they didn’t.
Checking the logs revealed that Chromium launched, and then it terminated – just out of the blue. The weird thing was that none of our code performed process terminations, so what happened here?
A little context
Bubblewrap is a lightweight program included in Linux that allows creating isolated sandboxed environments inline with program executions. Bubblewrap can spawn contained processes leveraging Linux namespaces, seccomp filters, bind mounts and kernel-level APIs.
Unlike full virtual machines or containerization like Docker, which can be resource-intensive, Bubblewrap operates with minimal overhead while still providing strong isolation, which is crucial in our bot environments where compute is limited.
We also leverage Tokio to ensure fast performance and parallel processing for all the different types of tasks our bots handle simultaneously. This includes operations like relaying audio in real-time to transcription providers and, at the same time, handling incoming results and saving them to our database for retrieval later, all while forwarding these results to our customers’ specified ingress endpoints.
Output Media is no different–it uses many threads at once and juggles them all to ensure a smooth and optimized experience without compromising on the other essential tasks our customers rely on our bots for. This becomes very important later on…
I found the solution! But that shouldn’t have fixed it…
Over the course of the next week, I spent every single day debugging this process termination, and I was getting exhausted.
I had explored every single theory I had as to why the process was terminating, and eventually devolved to just throwing ideas at the wall hoping that one would stick.
I decided to play around with the arguments we gave Bubblewrap to run our isolated Output Media code.
Oops, removing that one broke literally everything. Ok, this one did nothing… That one obviously didn’t fix the problem, why did I even try changing it? Alright this one is DEFINITELY unrelated, right? Save, compile. Why’s it taking so long? Please work please work please work please wor-
“bro, I fixed it.”
In our Bubblewrap command, there was a flag, --die-with-parent
, that was somehow causing our Chromium to randomly terminate . Removing this argument appeared to fix the issue.
But this made no sense. The parent was the bot’s main loop, which was definitely still running. So how could removing the --die-with-parent
flag have fixed the problem?
Down the rabbit hole we go
Remember when I provided context on Bubblewrap a few paragraphs ago? Well, --die-with-parent
uses those aforementioned “kernel-level APIs.” From the flag’s name, I deduced that its purpose is to ensure that if the parent executor exits or crashes, its exit would propagate down to the child program being executed in Bubblewrap.
In theory, this can be quite useful to ensure that you don’t end up with a bunch of stranded processes, reducing the need for manual cleanup procedures. But evidently it turned out to be a little more nuanced than that, since when --die-with-parent
was set, our child process was being terminated randomly, even though its parent continued to live.
So my next step was to actually look into Bubblewrap’s source code.
/* If --die-with-parent was specified, use PDEATHSIG to ensure SIGKILL
* is sent to the current process when our parent dies.
*/
static void
handle_die_with_parent (void)
{
if (opt_die_with_parent && prctl (PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0) != 0)
die_with_error ("prctl");
}
When I first saw this block of code, I was remarkably underwhelmed. It’s so simple! All it did was call the native function to set PR_SET_PDEATHSIG
. This functionality is entirely handled by the Linux kernel. When this call is made, the kernel records this preference in the process' internal data structures.
All the documentation we came across on PR_SET_PDEATHSIG
was subtly misleading. The phrase in the comment block, “When our parent dies”, sounds like it refers to the parent process. But actually, it refers to the parent thread. This subtle detail made all the difference, and we missed this detail for the longest time.
The actual cause
After digging through Bubblewrap’s source code and Linux kernel documentation, we finally discovered the true culprit. The issue wasn’t with process death but with how the Linux kernel interprets thread lifetimes when using PR_SET_PDEATHSIG
.
When the --die-with-parent
flag is set, Bubblewrap uses the Linux kernel’s prctl(PR_SET_PDEATHSIG, SIGKILL)
call. This kernel feature is designed to send a SIGKILL signal to a child process when its parent thread dies–not necessarily when the entire parent process exits.
Here’s how the issue manifested in our codebase:
- Tokio’s threading model and thread parking
Our bot code uses Tokio, an asynchronous runtime that manages a pool of worker threads. Tokio efficiently utilizes thread parking and unparking to save resources. When tasks are temporarily inactive, their threads can be “parked” (synonymous with being put to sleep), and if they remain idle for a while, they might be stopped altogether.
- Thread-based parent tracking in the kernel
When a thread in Tokio’s runtime launches Bubblewrap with PR_SET_PDEATHSIG
, the Linux kernel associates that specific thread as the “parent” of the Bubblewrap process. This means that when that thread exits or is reaped by Tokio’s internal scheduler, the kernel considers the “parent” to have died, even if the overall bot process is still running.
- Misinterpretation leading to premature termination
In some cases, Tokio’s scheduler parked the parent thread that had spawned Bubblewrap. If the thread remained idle long enough, Tokio eventually cleaned it up, fully terminating it. The Linux kernel then mistakenly interpreted this as the death of the parent process, triggering PR_SET_PDEATHSIG
and sending a SIGKILL to Bubblewrap. Since Bubblewrap itself was running Chromium inside its sandbox, Chromium was killed as well, leading to the seemingly random crashes we observed.
This explains why removing the --die-with-parent
flag fixed the issue: we disabled the mechanism that was incorrectly interpreting thread termination as process death.
This behavior is not a bug in Tokio or Bubblewrap, but rather a hard to find quirk of the Linux kernel. Since PR_SET_PDEATHSIG
tracks the parent thread, not the whole process, it interacts unexpectedly with runtimes that dynamically park and reap threads. This subtle interaction between Tokio’s thread management, Bubblewrap’s process isolation, and Linux kernel internals was completely undocumented in any of the sources we initially checked, making it a particularly nasty debugging challenge.
Light at the end of the tunnel
After all the debugging and investigating, I was able to pick back up speed and push this optimization to completion. The result is 10+ seconds of wait time shaved off. The time to first frame with output media is now just 2-3 seconds on average, leading to a much more enjoyable experience for our customers. Super worth it.