Recall.ai powers meeting bots for hundreds of companies. We capture millions of meetings per month, and operate enormous infrastructure to do so.
We run all this infrastructure on AWS. Cloud computing is enormously convenient, but also notoriously expensive, which means performance and efficiency is very important to us.
In order to deliver a cost-efficient service to our customers, we're determined to squeeze every ounce of performance we can from our hardware.
We do our video processing on the CPU instead of on GPU, as GPU availability on the cloud providers has been patchy in the last few years. Before we started our optimization efforts, our bots generally required 4 CPU cores to run smoothly in all circumstances. These 4 CPU cores powered all parts of the bot, from the headless Chromium used to join meetings to the real-time video processing piplines to ingest the media.
We set a goal for ourselves to cut this CPU requirement in half, and thereby cut our cloud compute bill in half.
A lofty target, and the first step to accomplish it would be to profile our bots.
Our CPU is being spent doing what??
Everyone knows that video processing is very computationally expensive. Given that we process a ton of video, we initially expected the majority of our CPU usage to be video encoding and decoding.
We profiled a sample of running bots, and came to a shocking realization. The majority of our CPU time was actually being spent in two functions: __memmove_avx_unaligned_erms
and __memcpy_avx_unaligned_erms
.
Let's take a brief detour to explain what these functions do.
memmove
and memcpy
are both functions in the C standard library (glibc) that copy blocks of memory. memmove
handles a few edge-cases around copying memory into overlapping ranges, but we can broadly categorize both these functions as "copying memory".
The avx_unaligned_erms
suffix means this function is specifically optimized for systems with Advanced Vector Extensions (AVX) support and is also optimized for unaligned memory access. The erms
part stands for Enhanced REP MOVSB/STOSB, which are optimizations in recent Intel processors for fast memory movement. We can broadly categorize the suffix to mean "a faster implementation, for this specific processor"
In our profiling, we discovered that by far, the biggest callers of these functions were in our Python WebSocket client that was receiving the data, followed by Chromium's WebSocket implementation that was sending the data.
An expensive set of sockets...
After pondering this, the result started making more sense. For bots that join calls using a headless Chromium, we needed a way to transport the raw decoded video out of Chromium's Javascript environment and into our encoder.
We originally settled on running a local WebSocket server, connecting to it in the Javascript environment, and sending data over that channel.
WebSocket seemed like a decent fit for our needs. It was "fast" as far as web APIs go, convenient to access from within the JS runtime, supported binary data, and most importantly was already built-in to Chromium.
One complicating factor here is that raw video is surprisingly high bandwidth. A single 1080p 30fps video stream, in uncompressed I420 format, is 1080 * 1920 * 1.5 (bytes per pixel) * 30 (frames per second) = 93.312 MB/s
Our monitoring showed us that at scale, the p99 bot receives 150MB/s of video data.
That's a lot of data to move around!
The next step was to figure out what specifically was causing the WebSocket transport to be so computationally expensive. We had to find the root cause, in order to make sure that our solution would sidestep WebSocket's pitfalls, and not introduce new issues of it's own.
We read through the WebSocket RFC, and Chromium's WebSocket implementation, dug through our profile data, and discovered two primary causes of slowness: fragmentation, and masking.
Fragmentation
The WebSocket specification supports fragmenting messages. This is the process of splitting a large message across several WebSocket frames.
According to Section 5.4 of the WebSocket RFC):
The primary purpose of fragmentation is to allow sending a message that is of unknown size when the message is started without having to buffer that message. If messages couldn't be fragmented, then an endpoint would have to buffer the entire message so its length could be counted before the first byte is sent. With fragmentation, a server or intermediary may choose a reasonable size buffer and, when the buffer is full, write a fragment to the network.
A secondary use-case for fragmentation is for multiplexing, where it is not desirable for a large message on one logical channel to monopolize the output channel, so the multiplexing needs to be free to split the message into smaller fragments to better share the output channel. (Note that the multiplexing extension is not described in this document.)
Different WebSocket implementations have different standards
Looking into the Chromium WebSocket source code, messages larger than 131KB will be fragmented into multiple WebSocket frames.
A single 1080p raw video frame would be 1080 * 1920 * 1.5 = 3110.4 KB
in size, and therefore Chromium's WebSocket implementation would fragment it into 24 separate WebSocket frames.
That's a lot of copying and duplicate work!
Masking
The WebSocket specification also mandates that data from client to server is masked.
To avoid confusing network intermediaries (such as intercepting proxies) and for security reasons that are further discussed in Section 10.3, a client MUST mask all frames that it sends to the server
Masking the data involves obtaining a random 32-bit masking key, and XOR
-ing the bytes of the original data with the masking key in 32-bit chunks.
This has security benefits, because it prevents a client from controlling the bytes that appear on the wire. If you're interested in the precise reason why this is important, read more here!
While this is great for security, the downside is masking the data means making an additional once-over pass over every byte sent over WebSocket -- insignificant for most web usages, but a meaningful amount of work when you're dealing with 100+ MB/s
Quest for a cheaper transport!
We knew we need to move away from WebSockets, so we began our quest to find a new mechanism to get data out of Chromium.
We realized pretty quickly that browser APIs are severely limited if we wanted something significantly more performant that WebSocket.
This meant we'd need to fork Chromium and implement something custom. But this also meant that the sky was the limit for how efficient we could get.
We considered 3 options: raw TCP/IP, Unix Domain Sockets, and Shared Memory:
TCP/IP
Chromium's WebSocket implementation, and the WebSocket spec in general, create some especially bad performance pitfalls.
How about we go one level deeper and add an extension to Chromium to allow us to send raw TCP/IP packets over the loopback device?
This would bypass the issues around WebSocket fragmentation and masking, and this would be pretty straightforward to implement. The loopback device would also introduce minimal latency.
There were a few drawbacks however. Firstly, the maximum size for TCP/IP packets is much smaller than the size of our raw video frames, which means we still run into fragmentation.
In a typical TCP/IP network connected via ethernet, the standard MTU (Maximum Transmission Unit) is 1500 bytes, resulting in a TCP MSS (Maximum Segment Size) of 1448 bytes. This is much smaller than our 3MB+ raw video frames.
Even the theoretical maximum size of a TCP/IP packet, 64k, is much smaller than the data we need to send, so there's no way for us to use TCP/IP without suffering from fragmentation.
There was another issue as well. Because the Linux networking stack runs in kernel-space, any packets we send over TCP/IP need to be copied from user-space into kernel-space. This adds significant overhead as we're transporting a high volume of data.
Unix Domain Sockets
We also explored exiting the networking stack entirely, and using good old Unix domain sockets.
A classic choice for IPC, and it turns out Unix domain sockets can actually be pretty fast.
Most importantly however, Unix domain sockets are a native part of the Linux operating system we run our bots in, and there are pre-existing functions and libraries to push data through Unix sockets.
There is one con however. To send data through a Unix domain socket, it needs to be copied from user-space to kernel-space, and back again. With the volume of data we're working with, this is a decent amount of overhead.
Shared Memory
We realized we could go one step further. Both TCP/IP and Unix Domain Sockets would at minimum require copying the data between user-space and kernel-space.
With a bit of DIY, we could get even more efficient using Shared Memory.
Shared memory is memory that can be simultaneously accessed by multiple processes at a time. This means that our Chromium could write to a block of memory, which would then be read directly by our video encoder with no copying at all required in between.
However there's no standard interface for transporting data over shared memory. It's not a standard like TCP/IP or Unix Domain sockets. If we went the shared memory route, we'd need to build the transport ourselves from the ground up, and there's a lot that could go wrong.
Glancing at our AWS bill gave us the resolve we needed to push forward. Shared memory, for maximum efficiency, was the way to go.
Sharing is caring (about performance)
As we need to continuously read and write data serially into our shared memory, we settled on a ring buffer as our high level transport design.
There are quite a few ringbuffer implementations in the Rust community, but we had a few specific requirements for our implementation:
- Lock-free: We need consistent latency and no jitter, otherwise our real-time video processing would be disrupted.
- Multiple producer, single consumer: We have multiple chromium threads writing audio and video data into the buffer, and a single thread in the media pipline consuming this data.
- Dynamic Frame Sizes: Our ringbuffer needed to support audio packets, as well as video frames of different resolutions, meaning the size of each datum could vary drastically.
- Zero-Copy Reads: We want to avoid copies as much as possible, and therefore want our media pipeline to be able to read data out of the buffer without copying it.
- Sandbox Friendlyness: Chromium threads are sandboxed, and we need them to be able to access the ringbuffer easily.
- Low Latency Signalling: We need our Chromium threads to be able to signal to the media pipeline when new data is available, or when buffer space is available.
We evaluated the off-the-shelf ringbuffer implementations, but didn't find one that fit our needs... so we decided to write our own!
The most non-standard part of our ring-buffer implementation is our support for zero-copy reads. Instead of the typical two-pointers, we have three pointers in our ring buffer:
- write pointer: the next address to write to
- peek pointer: the address of the next frame to read
- read pointer: the address where data can be overwritten
To support zero-copy reads we feed frames from the peek pointer into our media pipeline, and only advance the read pointer when the frame has been fully processed.
This means that it's safe for the media pipeline to hold a reference to the data inside the ringbuffer, since that reference is guaranteed to be valid until the data is fully processed and the read pointer is advanced.
We use atomic operations to update the pointers in a thread-safe manner, and to signal that new data is available or buffer space is free we use a named semaphore.
After implementing this ringbuffer, and deploying this into production with a few other optimizations, we were able to reduce the CPU usage of our bots by up to 50%.
This exercise in CPU efficiency reduced our AWS bill by over a million dollars per year, a huge impact and a really great use of time!