How Expensify achieves extreme concurrency with NUMA balancing
If you've never heard of "NUMA," you aren't alone – it's a bit of an IYKYK topic that very few engineers encounter in their entire careers. But if you've ever worked on an extremely high core-count server, these four letters will put shivers down your spine.
NUMA is the dark magic that powers extremely high-performance applications on extremely large servers – and the people who work on those systems rarely have the time to write blog posts explaining how it works. This contributes to the overall air of mystery surrounding the topic, as there are very few good introductory materials on it, leaving the whole thing kind of inscrutable.
But I find myself with some holiday time on my hands, so here you go: a deep dive into how Expensify's core hardware works – down to the chip level – with an explanation of how that enables a critical feature of our software to work. I hope you'll find it useful as you architect your own NUMA-balanced applications on your own high-density servers!
What is NUMA?
Before answering that, let me first describe how our hardware works. Our SR950 servers each have:
8 CPUs, each of which has:
24 cores, and
768 GB of RAM
The CPUs are arranged in a complex pattern like this:
Each CPU is directly connected to its own "local node" of RAM, which threads on the CPU can access at extremely high speeds. But to access the RAM connected to the other CPUs, threads must "hop" on the "Ultra Path Interconnect" (UPI) bus to access "foreign" nodes.
For that, each CPU is connected to three other CPUs, meaning each can access three foreign RAM nodes with only one hop on the bus. However, the other four CPUs are not directly connected via dedicated UPI busses, so memory requests must route through an intermediate CPU to access them.
For example, any memory request from a thread on CPU 0 must hop on the UPI bus from CPU 0 to CPU 3, and then again from CPU 3 to CPU 7, to access memory stores in node 7. You can directly see the "distance" between each CPU and another node can be with this command:
Unset $ numactl --hardware node distances: node 0 1 2 3 4 5 6 7 0: 10 21 31 21 21 31 31 31 1: 21 10 21 31 31 21 31 31 2: 31 21 10 21 31 31 21 31 3: 21 31 21 10 31 31 31 21 4: 21 31 31 31 10 21 21 31 5: 31 21 31 31 21 10 31 21 6: 31 31 21 31 21 31 10 21 7: 31 31 31 21 31 21 21 10
The unit of this measure is abstract, with 10 meaning "the fastest possible". As such, you can see that each CPU is able to access its own local node at distance 10. 21 means "roughly 2.1x as slow," and 31 means 3.1x slower – as a consequence of hopping on the UPI bus twice.
As you can see, this means that the time (ie, "latency") to access RAM is not "uniform", and depends heavily upon the CPU in which the thread executes and the RAM node in which the memory resides. For this reason, this design is called a "non-uniform memory architecture", or NUMA for short.
Why does NUMA matter?
Expensify's main database is over 5TB of data, on a server with 6TB of RAM. Caching the whole thing in RAM requires splitting it across all NUMA nodes. This means no matter which CPU a given thread needs to access, odds are it's accessing RAM in a "remote" NUMA node. Accordingly, our system is making heavy use of the NUMA interconnects as part of regular operation.
Additionally, as exhaustively described in Scaling Onyx (a postmortem), Expensify's "Onyx" system tracks not only the current state of every object in the system and who it's currently shared with, but also a recent log of every change to those objects. Whenever anybody changes any object, we need to quickly look up everybody that change affects, and broadcast the change out to them in realtime.
That part is easy. But along with that new update, we need to identify the unique state of the database immediately prior to that update, on a per-user basis – because we want to make sure updates are applied in the correct order by the client (without any gaps in between) to ensure the client is always updated correctly. And figuring that part out is shockingly hard.
For example, if Alice posts "Hello world!" to a room with 100 other users in it, we will:
Create 100 threads (one per user in the room), each of which:
Creates 32 threads (a number that seems to work well), to:
Scan millions of objects to identify each users' unique client side state, to ensure it can apply the update correctly.
Accordingly, what might seem like a mundane action (post to a room) can actually generate thousands of threads across our database cluster, which grind through millions of indexed rows in a matter of milliseconds. And all of that is spread across 192 CPU cores, split into 8 NUMA nodes, hopping madly over a series of interconnects.
Everyone knows balancing load across a database cluster is important. But most people don't realize that when you have extremely large servers, it's just as important to balance the load across the CPUs. And NUMA is the system that does that.
What happens if I do nothing?
Like most things, Linux works pretty great out of the box with no tuning. We got these servers a long time ago, and we haven't had to worry about it prior to Onyx. But as we dial up the CPU load and memory bandwidth required to power our system, we started seeing random slowdowns in queries that should otherwise be instant.
Digging deep through all the layers, we realized that by default, the kernel "packs" a process's memory use into the fewest NUMA nodes possible – and then assigns new threads from that process to the same node. In general, this is really smart, as keeping all activity within a single node minimizes the chance that each thread needs to reach out to a remote node for memory -- thereby minimizing memory latency.
However, as the server "warms up" after a boot, it pulls in an increasing fraction of the 5TB database off of the NVMe drives and into RAM – and at some point, it "overflows" the 768GB of a single NUMA node, and starts putting new data into a remote node. But the threads continue operating in the first node, as that's where most of the data is. This results in a situation where every allocation of new memory starts to "miss" the local node – and it seems that each allocation miss has a serious impact upon performance.
This dynamic of "all the threads are in node 0, but all new memory must be allocated in node 1" causes every single new allocation to introduce a random delay – and when you are processing thousands of threads at a time, these random delays start to add up really fast.
How did we manually balance across NUMA nodes?
The fix to this specific problem was relatively simple, involving:
vmtouch
We use this to "warm" the file system cache by looping over the entire file, loading each page one at a time. We then use its "lock" function to ensure the data isn't evicted by some other process (I'm looking at you rsyslog...), and its "daemon" function to ensure that it holds onto the lock indefinitely.
numactl
We use this to force vmtouch to "interleave" its memory allocations across all NUMA nodes.
echo 0 | sudo tee /proc/sys/kernel/numa_balancing
This disables "automatic NUMA balancing", which attempts to move RAM "closer" to the threads that are executing. Given that most of the time our servers are not at peak traffic, that causes the server to gradually "pack" RAM into a smaller number of nodes, which sets us up for trouble when traffic spikes and the nodes overflow again.
The combination of the three changes our RAM distribution from this (ie, tons of RAM free in nodes 0 and 2, but almost none in the rest):
Unset $ numactl --hardware | grep free node 0 free: 360944 MB node 1 free: 801 MB node 2 free: 126836 MB node 3 free: 1906 MB node 4 free: 801 MB node 5 free: 801 MB node 6 free: 801 MB node 7 free: 801 MB
To this (with lots of RAM free in every node):
Unset $ numactl --hardware | grep free node 0 free: 106199 MB node 1 free: 58140 MB node 2 free: 106857 MB node 3 free: 109993 MB node 4 free: 101461 MB node 5 free: 111030 MB node 6 free: 110838 MB node 7 free: 103001 MB
What is the effect of manual NUMA balancing?
The result is virtually all of our "allocation misses" went away from each server one by one – ultimately dropping the whole chart to 0 at the end – as we disabled the kernel's "automatic NUMA balancing" (which is misleading, as it actually does the opposite) and added true manual NUMA balancing of RAM equally across all nodes:
This means that whenever a thread needs to allocate RAM, it has plenty available to it in its local node – reducing one class of "random slowdowns" that otherwise seemed to come in inexplicable waves.
What about reads and writes?
In our case, allocations were correlated with slowdowns, so this is where we focused our time. However, in general, there are three metrics to track:
Allocations - Whether a thread in a given node was able to allocate RAM in its local node. The key values come from
cat /sys/devices/system/node/node0/numastat
and include many metrics, but in particular:
numa-hit
Which increments each time a thread in this node allocates locally.
numa-miss
Which increments each time a thread in this node is forced to allocate from a different node
Note: collectd reports something called operations-miss, which as best as I can tell is numa-miss + numa_foreign
Loads (ie, reads) - Whether memory requested by a thread was found in the local node, or needed to be fetched from a remote node. You can get this from:
sudo perf stat -e node-loads,node-load-misses --timeout 1000
and includes:
numa-loads
The count of total loads across all nodes in the system that successfully loaded from the local node, accumulated over 1s.
numa-load-misses
The count of total loads across all nodes in the system that failed to load from the local node, and were forced to load from a remote node, accumulated over 1s.
Stores (ie, writes) - Whether memory written by a thread was able to be stored in the local node, or needed to be stored in a remote node. You can get this from:
sudo perf stat -e node-stores,node-load-stores --timeout 1000
and includes:
numa-stores
The count of total stores across all nodes in the system that are successfully stored into the local node, accumulated over 1s.
numa-store-misses
The count of total stores across all nodes in the system that failed to store into the local node and were forced to store into a remote node, accumulated over 1s.
What happens if numa-load-misses or numa-store-misses spike?
Currently, for our application, not much. At the time of this writing, this doesn't seem to correlate with any problem in the application, so in the spirit of solving problems that exist when they exist, we're not worrying about it. However, in the event that either does spike, the solution will likely be to go beyond mere "NUMA balancing" and make Bedrock fully "NUMA aware."
This would include getting more intentional about starting threads inside specific CPUs, to ensure they are close to specific types of memory (i.e., start all threads dealing with expenses on one CPU, and all threads dealing with chats on another, such that each tends to cache the data it needs into its local node).
However, for now, we aren't worried about it, and we'll deal with that if it comes. We've found that Intel, Lenovo, and Ubuntu are pretty great at what they do and have ensured that, "out of the box," it performs shockingly well, without a lot of tweaking.
What's next?
For now, we seem to be pretty good. However, we bought our hardware years ago, and even though it's working great, it's coming up on the end of its usable life (ie, Lenovo won't warranty it any longer). We are evaluating many new hardware platforms for the next generation of our datacenter, codenamed:
#datacenterv5
But I'm personally pulling for just upgrading our servers to "Version 3" of our existing SR950 platform. In addition to supporting more RAM, more NVMe drives, more GPUs, more cores, and basically more of everything – of particular interest to this post it has more UPI channels:
SR950 V1 works with 1st Gen Xeon CPUs, which have 3 UPI channels per CPU, meaning each CPU is only able to access 3 out of 7 remote nodes in 1 hop.
SR950 V3 works with 4th Gen Xeon CPUs, which have 4 UPI channels per CPU, meaning each CPU is able to access 4 out of 7 remote nodes in 1 hop. That might not sound like a lot, but that's a 25% reduction in 2-hop accesses, which is a big deal!
To visualize how this works, here is the layout of CPUs and RAM on a SR950 V3 system:
So, not only does this mean that foreign NUMA nodes are on average "closer," but UPI 2.0 is 53% faster. Which all just adds up to making a system that's already crazy fast, even crazy faster. Anyway, the jury is still out; we'll see where we land.
What's after that?
The other option we're looking at – but even I recognize might be overkill – is the HPE Superdome Flex. It also uses the 4th Generation Xeon CPU, which has four UPI channels. But it takes a radical approach of grouping all CPUs into clusters of four CPU "compute modules".
Each CPU in a module uses three of its UPI nodes to connect directly to the other CPUs, and the fourth UPI channel goes to a central "Superdome Flex Fabric Module" – which acts as a central router for memory requests between all the modules:
This means that any CPU can access the RAM of any other CPU in the same compute module with one hop, and the RAM of every other compute module in only two hops. It's a pretty amazing design that can achieve massive CPU density while still presenting to the OS as a single unified memory and CPU space. I've charted it here with "only" 16 CPUs, but the Superdome Flex 32X supports up to 32 4th Gen Xeon Processors.
If you pack that full of Intel Xeon Platinum 8490H processors, each with 60 cores, that adds up to 1,920 cores. And given that each compute module can support 6TB of RAM, with eight modules, that's a whopping 48TB of RAM.
A fully provisioned server would probably cost somewhere in the ballpark of a million dollars – which isn't cheap, but isn't that crazy, given the insane computational power it affords. When financed and amortized over an eight-year life (what we're experiencing with our current hardware), it could be yours for the low, low price of $12K/mo.
Anyway, as tempted as I am, that doesn't seem necessary... yet. But maybe with #datacenterv6? We'll see.
Conclusion
Long story short: at some point, if you keep scaling "up" a system long enough – with more CPUs, more RAM, and more threads – you need to start paying attention to managing NUMA bottlenecks. However, as exotic as they might seem, they're really no different than any other bottleneck.
The alternative to scaling "up" is scaling "out," but then you're basically just solving the same problem at a different layer: replacing CPUs with servers, and UPI interconnects with LANs. No matter how you slice it, data, computation, and communication needs to be balanced at every layer, whether on a single chip, inside a single server, inside a single datacenter, or between datacenters. It's just the same basic problem repeated at different scales.
If you got this far... are you looking for a job?
This is a pretty low level post on an extremely specific topic. If you are actually reading this whole thing, that makes us really, really interested to talk to you. This is just one of the many hard problems we are solving here at Expensify, and I would love an opportunity to chat with you more to see if you are a fit for our C++ distributed systems engineer opening. Looking forward to chatting with you soon!