Scaling Onyx (a postmortem)
Hi there, though it seems to be firmly behind us, I wanted to quickly take a moment to apologize for some performance and reliability issues we had a couple weeks back. We understand how critical our service is to your daily operations, and we're deeply sorry for any inconvenience this may have caused. In the spirit of transparency, here are comprehensive details on what happened, how it was fixed, and how we will ensure better reliability going forward.
tl;dr
In short, this problem traces back to a major, difficult upgrade we're making to a core feature that powers New Expensify. The problem has been conclusively identified and mitigated, and a long-term fix is underway. We don't anticipate any further issues with this component.
Customer impact
To start, I'd like to acknowledge that there were some brief (and a couple not-so-brief) periods of degraded performance, or even total downtime. As you can see on our public status page, we have generally been in the range of "four-nines" of uptime (99.99%, or around 52.56 minutes of downtime a year). However, in the past few weeks we dipped into the "three-nines" territory (99.9%, or around 8.76 hours of downtime a year),which is not good. We of course aim for the gold standard of "five nines" of uptime (99.999%, or 5.26 minutes of downtime a year), but we still have our work cut out for us.
Key systems
In order to explain in detail how the problem itself happened, I need to first explain how the system is intended to work in general. There are many layers of technology, and each of them is rich and nuanced. However, let me quickly summarize the key systems that were involved in this outage to help provide enough context for you to understand the problem itself and resulting fix:
Onyx. A major feature of New Expensify is our open-source state management system, which is a fancy way of saying "the technology that keeps the mobile app up to date". Onyx is the heart of New Expensify's "offline first" design, which is what makes the app feel so fast and reliable even when on a flaky internet connection: every action you take shows as if it happened immediately, even if it might take a while to reliably synchronize up to the server. WhatsApp also has an offline first design, which is what makes it feels so snappy and rock solid, even on a crappy connection.
Onyx updates. Onyx downloads a small subset of the user's data onto their device or browser (whatever is minimally required based on the user's activity), and then waits for "updates" to that subset to be "pushed" down to it, without asking. This "push" design is what makes the app update automatically, without needing to refresh your browser.
Bedrock. The full data of every account is stored on our servers, which each have hundreds of CPUs and multiple terabytes of RAM. These are much, much larger servers than are typically used, so we have worked with SQLite to build a database named Bedrock optimized for extreme concurrency.
Stored procedures. One major feature of Bedrock is that it allows C++ code to execute within the database itself, which enables extreme performance because the application code has direct access to the RAM of the database itself, without needing to copy it a bunch of times or ship it over the network. This allows for extremely high performance application code, operating within the atomic protections of a single transaction (meaning, if a command has 10 steps, they either all complete successfully, or all of them are unwound).
Blocking queue. Bedrock executes thousands of simultaneous commands for different users in parallel, each of which modifies different "pages" of data in the same database. When two commands attempt to modify the same page at the same time, they "conflict". In most cases, conflicting commands are just retried, and they work fine the second or third attempt. But if after a certain number of attempts they are still conflicting, they are put into the "blocking queue". This queue is designed to ensure that rare commands for large customers (like doing a bulk operation on thousands of reports) are able to successfully run, by "blocking" every other command that's operating to allow this one big command to complete. In practice, this is perfectly fine and by adding a few milliseconds to every small command every once in a while, we can ensure the big commands get done reliably.
Blockchain journal. Because Bedrock runs with such extreme concurrency on such giant servers, "replicating" the changes between different copies of the database is a major challenge. Bedrock uses a Paxos algorithm to elect a "leader" for a blockchain broadcast tree, which means at any point in time, one of the nodes is "leading" and everyone else is "following". Followers execute read-only commands locally, but "escalate" read/write commands to the leader for processing – and then the changes made by the leader are replicated out to followers via the private blockchain. This blockchain is stored in a "journal", which is basically every change ever made to the database stored in a giant list, where you can recreate the current state of the database by just applying the same changes in the same order.
Onyx update journal. Finally, there is a second kind of journal that keeps track of every change that has ever happened to each individual data object (ie, an expense report). When a user connects to our servers, we quickly figure out which objects that user's client (ie, mobile device or browser) already has stored locally, and then which updates we need to push out to it to keep that object up to date.
I'm skipping over a lot of detail and additional layers of technology. But these are the major pieces that contributed to this outage, and a quick understanding of each is necessary to understand the outage itself.
General symptoms
Every piece of a distributed system is designed to fail. Every system will crash, CPU will burn out, hard drive will glitch, RAM stick will get solar flared, etc. Everything will eventually break; failure is not an exception, it's the norm. We plan for it by having many redundant systems designed to quickly step in to pick up the load when one of their peers fails. Additionally, our redundant systems fail all the time too, which is why we have teams standing by 24/7 to detect problems before they happen and step in to guide them back to stability. Just like it requires constant micro-adjustments in your body to remain standing "motionless" at attention, we are constantly observing a thousand variables and tweaking countless tiny knobs to keep everything humming along smoothly. We know by heart what "normal" activity looks like for each of our servers. And it was increasingly clear, something was not normal.
In short, everything would be fine for a while, indicating that the system had the capacity to work, even at peak traffic. But sometimes it would crash catastrophically, even when traffic was low. We found and fixed a bunch of smaller problems to clear away the noise and contribute some incremental stability. But try as we might, the core problem kept coming back erratically, without really correlating to any particular user, command, time of day, or other common trigger.
Our initial inclination was that it was due to something we recently released, so we spent a long time investigating recent changes to understand what might be causing the instability. However, there were no major changes to any of the above systems immediately preceding the onset of the problem. Despite that, out of an abundance of caution we spent a lot of time reverting code and undoing changes to see if any of them might have contributed to the problem, to no avail.
Cascading problems
The Expensify system has many layers of interconnected systems designed to operate seamlessly together to create an uninterrupted product experience. And over 99.9% of the time, it does just that. But like a heart attack, there are certain situations that can cause a cascading series of problems that can take down the site, as follows:
The "blocking queue" is the core bottleneck of the entire system. We have plans underway to eliminate the need for it, but at the moment, it's our Achilles heel. That said, in normal operation, occasionally blocking all commands to let a few big ones through works great.
However, sometimes there are situations where a particular command will have a "slow query" (ie, will take over 1000ms to run). This time itself isn't a problem – we have hundreds of CPUs available, so a command taking over a second to run isn't itself a problem (except to whichever user is waiting over a second to get a result). We always have plenty of CPU available to do more in parallel. But if a query is taking a long time to run, it's invariably because it is touching many "pages" of the database to do whatever it's doing. And the more pages it touches, the greater the chance that it will conflict with another command, forcing one of them to retry.
Most commands work without retrying, and those that conflict, generally work after one retry. But every retry puts more load on the system: it's like one more user showed up doing the exact same thing at the exact same time. Additionally, if it's the slow query that is doing the retry (which is the common case), it's like one more user showed up to do some particularly slow operation, at the exact same time.
So we've seen in the past if we update a sufficiently common command with a sufficiently slow query, it can create a catastrophic problem where the more this command happens, the more it conflicts with other commands – but then the retries of those commands cause more conflicts, compounding endlessly. These retries pile up while new commands are also coming in, causing a massive spike of artificial activity, many times larger than the actual activity of users at that time. This will cause the blocking queue to get longer and longer, resulting in the whole server pausing to let those commands through… which causes even more conflicts, which sends even more commands to the blocking queue, and so on. And because the blocking queue pauses everything, even the fast commands start executing slowly, and the whole server grinds to a halt.
Needless to say, we have systems and processes to identify these slow commands and fix them. However, our normal systems weren't flagging anything in particular. Everything seemed to be working as intended, and no specific commands were operating especially slowly. Additionally, everything would work fine for extended periods of time. But every once, something would trigger a cascading blocking queue backlog and the site would suffer.
So this system bore some of the hallmarks of problems we are well aware of, and generally able to avoid. But with some unique twists we hadn't experienced before.
The root cause
In the end we were able to trace the problem back to one of the queries in our Onyx update system – specifically the query that searches the Onyx update journal to determine which of its many millions of updates was the "most recent" for a given user. It's a pretty intensive query, but by itself isn't especially suspicious. Additionally, it's not a new query: it's been in place for many months, without change or problem.
Additionally, this query primarily executes on the followers, where when you (for example) post a comment in a New Expensify chat room with 1000 people, the system that notifies those 1000 people will execute this query 1000 times. Accordingly, this was clearly driving up the CPU times of our follower nodes, but read-only queries on followers execute outside of any read-write transaction, and thus shouldn't affect conflict rates.
However, what has changed is that we have more users than ever using New Expensify. This means the load upon the Onyx update system has steadily increased – meaning the cost of this particular query has slowly crept up a little more every day, as the number of objects being searched (and the number of users doing the search) steadily grew.
Additionally, what we were slow to realize is that even though this query primarily operates safely on read-only followers, it also operates inside read/write commands on the leader. And while executing it for any single user isn't especially intensive, it is heavy enough that it adds a small amount of load to every single read/write transaction. Unfortunately, the load it added to any single command wasn't enough to flag it as creating a risk of cascading blocking queue failure by itself. But it added enough spread across all commands to push us toward that slippery slope, such that when any other slow command did come along, the whole system was that much closer toward the edge and that much easier to fall down.
The end result of this was that even though nothing was changing, the slow migration of users and data objects to New Expensify gradually pushed us closer and closer to the event horizon of our main known bottleneck, resulting in a little more instability every day as we frantically explored and discarded one theory after the next.
The short term fix
Like most problems, once identified, it's pretty easily fixed. Recall that the query in question is one that identifies which Onyx updates should be returned to the calling device/browser. This is something that in general needs to be called with every single API request from New Expensify. However, while our New Expensify userbase is growing, it's still a tiny fraction of our Expensify Classic userbase – and we were returning this same data with every Expensify Classic API call even though Classic ignores this data. So the "quick fix" is to simply limit this query to the API calls that actually intend to use this data. Once rolled out, CPU usage dropped across all servers, and the problem went away instantly. You can see this visually by comparing the spikey CPU charts up to 6/6, when the fix was deployed:
That said, while this is obviously the biggest contributing factor to the problem, downtime in any distributed system is never fully attributed to a single cause: there are many layers of redundancy, so for them all to fail means many things went wrong simultaneously. As such, in the process of resolving the core Onyx scalability problem, we found and fixed many small problems that caused our redundancy to fail, and will result in greater reliability going forward. To give a couple examples:
Our core servers have a process that memory-maps multiple terabytes of data into RAM. Though mapping is instant, unmapping takes many, many minutes. Previously when restarting that server, we would wait for the previous process to fully unmap before starting up the new one. Given that we are restarting the same process across many servers in a sequence, the total time it takes to restart that one process across all servers can extend into multiple hours. This isn't a "problem" per se, but when there are serious server problems, it can extend the time it takes to fix them. As part of resolving this issue, we also changed our restart process to begin the new process before the previous process has fully shut down (easier said than done). This has reduced a multi-hour upgrade process into a few minutes. This wasn't the cause of the problem, and fixing it doesn't solve it. But it extended the impact of the root problem, and now that it's fixed, we will have faster upgrades going forward.
Our servers often use regular expressions to parse strings of various kinds. While doing deep dives on the performance of various commands, we realized that we had been using the builtin std::regex library – which is not only notoriously slow, but even has infinite recursion bugs that caused some servers to crash entirely unrelated to this core problem. For these reasons, we fully refactored our codebase to use PCRE, which is much higher performance and more reliable across the board.
Our core database is many terabytes in size, and the simple act of backing it up nightly can take many hours. Add hours more to download the backup, hours more to decrypt and decompress, and then hours more to synchronize the newly restored server with the rest of the cluster. This means when a server goes down, it stays down for quite a while. This isn't normally a problem, as we have many servers precisely to allow us the redundancy to operate fine during this long recovery period. But in a severe outage situation like this, this extended recovery time makes everything much more difficult to resolve. For this reason we're rebuilding our database backup system to "pre-stage" the latest backup on each server before it crashes, to replace the multi-hour download/decrypt/decompress time with an instant file move of a fully formed database file ready to go.
And many more. Nobody likes server downtime. But the silver lining is that it gives us a unique opportunity to stress test the system in the most extreme edge cases. We take maximum advantage of these painful, expensive lessons to ensure any repeat of these past problems will be solved better in the future.
The long term fix
The short term fixes described above were effective, and we feel confident this problem is fully solved for the foreseeable future. However, the main fixes take advantage of the fact that only a small number of our users have migrated to New Expensify – if nothing else were done, this problem would gradually return as Expensify Classic users continue to quickly adopt New Expensify. For this reason, we are working on a few longer-term fixes that will ensure this continues working at ever greater scales:
Add an index. Databases are amazing in that the performance of a slow query can often be improved by 10x or more by just adding a well crafted database index. This was our first move, and we're already in the process of rolling this out.
Add a lower bound. Currently we track the most recent several days of Onyx updates, and each time we run this query we search the entire list of many tens of millions of updates. However, by having the client submit its "current updateID" with each API call, we can add a "lower bound" to the query, such that we can ignore any update prior to anything the client is reporting that it already has. This means rather than searching the past few days, we only send up searching the past few seconds – which is dramatically faster.
Make the Onyx journal NUMA aware. One detail not really discussed above is that the Onyx update journal isn't a single table – it's actually hundreds of tables, one for each CPU core on the server. Each thread inserting into the Onyx update journal randomly picks one of these tables, which dramatically reduces conflicts relative to having all threads fight over the "end" of the same journal. However, this spreads the memory for the Onyx update tables randomly across all of system RAM – even though each CPU only has direct access to 1/8th of the system RAM. As a result, each thread is ⅞ = 87% likely to need to access "remote RAM" to do the query, which is substantially slower than accessing the RAM directly connected to that thread's CPU. By "pinning" each thread to a given CPU – and then pinning each Onyx update table to that same CPU – we can ensure that each thread only accesses its local RAM bank, avoiding the interprocess memory bus and (in theory) accelerating this query even further. (Read here for more info on how NUMA and all those random BIOS settings we ignore affect database performance.)
Rethink it entirely. This is a relatively new system, and it's probable that we can just solve this same problem a different way, or with a different query. However, it seems like the above changes will likely do the trick.
In practice, no performance problem is ever "solved" – at some point, as users grow, every solved problem will eventually reappear. But this is what we call a "champagne problem", because by that time we'll have more users, more resources, faster servers, and generally more time to come up with another solution to push back the problem again and again.
Conclusion
In general, I feel very good about the performance and reliability of our systems. However, we've been making enormous changes to every layer in support of our radical New Expensify design – and that came with some unfortunate bumps along the road. Thankfully, all our major systems are now built and deployed, so I don't anticipate any major new performance or reliability surprises going forward. It was a painful episode that was disruptive to customers and very stressful for us. But I believe it's behind us, and we're better equipped than ever to achieve more "nines" going forward than we've ever been able to do in the past. Thank you for your patience and understanding, and thank you for checking out New Expensify!