Incident response at global scale

Incident response at global scale

Developers, Coders, SysAdmins, Site Reliability Engineers — no matter your title, we all have something in common: there are infinite possibilities of what can go wrong in the code, so learning from others mistakes is one of the best ways to increase our collective knowledge. 

If you were trying to get shit done on your expense reports around 3:30am UTC on the morning of Friday 16th June, then you would have noticed we had a site outage. We want to share the details of our Incident Response, not just to enhance our own learnings and behaviors for next time, but to spread that knowledge with our community as well.

Some background context: rekeying our database encryption

Being that we handle a lot of financial data, we encrypt a lot of that data in our database (BedrockDB) to keep it safe from prying eyes — not only from intruders but from potential inside threats too. Much like the quintessential “nuclear self-destruct” device in your favorite 70’s sci-fi drama TV show needs two people to insert and turn their keys at the same time, no single employee can decrypt our database. 

To protect the encryption key for our database, we split it using Shamir’s Secret Sharing algorithm, with parts distributed among a group of employees called (creatively) “keyholders”. There are about two dozen keyholders at any given time, and any two of them are required to reassemble the database encryption key. When we are (re)starting the BedrockDB, two keyholders must coordinate to each provide their part of the split key before the database can successfully start and read the encrypted data.

As a PCI Level 1 compliant company, Expensify is required to rotate these encryption keys — it’s generally a good idea to do that anyway regardless of your PCI compliance status. We rotate our encryption keys annually at a minimum, sometimes more often when certain other triggers instigate it. This ensures that even in the unlikely event that the encryption key somehow got leaked or compromised without anyone noticing, the compromised key becomes useless to an attacker as soon as we rotate it.

The rekey process itself is technically complicated, but conceptually simple:

  1. Generate a new encryption key, splitting it and distributing the parts to keyholders.

  2. Restarting the database, and assigning two employees to send both the old and new keys.

  3. New data will be written immediately with the new key.

  4. Initiating the rekey process that will decrypt existing data using the old key, and re-encrypt it using the new key.

  5. Once all existing data has been re-encrypted, throw away the old key.

Simple, right? But how does the database know which key to use for which data? It could just try both and see which one works, but that’s gratuitously inefficient. Following the age-old wisdom of “keep it simple, stupid” (KISS), we simply prepend the encrypted data with a version identifier that corresponds to which key the data is encrypted with. 

In the example below, the encrypted data starts with “SGV5”, and is encrypted with the key version 9 (“V9”):

V9:SGV5ISBZb3Ugc2hvdWxkIGFwcGx5IHRvIHdvcmsgZm9yIHVzIQo=

How a routine update caused an outage

As part of our annual encryption rekey to move from key version 9 to version 10, two of our Australian employees started the rekey process during their normal workday. Once the database cluster had reached a quorum of nodes running with the new and old encryption keys, all users were immediately logged out and unable to log in.

One of the things we encrypt (but of course do not store) in the database application is our “authTokens” which are little bits of data that identify you to our servers after you log in so you don’t have to keep logging in. The logs indicated that the database didn’t have a key suitable for decrypting the existing authTokens. If we can’t verify your authToken, then we can’t authenticate you, so we fail-safe and instead we’ll log you out.

But how was it possible that there was no key to decrypt the authTokens with? The logs confirm that both keys were loaded:

2023-06-16T03:41:30.994798+00:00 db1 Got versioned key for version=10. Using it.
2023-06-16T03:41:30.994892+00:00 db1 Got versioned key for version=9. Using it.

There’s a hint in the above logs that point to the problem; I will award you 42 imaginary internet points if you can spot the issue at this stage.

Here’s the code in question. It has been simplified by removing some lines that are irrelevant to the final problem, but it is sufficient to expose the fault in the logic:

int oldKeyVersion = 0;int newKeyVersion = 0;
for (const auto& nameValuePair : result.nameValueMap) {
string name = nameValuePair.first;
string value = nameValuePair.second;
int keyVersion = SToInt(SAfter(name, "-keyPart-v"));

SINFO("Got versioned key for version=" << keyVersion << ". Using it.");
if (oldKeyVersion == 0) {
oldKeyVersion = keyVersion;
}
if (keyVersion > oldKeyVersion) {
// We found a second key.
newKeyVersion = keyVersion;
}
}

If you want to find the issue yourself, keep looking at the code and stop reading now until you find it… Otherwise, keep reading to see how we identified and resolved the issue.

Working together to find a solution

The outage was immediately reported to our Infrastructure team within 10 minutes; we declared a fire, and all our resources were focused on identifying the issue. Our Incident Response protocols are multi-tiered and thorough, and one of the first steps is triage.

Making a swift assessment of an incident ensures we get the appropriate attention to any given problem. A single server going down won’t be noticed by customers and can be handled by a single SRE or Engineer, but a “site down” incident will rapidly gain everyone’s attention through our triage and escalation process.

Responding Engineers joined a group video call to focus on the code responsible for loading the encryption keys, talking it through, and theorizing on potential causes.

After analyzing the issue, our team saw that the keys were being loaded out-of-order. In other words, the “new” v10 key was being loaded before the “old” v9 key. The code assumed that the lowest key would always be processed first. Given the logic above:

  1. On the first loop, both oldKeyVersion and newKeyVersion are initialized to 0 as default values, and keyVersion == 10, so we enter the first if-block, and set oldKeyVersion to “10” – Wrong!

  2. On the second loop, keyVersion == 9, we skip the first if-block AND skip the second if-block because 9 is not greater than 10. The second key is silently discarded!

With the second key silently discarded, and the new key loaded as the old key — nothing in the database could be decrypted. 

Figuring out why the keys were being loaded out of order

The key files each of the employees that initiated the rekey had the v9 key first, followed by the v10 key. Our key files are formatted like this:

KEY
keyPart-v9:4e6f207265616c6c792c20686176652061206c6f
keyPart-v10:6f6b206174206f7572206a6f627320706167652e

The final piece of this puzzle is that when the key parts are sent to the database by our keyholders, it stores them in a std::map structure, which according to cppreference.com is “a sorted associative container that contains key-value pairs with unique keys. Keys are sorted by using the comparison function Compare.” 

The keys of the map structure in this case are being treated as strings, not numbers and so lexicographically when sorted as a string, “v10” comes before “v9” because “v1” is “less than” “v9”. We can demonstrate this using a little bash example:

$ echo "v10 v1 v3 v9 v2" | tr " " "\n" | sort
v1
v10
v2
v3
v9

For the last six years, while we had single digit version numbers, the sorting worked fine. But since we just entered double-digit version numbers for our encryption keys, the sorting algorithm exposed the fault in the key loading logic.

Once we understood the root cause, we restarted the database application again, this time with only the old v9 key, which brought the service back online. We didn’t do this earlier because until we had a proper understanding of the issue, we weren’t confident that doing so would not create a bigger problem, such as data loss. With the site back online and users able to get shit done again, we prepared an emergency deployment to improve the logic in the encryption key loading, which now looks like this (simplified):

if (oldKeyVersion == 0) {
oldKeyVersion = keyVersion;
} else if (keyVersion > oldKeyVersion) {
newKeyVersion = keyVersion;
} else {
// If oldKeyVersion is newer than the key that we've just found,
// let's swap things around.
// The new key should be what the old key was, and this key
// becomes the old key
newKeyVersion = oldKeyVersion;
oldKeyVersion = keyVersion;
}

Additional logging has also been added to confirm the loaded keys after this processing of the encryption keys so any future issues can be spotted even faster.

With all of that said, the crisis was solved, and we were back in business.

Writing a post-mortem to share our learnings

The whole point of conducting post-mortems is to not only identify what happened, but more importantly learn what we can do to avoid the same or similar issues in the future. Blame plays no part in a post-mortem; we still don’t know exactly who wrote the faulty logic in the original code. No-one has looked at the git blame, because that’s not what’s important. To maximize value and impact of our post mortems, we always share them internally with the entire company to ensure everyone learns from every incident.

These key points come directly from our internal post-mortem of this incident:

  • The existing logging around the key loading logic was insufficient, which delayd us in determining the cause.

    • We logged what keys were provided on the command line, but before additional processing while loading them into internal variables.

    • We should log results after performing additional operations to ensure that assumptions about internal behaviors are correct.

  • Data structures using std::map have a custom case-insensitive sorter, so guarantees a certain order to be maintained if you iterate over the internal map, however, this order is not guaranteed to be what we want or expect.

    • In this example we assumed lower keys would always be first like v1 before v2, but the moment we went past 10, it was sorted that v10 comes before v9.

We were able to resolve this quickly because the sun never sets at Expensify. This is thanks to our remote-first policy, dotting our employees all over the globe in almost every time zone. So besides our on-call Infrastructure team, there’s always someone available to help no matter what time zone it is.

This policy has an additional advantage because it ensures nobody has to stay up all night, and instead, we can hand things off and help our team maintain a good work/life balance. As other employees began to come online to start their day from New York or end their day in Taiwan, we had plenty of helping hands to manage this incident without adding undue stress. 

The value of asking questions and challenging assumptions

I hope you’ve found this event interesting in the eyes of hindsight; a simple assumption about internal behaviors of a computer system took our entire site down, years after the original code was written because the assumption was correct… Until it wasn’t. From this incident, the idea of validating assumptions is a key takeaway for our entire team — always ask questions, so that we can understand everything, even if we weren’t here when those decisions were made.

Is Incident Response what you dream about at night? Do you want to join a collaborative team that helps solve problems, no matter how big or small? Expensify is always looking for talented individuals to join our team. Check out our SRE position — when we say it’s fully remote, we mean it because the more distributed we are, the better! You can work from (almost) anywhere on the planet and we do not adjust salaries based on location. 

Let us know in the comments if you’re interested in learning more about our approach to incident response and post mortems!





Phil Smith

Phil joined Expensify as a Site Reliability Engineer after almost 20 years of working in the Ministry of Ones-and-Zeros. Husband of Steph and Father of Eli, when he's not working or at home, you'll find him either at the Fire Station where he volunteers as a first responder, or at the local airport enjoying anything and everything aviation related.

Related Posts

Streamline your travel expense reporting: Examples and tips

Streamline your travel expense reporting: Examples and tips

What does per diem mean? How to create a per diem expense policy

What does per diem mean? How to create a per diem expense policy

Strategies to prevent expense fraud

Strategies to prevent expense fraud

Better money management = more money to manage.

Get started with Expensify.