In the early days of 2018, the engineering team at the mobile services company Branch noticed slowdowns and errors with its Amazon Web Services cloud servers. An unexpected round of AWS server reboots in December had already struck Ian Chan, Branch’s director of engineering, as odd. But the server slowdowns a few weeks later presented a more pressing concern.

“We had six engineers crammed in a small war room all staring at charts, deploy logs, revision histories, and latency graphs looking for the cause,” Chan says. “We spent a few days eliminating possibilities one after another, but were unable to find a root cause. We were seemingly chasing a non-existent bug in our system.”

