Convex had downtime for some customers caused by an anomalous load spike that pushed an internal service into a load-shedding mode, which then triggered an unexpected panic in a caching library we were using.
Specifically the load triggered two queue management algorithms: CoDel which proactively drops requests to keep queues small, and adaptive-LIFO which dequeues in reverse order to avoid wasting time on old requests. These are both rather subtle algorithms that large services use to avoid congestion collapse under high load or attack. The panic in the caching library was just a bug that depended on both these algorithms simultaneously.
We've made some changes as a result of this incident but the key lesson is that services should try to avoid switching logical behavior during high load. When systems are stressed switching to infrequently-used codepaths can often make matters worse. We're now going to be proactively triggering CoDel and adaptive-LIFO at steady state to ensure that we're exercising this worst-case flow at all times.
We apologize to our customers for impacting your products and services. We’re focusing intensely over the next few weeks on hardening our systems to prevent issues like this from happening again.