high error rates on customer backends

Incident Report for Convex

Postmortem

Convex had downtime for some customers caused by an anomalous load spike that pushed an internal service into a load-shedding mode, which then triggered an unexpected panic in a caching library we were using.

Specifically the load triggered two queue management algorithms: CoDel which proactively drops requests to keep queues small, and adaptive-LIFO which dequeues in reverse order to avoid wasting time on old requests. These are both rather subtle algorithms that large services use to avoid congestion collapse under high load or attack. The panic in the caching library was just a bug that depended on both these algorithms simultaneously.

We've made some changes as a result of this incident but the key lesson is that services should try to avoid switching logical behavior during high load. When systems are stressed switching to infrequently-used codepaths can often make matters worse. We're now going to be proactively triggering CoDel and adaptive-LIFO at steady state to ensure that we're exercising this worst-case flow at all times.

We apologize to our customers for impacting your products and services. We’re focusing intensely over the next few weeks on hardening our systems to prevent issues like this from happening again.

Posted Mar 05, 2026 - 18:16 UTC

Resolved

Incident resolved. We're very sorry for the impact on your projects.

Our team will be publishing a detailed postmortem soon.

Posted Mar 04, 2026 - 18:28 UTC

Monitoring

We've identified the issue and remediated the problem. We're monitoring before we declare the all clear.

Posted Mar 04, 2026 - 18:14 UTC

Investigating

We are currently investigating an issue leading to elevated error rates on customer backends

Posted Mar 04, 2026 - 17:28 UTC

This incident affected: Live Traffic (Free & Starter).