From around 5:18am to 5:54am Pacific (12:18pm to 12:54pm UTC), Convex had a 36 min period of intermittent downtime that affected all Convex services.
The specific issue was a cascading failure in our traffic layer. We had a traffic node (Caddy) run out of memory due to an unforeseen load spike and instead of just being restarted/replaced this node was marked as permanently down by our container management layer (Nomad) which led to the issue propagating to all traffic servers.
Since the incident we've more than doubled the size of our traffic layer, fixed the failover behavior which led to nodes staying failed after OOMing, and will be investigating alternative traffic services.
As always data was safe during this incident but we really apologize for the availability impact during that time period.