Incident Summary
On November 6th at 7:00 PM PST, our automated monitoring system detected a service degradation impacting multiple services, leading us to declare an internal incident. During this period, users encountered instability in the Command application and API, with some unable to access the Command application at all.
Root Cause
The incident was caused by a surge in load on a permissions database. This spike originated from a suboptimal database operation within our regular release process, which led to excessive lock contention on the database. As a result, database queries and numerous internal requests were blocked or delayed, causing instability across the Command application and API.
Resolution and Mitigations
We quickly identified the source of the elevated database load and have since disabled the responsible operation. To prevent recurrence, we are:
Next Steps
We are conducting a broader review of our database operations in the release process to ensure stability under all conditions. Additionally, further adjustments to monitoring thresholds and alerts are underway to enhance early detection and prevent similar issues from impacting our users in the future.