Details
-
Bug
-
Resolution: Done
-
Major
-
None
-
0
Description
During the EC2->OSD migration we found that when the build cluster was over capacity, builds would stay queued as Kubernetes jobs for extended periods of time while they waited for build cluster capacity. There is only a 100 second grace period, so those builders didn't check in on time and were scheduled to be killed. Something in the bookkeeping during this scenario is broken and the build manager was getting update events from Redis without any status payload.