-
Story
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
None
-
False
-
-
False
-
None
-
None
-
None
-
None
-
None
We frequently see InstallJobDelayHigh due to customers attempting to install with insufficient credentials, which is not actionable by hive/SRE/etc. One way this manifests is DNSZones never becoming ready. In such cases, the DNSZone gets a status condition of type "InsufficientCredentials".
It would be nice if we could label the metric on which this alert is built – hive_cluster_deployment_install_job_delay_seconds – such that we can filter it out in the alert def so we don't waste effort tracking these down.
This is likely to involve some nontrivial refactoring, as today where we observe that metric we don't have much context to work with.
Also, I've noticed lately that problems with DNSZone readiness that I think should result in DNSNotReadyTimeout... don't. Possibly a bug. Possibly unrelated to this card... but possibly related, as I think it would mean we wouldn't ever actually start the provision, and therefore never observe this metric. Worth investigating before we go too far down this rabbit hole.