-
Spike
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
False
-
False
-
Undefined
1. Proposed title of this feature request
Promethues rule for cronjobs in FailedNeedsStart condition
2. What is the nature and description of the request?
When cronjobs fail a certain number of times (100 or more), the cronjob enters a permanent failed state. This can sometimes occur when the cluster is shutdown for an extended period, or can also occur if there are temporary issues in the cluster. Customer is requesting our alertmanager fire an alert when an infrastructure cronjob is in this state. An example message you might see in the events:
The elasticsearch index management cronjobs fail after a maintenance window brought down the nodes for 72 hours. The event in the cronjob description is: Warning FailedNeedsStart 69s (x24505 over 2d20h) cronjob-controller Cannot determine if job needs to be started: too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.
3. Why does the customer need this? (List the business requirements here)
Reliability and uptime.
4. List any affected packages or components.
Any cronjobs for components supported by Red Hat.
This is a known kubernetes behavior, for example: