-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.17.0
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
firing alerts in the cluster
$ token=`oc create token prometheus-k8s -n openshift-monitoring` $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=count (ALERTS{alertstate="firing"}) by (namespace, alertname)' | jq { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "alertname": "AlertmanagerReceiversNotConfigured", "namespace": "openshift-monitoring" }, "value": [ 1721886663.818, "1" ] }, { "metric": { "alertname": "CannotRetrieveUpdates", "namespace": "openshift-cluster-version" }, "value": [ 1721886663.818, "1" ] }, { "metric": { "alertname": "ClusterNotUpgradeable", "namespace": "openshift-cluster-version" }, "value": [ 1721886663.818, "1" ] }, { "metric": { "alertname": "CsvAbnormalOver30Min", "namespace": "openshift-operator-lifecycle-manager" }, "value": [ 1721886663.818, "1" ] }, { "metric": { "alertname": "GarbageCollectorSyncFailed", "namespace": "openshift-kube-controller-manager" }, "value": [ 1721886663.818, "1" ] }, { "metric": { "alertname": "KubePodCrashLooping", "namespace": "openshift-operators" }, "value": [ 1721886663.818, "1" ] }, { "metric": { "alertname": "TechPreviewNoUpgrade", "namespace": "openshift-kube-apiserver-operator" }, "value": [ 1721886663.818, "1" ] }, { "metric": { "alertname": "TestAlert", "namespace": "ns1" }, "value": [ 1721886663.818, "1" ] }, { "metric": { "alertname": "Watchdog", "namespace": "openshift-monitoring" }, "value": [ 1721886663.818, "1" ] } ], "analysis": {} } }
there are firing alerts CannotRetrieveUpdates/ClusterNotUpgradeable under openshift-cluster-version
go to developer console, select openshift-cluster-version, click "Observe - Alerts" tab, no alerts shown, see picture: https://drive.google.com/file/d/1E0qI6NJgHiElCLuL384t8IZkBD7CIj6I/view?usp=drive_link
debugged in console, rules?namespace=openshift-cluster-version API response is only for ClusterVersionOperatorDown alert, which is not fired(NOTE: the ClusterVersionOperatorDown alert has namespace: openshift-cluster-version label, the firing alerts CannotRetrieveUpdates/ClusterNotUpgradeable don't have such label)
NOTE: you can install the latest accepted 4.17 nightly build, and wait for some time, would see ClusterNotUpgradeable/CannotRetrieveUpdates fired. ClusterNotUpgradeable/CannotRetrieveUpdates alert details
$ oc -n openshift-cluster-version get prometheusrules cluster-version-operator -oyaml ... spec: groups: - name: cluster-version rules: - alert: ClusterVersionOperatorDown annotations: description: The operator may be down or disabled. The cluster will not be kept up to date and upgrades will not be possible. Inspect the openshift-cluster-version namespace for events or changes to the cluster-version-operator deployment or pods to diagnose and repair. {{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url ) ) ) 0}} For more information refer to {{ label "url" (first $console_url ) }}/k8s/cluster/projects/openshift-cluster-version.{{ end }}{{ end }} summary: Cluster version operator has disappeared from Prometheus target discovery. expr: | absent(up{job="cluster-version-operator"} == 1) for: 10m labels: namespace: openshift-cluster-version severity: critical ... - alert: CannotRetrieveUpdates annotations: description: Failure to retrieve updates means that cluster administrators will need to monitor for available updates on their own or risk falling behind on security or other bugfixes. If the failure is expected, you can clear spec.channel in the ClusterVersion object to tell the cluster-version operator to not retrieve updates. Failure reason {{ with $cluster_operator_conditions := "cluster_operator_conditions" | query}}{{range $value := .}}{{if and (eq (label "name" $value) "version") (eq (label "condition" $value) "RetrievedUpdates") (eq (label "endpoint" $value) "metrics") (eq (value $value) 0.0)}}{{label "reason" $value}} {{end}}{{end}}{{end}}. For more information refer to `oc get clusterversion/version -o=jsonpath="{.status.conditions[?(.type=='RetrievedUpdates')]}{'\n'}"`{{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url ) ) ) 0}} or {{ label "url" (first $console_url ) }}/settings/cluster/{{ end }}{{ end }}. summary: Cluster version operator has not retrieved updates in {{ $value | humanizeDuration }}. expr: "max by (namespace)\n(\n (\n time()-cluster_version_operator_update_retrieval_timestamp_seconds\n \ ) >= 3600 \n and ignoring(condition, name, reason) \n (cluster_operator_conditions{name=\"version\", condition=\"RetrievedUpdates\", endpoint=\"metrics\", reason!=\"NoChannel\"})\n)\n" labels: severity: warning ... - name: cluster-operators rules: - alert: ClusterNotUpgradeable annotations: description: In most cases, you will still be able to apply patch releases. Reason {{ with $cluster_operator_conditions := "cluster_operator_conditions" | query}}{{range $value := .}}{{if and (eq (label "name" $value) "version") (eq (label "condition" $value) "Upgradeable") (eq (label "endpoint" $value) "metrics") (eq (value $value) 0.0) (ne (len (label "reason" $value)) 0) }}{{label "reason" $value}}.{{end}}{{end}}{{end}} For more information refer to 'oc adm upgrade'{{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url ) ) ) 0}} or {{ label "url" (first $console_url ) }}/settings/cluster/{{ end }}{{ end }}. summary: One or more cluster operators have been blocking minor version cluster upgrades for at least an hour. expr: | max by (namespace, name, condition, endpoint) (cluster_operator_conditions{name="version", condition="Upgradeable", endpoint="metrics"} == 0) for: 60m labels: severity: info ...
same for alerts under openshift-operator-lifecycle-manager/openshift-kube-controller-manager/openshift-operators/openshift-kube-apiserver-operator project.
for TechPreviewNoUpgrade alert fired under openshift-kube-apiserver-operator project, debugged in console, rules?namespace=openshift-kube-apiserver-operator, response is empty(NOTE: it does not have namespace label), see: https://drive.google.com/file/d/13tB7_3Ko6ISfzGMZ_JkDi80kSFv3be7x/view?usp=drive_link
$ oc -n openshift-kube-apiserver-operator get prometheusrules kube-apiserver-operator -oyaml ... spec: groups: - name: cluster-version rules: - alert: TechPreviewNoUpgrade annotations: description: Cluster has enabled Technology Preview features that cannot be undone and will prevent upgrades. The TechPreviewNoUpgrade feature set is not recommended on production clusters. summary: Cluster has enabled tech preview features that will prevent upgrades. expr: | cluster_feature_set{name=~"TechPreviewNoUpgrade|CustomNoUpgrade", namespace="openshift-kube-apiserver-operator"} == 0 for: 10m labels: severity: warning
for alerts under openshift-monitoring/ns1, take alerts for openshift-monitoring as example, could see the alerts under the Alerts tab , see https://drive.google.com/file/d/1gMjrVY_BN8OErwXeMuThfuQYutv1l0nx/view?usp=drive_link
NOTE: there is "namespace: openshift-monitoring" label in the prometheusrules, no namespace label for firing alerts under other projects, this maybe the reason why we don't see alerts on developer console for other projects
$ oc -n openshift-monitoring get prometheusrules cluster-monitoring-operator-prometheus-rules -oyaml ... - name: general.rules rules: - alert: Watchdog annotations: description: | This is an alert meant to ensure that the entire alerting pipeline is functional. This alert is always firing, therefore it should always be firing in Alertmanager and always fire against a receiver. There are integrations with various notification mechanisms that send a notification when this alert is not firing. For example the "DeadMansSnitch" integration in PagerDuty. summary: An alert that should always be firing to certify that Alertmanager is working properly. expr: vector(1) labels: namespace: openshift-monitoring severity: none ... - alert: AlertmanagerReceiversNotConfigured annotations: description: Alerts are not configured to be sent to a notification system, meaning that you may not be notified in a timely fashion when important failures occur. Check the OpenShift documentation to learn how to configure notifications with Alertmanager. summary: Receivers (notification integrations) are not configured on Alertmanager expr: cluster:alertmanager_integrations:max == 0 for: 10m labels: namespace: openshift-monitoring severity: warning
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2024-07-20-191204
How reproducible:
always for developer console, no such issue for the administrator console
Steps to Reproduce:
1. try to trigger the alerts not under openshift-monitoring and user project, example project see from the description
Actual results:
no alerts shown under developer console "Observe - Alerts" tab for some projects
Expected results:
show alerts
Additional info: