Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.17.0
Component/s: Observability UI
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

firing alerts in the cluster

$ token=`oc create token prometheus-k8s -n openshift-monitoring`
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=count (ALERTS{alertstate="firing"}) by (namespace, alertname)' | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "alertname": "AlertmanagerReceiversNotConfigured",
          "namespace": "openshift-monitoring"
        },
        "value": [
          1721886663.818,
          "1"
        ]
      },
      {
        "metric": {
          "alertname": "CannotRetrieveUpdates",
          "namespace": "openshift-cluster-version"
        },
        "value": [
          1721886663.818,
          "1"
        ]
      },
      {
        "metric": {
          "alertname": "ClusterNotUpgradeable",
          "namespace": "openshift-cluster-version"
        },
        "value": [
          1721886663.818,
          "1"
        ]
      },
      {
        "metric": {
          "alertname": "CsvAbnormalOver30Min",
          "namespace": "openshift-operator-lifecycle-manager"
        },
        "value": [
          1721886663.818,
          "1"
        ]
      },
      {
        "metric": {
          "alertname": "GarbageCollectorSyncFailed",
          "namespace": "openshift-kube-controller-manager"
        },
        "value": [
          1721886663.818,
          "1"
        ]
      },
      {
        "metric": {
          "alertname": "KubePodCrashLooping",
          "namespace": "openshift-operators"
        },
        "value": [
          1721886663.818,
          "1"
        ]
      },
      {
        "metric": {
          "alertname": "TechPreviewNoUpgrade",
          "namespace": "openshift-kube-apiserver-operator"
        },
        "value": [
          1721886663.818,
          "1"
        ]
      },
      {
        "metric": {
          "alertname": "TestAlert",
          "namespace": "ns1"
        },
        "value": [
          1721886663.818,
          "1"
        ]
      },
      {
        "metric": {
          "alertname": "Watchdog",
          "namespace": "openshift-monitoring"
        },
        "value": [
          1721886663.818,
          "1"
        ]
      }
    ],
    "analysis": {}
  }
}

there are firing alerts CannotRetrieveUpdates/ClusterNotUpgradeable under openshift-cluster-version

go to developer console, select openshift-cluster-version, click "Observe - Alerts" tab, no alerts shown, see picture: https://drive.google.com/file/d/1E0qI6NJgHiElCLuL384t8IZkBD7CIj6I/view?usp=drive_link

debugged in console, rules?namespace=openshift-cluster-version API response is only for ClusterVersionOperatorDown alert, which is not fired(NOTE: the ClusterVersionOperatorDown alert has namespace: openshift-cluster-version label, the firing alerts CannotRetrieveUpdates/ClusterNotUpgradeable don't have such label)

NOTE: you can install the latest accepted 4.17 nightly build, and wait for some time, would see ClusterNotUpgradeable/CannotRetrieveUpdates fired. ClusterNotUpgradeable/CannotRetrieveUpdates alert details

$ oc -n openshift-cluster-version get prometheusrules cluster-version-operator -oyaml
...
spec:
  groups:
  - name: cluster-version
    rules:
    - alert: ClusterVersionOperatorDown
      annotations:
        description: The operator may be down or disabled. The cluster will not be
          kept up to date and upgrades will not be possible. Inspect the openshift-cluster-version
          namespace for events or changes to the cluster-version-operator deployment
          or pods to diagnose and repair. {{ with $console_url := "console_url" |
          query }}{{ if ne (len (label "url" (first $console_url ) ) ) 0}} For more
          information refer to {{ label "url" (first $console_url ) }}/k8s/cluster/projects/openshift-cluster-version.{{
          end }}{{ end }}
        summary: Cluster version operator has disappeared from Prometheus target discovery.
      expr: |
        absent(up{job="cluster-version-operator"} == 1)
      for: 10m
      labels:
        namespace: openshift-cluster-version
        severity: critical
...
     - alert: CannotRetrieveUpdates
      annotations:
        description: Failure to retrieve updates means that cluster administrators
          will need to monitor for available updates on their own or risk falling
          behind on security or other bugfixes. If the failure is expected, you can
          clear spec.channel in the ClusterVersion object to tell the cluster-version
          operator to not retrieve updates. Failure reason {{ with $cluster_operator_conditions
          := "cluster_operator_conditions" | query}}{{range $value := .}}{{if and
          (eq (label "name" $value) "version") (eq (label "condition" $value) "RetrievedUpdates")
          (eq (label "endpoint" $value) "metrics") (eq (value $value) 0.0)}}{{label
          "reason" $value}} {{end}}{{end}}{{end}}. For more information refer to `oc
          get clusterversion/version -o=jsonpath="{.status.conditions[?(.type=='RetrievedUpdates')]}{'\n'}"`{{
          with $console_url := "console_url" | query }}{{ if ne (len (label "url"
          (first $console_url ) ) ) 0}} or {{ label "url" (first $console_url ) }}/settings/cluster/{{
          end }}{{ end }}.
        summary: Cluster version operator has not retrieved updates in {{ $value |
          humanizeDuration }}.
      expr: "max by (namespace)\n(\n  (\n    time()-cluster_version_operator_update_retrieval_timestamp_seconds\n
        \ ) >= 3600 \n  and ignoring(condition, name, reason) \n  (cluster_operator_conditions{name=\"version\",
        condition=\"RetrievedUpdates\", endpoint=\"metrics\", reason!=\"NoChannel\"})\n)\n"
      labels:
        severity: warning
...
  - name: cluster-operators
    rules:
    - alert: ClusterNotUpgradeable
      annotations:
        description: In most cases, you will still be able to apply patch releases.
          Reason {{ with $cluster_operator_conditions := "cluster_operator_conditions"
          | query}}{{range $value := .}}{{if and (eq (label "name" $value) "version")
          (eq (label "condition" $value) "Upgradeable") (eq (label "endpoint" $value)
          "metrics") (eq (value $value) 0.0) (ne (len (label "reason" $value)) 0)
          }}{{label "reason" $value}}.{{end}}{{end}}{{end}} For more information refer
          to 'oc adm upgrade'{{ with $console_url := "console_url" | query }}{{ if
          ne (len (label "url" (first $console_url ) ) ) 0}} or {{ label "url" (first
          $console_url ) }}/settings/cluster/{{ end }}{{ end }}.
        summary: One or more cluster operators have been blocking minor version cluster
          upgrades for at least an hour.
      expr: |
        max by (namespace, name, condition, endpoint) (cluster_operator_conditions{name="version", condition="Upgradeable", endpoint="metrics"} == 0)
      for: 60m
      labels:
        severity: info
...

same for alerts under openshift-operator-lifecycle-manager/openshift-kube-controller-manager/openshift-operators/openshift-kube-apiserver-operator project.

for TechPreviewNoUpgrade alert fired under openshift-kube-apiserver-operator project, debugged in console, rules?namespace=openshift-kube-apiserver-operator, response is empty(NOTE: it does not have namespace label), see: https://drive.google.com/file/d/13tB7_3Ko6ISfzGMZ_JkDi80kSFv3be7x/view?usp=drive_link

$ oc -n openshift-kube-apiserver-operator get prometheusrules kube-apiserver-operator -oyaml
...
spec:
  groups:
  - name: cluster-version
    rules:
    - alert: TechPreviewNoUpgrade
      annotations:
        description: Cluster has enabled Technology Preview features that cannot be
          undone and will prevent upgrades. The TechPreviewNoUpgrade feature set is
          not recommended on production clusters.
        summary: Cluster has enabled tech preview features that will prevent upgrades.
      expr: |
        cluster_feature_set{name=~"TechPreviewNoUpgrade|CustomNoUpgrade", namespace="openshift-kube-apiserver-operator"} == 0
      for: 10m
      labels:
        severity: warning

for alerts under openshift-monitoring/ns1, take alerts for openshift-monitoring as example, could see the alerts under the Alerts tab , see https://drive.google.com/file/d/1gMjrVY_BN8OErwXeMuThfuQYutv1l0nx/view?usp=drive_link

NOTE: there is "namespace: openshift-monitoring" label in the prometheusrules, no namespace label for firing alerts under other projects, this maybe the reason why we don't see alerts on developer console for other projects

$ oc -n openshift-monitoring get prometheusrules cluster-monitoring-operator-prometheus-rules -oyaml
...
  - name: general.rules
    rules:
    - alert: Watchdog
      annotations:
        description: |
          This is an alert meant to ensure that the entire alerting pipeline is functional.
          This alert is always firing, therefore it should always be firing in Alertmanager
          and always fire against a receiver. There are integrations with various notification
          mechanisms that send a notification when this alert is not firing. For example the
          "DeadMansSnitch" integration in PagerDuty.
        summary: An alert that should always be firing to certify that Alertmanager
          is working properly.
      expr: vector(1)
      labels:
        namespace: openshift-monitoring
        severity: none
...
    - alert: AlertmanagerReceiversNotConfigured
      annotations:
        description: Alerts are not configured to be sent to a notification system,
          meaning that you may not be notified in a timely fashion when important
          failures occur. Check the OpenShift documentation to learn how to configure
          notifications with Alertmanager.
        summary: Receivers (notification integrations) are not configured on Alertmanager
      expr: cluster:alertmanager_integrations:max == 0
      for: 10m
      labels:
        namespace: openshift-monitoring
        severity: warning

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2024-07-20-191204

How reproducible:

always for developer console, no such issue for the administrator console

Steps to Reproduce:

1. try to trigger the alerts not under openshift-monitoring and user project, example project see from the description

Actual results:

no alerts shown under developer console "Observe - Alerts" tab for some projects

Expected results:

show alerts

Additional info:

Assignee:: Gabriel Bernal

Reporter:: Junqi Zhao

Need Info From:: None

Contributors:: None

QA Contact:: Junqi Zhao

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2024/07/25 6:26 AM

Updated:: 2025/07/22 5:33 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide