Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-37539

no alerts shown under developer console "Observe - Alerts" tab if the alert prometheusrule does not have namespace label

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.17.0
    • Observability UI
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      firing alerts in the cluster

      $ token=`oc create token prometheus-k8s -n openshift-monitoring`
      $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=count (ALERTS{alertstate="firing"}) by (namespace, alertname)' | jq
      {
        "status": "success",
        "data": {
          "resultType": "vector",
          "result": [
            {
              "metric": {
                "alertname": "AlertmanagerReceiversNotConfigured",
                "namespace": "openshift-monitoring"
              },
              "value": [
                1721886663.818,
                "1"
              ]
            },
            {
              "metric": {
                "alertname": "CannotRetrieveUpdates",
                "namespace": "openshift-cluster-version"
              },
              "value": [
                1721886663.818,
                "1"
              ]
            },
            {
              "metric": {
                "alertname": "ClusterNotUpgradeable",
                "namespace": "openshift-cluster-version"
              },
              "value": [
                1721886663.818,
                "1"
              ]
            },
            {
              "metric": {
                "alertname": "CsvAbnormalOver30Min",
                "namespace": "openshift-operator-lifecycle-manager"
              },
              "value": [
                1721886663.818,
                "1"
              ]
            },
            {
              "metric": {
                "alertname": "GarbageCollectorSyncFailed",
                "namespace": "openshift-kube-controller-manager"
              },
              "value": [
                1721886663.818,
                "1"
              ]
            },
            {
              "metric": {
                "alertname": "KubePodCrashLooping",
                "namespace": "openshift-operators"
              },
              "value": [
                1721886663.818,
                "1"
              ]
            },
            {
              "metric": {
                "alertname": "TechPreviewNoUpgrade",
                "namespace": "openshift-kube-apiserver-operator"
              },
              "value": [
                1721886663.818,
                "1"
              ]
            },
            {
              "metric": {
                "alertname": "TestAlert",
                "namespace": "ns1"
              },
              "value": [
                1721886663.818,
                "1"
              ]
            },
            {
              "metric": {
                "alertname": "Watchdog",
                "namespace": "openshift-monitoring"
              },
              "value": [
                1721886663.818,
                "1"
              ]
            }
          ],
          "analysis": {}
        }
      }
      

      there are firing alerts CannotRetrieveUpdates/ClusterNotUpgradeable under openshift-cluster-version

      go to developer console, select openshift-cluster-version, click "Observe - Alerts" tab, no alerts shown, see picture: https://drive.google.com/file/d/1E0qI6NJgHiElCLuL384t8IZkBD7CIj6I/view?usp=drive_link

      debugged in console, rules?namespace=openshift-cluster-version API response is only for ClusterVersionOperatorDown alert, which is not fired(NOTE: the ClusterVersionOperatorDown alert has namespace: openshift-cluster-version label, the firing alerts CannotRetrieveUpdates/ClusterNotUpgradeable don't have such label)

      NOTE: you can install the latest accepted 4.17 nightly build, and wait for some time, would see ClusterNotUpgradeable/CannotRetrieveUpdates fired. ClusterNotUpgradeable/CannotRetrieveUpdates alert details

      $ oc -n openshift-cluster-version get prometheusrules cluster-version-operator -oyaml
      ...
      spec:
        groups:
        - name: cluster-version
          rules:
          - alert: ClusterVersionOperatorDown
            annotations:
              description: The operator may be down or disabled. The cluster will not be
                kept up to date and upgrades will not be possible. Inspect the openshift-cluster-version
                namespace for events or changes to the cluster-version-operator deployment
                or pods to diagnose and repair. {{ with $console_url := "console_url" |
                query }}{{ if ne (len (label "url" (first $console_url ) ) ) 0}} For more
                information refer to {{ label "url" (first $console_url ) }}/k8s/cluster/projects/openshift-cluster-version.{{
                end }}{{ end }}
              summary: Cluster version operator has disappeared from Prometheus target discovery.
            expr: |
              absent(up{job="cluster-version-operator"} == 1)
            for: 10m
            labels:
              namespace: openshift-cluster-version
              severity: critical
      ...
           - alert: CannotRetrieveUpdates
            annotations:
              description: Failure to retrieve updates means that cluster administrators
                will need to monitor for available updates on their own or risk falling
                behind on security or other bugfixes. If the failure is expected, you can
                clear spec.channel in the ClusterVersion object to tell the cluster-version
                operator to not retrieve updates. Failure reason {{ with $cluster_operator_conditions
                := "cluster_operator_conditions" | query}}{{range $value := .}}{{if and
                (eq (label "name" $value) "version") (eq (label "condition" $value) "RetrievedUpdates")
                (eq (label "endpoint" $value) "metrics") (eq (value $value) 0.0)}}{{label
                "reason" $value}} {{end}}{{end}}{{end}}. For more information refer to `oc
                get clusterversion/version -o=jsonpath="{.status.conditions[?(.type=='RetrievedUpdates')]}{'\n'}"`{{
                with $console_url := "console_url" | query }}{{ if ne (len (label "url"
                (first $console_url ) ) ) 0}} or {{ label "url" (first $console_url ) }}/settings/cluster/{{
                end }}{{ end }}.
              summary: Cluster version operator has not retrieved updates in {{ $value |
                humanizeDuration }}.
            expr: "max by (namespace)\n(\n  (\n    time()-cluster_version_operator_update_retrieval_timestamp_seconds\n
              \ ) >= 3600 \n  and ignoring(condition, name, reason) \n  (cluster_operator_conditions{name=\"version\",
              condition=\"RetrievedUpdates\", endpoint=\"metrics\", reason!=\"NoChannel\"})\n)\n"
            labels:
              severity: warning
      ...
        - name: cluster-operators
          rules:
          - alert: ClusterNotUpgradeable
            annotations:
              description: In most cases, you will still be able to apply patch releases.
                Reason {{ with $cluster_operator_conditions := "cluster_operator_conditions"
                | query}}{{range $value := .}}{{if and (eq (label "name" $value) "version")
                (eq (label "condition" $value) "Upgradeable") (eq (label "endpoint" $value)
                "metrics") (eq (value $value) 0.0) (ne (len (label "reason" $value)) 0)
                }}{{label "reason" $value}}.{{end}}{{end}}{{end}} For more information refer
                to 'oc adm upgrade'{{ with $console_url := "console_url" | query }}{{ if
                ne (len (label "url" (first $console_url ) ) ) 0}} or {{ label "url" (first
                $console_url ) }}/settings/cluster/{{ end }}{{ end }}.
              summary: One or more cluster operators have been blocking minor version cluster
                upgrades for at least an hour.
            expr: |
              max by (namespace, name, condition, endpoint) (cluster_operator_conditions{name="version", condition="Upgradeable", endpoint="metrics"} == 0)
            for: 60m
            labels:
              severity: info
      ...
      

      same for alerts under openshift-operator-lifecycle-manager/openshift-kube-controller-manager/openshift-operators/openshift-kube-apiserver-operator project.

      for TechPreviewNoUpgrade alert fired under openshift-kube-apiserver-operator project, debugged in console, rules?namespace=openshift-kube-apiserver-operator, response is empty(NOTE: it does not have namespace label), see: https://drive.google.com/file/d/13tB7_3Ko6ISfzGMZ_JkDi80kSFv3be7x/view?usp=drive_link

      $ oc -n openshift-kube-apiserver-operator get prometheusrules kube-apiserver-operator -oyaml
      ...
      spec:
        groups:
        - name: cluster-version
          rules:
          - alert: TechPreviewNoUpgrade
            annotations:
              description: Cluster has enabled Technology Preview features that cannot be
                undone and will prevent upgrades. The TechPreviewNoUpgrade feature set is
                not recommended on production clusters.
              summary: Cluster has enabled tech preview features that will prevent upgrades.
            expr: |
              cluster_feature_set{name=~"TechPreviewNoUpgrade|CustomNoUpgrade", namespace="openshift-kube-apiserver-operator"} == 0
            for: 10m
            labels:
              severity: warning 

      for alerts under openshift-monitoring/ns1, take alerts for openshift-monitoring as example, could see the alerts under the Alerts tab , see https://drive.google.com/file/d/1gMjrVY_BN8OErwXeMuThfuQYutv1l0nx/view?usp=drive_link

      NOTE: there is "namespace: openshift-monitoring" label in the prometheusrules, no namespace label for firing alerts under other projects, this maybe the reason why we don't see alerts on developer console for other projects

      $ oc -n openshift-monitoring get prometheusrules cluster-monitoring-operator-prometheus-rules -oyaml
      ...
        - name: general.rules
          rules:
          - alert: Watchdog
            annotations:
              description: |
                This is an alert meant to ensure that the entire alerting pipeline is functional.
                This alert is always firing, therefore it should always be firing in Alertmanager
                and always fire against a receiver. There are integrations with various notification
                mechanisms that send a notification when this alert is not firing. For example the
                "DeadMansSnitch" integration in PagerDuty.
              summary: An alert that should always be firing to certify that Alertmanager
                is working properly.
            expr: vector(1)
            labels:
              namespace: openshift-monitoring
              severity: none
      ...
          - alert: AlertmanagerReceiversNotConfigured
            annotations:
              description: Alerts are not configured to be sent to a notification system,
                meaning that you may not be notified in a timely fashion when important
                failures occur. Check the OpenShift documentation to learn how to configure
                notifications with Alertmanager.
              summary: Receivers (notification integrations) are not configured on Alertmanager
            expr: cluster:alertmanager_integrations:max == 0
            for: 10m
            labels:
              namespace: openshift-monitoring
              severity: warning
      

      Version-Release number of selected component (if applicable):

      4.17.0-0.nightly-2024-07-20-191204

      How reproducible:

      always for developer console, no such issue for the administrator console

      Steps to Reproduce:

      1. try to trigger the alerts not under openshift-monitoring and user project, example project see from the description

      Actual results:

      no alerts shown under developer console "Observe - Alerts" tab for some projects

      Expected results:

      show alerts

      Additional info:

       

              gbernal@redhat.com Gabriel Bernal
              juzhao@redhat.com Junqi Zhao
              None
              None
              Junqi Zhao Junqi Zhao
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: