Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-27476

NoRunningOvnControlPlane alert fired after user workload monitoring is enabled on hypershift management cluster

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.16.0
    • HyperShift
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • No
    • None
    • None
    • None
    • Hypershift Sprint 249
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      hypershift management cluster, user workload monitoring is not enabled, no NoRunningOvnControlPlane alert

      $ oc -n openshift-user-workload-monitoring get pod
      No resources found in openshift-user-workload-monitoring namespace.
      
      $ token=`oc create token prometheus-k8s -n openshift-monitoring`
      $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=ALERTS' | jq '.data.result[].metric | {alertname: .alertname, alertstate: .alertstate}'
      {
        "alertname": "AlertmanagerReceiversNotConfigured",
        "alertstate": "firing"
      }
      {
        "alertname": "CannotRetrieveUpdates",
        "alertstate": "firing"
      }
      {
        "alertname": "Watchdog",
        "alertstate": "firing"
      }    

      enable workload monitoring as below in hypershift management cluster

      $ oc create -f - << EOF
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: cluster-monitoring-config
        namespace: openshift-monitoring
      data:
        config.yaml: |
          enableUserWorkload: true
      EOF

      wait for 5 minutes at least, NoRunningOvnControlPlane is fired

      # oc -n openshift-user-workload-monitoring get pod
      NAME                                   READY   STATUS    RESTARTS   AGE
      prometheus-operator-77b44bfd69-wk5x2   2/2     Running   0          3h2m
      prometheus-user-workload-0             6/6     Running   0          3h2m
      prometheus-user-workload-1             6/6     Running   0          3h2m
      thanos-ruler-user-workload-0           4/4     Running   0          3h2m
      thanos-ruler-user-workload-1           4/4     Running   0          3h2m
      
      $ token=`oc create token prometheus-k8s -n openshift-monitoring`
      $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=ALERTS' | jq '.data.result[].metric | {alertname: .alertname, alertstate: .alertstate}'
      ...
      {
        "alertname": "NoRunningOvnControlPlane",
        "alertstate": "firing"
      }
      ...

      and from

      $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=ALERTS{alertname="NoRunningOvnControlPlane"}' | jq
      {
        "status": "success",
        "data": {
          "resultType": "vector",
          "result": [
            {
              "metric": {
                "__name__": "ALERTS",
                "alertname": "NoRunningOvnControlPlane",
                "alertstate": "firing",
                "namespace": "clusters-hypershift-ci-12900",
                "severity": "critical"
              },
              "value": [
                1705915247.862,
                "1"
              ]
            }
          ],
          "analysis": {}
        }

      NoRunningOvnControlPlane is defined under clusters-hypershift-ci-12900 namespace

      $ oc -n clusters-hypershift-ci-12900 get prometheusrules master-rules -oyaml 
      ...
          - alert: NoRunningOvnControlPlane
            annotations:
              description: |
                Networking control plane is degraded. Networking configuration updates applied to the cluster will not be
                implemented while there are no OVN Kubernetes pods.
              runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-network-operator/NoRunningOvnMaster.md
              summary: There is no running ovn-kubernetes control plane.
            expr: |
              absent(up{job="ovnkube-control-plane", namespace="openshift-ovn-kubernetes"} == 1)
            for: 5m
            labels:
              namespace: clusters-hypershift-ci-12900
              severity: critical

      namespace with openshift.io/cluster-monitoring: "true" label will be monitored by openshift-monitoring, namespace without openshift.io/user-monitoring: "false" label will will be monitored by openshift-user-workload-monitoring(if user workload monitoring is enabled), follow this rule, clusters-hypershift-ci-12900 namespace is monitored by openshift-user-workload-monitoring, and the value for namespace in the defined prometheusrules file is overwritten by the namespace where the prometheusrules residents, in this case, it's clusters-hypershift-ci-12900, so the expr in NoRunningOvnControlPlane would be changed from

      absent(up{job="ovnkube-control-plane", namespace="openshift-ovn-kubernetes"} == 1)

      to

      absent(up{job="ovnkube-control-plane", namespace="clusters-hypershift-ci-12900"} == 1)

      also could be proved by

      # oc -n openshift-user-workload-monitoring exec -c thanos-ruler thanos-ruler-user-workload-0  -- curl -k -H "Authorization: Bearer $token" 'https://thanos-ruler.openshift-user-workload-monitoring.svc:9091/api/v1/rules' | jq  '.data.groups[].rules[] | select(.name=="NoRunningOvnControlPlane")'
        % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                       Dload  Upload   Total   Spent    Left  Speed
      100 22309    0 22309    0     0   660k      0 --:--:-- --:--:-- --:--:--  660k
      {
        "state": "firing",
        "name": "NoRunningOvnControlPlane",
        "query": "absent(up{job=\"ovnkube-control-plane\",namespace=\"clusters-hypershift-ci-12900\"} == 1)",
        "duration": 300,
        "labels": {
          "namespace": "clusters-hypershift-ci-12900",
          "severity": "critical",
          "thanos_ruler_replica": "thanos-ruler-user-workload-1"
        },
        "annotations": {
          "description": "Networking control plane is degraded. Networking configuration updates applied to the cluster will not be\nimplemented while there are no OVN Kubernetes pods.\n",
          "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/cluster-network-operator/NoRunningOvnMaster.md",
          "summary": "There is no running ovn-kubernetes control plane."
        }
      ...

      clusters-hypershift-ci-12900 labels

      # oc get ns clusters-hypershift-ci-12900 -oyaml
      apiVersion: v1
      kind: Namespace
      metadata:
        annotations:
          openshift.io/sa.scc.mcs: s0:c26,c25
          openshift.io/sa.scc.supplemental-groups: 1000700000/10000
          openshift.io/sa.scc.uid-range: 1000700000/10000
        creationTimestamp: "2024-01-22T03:29:49Z"
        labels:
          hypershift.openshift.io/hosted-control-plane: "true"
          hypershift.openshift.io/monitoring: "true"
          kubernetes.io/metadata.name: clusters-hypershift-ci-12900
          pod-security.kubernetes.io/audit: restricted
          pod-security.kubernetes.io/enforce: restricted
          pod-security.kubernetes.io/warn: restricted
          security.openshift.io/scc.podSecurityLabelSync: "false"
        name: clusters-hypershift-ci-12900 

      search with absent(up{job="ovnkube-control-plane", namespace="clusters-hypershift-ci-12900"} == 1)

      # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=absent(up{job="ovnkube-control-plane", namespace="clusters-hypershift-ci-12900"} == 1)' | jq
      {
        "status": "success",
        "data": {
          "resultType": "vector",
          "result": [
            {
              "metric": {},
              "value": [
                1705925585.007,
                "1"
              ]
            }
          ],
          "analysis": {}
        }
      }

      more info

      # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=up{job="ovnkube-control-plane", namespace="clusters-hypershift-ci-12900"} ' | jq
      {
        "status": "success",
        "data": {
          "resultType": "vector",
          "result": [
            {
              "metric": {
                "__name__": "up",
                "_id": "f3c9d9e7-4676-42d8-bbd3-2e9337f1c44e",
                "container": "ovnkube-control-plane",
                "endpoint": "metrics",
                "instance": "10.129.2.62:9108",
                "job": "ovnkube-control-plane",
                "namespace": "clusters-hypershift-ci-12900",
                "pod": "ovnkube-control-plane-5fdc875547-pc8pm",
                "prometheus": "openshift-user-workload-monitoring/user-workload",
                "service": "ovn-kubernetes-control-plane"
              },
              "value": [
                1705925726.478,
                "0"
              ]
            },
            {
              "metric": {
                "__name__": "up",
                "_id": "f3c9d9e7-4676-42d8-bbd3-2e9337f1c44e",
                "container": "ovnkube-control-plane",
                "endpoint": "metrics",
                "instance": "10.131.0.39:9108",
                "job": "ovnkube-control-plane",
                "namespace": "clusters-hypershift-ci-12900",
                "pod": "ovnkube-control-plane-5fdc875547-4pd6s",
                "prometheus": "openshift-user-workload-monitoring/user-workload",
                "service": "ovn-kubernetes-control-plane"
              },
              "value": [
                1705925726.478,
                "0"
              ]
            }
          ],
          "analysis": {}
        }
      }

      ovnkube pods under clusters-hypershift-ci-12900

      # oc -n clusters-hypershift-ci-12900 get pod | grep ovn
      ovnkube-control-plane-5fdc875547-4pd6s                3/3     Running     0          8h
      ovnkube-control-plane-5fdc875547-pc8pm                3/3     Running     0          8h
      

       

      Version-Release number of selected component (if applicable):

      4.16.0-0.nightly-2024-01-21-154905 hypershift management cluster

      How reproducible:

      only with hypershift management cluster, no such issue with guest cluster

      Steps to Reproduce:

      1. see the descriptions

      Actual results:

      NoRunningOvnControlPlane alert fired 

      Expected results:

      no such alert    

      Additional info:

      if this is expected, you can close this bug

              rh-ee-bclement Borja Clemente Castanera
              juzhao@redhat.com Junqi Zhao
              None
              None
              Jie Zhao Jie Zhao
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: