Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14674

MCCPoolAlert is not removed when the problem that caused the alert is fixed

XMLWordPrintable

      Description of problem:

      When a MCCPoolAlert is fired and we fix the problem that caused this alert, the alert is not removed.
       

      Version-Release number of selected component (if applicable):

      $ oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.14.0-0.nightly-2023-06-06-212044   True        False         114m    Cluster version is 4.14.0-0.nightly-2023-06-06-212044
       

      How reproducible:

      Always
       

      Steps to Reproduce:

      1. Create a custom MCP
      
      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfigPool
      metadata:
        name: infra
      spec:
        machineConfigSelector:
          matchExpressions:
            - {key: machineconfiguration.openshift.io/role, operator: In, values: [master,infra]}
        nodeSelector:
          matchLabels:
            node-role.kubernetes.io/infra: ""
      
      
      2. Label a master node so that it is included in the new custom MCP
      
      $ oc label node $(oc get nodes -l node-role.kubernetes.io/master -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/infra=""
      
      3. Verify that the alert is fired
      
      alias thanosalerts='curl -s -k -H "Authorization: Bearer $(oc -n openshift-monitoring create token prometheus-k8s)" https://$(oc get route -n openshift-monitoring thanos-querier -o jsonpath={.spec.host})/api/v1/alerts | jq '
      
      $ thanosalerts |grep alertname
        ....
                "alertname": "MCCPoolAlert",
      
      
      4. Remove the label from the node to fix the problem
      
      $ oc label node $(oc get nodes -l node-role.kubernetes.io/master -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/infra-
      
      

      Actual results:

      The alert is not removed.
      
      When we have a look at the mcc_pool_alert  metric we find 2 values with 2 different "alert" fields.
      
      alias thanosquery='function __lgb() { unset -f __lgb; oc rsh -n openshift-monitoring prometheus-k8s-0 curl -s -k  -H "Authorization: Bearer $(oc -n openshift-monitoring create token prometheus-k8s)" --data-urlencode "query=$1" https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query | jq -c | jq; }; __lgb'
      
      $ thanosquery mcc_pool_alert
      {
        "status": "success",
        "data": {
          "resultType": "vector",
          "result": [
            {
              "metric": {
                "__name__": "mcc_pool_alert",
                "alert": "Applying custom label for pool",
                "container": "oauth-proxy",
                "endpoint": "metrics",
                "instance": "10.130.0.86:9001",
                "job": "machine-config-controller",
                "namespace": "openshift-machine-config-operator",
                "node": "ip-10-0-129-20.us-east-2.compute.internal",
                "pod": "machine-config-controller-76dbddff49-75ggr",
                "pool": "infra",
                "prometheus": "openshift-monitoring/k8s",
                "service": "machine-config-controller"
              },
              "value": [
                1686137977.158,
                "0"
              ]
            },
            {
              "metric": {
                "__name__": "mcc_pool_alert",
                "alert": "Given both master and custom pools. Defaulting to master: custom infra",
                "container": "oauth-proxy",
                "endpoint": "metrics",
                "instance": "10.130.0.86:9001",
                "job": "machine-config-controller",
                "namespace": "openshift-machine-config-operator",
                "node": "ip-10-0-129-20.us-east-2.compute.internal",
                "pod": "machine-config-controller-76dbddff49-75ggr",
                "pool": "infra",
                "prometheus": "openshift-monitoring/k8s",
                "service": "machine-config-controller"
              },
              "value": [
                1686137977.158,
                "1"
              ]
            }
          ]
        }
      }
       

      Expected results:

      The alert should be removed.
       

      Additional info:

      If we remove the MCO controller pod, a new mcc_pool_alert data is generated with the right value and the other values are removed. If we execute this workaround the alert is removed.
      
       

              cdoern@redhat.com Charles Doern
              sregidor@redhat.com Sergio Regidor de la Rosa
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: