Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-5188

Wrong message in MCCDrainError alert

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • 4.13
    • None
    • Moderate
    • None
    • MCO Sprint 230, MCO Sprint 231
    • 2
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Bug Fix
    • Done

      Description of problem:

      When a MCCDrainError alert is triggered the alert's message says that the drain problem is happening in the wrong node.
      
      

      Version-Release number of selected component (if applicable):

      $ oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.13.0-0.nightly-2022-12-22-120609   True        False         4h59m   Cluster version is 4.13.0-0.nightly-2022-12-22-120609
      
      

      How reproducible:

      Always
      

      Steps to Reproduce:

      1. Create a PodDisruptionBudget resource
      
      apiVersion: policy/v1
      kind: PodDisruptionBudget
      metadata:
        name: dontevict
      spec:
        minAvailable: 1
        selector:
          matchLabels:
              app: dontevict
      
      2. Create a pod matching the PodDisruptionBudget
      
      $ oc run --restart=Never --labels app=dontevict  --image=docker.io/busybox dont-evict-this-pod -- sleep 3h
      
      
      3. Create a MC
      
      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      metadata:
        labels:
          machineconfiguration.openshift.io/role: worker
        name: test-file
      spec:
        config:
          ignition:
            version: 3.2.0
          storage:
            files:
            - contents:
                source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
              mode: 420
              path: /etc/test
      
      4. Wait 1 hour for the MCCDrainError alert to be triggered
      
      

      Actual results:

      
      The alert is like
      
      $ curl -s -k -H "Authorization: Bearer $(oc -n openshift-monitoring create token prometheus-k8s)" https://$(oc get route -n openshift-monitoring alertmanager-main -o jsonpath={.spec.host})/api/v1/alerts | jq 
      .....
       {
          "activeAt": "2022-12-23T11:24:05.807925776Z",
          "annotations": {
              "message": "Drain failed on ip-10-0-193-114.us-east-2.compute.internal , updates may be blocked. For more details check MachineConfigController pod logs: oc logs -f -n openshift-machine-config-operator machine-config-controller-xxxxx -c machine-config-controller"
          },
          "labels": {
              "alertname": "MCCDrainError",
              "container": "oauth-proxy",
              "endpoint": "metrics",
              "instance": "10.130.0.10:9001",
              "job": "machine-config-controller",
              "namespace": "openshift-machine-config-operator",
              "node": "ip-10-0-193-114.us-east-2.compute.internal",
              "pod": "machine-config-controller-5468769874-44tnt",
              "service": "machine-config-controller",
              "severity": "warning"
          },
          "state": "firing",
          "value": "1e+00"
      }
      
      The alert message is wrong, since the reported not in "Drain failed on ip-10-0-193-114.us-east-2.compute.internal , updates may....." is not the node where the drain problem happened, but the node running the controller pod.
      
      

      Expected results:

      
      The alert message should not point to a wrong node, since it can mislead the user.
      
      

      Additional info:

      
      

              zzlotnik@redhat.com Zack Zlotnik
              sregidor@redhat.com Sergio Regidor de la Rosa
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: