Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.13
Component/s: Machine Config Operator
Labels:
None

Severity:
Moderate
Regression:
None
Sprint:
MCO Sprint 230, MCO Sprint 231
sprint_count:
2
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
N/A
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.13.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

When a MCCDrainError alert is triggered the alert's message says that the drain problem is happening in the wrong node.

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-2022-12-22-120609   True        False         4h59m   Cluster version is 4.13.0-0.nightly-2022-12-22-120609

How reproducible:

Always

Steps to Reproduce:

1. Create a PodDisruptionBudget resource

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: dontevict
spec:
  minAvailable: 1
  selector:
    matchLabels:
        app: dontevict

2. Create a pod matching the PodDisruptionBudget

$ oc run --restart=Never --labels app=dontevict  --image=docker.io/busybox dont-evict-this-pod -- sleep 3h


3. Create a MC

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: test-file
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
        mode: 420
        path: /etc/test

4. Wait 1 hour for the MCCDrainError alert to be triggered

Actual results:


The alert is like

$ curl -s -k -H "Authorization: Bearer $(oc -n openshift-monitoring create token prometheus-k8s)" https://$(oc get route -n openshift-monitoring alertmanager-main -o jsonpath={.spec.host})/api/v1/alerts | jq 
.....
 {
    "activeAt": "2022-12-23T11:24:05.807925776Z",
    "annotations": {
        "message": "Drain failed on ip-10-0-193-114.us-east-2.compute.internal , updates may be blocked. For more details check MachineConfigController pod logs: oc logs -f -n openshift-machine-config-operator machine-config-controller-xxxxx -c machine-config-controller"
    },
    "labels": {
        "alertname": "MCCDrainError",
        "container": "oauth-proxy",
        "endpoint": "metrics",
        "instance": "10.130.0.10:9001",
        "job": "machine-config-controller",
        "namespace": "openshift-machine-config-operator",
        "node": "ip-10-0-193-114.us-east-2.compute.internal",
        "pod": "machine-config-controller-5468769874-44tnt",
        "service": "machine-config-controller",
        "severity": "warning"
    },
    "state": "firing",
    "value": "1e+00"
}

The alert message is wrong, since the reported not in "Drain failed on ip-10-0-193-114.us-east-2.compute.internal , updates may....." is not the node where the drain problem happened, but the node running the controller pod.

Expected results:


The alert message should not point to a wrong node, since it can mislead the user.

Additional info:

is related to

MCO-420 Move MCD drain alert into the MCC, revisit error modes

Closed

links to

openshift/machine-config-operator#3477: OCPBUGS-5188: MCCDrainErr should reference the affected node

Assignee:: Zack Zlotnik

Reporter:: Sergio Regidor de la Rosa

QA Contact:: Sergio Regidor de la Rosa

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2022/12/23 1:59 PM

Updated:: 2023/05/17 10:33 PM

Resolved:: 2023/05/17 10:33 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates