Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: 4.14.0
Affects Version/s: 4.14
Component/s: Machine Config Operator
Labels:
- mco-triaged
- pre-merge-tested

Test Coverage:

+
Severity:
Moderate
Regression:
No
Sprint:
MCO Sprint 237, MCO Sprint 238
sprint_count:
2
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.14.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

When a MCCPoolAlert is fired and we fix the problem that caused this alert, the alert is not removed.

Version-Release number of selected component (if applicable):

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.0-0.nightly-2023-06-06-212044   True        False         114m    Cluster version is 4.14.0-0.nightly-2023-06-06-212044

How reproducible:

Always

Steps to Reproduce:

1. Create a custom MCP

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: infra
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [master,infra]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/infra: ""


2. Label a master node so that it is included in the new custom MCP

$ oc label node $(oc get nodes -l node-role.kubernetes.io/master -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/infra=""

3. Verify that the alert is fired

alias thanosalerts='curl -s -k -H "Authorization: Bearer $(oc -n openshift-monitoring create token prometheus-k8s)" https://$(oc get route -n openshift-monitoring thanos-querier -o jsonpath={.spec.host})/api/v1/alerts | jq '

$ thanosalerts |grep alertname
  ....
          "alertname": "MCCPoolAlert",


4. Remove the label from the node to fix the problem

$ oc label node $(oc get nodes -l node-role.kubernetes.io/master -ojsonpath="{.items[0].metadata.name}") node-role.kubernetes.io/infra-

Actual results:

The alert is not removed.

When we have a look at the mcc_pool_alert  metric we find 2 values with 2 different "alert" fields.

alias thanosquery='function __lgb() { unset -f __lgb; oc rsh -n openshift-monitoring prometheus-k8s-0 curl -s -k  -H "Authorization: Bearer $(oc -n openshift-monitoring create token prometheus-k8s)" --data-urlencode "query=$1" https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query | jq -c | jq; }; __lgb'

$ thanosquery mcc_pool_alert
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "mcc_pool_alert",
          "alert": "Applying custom label for pool",
          "container": "oauth-proxy",
          "endpoint": "metrics",
          "instance": "10.130.0.86:9001",
          "job": "machine-config-controller",
          "namespace": "openshift-machine-config-operator",
          "node": "ip-10-0-129-20.us-east-2.compute.internal",
          "pod": "machine-config-controller-76dbddff49-75ggr",
          "pool": "infra",
          "prometheus": "openshift-monitoring/k8s",
          "service": "machine-config-controller"
        },
        "value": [
          1686137977.158,
          "0"
        ]
      },
      {
        "metric": {
          "__name__": "mcc_pool_alert",
          "alert": "Given both master and custom pools. Defaulting to master: custom infra",
          "container": "oauth-proxy",
          "endpoint": "metrics",
          "instance": "10.130.0.86:9001",
          "job": "machine-config-controller",
          "namespace": "openshift-machine-config-operator",
          "node": "ip-10-0-129-20.us-east-2.compute.internal",
          "pod": "machine-config-controller-76dbddff49-75ggr",
          "pool": "infra",
          "prometheus": "openshift-monitoring/k8s",
          "service": "machine-config-controller"
        },
        "value": [
          1686137977.158,
          "1"
        ]
      }
    ]
  }
}

Expected results:

The alert should be removed.

Additional info:

If we remove the MCO controller pod, a new mcc_pool_alert data is generated with the right value and the other values are removed. If we execute this workaround the alert is removed.

links to

openshift/machine-config-operator#3733: OCPBUGS-14674: set pool alert back to zero in more default scenarios.

RHSA-2023:5006 OpenShift Container Platform 4.14.z security update

Assignee:: Charles Doern

Reporter:: Sergio Regidor de la Rosa

QA Contact:: Sergio Regidor de la Rosa

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2023/06/07 11:51 AM

Updated:: 2023/10/31 1:41 PM

Resolved:: 2023/10/31 1:41 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates