Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.16.0
Component/s: HyperShift
Labels:
- triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
No

Target Backport Versions:

4.17.z, 4.16.z, 4.18.z
Target Version:
None
Release Blocker:
None
Sprint:
Hypershift Sprint 249
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
In Progress
Release Note Type:
Bug Fix
Release Note Text:

Hide
*Cause*: The ovn-kubernetes cp metrics endpoint's DNS had a mismatch with what the network operator was expecting making connectivity not possible.
*Consequence*: The NoRunningOvnControlPlane alert was wrongly firing because it was not able to reach the metrics endpoint.
*Fix*: The DNS for ovn-kubernetes cp metrics was updated on the cp operator certificate to correctly match and restore the connection in release 4.19. This fix has been backported to 4.18, 4.17 and 4.16.
*Result*: Bug doesn’t present anymore in the mentioned versions.

Show
*Cause*: The ovn-kubernetes cp metrics endpoint's DNS had a mismatch with what the network operator was expecting making connectivity not possible. *Consequence*: The NoRunningOvnControlPlane alert was wrongly firing because it was not able to reach the metrics endpoint. *Fix*: The DNS for ovn-kubernetes cp metrics was updated on the cp operator certificate to correctly match and restore the connection in release 4.19. This fix has been backported to 4.18, 4.17 and 4.16. *Result*: Bug doesn’t present anymore in the mentioned versions.

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

hypershift management cluster, user workload monitoring is not enabled, no NoRunningOvnControlPlane alert

$ oc -n openshift-user-workload-monitoring get pod
No resources found in openshift-user-workload-monitoring namespace.

$ token=`oc create token prometheus-k8s -n openshift-monitoring`
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=ALERTS' | jq '.data.result[].metric | {alertname: .alertname, alertstate: .alertstate}'
{
  "alertname": "AlertmanagerReceiversNotConfigured",
  "alertstate": "firing"
}
{
  "alertname": "CannotRetrieveUpdates",
  "alertstate": "firing"
}
{
  "alertname": "Watchdog",
  "alertstate": "firing"
}

enable workload monitoring as below in hypershift management cluster

$ oc create -f - << EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true
EOF

wait for 5 minutes at least, NoRunningOvnControlPlane is fired

# oc -n openshift-user-workload-monitoring get pod
NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-77b44bfd69-wk5x2   2/2     Running   0          3h2m
prometheus-user-workload-0             6/6     Running   0          3h2m
prometheus-user-workload-1             6/6     Running   0          3h2m
thanos-ruler-user-workload-0           4/4     Running   0          3h2m
thanos-ruler-user-workload-1           4/4     Running   0          3h2m

$ token=`oc create token prometheus-k8s -n openshift-monitoring`
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=ALERTS' | jq '.data.result[].metric | {alertname: .alertname, alertstate: .alertstate}'
...
{
  "alertname": "NoRunningOvnControlPlane",
  "alertstate": "firing"
}
...

and from

$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=ALERTS{alertname="NoRunningOvnControlPlane"}' | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "ALERTS",
          "alertname": "NoRunningOvnControlPlane",
          "alertstate": "firing",
          "namespace": "clusters-hypershift-ci-12900",
          "severity": "critical"
        },
        "value": [
          1705915247.862,
          "1"
        ]
      }
    ],
    "analysis": {}
  }

NoRunningOvnControlPlane is defined under clusters-hypershift-ci-12900 namespace

$ oc -n clusters-hypershift-ci-12900 get prometheusrules master-rules -oyaml 
...
    - alert: NoRunningOvnControlPlane
      annotations:
        description: |
          Networking control plane is degraded. Networking configuration updates applied to the cluster will not be
          implemented while there are no OVN Kubernetes pods.
        runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-network-operator/NoRunningOvnMaster.md
        summary: There is no running ovn-kubernetes control plane.
      expr: |
        absent(up{job="ovnkube-control-plane", namespace="openshift-ovn-kubernetes"} == 1)
      for: 5m
      labels:
        namespace: clusters-hypershift-ci-12900
        severity: critical

namespace with openshift.io/cluster-monitoring: "true" label will be monitored by openshift-monitoring, namespace without openshift.io/user-monitoring: "false" label will will be monitored by openshift-user-workload-monitoring(if user workload monitoring is enabled), follow this rule, clusters-hypershift-ci-12900 namespace is monitored by openshift-user-workload-monitoring, and the value for namespace in the defined prometheusrules file is overwritten by the namespace where the prometheusrules residents, in this case, it's clusters-hypershift-ci-12900, so the expr in NoRunningOvnControlPlane would be changed from

absent(up{job="ovnkube-control-plane", namespace="openshift-ovn-kubernetes"} == 1)

absent(up{job="ovnkube-control-plane", namespace="clusters-hypershift-ci-12900"} == 1)

also could be proved by

# oc -n openshift-user-workload-monitoring exec -c thanos-ruler thanos-ruler-user-workload-0  -- curl -k -H "Authorization: Bearer $token" 'https://thanos-ruler.openshift-user-workload-monitoring.svc:9091/api/v1/rules' | jq  '.data.groups[].rules[] | select(.name=="NoRunningOvnControlPlane")'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 22309    0 22309    0     0   660k      0 --:--:-- --:--:-- --:--:--  660k
{
  "state": "firing",
  "name": "NoRunningOvnControlPlane",
  "query": "absent(up{job=\"ovnkube-control-plane\",namespace=\"clusters-hypershift-ci-12900\"} == 1)",
  "duration": 300,
  "labels": {
    "namespace": "clusters-hypershift-ci-12900",
    "severity": "critical",
    "thanos_ruler_replica": "thanos-ruler-user-workload-1"
  },
  "annotations": {
    "description": "Networking control plane is degraded. Networking configuration updates applied to the cluster will not be\nimplemented while there are no OVN Kubernetes pods.\n",
    "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/cluster-network-operator/NoRunningOvnMaster.md",
    "summary": "There is no running ovn-kubernetes control plane."
  }
...

clusters-hypershift-ci-12900 labels

# oc get ns clusters-hypershift-ci-12900 -oyaml
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    openshift.io/sa.scc.mcs: s0:c26,c25
    openshift.io/sa.scc.supplemental-groups: 1000700000/10000
    openshift.io/sa.scc.uid-range: 1000700000/10000
  creationTimestamp: "2024-01-22T03:29:49Z"
  labels:
    hypershift.openshift.io/hosted-control-plane: "true"
    hypershift.openshift.io/monitoring: "true"
    kubernetes.io/metadata.name: clusters-hypershift-ci-12900
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/warn: restricted
    security.openshift.io/scc.podSecurityLabelSync: "false"
  name: clusters-hypershift-ci-12900

search with absent(up{job="ovnkube-control-plane", namespace="clusters-hypershift-ci-12900"} == 1)

# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=absent(up{job="ovnkube-control-plane", namespace="clusters-hypershift-ci-12900"} == 1)' | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {},
        "value": [
          1705925585.007,
          "1"
        ]
      }
    ],
    "analysis": {}
  }
}

more info

# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=up{job="ovnkube-control-plane", namespace="clusters-hypershift-ci-12900"} ' | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "up",
          "_id": "f3c9d9e7-4676-42d8-bbd3-2e9337f1c44e",
          "container": "ovnkube-control-plane",
          "endpoint": "metrics",
          "instance": "10.129.2.62:9108",
          "job": "ovnkube-control-plane",
          "namespace": "clusters-hypershift-ci-12900",
          "pod": "ovnkube-control-plane-5fdc875547-pc8pm",
          "prometheus": "openshift-user-workload-monitoring/user-workload",
          "service": "ovn-kubernetes-control-plane"
        },
        "value": [
          1705925726.478,
          "0"
        ]
      },
      {
        "metric": {
          "__name__": "up",
          "_id": "f3c9d9e7-4676-42d8-bbd3-2e9337f1c44e",
          "container": "ovnkube-control-plane",
          "endpoint": "metrics",
          "instance": "10.131.0.39:9108",
          "job": "ovnkube-control-plane",
          "namespace": "clusters-hypershift-ci-12900",
          "pod": "ovnkube-control-plane-5fdc875547-4pd6s",
          "prometheus": "openshift-user-workload-monitoring/user-workload",
          "service": "ovn-kubernetes-control-plane"
        },
        "value": [
          1705925726.478,
          "0"
        ]
      }
    ],
    "analysis": {}
  }
}

ovnkube pods under clusters-hypershift-ci-12900

# oc -n clusters-hypershift-ci-12900 get pod | grep ovn
ovnkube-control-plane-5fdc875547-4pd6s                3/3     Running     0          8h
ovnkube-control-plane-5fdc875547-pc8pm                3/3     Running     0          8h

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-01-21-154905 hypershift management cluster

How reproducible:

only with hypershift management cluster, no such issue with guest cluster

Steps to Reproduce:

1. see the descriptions

Actual results:

NoRunningOvnControlPlane alert fired

Expected results:

no such alert

Additional info:

if this is expected, you can close this bug

duplicates

OCPBUGS-54533 ARO-HCP: Prometheus metrics scraping failing for ovnkube-control-plane

Closed

is blocked by

OCPBUGS-63034 ARO-HCP: Prometheus metrics scraping failing for ovnkube-control-plane

Closed

OCPBUGS-64604 ARO-HCP: Prometheus metrics scraping failing for ovnkube-control-plane

Closed

OCPBUGS-64680 ARO-HCP: Prometheus metrics scraping failing for ovnkube-control-plane

Closed

is duplicated by

OCPBUGS-38614 Hypershift - missing ServiceMonitor for OVN

Closed

Assignee:: Borja Clemente Castanera

Reporter:: Junqi Zhao

Need Info From:: None

Contributors:: None

QA Contact:: Jie Zhao

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/01/22 12:27 PM

Updated:: 2025/11/14 2:14 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates