-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.16.0
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
No
-
None
-
None
-
None
-
Hypershift Sprint 249
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
hypershift management cluster, user workload monitoring is not enabled, no NoRunningOvnControlPlane alert
$ oc -n openshift-user-workload-monitoring get pod No resources found in openshift-user-workload-monitoring namespace. $ token=`oc create token prometheus-k8s -n openshift-monitoring` $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=ALERTS' | jq '.data.result[].metric | {alertname: .alertname, alertstate: .alertstate}' { "alertname": "AlertmanagerReceiversNotConfigured", "alertstate": "firing" } { "alertname": "CannotRetrieveUpdates", "alertstate": "firing" } { "alertname": "Watchdog", "alertstate": "firing" }
enable workload monitoring as below in hypershift management cluster
$ oc create -f - << EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
enableUserWorkload: true
EOF
wait for 5 minutes at least, NoRunningOvnControlPlane is fired
# oc -n openshift-user-workload-monitoring get pod NAME READY STATUS RESTARTS AGE prometheus-operator-77b44bfd69-wk5x2 2/2 Running 0 3h2m prometheus-user-workload-0 6/6 Running 0 3h2m prometheus-user-workload-1 6/6 Running 0 3h2m thanos-ruler-user-workload-0 4/4 Running 0 3h2m thanos-ruler-user-workload-1 4/4 Running 0 3h2m $ token=`oc create token prometheus-k8s -n openshift-monitoring` $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=ALERTS' | jq '.data.result[].metric | {alertname: .alertname, alertstate: .alertstate}' ... { "alertname": "NoRunningOvnControlPlane", "alertstate": "firing" } ...
and from
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=ALERTS{alertname="NoRunningOvnControlPlane"}' | jq { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "ALERTS", "alertname": "NoRunningOvnControlPlane", "alertstate": "firing", "namespace": "clusters-hypershift-ci-12900", "severity": "critical" }, "value": [ 1705915247.862, "1" ] } ], "analysis": {} }
NoRunningOvnControlPlane is defined under clusters-hypershift-ci-12900 namespace
$ oc -n clusters-hypershift-ci-12900 get prometheusrules master-rules -oyaml ... - alert: NoRunningOvnControlPlane annotations: description: | Networking control plane is degraded. Networking configuration updates applied to the cluster will not be implemented while there are no OVN Kubernetes pods. runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-network-operator/NoRunningOvnMaster.md summary: There is no running ovn-kubernetes control plane. expr: | absent(up{job="ovnkube-control-plane", namespace="openshift-ovn-kubernetes"} == 1) for: 5m labels: namespace: clusters-hypershift-ci-12900 severity: critical
namespace with openshift.io/cluster-monitoring: "true" label will be monitored by openshift-monitoring, namespace without openshift.io/user-monitoring: "false" label will will be monitored by openshift-user-workload-monitoring(if user workload monitoring is enabled), follow this rule, clusters-hypershift-ci-12900 namespace is monitored by openshift-user-workload-monitoring, and the value for namespace in the defined prometheusrules file is overwritten by the namespace where the prometheusrules residents, in this case, it's clusters-hypershift-ci-12900, so the expr in NoRunningOvnControlPlane would be changed from
absent(up{job="ovnkube-control-plane", namespace="openshift-ovn-kubernetes"} == 1)
to
absent(up{job="ovnkube-control-plane", namespace="clusters-hypershift-ci-12900"} == 1)
also could be proved by
# oc -n openshift-user-workload-monitoring exec -c thanos-ruler thanos-ruler-user-workload-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-ruler.openshift-user-workload-monitoring.svc:9091/api/v1/rules' | jq '.data.groups[].rules[] | select(.name=="NoRunningOvnControlPlane")' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 22309 0 22309 0 0 660k 0 --:--:-- --:--:-- --:--:-- 660k { "state": "firing", "name": "NoRunningOvnControlPlane", "query": "absent(up{job=\"ovnkube-control-plane\",namespace=\"clusters-hypershift-ci-12900\"} == 1)", "duration": 300, "labels": { "namespace": "clusters-hypershift-ci-12900", "severity": "critical", "thanos_ruler_replica": "thanos-ruler-user-workload-1" }, "annotations": { "description": "Networking control plane is degraded. Networking configuration updates applied to the cluster will not be\nimplemented while there are no OVN Kubernetes pods.\n", "runbook_url": "https://github.com/openshift/runbooks/blob/master/alerts/cluster-network-operator/NoRunningOvnMaster.md", "summary": "There is no running ovn-kubernetes control plane." } ...
clusters-hypershift-ci-12900 labels
# oc get ns clusters-hypershift-ci-12900 -oyaml apiVersion: v1 kind: Namespace metadata: annotations: openshift.io/sa.scc.mcs: s0:c26,c25 openshift.io/sa.scc.supplemental-groups: 1000700000/10000 openshift.io/sa.scc.uid-range: 1000700000/10000 creationTimestamp: "2024-01-22T03:29:49Z" labels: hypershift.openshift.io/hosted-control-plane: "true" hypershift.openshift.io/monitoring: "true" kubernetes.io/metadata.name: clusters-hypershift-ci-12900 pod-security.kubernetes.io/audit: restricted pod-security.kubernetes.io/enforce: restricted pod-security.kubernetes.io/warn: restricted security.openshift.io/scc.podSecurityLabelSync: "false" name: clusters-hypershift-ci-12900
search with absent(up{job="ovnkube-control-plane", namespace="clusters-hypershift-ci-12900"} == 1)
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=absent(up{job="ovnkube-control-plane", namespace="clusters-hypershift-ci-12900"} == 1)' | jq { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": {}, "value": [ 1705925585.007, "1" ] } ], "analysis": {} } }
more info
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=up{job="ovnkube-control-plane", namespace="clusters-hypershift-ci-12900"} ' | jq { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "up", "_id": "f3c9d9e7-4676-42d8-bbd3-2e9337f1c44e", "container": "ovnkube-control-plane", "endpoint": "metrics", "instance": "10.129.2.62:9108", "job": "ovnkube-control-plane", "namespace": "clusters-hypershift-ci-12900", "pod": "ovnkube-control-plane-5fdc875547-pc8pm", "prometheus": "openshift-user-workload-monitoring/user-workload", "service": "ovn-kubernetes-control-plane" }, "value": [ 1705925726.478, "0" ] }, { "metric": { "__name__": "up", "_id": "f3c9d9e7-4676-42d8-bbd3-2e9337f1c44e", "container": "ovnkube-control-plane", "endpoint": "metrics", "instance": "10.131.0.39:9108", "job": "ovnkube-control-plane", "namespace": "clusters-hypershift-ci-12900", "pod": "ovnkube-control-plane-5fdc875547-4pd6s", "prometheus": "openshift-user-workload-monitoring/user-workload", "service": "ovn-kubernetes-control-plane" }, "value": [ 1705925726.478, "0" ] } ], "analysis": {} } }
ovnkube pods under clusters-hypershift-ci-12900
# oc -n clusters-hypershift-ci-12900 get pod | grep ovn ovnkube-control-plane-5fdc875547-4pd6s 3/3 Running 0 8h ovnkube-control-plane-5fdc875547-pc8pm 3/3 Running 0 8h
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-01-21-154905 hypershift management cluster
How reproducible:
only with hypershift management cluster, no such issue with guest cluster
Steps to Reproduce:
1. see the descriptions
Actual results:
NoRunningOvnControlPlane alert fired
Expected results:
no such alert
Additional info:
if this is expected, you can close this bug
- duplicates
-
OCPBUGS-54533 ARO-HCP: Prometheus metrics scraping failing for ovnkube-control-plane
-
- Closed
-
- is duplicated by
-
OCPBUGS-38614 Hypershift - missing ServiceMonitor for OVN
-
- Closed
-