-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.14
-
None
-
Moderate
-
None
-
False
-
Description of problem:
I have 3 Agent-provider hosted clusters installed, all are configured w/ a metallb-ingress, everything appears to be working as expected, I can access the hosted prom over ingress, etc. However the MCE cluster is reporting these alerts for all 3 hosted clusters: 100% of the ovnkube-control-plane/ovn-kubernetes-control-plane targets in Namespace clusters-test[1-3] namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support. Is it possible this is a lingering alert state that was not cleared after firing during the initial part of the hosted cluster creation, before the metallb-ingress was created? I am not finding any errors in the pod states: clusters-test1 ovnkube-control-plane-76b46b8b94-2kmlf 3/3 Running 0 6d2h clusters-test1 ovnkube-control-plane-76b46b8b94-b4p7r 3/3 Running 0 6d2h
Must gathers are here (behind VPN): http://perfscale.perf.eng.bos2.dc.redhat.com/pub/jhopper/OCP4/debug/agent_hcp_mg/
Version-Release number of selected component (if applicable):
OCP 4.14.21 CNV 4.14.6 multicluster-engine.v2.5.3 metallb-operator.v4.14.0-202406111908
How reproducible:
Alert fires for all 3 hosted clusters
Steps to Reproduce:
1. Install baremetal agent-provider Hosted Clusters as per docs 2. Configure ingress after initial HC creation stage 3. Check alerts on mgmt/hub cluster
Actual results:
OVN alerts
Expected results:
No alert if no CO or OVN pod shows errors
Additional info:
ex: Annotations: hypershift.openshift.io/release-image: quay.io/openshift-release-dev/ocp-release:4.14.21-multi k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["10.130.0.208/23"],"mac_address":"0a:58:0a:82:00:d0","gateway_ips":["10.130.0.1"],"routes":[{"dest":"10.128.0.... k8s.v1.cni.cncf.io/network-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "10.130.0.208" ], "mac": "0a:58:0a:82:00:d0", "default": true, "dns": {} }] networkoperator.openshift.io/cluster-network-cidr: 10.132.0.0/14 networkoperator.openshift.io/hybrid-overlay-status: disabled networkoperator.openshift.io/ip-family-mode: single-stack openshift.io/scc: restricted-v2 seccomp.security.alpha.kubernetes.io/pod: runtime/default Status: Running SeccompProfile: RuntimeDefault IP: 10.130.0.208 IPs: IP: 10.130.0.208 Controlled By: ReplicaSet/ovnkube-control-plane-76b46b8b94 Inside the hosted cluster all OVN pods look ok: openshift-ovn-kubernetes ovnkube-node-67v22 8/8 Running 1 (6d2h ago) 6d2h openshift-ovn-kubernetes ovnkube-node-6mqdp 8/8 Running 0 6d2h openshift-ovn-kubernetes ovnkube-node-stpxp 8/8 Running 9 (6d1h ago) 6d2h openshift-ovn-kubernetes ovnkube-node-wlplr 8/8 Running 0 6d2h All HC CO's are available. Note the alert continues to fire even after a successful minor version (.20->.21) upgrade of both the MCE cluster & Hosted Clusters. NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.14.21 True False False 6d6h baremetal 4.14.21 True False False 66d cloud-controller-manager 4.14.21 True False False 66d cloud-credential 4.14.21 True False False 66d cluster-autoscaler 4.14.21 True False False 66d config-operator 4.14.21 True False False 66d console 4.14.21 True False False 22d control-plane-machine-set 4.14.21 True False False 66d csi-snapshot-controller 4.14.21 True False False 66d dns 4.14.21 True False False 66d etcd 4.14.21 True False False 66d image-registry 4.14.21 True False False 66d ingress 4.14.21 True False False 66d insights 4.14.21 True False False 34d kube-apiserver 4.14.21 True False False 66d kube-controller-manager 4.14.21 True False False 66d kube-scheduler 4.14.21 True False False 66d kube-storage-version-migrator 4.14.21 True False False 6d6h machine-api 4.14.21 True False False 66d machine-approver 4.14.21 True False False 66d machine-config 4.14.21 True False False 34d marketplace 4.14.21 True False False 66d monitoring 4.14.21 True False False 66d network 4.14.21 True False False 66d node-tuning 4.14.21 True False False 6d7h openshift-apiserver 4.14.21 True False False 37d openshift-controller-manager 4.14.21 True False False 66d openshift-samples 4.14.21 True False False 6d7h operator-lifecycle-manager 4.14.21 True False False 66d operator-lifecycle-manager-catalog 4.14.21 True False False 66d operator-lifecycle-manager-packageserver 4.14.21 True False False 35d service-ca 4.14.21 True False False 66d storage 4.14.21 True False False 66d # oc --kubeconfig kubeconfig-test1 describe svc -n openshift-ingress metallb-ingress Name: metallb-ingress Namespace: openshift-ingress Labels: <none> Annotations: metallb.universe.tf/address-pool: ingress-ip metallb.universe.tf/ip-allocated-from-pool: ingress-ip Selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default Type: LoadBalancer IP Family Policy: SingleStack IP Families: IPv4 IP: 172.31.135.76 IPs: 172.31.135.76 LoadBalancer Ingress: 198.18.10.11 Port: http 80/TCP TargetPort: 80/TCP NodePort: http 30200/TCP Endpoints: 198.18.10.32:80 Port: https 443/TCP TargetPort: 443/TCP NodePort: https 30136/TCP Endpoints: 198.18.10.32:443 Session Affinity: None External Traffic Policy: Cluster Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal nodeAssigned 123m (x6 over 20h) metallb-speaker announcing from node "worker02" with protocol "layer2"