-
Bug
-
Resolution: Won't Do
-
Critical
-
4.15
Description of problem:
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-upgrade/job/upgrade-pipeline/46064/consoleFull
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-upgrade/job/upgrade-pipeline/46126/console
Version-Release number of selected component (if applicable):
How reproducible:
two upgrades, two failed.
Steps to Reproduce:
Triggered 2 upgrade for template 11_UPI on vSphere 8.0& FIPS ON & OVN IPSEC & Static Network & Bonding & HW19 & Secureboot (IPSEC E-W only) 1. From 4.13.26-x86_64 - > 4.14.0-0.nightly-2023-12-08-072853->4.15.0-0.nightly-2023-12-09-012410 12-11 16:28:56.968 oc get clusteroperators: 12-11 16:28:56.968 NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE 12-11 16:28:56.968 authentication 4.15.0-0.nightly-2023-12-09-012410 False False True 104m APIServicesAvailable: "oauth.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request... 12-11 16:28:56.968 baremetal 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m 12-11 16:28:56.968 cloud-controller-manager 4.15.0-0.nightly-2023-12-09-012410 True False False 5h43m 12-11 16:28:56.968 cloud-credential 4.15.0-0.nightly-2023-12-09-012410 True False False 5h45m 12-11 16:28:56.968 cluster-autoscaler 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m 12-11 16:28:56.968 config-operator 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m 12-11 16:28:56.968 console 4.15.0-0.nightly-2023-12-09-012410 False False False 107m RouteHealthAvailable: console route is not admitted 12-11 16:28:56.968 control-plane-machine-set 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m 12-11 16:28:56.968 csi-snapshot-controller 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m 12-11 16:28:56.968 dns 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m 12-11 16:28:56.968 etcd 4.15.0-0.nightly-2023-12-09-012410 True False False 5h38m 12-11 16:28:56.968 image-registry 4.15.0-0.nightly-2023-12-09-012410 True False False 109m 12-11 16:28:56.968 ingress 4.15.0-0.nightly-2023-12-09-012410 True False False 108m 12-11 16:28:56.968 insights 4.15.0-0.nightly-2023-12-09-012410 True False False 5h33m 12-11 16:28:56.968 kube-apiserver 4.15.0-0.nightly-2023-12-09-012410 True False False 5h35m 12-11 16:28:56.968 kube-controller-manager 4.15.0-0.nightly-2023-12-09-012410 True False True 5h37m GarbageCollectorDegraded: error querying alerts: Post "[https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query]": dial tcp 172.30.77.136:9091: i/o timeout 12-11 16:28:56.968 kube-scheduler 4.15.0-0.nightly-2023-12-09-012410 True False False 5h37m 12-11 16:28:56.968 kube-storage-version-migrator 4.15.0-0.nightly-2023-12-09-012410 True False False 109m 12-11 16:28:56.968 machine-api 4.15.0-0.nightly-2023-12-09-012410 True False False 5h36m 12-11 16:28:56.968 machine-approver 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m 12-11 16:28:56.968 machine-config 4.14.0-0.nightly-2023-12-08-072853 True False False 5h39m 12-11 16:28:56.968 marketplace 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m 12-11 16:28:56.968 monitoring 4.15.0-0.nightly-2023-12-09-012410 False True True 63s UpdatingThanosQuerier: reconciling Thanos Querier Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-querier), UpdatingAlertmanager: reconciling Alertmanager Route failed: updating Route object failed: the server is currently unable to handle the request (put routes.route.openshift.io alertmanager-main), UpdatingUserWorkloadThanosRuler: reconciling Thanos Ruler Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-ruler), UpdatingPrometheus: reconciling Prometheus API Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io prometheus-k8s), UpdatingUserWorkloadPrometheus: reconciling UserWorkload federate Route failed: updating Route object failed: the server is currently unable to handle the request (put routes.route.openshift.io federate) 12-11 16:28:56.968 network 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m 12-11 16:28:56.968 node-tuning 4.15.0-0.nightly-2023-12-09-012410 True False False 124m 12-11 16:28:56.968 openshift-apiserver 4.15.0-0.nightly-2023-12-09-012410 False False False 97m APIServicesAvailable: "image.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request 12-11 16:28:56.968 openshift-controller-manager 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m 12-11 16:28:56.968 openshift-samples 4.15.0-0.nightly-2023-12-09-012410 True False False 124m 12-11 16:28:56.968 operator-lifecycle-manager 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m 12-11 16:28:56.968 operator-lifecycle-manager-catalog 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m 12-11 16:28:56.968 operator-lifecycle-manager-packageserver 4.15.0-0.nightly-2023-12-09-012410 True False False 100m 12-11 16:28:56.968 service-ca 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m 12-11 16:28:56.968 storage 4.15.0-0.nightly-2023-12-09-012410 True False False 104m 2. From 4.14.5-x86_64 - > 4.15.0-0.nightly-2023-12-11-033133 % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.15.0-0.nightly-2023-12-11-033133 False False True 3h32m APIServicesAvailable: "user.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request... baremetal 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m cloud-controller-manager 4.15.0-0.nightly-2023-12-11-033133 True False False 5h47m cloud-credential 4.15.0-0.nightly-2023-12-11-033133 True False False 5h50m cluster-autoscaler 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m config-operator 4.15.0-0.nightly-2023-12-11-033133 True False False 5h46m console 4.15.0-0.nightly-2023-12-11-033133 False False False 3h30m RouteHealthAvailable: console route is not admitted control-plane-machine-set 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m csi-snapshot-controller 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m dns 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m etcd 4.15.0-0.nightly-2023-12-11-033133 True False False 5h43m image-registry 4.15.0-0.nightly-2023-12-11-033133 True False False 3h34m ingress 4.15.0-0.nightly-2023-12-11-033133 True False False 4h22m insights 4.15.0-0.nightly-2023-12-11-033133 True False False 5h39m kube-apiserver 4.15.0-0.nightly-2023-12-11-033133 True False False 5h42m kube-controller-manager 4.15.0-0.nightly-2023-12-11-033133 True False True 5h42m GarbageCollectorDegraded: error fetching rules: Get "[https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules]": dial tcp 172.30.237.96:9091: i/o timeout kube-scheduler 4.15.0-0.nightly-2023-12-11-033133 True False False 5h42m kube-storage-version-migrator 4.15.0-0.nightly-2023-12-11-033133 True False False 3h34m machine-api 4.15.0-0.nightly-2023-12-11-033133 True False False 5h41m machine-approver 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m machine-config 4.14.5 True False False 5h44m marketplace 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m monitoring 4.15.0-0.nightly-2023-12-11-033133 False True True 4m32s UpdatingAlertmanager: reconciling Alertmanager Route failed: updating Route object failed: the server is currently unable to handle the request (put routes.route.openshift.io alertmanager-main), UpdatingUserWorkloadThanosRuler: reconciling Thanos Ruler Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-ruler), UpdatingThanosQuerier: reconciling Thanos Querier Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-querier), UpdatingPrometheus: reconciling Prometheus API Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io prometheus-k8s), UpdatingUserWorkloadPrometheus: reconciling UserWorkload federate Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io federate) network 4.15.0-0.nightly-2023-12-11-033133 True False False 5h44m node-tuning 4.15.0-0.nightly-2023-12-11-033133 True False False 3h48m openshift-apiserver 4.15.0-0.nightly-2023-12-11-033133 False False False 11m APIServicesAvailable: "apps.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request... openshift-controller-manager 4.15.0-0.nightly-2023-12-11-033133 True False False 5h41m openshift-samples 4.15.0-0.nightly-2023-12-11-033133 True False False 3h49m operator-lifecycle-manager 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m operator-lifecycle-manager-catalog 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m operator-lifecycle-manager-packageserver 4.15.0-0.nightly-2023-12-11-033133 True False False 2m57s service-ca 4.15.0-0.nightly-2023-12-11-033133 True False False 5h46m storage 4.15.0-0.nightly-2023-12-11-033133 True False False 3h28m % oc get pods -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovn-ipsec-host-bn5mm 1/1 Running 0 3h17m ovn-ipsec-host-dlg5c 1/1 Running 0 3h20m ovn-ipsec-host-dztzf 1/1 Running 0 3h14m ovn-ipsec-host-tfflr 1/1 Running 0 3h11m ovn-ipsec-host-wvkwq 1/1 Running 0 3h10m ovnkube-control-plane-85b45bf6cf-78tbq 2/2 Running 0 3h30m ovnkube-control-plane-85b45bf6cf-n5pqn 2/2 Running 0 3h33m ovnkube-node-4rwk4 8/8 Running 8 3h40m ovnkube-node-567rz 8/8 Running 8 3h34m ovnkube-node-c7hv4 8/8 Running 8 3h40m ovnkube-node-qmw49 8/8 Running 8 3h35m ovnkube-node-s2nsw 8/8 Running 0 3h36m Multiple pods on different nodes have the connection problems. % oc get pods -n openshift-network-diagnostics -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES network-check-source-5cd74f77cc-mlqvz 1/1 Running 0 134m 10.131.0.25 huirwang-46126-g66cb-compute-0 <none> <none> network-check-target-824mt 1/1 Running 1 139m 10.130.0.212 huirwang-46126-g66cb-control-plane-2 <none> <none> network-check-target-dzl7m 1/1 Running 1 140m 10.128.2.46 huirwang-46126-g66cb-compute-1 <none> <none> network-check-target-l224m 1/1 Running 1 133m 10.129.0.173 huirwang-46126-g66cb-control-plane-1 <none> <none> network-check-target-qd48q 1/1 Running 1 138m 10.128.0.148 huirwang-46126-g66cb-control-plane-0 <none> <none> network-check-target-sc8hn 1/1 Running 0 134m 10.131.0.3 huirwang-46126-g66cb-compute-0 <none> <none> % oc rsh -n openshift-network-diagnostics network-check-source-5cd74f77cc-mlqvz sh-5.1$ curl 10.130.0.212:8080 --connect-timeout 5 curl: (28) Connection timed out after 5000 milliseconds sh-5.1$ curl 10.128.2.46:8080 --connect-timeout 5 curl: (28) Connection timed out after 5001 milliseconds sh-5.1$ curl 10.129.0.173:8080 --connect-timeout 5 curl: (28) Connection timed out after 5001 milliseconds sh-5.1$ curl 10.128.0.148:8080 --connect-timeout 5 curl: (28) Connection timed out after 5001 milliseconds sh-5.1$ curl 10.131.0.3:8080 --connect-timeout 5 Hello, 10.131.0.25. You have reached 10.131.0.3 on huirwang-46126-g66cb-compute-0sh-5.1$
Actual results:
Upgrade failed.
Expected results:
Upgrade succeeded.
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
# internal CI failure
# customer issue / SD
# internal RedHat testing failure
If it is an internal RedHat testing failure:
* Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).
If it is a CI failure:
* Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
* Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
* Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
* When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
* If it's a connectivity issue,
* What is the srcNode, srcIP and srcNamespace and srcPodName?
* What is the dstNode, dstIP and dstNamespace and dstPodName?
* What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
If it is a customer / SD issue:
* Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
* Don’t presume that Engineering has access to Salesforce.
* Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment. The format should be: [https://access.redhat.com/support/cases/#/case/]<case number>/discussion?attachmentId=<attachment id>
* Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
* Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
** If the issue is in a customer namespace then provide a namespace inspect.
** If it is a connectivity issue:
*** What is the srcNode, srcNamespace, srcPodName and srcPodIP?
*** What is the dstNode, dstNamespace, dstPodName and dstPodIP?
*** What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
*** Please provide the UTC timestamp networking outage window from must-gather
*** Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
** If it is not a connectivity issue:
*** Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
* For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
* For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with “sbr-untriaged”
* Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
- depends on
-
SPLAT-1409 [vSphere] ipsec communication is broken after nodes reboot
- Closed
-
OCPBUGS-25337 [OVN][IPSEC] ovn-ipsec-host pods got deleted when there is a NotReady node
- Closed
- is blocked by
-
SDN-4501 Impact assesment for OCPBUGS-25312 : Upgrade from 4.14->4.15 failed for Vsphere
- Closed
- is duplicated by
-
RHEL-43455 [OVN IPsec]One master node cannot access the pod on one worker node
- Planning
- relates to
-
OCPBUGS-39438 Configure-ovs doesn't persist ethtool configuration
- Verified
- links to