Loading...

Type: Bug
Resolution: Won't Do
Priority: Critical
Fix Version/s: 4.15.0
Affects Version/s: 4.15
Component/s: Networking / ovn-kubernetes
Labels:
- SDN:Backport
- ipsec

Severity:
Important
Regression:
No
Sprint:
SDN Sprint 246, SDN Sprint 247, SDN Sprint 248, SDN Sprint 249, SDN Sprint 250
sprint_count:
5
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
RH Private Keywords:
Target Version:

4.15.z
Target Backport Versions:

4.15

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-upgrade/job/upgrade-pipeline/46064/consoleFull
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-upgrade/job/upgrade-pipeline/46126/console

Version-Release number of selected component (if applicable):

How reproducible:
two upgrades, two failed.

Steps to Reproduce:

Triggered 2 upgrade for template 11_UPI on vSphere 8.0& FIPS ON & OVN IPSEC & Static Network & Bonding & HW19 & Secureboot (IPSEC E-W only)
1. From 4.13.26-x86_64 - > 4.14.0-0.nightly-2023-12-08-072853->4.15.0-0.nightly-2023-12-09-012410

12-11 16:28:56.968 oc get clusteroperators:
12-11 16:28:56.968 NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
12-11 16:28:56.968 authentication 4.15.0-0.nightly-2023-12-09-012410 False False True 104m APIServicesAvailable: "oauth.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request...
12-11 16:28:56.968 baremetal 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m
12-11 16:28:56.968 cloud-controller-manager 4.15.0-0.nightly-2023-12-09-012410 True False False 5h43m
12-11 16:28:56.968 cloud-credential 4.15.0-0.nightly-2023-12-09-012410 True False False 5h45m
12-11 16:28:56.968 cluster-autoscaler 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m
12-11 16:28:56.968 config-operator 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m
12-11 16:28:56.968 console 4.15.0-0.nightly-2023-12-09-012410 False False False 107m RouteHealthAvailable: console route is not admitted
12-11 16:28:56.968 control-plane-machine-set 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m
12-11 16:28:56.968 csi-snapshot-controller 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m
12-11 16:28:56.968 dns 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m
12-11 16:28:56.968 etcd 4.15.0-0.nightly-2023-12-09-012410 True False False 5h38m
12-11 16:28:56.968 image-registry 4.15.0-0.nightly-2023-12-09-012410 True False False 109m
12-11 16:28:56.968 ingress 4.15.0-0.nightly-2023-12-09-012410 True False False 108m
12-11 16:28:56.968 insights 4.15.0-0.nightly-2023-12-09-012410 True False False 5h33m
12-11 16:28:56.968 kube-apiserver 4.15.0-0.nightly-2023-12-09-012410 True False False 5h35m
12-11 16:28:56.968 kube-controller-manager 4.15.0-0.nightly-2023-12-09-012410 True False True 5h37m GarbageCollectorDegraded: error querying alerts: Post "[https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query]": dial tcp 172.30.77.136:9091: i/o timeout
12-11 16:28:56.968 kube-scheduler 4.15.0-0.nightly-2023-12-09-012410 True False False 5h37m
12-11 16:28:56.968 kube-storage-version-migrator 4.15.0-0.nightly-2023-12-09-012410 True False False 109m
12-11 16:28:56.968 machine-api 4.15.0-0.nightly-2023-12-09-012410 True False False 5h36m
12-11 16:28:56.968 machine-approver 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m
12-11 16:28:56.968 machine-config 4.14.0-0.nightly-2023-12-08-072853 True False False 5h39m
12-11 16:28:56.968 marketplace 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m
12-11 16:28:56.968 monitoring 4.15.0-0.nightly-2023-12-09-012410 False True True 63s UpdatingThanosQuerier: reconciling Thanos Querier Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-querier), UpdatingAlertmanager: reconciling Alertmanager Route failed: updating Route object failed: the server is currently unable to handle the request (put routes.route.openshift.io alertmanager-main), UpdatingUserWorkloadThanosRuler: reconciling Thanos Ruler Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-ruler), UpdatingPrometheus: reconciling Prometheus API Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io prometheus-k8s), UpdatingUserWorkloadPrometheus: reconciling UserWorkload federate Route failed: updating Route object failed: the server is currently unable to handle the request (put routes.route.openshift.io federate)
12-11 16:28:56.968 network 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m
12-11 16:28:56.968 node-tuning 4.15.0-0.nightly-2023-12-09-012410 True False False 124m
12-11 16:28:56.968 openshift-apiserver 4.15.0-0.nightly-2023-12-09-012410 False False False 97m APIServicesAvailable: "image.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request
12-11 16:28:56.968 openshift-controller-manager 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m
12-11 16:28:56.968 openshift-samples 4.15.0-0.nightly-2023-12-09-012410 True False False 124m
12-11 16:28:56.968 operator-lifecycle-manager 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m
12-11 16:28:56.968 operator-lifecycle-manager-catalog 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m
12-11 16:28:56.968 operator-lifecycle-manager-packageserver 4.15.0-0.nightly-2023-12-09-012410 True False False 100m
12-11 16:28:56.968 service-ca 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m
12-11 16:28:56.968 storage 4.15.0-0.nightly-2023-12-09-012410 True False False 104m


2. From 4.14.5-x86_64 - > 4.15.0-0.nightly-2023-12-11-033133
% oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.15.0-0.nightly-2023-12-11-033133 False False True 3h32m APIServicesAvailable: "user.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request...
baremetal 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m
cloud-controller-manager 4.15.0-0.nightly-2023-12-11-033133 True False False 5h47m
cloud-credential 4.15.0-0.nightly-2023-12-11-033133 True False False 5h50m
cluster-autoscaler 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m
config-operator 4.15.0-0.nightly-2023-12-11-033133 True False False 5h46m
console 4.15.0-0.nightly-2023-12-11-033133 False False False 3h30m RouteHealthAvailable: console route is not admitted
control-plane-machine-set 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m
csi-snapshot-controller 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m
dns 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m
etcd 4.15.0-0.nightly-2023-12-11-033133 True False False 5h43m
image-registry 4.15.0-0.nightly-2023-12-11-033133 True False False 3h34m
ingress 4.15.0-0.nightly-2023-12-11-033133 True False False 4h22m
insights 4.15.0-0.nightly-2023-12-11-033133 True False False 5h39m
kube-apiserver 4.15.0-0.nightly-2023-12-11-033133 True False False 5h42m
kube-controller-manager 4.15.0-0.nightly-2023-12-11-033133 True False True 5h42m GarbageCollectorDegraded: error fetching rules: Get "[https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules]": dial tcp 172.30.237.96:9091: i/o timeout
kube-scheduler 4.15.0-0.nightly-2023-12-11-033133 True False False 5h42m
kube-storage-version-migrator 4.15.0-0.nightly-2023-12-11-033133 True False False 3h34m
machine-api 4.15.0-0.nightly-2023-12-11-033133 True False False 5h41m
machine-approver 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m
machine-config 4.14.5 True False False 5h44m
marketplace 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m
monitoring 4.15.0-0.nightly-2023-12-11-033133 False True True 4m32s UpdatingAlertmanager: reconciling Alertmanager Route failed: updating Route object failed: the server is currently unable to handle the request (put routes.route.openshift.io alertmanager-main), UpdatingUserWorkloadThanosRuler: reconciling Thanos Ruler Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-ruler), UpdatingThanosQuerier: reconciling Thanos Querier Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-querier), UpdatingPrometheus: reconciling Prometheus API Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io prometheus-k8s), UpdatingUserWorkloadPrometheus: reconciling UserWorkload federate Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io federate)
network 4.15.0-0.nightly-2023-12-11-033133 True False False 5h44m
node-tuning 4.15.0-0.nightly-2023-12-11-033133 True False False 3h48m
openshift-apiserver 4.15.0-0.nightly-2023-12-11-033133 False False False 11m APIServicesAvailable: "apps.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request...
openshift-controller-manager 4.15.0-0.nightly-2023-12-11-033133 True False False 5h41m
openshift-samples 4.15.0-0.nightly-2023-12-11-033133 True False False 3h49m
operator-lifecycle-manager 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m
operator-lifecycle-manager-catalog 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m
operator-lifecycle-manager-packageserver 4.15.0-0.nightly-2023-12-11-033133 True False False 2m57s
service-ca 4.15.0-0.nightly-2023-12-11-033133 True False False 5h46m
storage 4.15.0-0.nightly-2023-12-11-033133 True False False 3h28m

 

% oc get pods -n openshift-ovn-kubernetes
NAME READY STATUS RESTARTS AGE
ovn-ipsec-host-bn5mm 1/1 Running 0 3h17m
ovn-ipsec-host-dlg5c 1/1 Running 0 3h20m
ovn-ipsec-host-dztzf 1/1 Running 0 3h14m
ovn-ipsec-host-tfflr 1/1 Running 0 3h11m
ovn-ipsec-host-wvkwq 1/1 Running 0 3h10m
ovnkube-control-plane-85b45bf6cf-78tbq 2/2 Running 0 3h30m
ovnkube-control-plane-85b45bf6cf-n5pqn 2/2 Running 0 3h33m
ovnkube-node-4rwk4 8/8 Running 8 3h40m
ovnkube-node-567rz 8/8 Running 8 3h34m
ovnkube-node-c7hv4 8/8 Running 8 3h40m
ovnkube-node-qmw49 8/8 Running 8 3h35m
ovnkube-node-s2nsw 8/8 Running 0 3h36m

Multiple pods on different nodes have the connection problems.
% oc get pods -n openshift-network-diagnostics -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
network-check-source-5cd74f77cc-mlqvz 1/1 Running 0 134m 10.131.0.25 huirwang-46126-g66cb-compute-0 <none> <none>
network-check-target-824mt 1/1 Running 1 139m 10.130.0.212 huirwang-46126-g66cb-control-plane-2 <none> <none>
network-check-target-dzl7m 1/1 Running 1 140m 10.128.2.46 huirwang-46126-g66cb-compute-1 <none> <none>
network-check-target-l224m 1/1 Running 1 133m 10.129.0.173 huirwang-46126-g66cb-control-plane-1 <none> <none>
network-check-target-qd48q 1/1 Running 1 138m 10.128.0.148 huirwang-46126-g66cb-control-plane-0 <none> <none>
network-check-target-sc8hn 1/1 Running 0 134m 10.131.0.3 huirwang-46126-g66cb-compute-0 <none> <none>

% oc rsh -n openshift-network-diagnostics network-check-source-5cd74f77cc-mlqvz
sh-5.1$ curl 10.130.0.212:8080 --connect-timeout 5
curl: (28) Connection timed out after 5000 milliseconds
sh-5.1$ curl 10.128.2.46:8080 --connect-timeout 5
curl: (28) Connection timed out after 5001 milliseconds
sh-5.1$ curl 10.129.0.173:8080 --connect-timeout 5
curl: (28) Connection timed out after 5001 milliseconds
sh-5.1$ curl 10.128.0.148:8080 --connect-timeout 5
curl: (28) Connection timed out after 5001 milliseconds
sh-5.1$ curl 10.131.0.3:8080 --connect-timeout 5
Hello, 10.131.0.25. You have reached 10.131.0.3 on huirwang-46126-g66cb-compute-0sh-5.1$

Actual results:
Upgrade failed.

Expected results:
Upgrade succeeded.

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

    Is it an
    # internal CI failure
    # customer issue / SD
    # internal RedHat testing failure

If it is an internal RedHat testing failure:
* Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/252060/artifact/workdir/install-dir/auth/kubeconfig/*view*/

If it is a CI failure:

    * Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
    * Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
    * Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
    * When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
    * If it's a connectivity issue,
    * What is the srcNode, srcIP and srcNamespace and srcPodName?
    * What is the dstNode, dstIP and dstNamespace and dstPodName?
    * What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

    * Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
    * Don’t presume that Engineering has access to Salesforce.
    * Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment. The format should be: [https://access.redhat.com/support/cases/#/case/]<case number>/discussion?attachmentId=<attachment id>
    * Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
    * Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
    ** If the issue is in a customer namespace then provide a namespace inspect.
    ** If it is a connectivity issue:
    *** What is the srcNode, srcNamespace, srcPodName and srcPodIP?
    *** What is the dstNode, dstNamespace, dstPodName and dstPodIP?
    *** What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
    *** Please provide the UTC timestamp networking outage window from must-gather
    *** Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
    ** If it is not a connectivity issue:
    *** Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

    * For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
    * For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with “sbr-untriaged”
    * Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”

depends on

SPLAT-1409 [vSphere] ipsec communication is broken after nodes reboot

Closed

OCPBUGS-25337 [OVN][IPSEC] ovn-ipsec-host pods got deleted when there is a NotReady node

Closed

is blocked by

SDN-4501 Impact assesment for OCPBUGS-25312 : Upgrade from 4.14->4.15 failed for Vsphere

Closed

is duplicated by

RHEL-43455 [OVN IPsec]One master node cannot access the pod on one worker node

Planning

relates to

OCPBUGS-39438 Configure-ovs doesn't persist ethtool configuration

Closed

links to

openshift/cluster-network-operator#2181: [release-4.15] OCPBUGS-25921,OCPBUGS-25312: Avoid removal of ipsec-host daemonset when node is NotReady

(1 links to)

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates