Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-25312

[OVN][IPSEC EW]Upgrade from 4.14->4.15 failed for Vsphere

    XMLWordPrintable

Details

    • Important
    • No
    • SDN Sprint 246, SDN Sprint 247, SDN Sprint 248, SDN Sprint 249, SDN Sprint 250
    • 5
    • Rejected
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:
      https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-upgrade/job/upgrade-pipeline/46064/consoleFull
      https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-upgrade/job/upgrade-pipeline/46126/console

      Version-Release number of selected component (if applicable):

           

      How reproducible:
       two upgrades, two failed.
           

      Steps to Reproduce:

       

      Triggered 2 upgrade for template 11_UPI on vSphere 8.0& FIPS ON & OVN IPSEC & Static Network & Bonding & HW19 & Secureboot (IPSEC E-W only)
      1. From 4.13.26-x86_64 - > 4.14.0-0.nightly-2023-12-08-072853->4.15.0-0.nightly-2023-12-09-012410
      
      12-11 16:28:56.968 oc get clusteroperators:
      12-11 16:28:56.968 NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
      12-11 16:28:56.968 authentication 4.15.0-0.nightly-2023-12-09-012410 False False True 104m APIServicesAvailable: "oauth.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request...
      12-11 16:28:56.968 baremetal 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m
      12-11 16:28:56.968 cloud-controller-manager 4.15.0-0.nightly-2023-12-09-012410 True False False 5h43m
      12-11 16:28:56.968 cloud-credential 4.15.0-0.nightly-2023-12-09-012410 True False False 5h45m
      12-11 16:28:56.968 cluster-autoscaler 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m
      12-11 16:28:56.968 config-operator 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m
      12-11 16:28:56.968 console 4.15.0-0.nightly-2023-12-09-012410 False False False 107m RouteHealthAvailable: console route is not admitted
      12-11 16:28:56.968 control-plane-machine-set 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m
      12-11 16:28:56.968 csi-snapshot-controller 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m
      12-11 16:28:56.968 dns 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m
      12-11 16:28:56.968 etcd 4.15.0-0.nightly-2023-12-09-012410 True False False 5h38m
      12-11 16:28:56.968 image-registry 4.15.0-0.nightly-2023-12-09-012410 True False False 109m
      12-11 16:28:56.968 ingress 4.15.0-0.nightly-2023-12-09-012410 True False False 108m
      12-11 16:28:56.968 insights 4.15.0-0.nightly-2023-12-09-012410 True False False 5h33m
      12-11 16:28:56.968 kube-apiserver 4.15.0-0.nightly-2023-12-09-012410 True False False 5h35m
      12-11 16:28:56.968 kube-controller-manager 4.15.0-0.nightly-2023-12-09-012410 True False True 5h37m GarbageCollectorDegraded: error querying alerts: Post "[https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query]": dial tcp 172.30.77.136:9091: i/o timeout
      12-11 16:28:56.968 kube-scheduler 4.15.0-0.nightly-2023-12-09-012410 True False False 5h37m
      12-11 16:28:56.968 kube-storage-version-migrator 4.15.0-0.nightly-2023-12-09-012410 True False False 109m
      12-11 16:28:56.968 machine-api 4.15.0-0.nightly-2023-12-09-012410 True False False 5h36m
      12-11 16:28:56.968 machine-approver 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m
      12-11 16:28:56.968 machine-config 4.14.0-0.nightly-2023-12-08-072853 True False False 5h39m
      12-11 16:28:56.968 marketplace 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m
      12-11 16:28:56.968 monitoring 4.15.0-0.nightly-2023-12-09-012410 False True True 63s UpdatingThanosQuerier: reconciling Thanos Querier Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-querier), UpdatingAlertmanager: reconciling Alertmanager Route failed: updating Route object failed: the server is currently unable to handle the request (put routes.route.openshift.io alertmanager-main), UpdatingUserWorkloadThanosRuler: reconciling Thanos Ruler Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-ruler), UpdatingPrometheus: reconciling Prometheus API Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io prometheus-k8s), UpdatingUserWorkloadPrometheus: reconciling UserWorkload federate Route failed: updating Route object failed: the server is currently unable to handle the request (put routes.route.openshift.io federate)
      12-11 16:28:56.968 network 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m
      12-11 16:28:56.968 node-tuning 4.15.0-0.nightly-2023-12-09-012410 True False False 124m
      12-11 16:28:56.968 openshift-apiserver 4.15.0-0.nightly-2023-12-09-012410 False False False 97m APIServicesAvailable: "image.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request
      12-11 16:28:56.968 openshift-controller-manager 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m
      12-11 16:28:56.968 openshift-samples 4.15.0-0.nightly-2023-12-09-012410 True False False 124m
      12-11 16:28:56.968 operator-lifecycle-manager 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m
      12-11 16:28:56.968 operator-lifecycle-manager-catalog 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m
      12-11 16:28:56.968 operator-lifecycle-manager-packageserver 4.15.0-0.nightly-2023-12-09-012410 True False False 100m
      12-11 16:28:56.968 service-ca 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m
      12-11 16:28:56.968 storage 4.15.0-0.nightly-2023-12-09-012410 True False False 104m
      
      
      2. From 4.14.5-x86_64 - > 4.15.0-0.nightly-2023-12-11-033133
      % oc get co
      NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
      authentication 4.15.0-0.nightly-2023-12-11-033133 False False True 3h32m APIServicesAvailable: "user.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request...
      baremetal 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m
      cloud-controller-manager 4.15.0-0.nightly-2023-12-11-033133 True False False 5h47m
      cloud-credential 4.15.0-0.nightly-2023-12-11-033133 True False False 5h50m
      cluster-autoscaler 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m
      config-operator 4.15.0-0.nightly-2023-12-11-033133 True False False 5h46m
      console 4.15.0-0.nightly-2023-12-11-033133 False False False 3h30m RouteHealthAvailable: console route is not admitted
      control-plane-machine-set 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m
      csi-snapshot-controller 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m
      dns 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m
      etcd 4.15.0-0.nightly-2023-12-11-033133 True False False 5h43m
      image-registry 4.15.0-0.nightly-2023-12-11-033133 True False False 3h34m
      ingress 4.15.0-0.nightly-2023-12-11-033133 True False False 4h22m
      insights 4.15.0-0.nightly-2023-12-11-033133 True False False 5h39m
      kube-apiserver 4.15.0-0.nightly-2023-12-11-033133 True False False 5h42m
      kube-controller-manager 4.15.0-0.nightly-2023-12-11-033133 True False True 5h42m GarbageCollectorDegraded: error fetching rules: Get "[https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules]": dial tcp 172.30.237.96:9091: i/o timeout
      kube-scheduler 4.15.0-0.nightly-2023-12-11-033133 True False False 5h42m
      kube-storage-version-migrator 4.15.0-0.nightly-2023-12-11-033133 True False False 3h34m
      machine-api 4.15.0-0.nightly-2023-12-11-033133 True False False 5h41m
      machine-approver 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m
      machine-config 4.14.5 True False False 5h44m
      marketplace 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m
      monitoring 4.15.0-0.nightly-2023-12-11-033133 False True True 4m32s UpdatingAlertmanager: reconciling Alertmanager Route failed: updating Route object failed: the server is currently unable to handle the request (put routes.route.openshift.io alertmanager-main), UpdatingUserWorkloadThanosRuler: reconciling Thanos Ruler Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-ruler), UpdatingThanosQuerier: reconciling Thanos Querier Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-querier), UpdatingPrometheus: reconciling Prometheus API Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io prometheus-k8s), UpdatingUserWorkloadPrometheus: reconciling UserWorkload federate Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io federate)
      network 4.15.0-0.nightly-2023-12-11-033133 True False False 5h44m
      node-tuning 4.15.0-0.nightly-2023-12-11-033133 True False False 3h48m
      openshift-apiserver 4.15.0-0.nightly-2023-12-11-033133 False False False 11m APIServicesAvailable: "apps.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request...
      openshift-controller-manager 4.15.0-0.nightly-2023-12-11-033133 True False False 5h41m
      openshift-samples 4.15.0-0.nightly-2023-12-11-033133 True False False 3h49m
      operator-lifecycle-manager 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m
      operator-lifecycle-manager-catalog 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m
      operator-lifecycle-manager-packageserver 4.15.0-0.nightly-2023-12-11-033133 True False False 2m57s
      service-ca 4.15.0-0.nightly-2023-12-11-033133 True False False 5h46m
      storage 4.15.0-0.nightly-2023-12-11-033133 True False False 3h28m
      
       
      
      % oc get pods -n openshift-ovn-kubernetes
      NAME READY STATUS RESTARTS AGE
      ovn-ipsec-host-bn5mm 1/1 Running 0 3h17m
      ovn-ipsec-host-dlg5c 1/1 Running 0 3h20m
      ovn-ipsec-host-dztzf 1/1 Running 0 3h14m
      ovn-ipsec-host-tfflr 1/1 Running 0 3h11m
      ovn-ipsec-host-wvkwq 1/1 Running 0 3h10m
      ovnkube-control-plane-85b45bf6cf-78tbq 2/2 Running 0 3h30m
      ovnkube-control-plane-85b45bf6cf-n5pqn 2/2 Running 0 3h33m
      ovnkube-node-4rwk4 8/8 Running 8 3h40m
      ovnkube-node-567rz 8/8 Running 8 3h34m
      ovnkube-node-c7hv4 8/8 Running 8 3h40m
      ovnkube-node-qmw49 8/8 Running 8 3h35m
      ovnkube-node-s2nsw 8/8 Running 0 3h36m
      
      Multiple pods on different nodes have the connection problems.
      % oc get pods -n openshift-network-diagnostics -o wide
      NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
      network-check-source-5cd74f77cc-mlqvz 1/1 Running 0 134m 10.131.0.25 huirwang-46126-g66cb-compute-0 <none> <none>
      network-check-target-824mt 1/1 Running 1 139m 10.130.0.212 huirwang-46126-g66cb-control-plane-2 <none> <none>
      network-check-target-dzl7m 1/1 Running 1 140m 10.128.2.46 huirwang-46126-g66cb-compute-1 <none> <none>
      network-check-target-l224m 1/1 Running 1 133m 10.129.0.173 huirwang-46126-g66cb-control-plane-1 <none> <none>
      network-check-target-qd48q 1/1 Running 1 138m 10.128.0.148 huirwang-46126-g66cb-control-plane-0 <none> <none>
      network-check-target-sc8hn 1/1 Running 0 134m 10.131.0.3 huirwang-46126-g66cb-compute-0 <none> <none>
      
      % oc rsh -n openshift-network-diagnostics network-check-source-5cd74f77cc-mlqvz
      sh-5.1$ curl 10.130.0.212:8080 --connect-timeout 5
      curl: (28) Connection timed out after 5000 milliseconds
      sh-5.1$ curl 10.128.2.46:8080 --connect-timeout 5
      curl: (28) Connection timed out after 5001 milliseconds
      sh-5.1$ curl 10.129.0.173:8080 --connect-timeout 5
      curl: (28) Connection timed out after 5001 milliseconds
      sh-5.1$ curl 10.128.0.148:8080 --connect-timeout 5
      curl: (28) Connection timed out after 5001 milliseconds
      sh-5.1$ curl 10.131.0.3:8080 --connect-timeout 5
      Hello, 10.131.0.25. You have reached 10.131.0.3 on huirwang-46126-g66cb-compute-0sh-5.1$
      

          Actual results:
      Upgrade failed.
           

      Expected results:
      Upgrade succeeded.
           

      Additional info:

      Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

      Affected Platforms:

          Is it an
          # internal CI failure 
          # customer issue / SD
          # internal RedHat testing failure

           

      If it is an internal RedHat testing failure:
          * Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

           https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/252060/artifact/workdir/install-dir/auth/kubeconfig/*view*/

      If it is a CI failure:

           
          * Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
          * Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
          * Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
          * When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
          * If it's a connectivity issue,
          * What is the srcNode, srcIP and srcNamespace and srcPodName?
          * What is the dstNode, dstIP and dstNamespace and dstPodName?
          * What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

           

      If it is a customer / SD issue:

           
          * Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
          * Don’t presume that Engineering has access to Salesforce.
          * Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment.  The format should be: [https://access.redhat.com/support/cases/#/case/]<case number>/discussion?attachmentId=<attachment id>
          * Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).  
          * Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
          ** If the issue is in a customer namespace then provide a namespace inspect.
          ** If it is a connectivity issue:
          *** What is the srcNode, srcNamespace, srcPodName and srcPodIP?
          *** What is the dstNode, dstNamespace, dstPodName and  dstPodIP?
          *** What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
          *** Please provide the UTC timestamp networking outage window from must-gather
          *** Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
          ** If it is not a connectivity issue:
          *** Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

          * For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
          * For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with “sbr-untriaged”
          * Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”

      Attachments

        Issue Links

          Activity

            People

              pepalani@redhat.com Periyasamy Palanichamy
              huirwang Huiran Wang
              Huiran Wang Huiran Wang
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: