Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-27061

[ARO] OCP Upgrade at load (4.12.25 --> 4.13.24) Failed

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Critical Critical
    • None
    • 4.13
    • None
    • Important
    • No
    • SDN Sprint 247, SDN Sprint 248
    • 2
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      The cluster with 252 Worker nodes was loaded with cluster-density-v2 workload,
      an attempt to upgrade the cluster from OCP 4.12.25 to 4.13.24 resulted into failure

      Here's the OCP Config of the said cluster:

      Master Nodes: Standard_D32s_v5 x 3
      Infra Nodes:  Standard_E16s_v3 x 3
      Worker Nodes: Standard_D8s_v5  x 3

      Version-Release number of selected component (if applicable):

      From OCP Version OCP 4.12.25
      To OCP Version: OCP 4.13.24 [channel: fast-4.13]

      Steps to Reproduce:

      1. kube-burner ocp cluster-density-v2 --gc=false --iterations=2268 --churn=false
      2. oc adm upgrade channel fast-4.13
      3. oc adm upgrade --to=4.13.24

      Actual results:

      oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.12.25   True        True          91m     Unable to apply 4.13.24: wait has exceeded 40 minutes for these operators: network

      Expected results:

      OCP Cluster should have upgraded to 4.13.24

      Additional info:

      $ oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.12.25   True        True          89m     Unable to apply 4.13.24: wait has exceeded 40 minutes for these operators: network
      $
      ============================================================
      $ NAME                                       VERSION        AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      aro                                        v20231214.00   True        False         False      46h     
      authentication                             4.13.24        True        False         False      61m     
      cloud-controller-manager                   4.13.24        True        False         False      46h     
      cloud-credential                           4.13.24        True        False         False      46h     
      cluster-autoscaler                         4.13.24        True        False         False      46h     
      config-operator                            4.13.24        True        False         False      46h     
      console                                    4.13.24        True        False         False      15h     
      control-plane-machine-set                  4.13.24        True        False         False      46h     
      csi-snapshot-controller                    4.13.24        True        False         False      46h     
      dns                                        4.12.25        True        False         False      46h     
      etcd                                       4.13.24        True        False         False      46h     
      image-registry                             4.13.24        True        False         False      46h     
      ingress                                    4.13.24        True        False         False      61m     
      insights                                   4.13.24        True        False         False      46h     
      kube-apiserver                             4.13.24        True        False         False      46h     
      kube-controller-manager                    4.13.24        True        False         False      46h     
      kube-scheduler                             4.13.24        True        False         False      46h     
      kube-storage-version-migrator              4.13.24        True        False         False      45h     
      machine-api                                4.13.24        True        False         False      46h     
      machine-approver                           4.13.24        True        False         False      46h     
      machine-config                             4.12.25        True        False         False      37h     
      marketplace                                4.13.24        True        False         False      46h     
      monitoring                                 4.13.24        True        False         False      46h     
      network                                    4.12.25        True        True          True       46h     DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - pod ovnkube-master-27mvw is in CrashLoopBackOff State...
      node-tuning                                4.13.24        True        False         False      61m     
      openshift-apiserver                        4.13.24        True        False         False      46h     
      openshift-controller-manager               4.13.24        True        False         False      46h     
      openshift-samples                          4.13.24        True        False         False      63m     
      operator-lifecycle-manager                 4.13.24        True        False         False      46h     
      operator-lifecycle-manager-catalog         4.13.24        True        False         False      46h     
      operator-lifecycle-manager-packageserver   4.13.24        True        False         False      45h     
      service-ca                                 4.13.24        True        False         False      46h     
      storage                                    4.13.24        True        False         False      46h  
      
      ============================================================
      $ oc get po | grep -i master
      ovnkube-master-27mvw   5/6     CrashLoopBackOff   16 (3m5s ago)   42m
      ovnkube-master-7959l   4/6     CrashLoopBackOff   15 (85s ago)    39m
      ovnkube-master-8k9rc   5/6     CrashLoopBackOff   22 (54s ago)    38m
      ============================================================ 
      
      $   ovn-dbchecker:
          Container ID:  cri-o://ed05d1834860fe64162db7bb1cb802b61c7d373725d7b33c8391a89a98e89cec
          Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:09719f2468f20098bcfed60af5394956311830885df86edbe40add8704c8703f
          Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:09719f2468f20098bcfed60af5394956311830885df86edbe40add8704c8703f
          Port:          <none>
          Host Port:     <none>
          Command:
            /bin/bash
            -c
            set -xe
            if [[ -f "/env/_master" ]]; then
              set -o allexport
              source "/env/_master"
              set +o allexport
            fi
            
            echo "I$(date "+%m%d %H:%M:%S.%N") - ovn-dbchecker - start ovn-dbchecker"
            
            # RAFT clusters need an odd number of members to achieve consensus.
            # The CNO determines which members make up the cluster, so if this container
            # is not supposed to be part of the cluster, wait forever doing nothing
            # (instad of exiting and causing CrashLoopBackoffs for no reason).
            if [[ ! "ssl:10.0.0.8:9641,ssl:10.0.0.10:9641,ssl:10.0.0.9:9641" =~ .*":${K8S_NODE_IP}:".* ]] && [[ ! "ssl:10.0.0.8:9641,ssl:10.0.0.10:9641,ssl:10.0.0.9:9641" =~ .*":[${K8S_NODE_IP}]:".* ]]; then
              echo "$(date -Iseconds) - not selected as RAFT member; sleeping..."
              sleep 1500d
              exit 0
            fi
            
            exec /usr/bin/ovndbchecker \
              --config-file=/run/ovnkube-config/ovnkube.conf \
              --loglevel "${OVN_KUBE_LOG_LEVEL}" \
              --sb-address "ssl:10.0.0.8:9642,ssl:10.0.0.10:9642,ssl:10.0.0.9:9642" \
              --sb-client-privkey /ovn-cert/tls.key \
              --sb-client-cert /ovn-cert/tls.crt \
              --sb-client-cacert /ovn-ca/ca-bundle.crt \
              --sb-cert-common-name "ovn" \
              --sb-raft-election-timer "16" \
              --nb-address "ssl:10.0.0.8:9641,ssl:10.0.0.10:9641,ssl:10.0.0.9:9641" \
              --nb-client-privkey /ovn-cert/tls.key \
              --nb-client-cert /ovn-cert/tls.crt \
              --nb-client-cacert /ovn-ca/ca-bundle.crt \
              --nb-cert-common-name "ovn" \
              --nb-raft-election-timer "10"
            
          State:       Running
            Started:   Fri, 12 Jan 2024 14:51:02 +0530
          Last State:  Terminated
            Reason:    Error
            Message:   27       1 ovndbmanager.go:354] "OVN_Southbound scheme upgrade failed" err="failed to upgrade schema, stderr: \"2024-01-12T09:18:48Z|00001|ovsdb|WARN|/usr/share/ovn/ovn-sb.ovsschema: changed 2 columns in 'OVN_Southbound' database from ephemeral to persistent, including 'status' column in 'Connection' table, because clusters do not support ephemeral columns\\n2024-01-12T09:19:18Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\\n\", error: OVN command '/usr/bin/ovsdb-client -t 30 convert unix:/var/run/ovn/ovnsb_db.sock /usr/share/ovn/ovn-sb.ovsschema' failed: signal: alarm clock"
      E0112 09:20:00.805994       1 ovndbmanager.go:354] "OVN_Southbound scheme upgrade failed" err="failed to upgrade schema, stderr: \"2024-01-12T09:19:30Z|00001|ovsdb|WARN|/usr/share/ovn/ovn-sb.ovsschema: changed 2 columns in 'OVN_Southbound' database from ephemeral to persistent, including 'status' column in 'Connection' table, because clusters do not support ephemeral columns\\n2024-01-12T09:20:00Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\\n\", error: OVN command '/usr/bin/ovsdb-client -t 30 convert unix:/var/run/ovn/ovnsb_db.sock /usr/share/ovn/ovn-sb.ovsschema' failed: signal: alarm clock"
      E0112 09:20:10.818688       1 ovndbmanager.go:354] "OVN_Southbound scheme upgrade failed" err="failed to get schema version for NBDB, stderr: \"2024-01-12T09:20:10Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\\n\", error: OVN command '/usr/bin/ovsdb-client -t 10 get-schema-version unix:/var/run/ovn/ovnsb_db.sock OVN_Southbound' failed: signal: alarm clock"
      F0112 09:20:10.818733       1 ovndbmanager.go:54] SBDB Upgrade failed: failed to upgrade db schema: timed out waiting for the condition. Error from last attempt: failed to get schema version for NBDB, stderr: "2024-01-12T09:20:10Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n", error: OVN command '/usr/bin/ovsdb-client -t 10 get-schema-version unix:/var/run/ovn/ovnsb_db.sock OVN_Southbound' failed: signal: alarm clock
      
      
            Exit Code:    255
            Started:      Fri, 12 Jan 2024 14:44:08 +0530
            Finished:     Fri, 12 Jan 2024 14:50:10 +0530
          Ready:          True
          Restart Count:  10
          Requests:
            cpu:     10m
            memory:  300Mi
          Environment:
            OVN_KUBE_LOG_LEVEL:  4
            K8S_NODE_IP:          (v1:status.hostIP)
          Mounts:
            /env from env-overrides (rw)
            /etc/openvswitch/ from etc-openvswitch (rw)
            /etc/ovn/ from etc-openvswitch (rw)
            /ovn-ca from ovn-ca (rw)
            /ovn-cert from ovn-cert (rw)
            /run/openvswitch/ from run-openvswitch (rw)
            /run/ovn/ from run-ovn (rw)
            /run/ovnkube-config/ from ovnkube-config (rw)
            /var/lib/openvswitch/ from var-lib-openvswitch (rw)
            /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-stb4h (ro)
      Conditions:
        Type              Status
        Initialized       True 
        Ready             True 
        ContainersReady   True 
        PodScheduled      True 
      Volumes:
        systemd-units:
          Type:          HostPath (bare host directory volume)
          Path:          /etc/systemd/system
          HostPathType: 
        etc-openvswitch:
          Type:          HostPath (bare host directory volume)
          Path:          /var/lib/ovn/etc
          HostPathType: 
        var-lib-openvswitch:
          Type:          HostPath (bare host directory volume)
          Path:          /var/lib/ovn/data
          HostPathType: 
        run-openvswitch:
          Type:          HostPath (bare host directory volume)
          Path:          /var/run/openvswitch
          HostPathType: 
        run-ovn:
          Type:          HostPath (bare host directory volume)
          Path:          /var/run/ovn
          HostPathType: 
        ovnkube-config:
          Type:      ConfigMap (a volume populated by a ConfigMap)
          Name:      ovnkube-config
          Optional:  false
        env-overrides:
          Type:      ConfigMap (a volume populated by a ConfigMap)
          Name:      env-overrides
          Optional:  true
        ovn-ca:
          Type:      ConfigMap (a volume populated by a ConfigMap)
          Name:      ovn-ca
          Optional:  false
        ovn-cert:
          Type:        Secret (a volume populated by a Secret)
          SecretName:  ovn-cert
          Optional:    false
        ovn-master-metrics-cert:
          Type:        Secret (a volume populated by a Secret)
          SecretName:  ovn-master-metrics-cert
          Optional:    true
        kube-api-access-stb4h:
          Type:                    Projected (a volume that contains injected data from multiple sources)
          TokenExpirationSeconds:  3607
          ConfigMapName:           kube-root-ca.crt
          ConfigMapOptional:       <nil>
          DownwardAPI:             true
          ConfigMapName:           openshift-service-ca.crt
          ConfigMapOptional:       <nil>
      QoS Class:                   Burstable
      Node-Selectors:              beta.kubernetes.io/os=linux
                                   node-role.kubernetes.io/master=
      Tolerations:                 node-role.kubernetes.io/master op=Exists
                                   node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                                   node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                                   node.kubernetes.io/network-unavailable op=Exists
                                   node.kubernetes.io/not-ready op=Exists
                                   node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                                   node.kubernetes.io/unreachable op=Exists
                                   node.kubernetes.io/unschedulable:NoSchedule op=Exists
      Events:
        Type     Reason     Age                  From               Message
        ----     ------     ----                 ----               -------
        Normal   Scheduled  72m                  default-scheduler  Successfully assigned openshift-ovn-kubernetes/ovnkube-master-27mvw to krishvoor-v5-ocp-jrq4p-master-0
        Normal   Pulled     72m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:09719f2468f20098bcfed60af5394956311830885df86edbe40add8704c8703f" already present on machine
        Normal   Created    72m                  kubelet            Created container northd
        Normal   Started    72m                  kubelet            Started container northd
        Normal   Pulled     72m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:09719f2468f20098bcfed60af5394956311830885df86edbe40add8704c8703f" already present on machine
        Normal   Created    72m                  kubelet            Created container nbdb
        Normal   Started    72m                  kubelet            Started container nbdb
        Normal   Pulled     72m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ff58f8cff3d9c63906656c10e45f9b61fda02d86165d6de8a4e8c0fc4bbca250" already present on machine
        Normal   Pulled     72m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:09719f2468f20098bcfed60af5394956311830885df86edbe40add8704c8703f" already present on machine
        Normal   Started    72m                  kubelet            Started container kube-rbac-proxy
        Normal   Created    72m                  kubelet            Created container kube-rbac-proxy
        Normal   Created    72m                  kubelet            Created container sbdb
        Normal   Started    72m                  kubelet            Started container sbdb
        Normal   Pulled     71m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:09719f2468f20098bcfed60af5394956311830885df86edbe40add8704c8703f" already present on machine
        Normal   Started    71m                  kubelet            Started container ovnkube-master
        Normal   Created    71m                  kubelet            Created container ovnkube-master
        Normal   Pulled     71m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:09719f2468f20098bcfed60af5394956311830885df86edbe40add8704c8703f" already present on machine
        Normal   Created    71m                  kubelet            Created container ovn-dbchecker
        Normal   Started    71m                  kubelet            Started container ovn-dbchecker
        Warning  BackOff    32m (x134 over 68m)  kubelet            Back-off restarting failed container
        Warning  Unhealthy  22m (x130 over 68m)  kubelet            Readiness probe failed: SB DB Raft leader is unknown to the cluster node.
      + [[ ! ssl:10.0.0.8:9642,ssl:10.0.0.10:9642,ssl:10.0.0.9:9642 =~ .*:10\.0\.0\.8:.* ]]
      ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=3 cluster/status OVN_Southbound
      ++ grep 'Leader: unknown'
      + leader_status='Leader: unknown'
      + [[ ! -z Leader: unknown ]]
      + echo 'SB DB Raft leader is unknown to the cluster node.'
      + exit 1
        Warning  Unhealthy  2m39s (x397 over 70m)  kubelet  Readiness probe failed: + [[ ! ssl:10.0.0.8:9642,ssl:10.0.0.10:9642,ssl:10.0.0.9:9642 =~ .*:10\.0\.0\.8:.* ]]
      ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=3 cluster/status OVN_Southbound
      ++ grep 'Leader: unknown'
      ++ true
      + leader_status=
      ============================================================ 
      
      $ 

       

              npinaeva@redhat.com Nadia Pinaeva
              rh-ee-krvoora Krishna Harsha Voora
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: