-
Bug
-
Resolution: Won't Do
-
Critical
-
None
-
4.13
-
None
-
Important
-
No
-
SDN Sprint 247, SDN Sprint 248
-
2
-
False
-
Description of problem:
The cluster with 252 Worker nodes was loaded with cluster-density-v2 workload, an attempt to upgrade the cluster from OCP 4.12.25 to 4.13.24 resulted into failure
Here's the OCP Config of the said cluster:
Master Nodes: Standard_D32s_v5 x 3 Infra Nodes: Standard_E16s_v3 x 3 Worker Nodes: Standard_D8s_v5 x 3
Version-Release number of selected component (if applicable):
From OCP Version OCP 4.12.25 To OCP Version: OCP 4.13.24 [channel: fast-4.13]
Steps to Reproduce:
1. kube-burner ocp cluster-density-v2 --gc=false --iterations=2268 --churn=false 2. oc adm upgrade channel fast-4.13 3. oc adm upgrade --to=4.13.24
Actual results:
oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.25 True True 91m Unable to apply 4.13.24: wait has exceeded 40 minutes for these operators: network
Expected results:
OCP Cluster should have upgraded to 4.13.24
Additional info:
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.25 True True 89m Unable to apply 4.13.24: wait has exceeded 40 minutes for these operators: network $ ============================================================ $ NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE aro v20231214.00 True False False 46h authentication 4.13.24 True False False 61m cloud-controller-manager 4.13.24 True False False 46h cloud-credential 4.13.24 True False False 46h cluster-autoscaler 4.13.24 True False False 46h config-operator 4.13.24 True False False 46h console 4.13.24 True False False 15h control-plane-machine-set 4.13.24 True False False 46h csi-snapshot-controller 4.13.24 True False False 46h dns 4.12.25 True False False 46h etcd 4.13.24 True False False 46h image-registry 4.13.24 True False False 46h ingress 4.13.24 True False False 61m insights 4.13.24 True False False 46h kube-apiserver 4.13.24 True False False 46h kube-controller-manager 4.13.24 True False False 46h kube-scheduler 4.13.24 True False False 46h kube-storage-version-migrator 4.13.24 True False False 45h machine-api 4.13.24 True False False 46h machine-approver 4.13.24 True False False 46h machine-config 4.12.25 True False False 37h marketplace 4.13.24 True False False 46h monitoring 4.13.24 True False False 46h network 4.12.25 True True True 46h DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - pod ovnkube-master-27mvw is in CrashLoopBackOff State... node-tuning 4.13.24 True False False 61m openshift-apiserver 4.13.24 True False False 46h openshift-controller-manager 4.13.24 True False False 46h openshift-samples 4.13.24 True False False 63m operator-lifecycle-manager 4.13.24 True False False 46h operator-lifecycle-manager-catalog 4.13.24 True False False 46h operator-lifecycle-manager-packageserver 4.13.24 True False False 45h service-ca 4.13.24 True False False 46h storage 4.13.24 True False False 46h ============================================================ $ oc get po | grep -i master ovnkube-master-27mvw 5/6 CrashLoopBackOff 16 (3m5s ago) 42m ovnkube-master-7959l 4/6 CrashLoopBackOff 15 (85s ago) 39m ovnkube-master-8k9rc 5/6 CrashLoopBackOff 22 (54s ago) 38m ============================================================ $ ovn-dbchecker: Container ID: cri-o://ed05d1834860fe64162db7bb1cb802b61c7d373725d7b33c8391a89a98e89cec Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:09719f2468f20098bcfed60af5394956311830885df86edbe40add8704c8703f Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:09719f2468f20098bcfed60af5394956311830885df86edbe40add8704c8703f Port: <none> Host Port: <none> Command: /bin/bash -c set -xe if [[ -f "/env/_master" ]]; then set -o allexport source "/env/_master" set +o allexport fi echo "I$(date "+%m%d %H:%M:%S.%N") - ovn-dbchecker - start ovn-dbchecker" # RAFT clusters need an odd number of members to achieve consensus. # The CNO determines which members make up the cluster, so if this container # is not supposed to be part of the cluster, wait forever doing nothing # (instad of exiting and causing CrashLoopBackoffs for no reason). if [[ ! "ssl:10.0.0.8:9641,ssl:10.0.0.10:9641,ssl:10.0.0.9:9641" =~ .*":${K8S_NODE_IP}:".* ]] && [[ ! "ssl:10.0.0.8:9641,ssl:10.0.0.10:9641,ssl:10.0.0.9:9641" =~ .*":[${K8S_NODE_IP}]:".* ]]; then echo "$(date -Iseconds) - not selected as RAFT member; sleeping..." sleep 1500d exit 0 fi exec /usr/bin/ovndbchecker \ --config-file=/run/ovnkube-config/ovnkube.conf \ --loglevel "${OVN_KUBE_LOG_LEVEL}" \ --sb-address "ssl:10.0.0.8:9642,ssl:10.0.0.10:9642,ssl:10.0.0.9:9642" \ --sb-client-privkey /ovn-cert/tls.key \ --sb-client-cert /ovn-cert/tls.crt \ --sb-client-cacert /ovn-ca/ca-bundle.crt \ --sb-cert-common-name "ovn" \ --sb-raft-election-timer "16" \ --nb-address "ssl:10.0.0.8:9641,ssl:10.0.0.10:9641,ssl:10.0.0.9:9641" \ --nb-client-privkey /ovn-cert/tls.key \ --nb-client-cert /ovn-cert/tls.crt \ --nb-client-cacert /ovn-ca/ca-bundle.crt \ --nb-cert-common-name "ovn" \ --nb-raft-election-timer "10" State: Running Started: Fri, 12 Jan 2024 14:51:02 +0530 Last State: Terminated Reason: Error Message: 27 1 ovndbmanager.go:354] "OVN_Southbound scheme upgrade failed" err="failed to upgrade schema, stderr: \"2024-01-12T09:18:48Z|00001|ovsdb|WARN|/usr/share/ovn/ovn-sb.ovsschema: changed 2 columns in 'OVN_Southbound' database from ephemeral to persistent, including 'status' column in 'Connection' table, because clusters do not support ephemeral columns\\n2024-01-12T09:19:18Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\\n\", error: OVN command '/usr/bin/ovsdb-client -t 30 convert unix:/var/run/ovn/ovnsb_db.sock /usr/share/ovn/ovn-sb.ovsschema' failed: signal: alarm clock" E0112 09:20:00.805994 1 ovndbmanager.go:354] "OVN_Southbound scheme upgrade failed" err="failed to upgrade schema, stderr: \"2024-01-12T09:19:30Z|00001|ovsdb|WARN|/usr/share/ovn/ovn-sb.ovsschema: changed 2 columns in 'OVN_Southbound' database from ephemeral to persistent, including 'status' column in 'Connection' table, because clusters do not support ephemeral columns\\n2024-01-12T09:20:00Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\\n\", error: OVN command '/usr/bin/ovsdb-client -t 30 convert unix:/var/run/ovn/ovnsb_db.sock /usr/share/ovn/ovn-sb.ovsschema' failed: signal: alarm clock" E0112 09:20:10.818688 1 ovndbmanager.go:354] "OVN_Southbound scheme upgrade failed" err="failed to get schema version for NBDB, stderr: \"2024-01-12T09:20:10Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\\n\", error: OVN command '/usr/bin/ovsdb-client -t 10 get-schema-version unix:/var/run/ovn/ovnsb_db.sock OVN_Southbound' failed: signal: alarm clock" F0112 09:20:10.818733 1 ovndbmanager.go:54] SBDB Upgrade failed: failed to upgrade db schema: timed out waiting for the condition. Error from last attempt: failed to get schema version for NBDB, stderr: "2024-01-12T09:20:10Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n", error: OVN command '/usr/bin/ovsdb-client -t 10 get-schema-version unix:/var/run/ovn/ovnsb_db.sock OVN_Southbound' failed: signal: alarm clock Exit Code: 255 Started: Fri, 12 Jan 2024 14:44:08 +0530 Finished: Fri, 12 Jan 2024 14:50:10 +0530 Ready: True Restart Count: 10 Requests: cpu: 10m memory: 300Mi Environment: OVN_KUBE_LOG_LEVEL: 4 K8S_NODE_IP: (v1:status.hostIP) Mounts: /env from env-overrides (rw) /etc/openvswitch/ from etc-openvswitch (rw) /etc/ovn/ from etc-openvswitch (rw) /ovn-ca from ovn-ca (rw) /ovn-cert from ovn-cert (rw) /run/openvswitch/ from run-openvswitch (rw) /run/ovn/ from run-ovn (rw) /run/ovnkube-config/ from ovnkube-config (rw) /var/lib/openvswitch/ from var-lib-openvswitch (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-stb4h (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: systemd-units: Type: HostPath (bare host directory volume) Path: /etc/systemd/system HostPathType: etc-openvswitch: Type: HostPath (bare host directory volume) Path: /var/lib/ovn/etc HostPathType: var-lib-openvswitch: Type: HostPath (bare host directory volume) Path: /var/lib/ovn/data HostPathType: run-openvswitch: Type: HostPath (bare host directory volume) Path: /var/run/openvswitch HostPathType: run-ovn: Type: HostPath (bare host directory volume) Path: /var/run/ovn HostPathType: ovnkube-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: ovnkube-config Optional: false env-overrides: Type: ConfigMap (a volume populated by a ConfigMap) Name: env-overrides Optional: true ovn-ca: Type: ConfigMap (a volume populated by a ConfigMap) Name: ovn-ca Optional: false ovn-cert: Type: Secret (a volume populated by a Secret) SecretName: ovn-cert Optional: false ovn-master-metrics-cert: Type: Secret (a volume populated by a Secret) SecretName: ovn-master-metrics-cert Optional: true kube-api-access-stb4h: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: beta.kubernetes.io/os=linux node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/master op=Exists node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/network-unavailable op=Exists node.kubernetes.io/not-ready op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 72m default-scheduler Successfully assigned openshift-ovn-kubernetes/ovnkube-master-27mvw to krishvoor-v5-ocp-jrq4p-master-0 Normal Pulled 72m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:09719f2468f20098bcfed60af5394956311830885df86edbe40add8704c8703f" already present on machine Normal Created 72m kubelet Created container northd Normal Started 72m kubelet Started container northd Normal Pulled 72m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:09719f2468f20098bcfed60af5394956311830885df86edbe40add8704c8703f" already present on machine Normal Created 72m kubelet Created container nbdb Normal Started 72m kubelet Started container nbdb Normal Pulled 72m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ff58f8cff3d9c63906656c10e45f9b61fda02d86165d6de8a4e8c0fc4bbca250" already present on machine Normal Pulled 72m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:09719f2468f20098bcfed60af5394956311830885df86edbe40add8704c8703f" already present on machine Normal Started 72m kubelet Started container kube-rbac-proxy Normal Created 72m kubelet Created container kube-rbac-proxy Normal Created 72m kubelet Created container sbdb Normal Started 72m kubelet Started container sbdb Normal Pulled 71m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:09719f2468f20098bcfed60af5394956311830885df86edbe40add8704c8703f" already present on machine Normal Started 71m kubelet Started container ovnkube-master Normal Created 71m kubelet Created container ovnkube-master Normal Pulled 71m kubelet Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:09719f2468f20098bcfed60af5394956311830885df86edbe40add8704c8703f" already present on machine Normal Created 71m kubelet Created container ovn-dbchecker Normal Started 71m kubelet Started container ovn-dbchecker Warning BackOff 32m (x134 over 68m) kubelet Back-off restarting failed container Warning Unhealthy 22m (x130 over 68m) kubelet Readiness probe failed: SB DB Raft leader is unknown to the cluster node. + [[ ! ssl:10.0.0.8:9642,ssl:10.0.0.10:9642,ssl:10.0.0.9:9642 =~ .*:10\.0\.0\.8:.* ]] ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=3 cluster/status OVN_Southbound ++ grep 'Leader: unknown' + leader_status='Leader: unknown' + [[ ! -z Leader: unknown ]] + echo 'SB DB Raft leader is unknown to the cluster node.' + exit 1 Warning Unhealthy 2m39s (x397 over 70m) kubelet Readiness probe failed: + [[ ! ssl:10.0.0.8:9642,ssl:10.0.0.10:9642,ssl:10.0.0.9:9642 =~ .*:10\.0\.0\.8:.* ]] ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=3 cluster/status OVN_Southbound ++ grep 'Leader: unknown' ++ true + leader_status= ============================================================ $
- is related to
-
OCPBUGS-27439 [ARO] 4.13.23 --> 4.14.10 Upgrade Failed at [network]
- Closed