-
Bug
-
Resolution: Done-Errata
-
Undefined
-
4.14
-
None
-
No
-
SDN Sprint 242, SDN Sprint 243
-
2
-
Approved
-
False
-
Description of problem:
OCP Upgrades fail with message "Upgrade error from 4.13.X: Unable to apply 4.14.0-X: an unknown error has occurred: MultipleErrors"
Version-Release number of selected component (if applicable):
Currently 4.14.0-rc.1, but we observed the same issue with previous 4.14 nightlies too: 4.14.0-0.nightly-2023-09-12-195514 4.14.0-0.nightly-2023-09-02-132842 4.14.0-0.nightly-2023-08-28-154013
How reproducible:
1 out of 2 upgrades
Steps to Reproduce:
1. Deploy OCP 4.13 with latest GA on a baremetal cluster with IPI and OVN-K 2. Upgrade to latest 4.14 available 3. Check cluster version status during the upgrade, at some point upgrade stops with message: "Upgrade error from 4.13.X Unable to apply 4.14.0-X: an unknown error has occurred: MultipleErrors" 4. Check OVN pods "oc get pods -n openshift-ovn-kubernetes", there are pods running 7 out 8 containers (missing ovnkube-node) constantly restarting, and pods running only 5 containers that show errors to connect to the OVN DBs. 5. Check cluster operators "oc get co" mainly dns, network, and machine-config remained in 4.13 and degraded.
Actual results:
Upgrade not completed, and OVN pods remain in a restarting loop with failures.
Expected results:
Upgrade should be completed without issues, and OVN pods should remain in a Running status without restarts.
Additional info:
- We have tested this with latest GA versions of 4.13 (as today Sep 19: 4.13.13 to 4.14.0-rc1), but we have been observing this since 20 days ago, with previous versions of 4.13 and 4.14.
- Our deployments have single stack IPv4 , one NIC for provisioning and one NIC for baremetal (machine network)
These are the results from our latest test from 4.13.13 to 4.14.0-rc1
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version True True 2h8m Unable to apply 4.14.0-rc.1: an unknown error has occurred: MultipleErrors $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-ebb1da47ad5cb76c396983decb7df1ea True False False 3 3 3 0 3h41m worker rendered-worker-26ccb35941236935a570dddaa0b699db False True True 3 2 2 1 3h41m $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.14.0-rc.1 True False False 2h21m baremetal 4.14.0-rc.1 True False False 3h38m cloud-controller-manager 4.14.0-rc.1 True False False 3h41m cloud-credential 4.14.0-rc.1 True False False 2h23m cluster-autoscaler 4.14.0-rc.1 True False False 2h21m config-operator 4.14.0-rc.1 True False False 3h40m console 4.14.0-rc.1 True False False 2h20m control-plane-machine-set 4.14.0-rc.1 True False False 3h40m csi-snapshot-controller 4.14.0-rc.1 True False False 2h21m dns 4.13.13 True True True 2h9m etcd 4.14.0-rc.1 True False False 2h40m image-registry 4.14.0-rc.1 True False False 2h9m ingress 4.14.0-rc.1 True True True 1h14m insights 4.14.0-rc.1 True False False 3h34m kube-apiserver 4.14.0-rc.1 True False False 2h35m kube-controller-manager 4.14.0-rc.1 True False False 2h30m kube-scheduler 4.14.0-rc.1 True False False 2h29m kube-storage-version-migrator 4.14.0-rc.1 False True False 2h9m machine-api 4.14.0-rc.1 True False False 2h24m machine-approver 4.14.0-rc.1 True False False 3h40m machine-config 4.13.13 True False True 59m marketplace 4.14.0-rc.1 True False False 3h40m monitoring 4.14.0-rc.1 False True True 2h3m network 4.13.13 True True True 2h4m node-tuning 4.14.0-rc.1 True False False 2h9m openshift-apiserver 4.14.0-rc.1 True False False 2h20m openshift-controller-manager 4.14.0-rc.1 True False False 2h20m openshift-samples 4.14.0-rc.1 True False False 2h23m operator-lifecycle-manager 4.14.0-rc.1 True False False 2h23m operator-lifecycle-manager-catalog 4.14.0-rc.1 True False False 2h18m operator-lifecycle-manager-packageserver 4.14.0-rc.1 True False False 2h20m service-ca 4.14.0-rc.1 True False False 2h23m storage 4.14.0-rc.1 True False False 3h40m
Some OVN pods are running 7 out 8 containers (missing ovnkube-node) constantly restarting, and pods running only 5 containers that show errors to connect to the OVN DBs.
$ oc get pods -n openshift-ovn-kubernetes -o wide NAME READY STATUS RESTARTS AGE IP NODE ovnkube-control-plane-5f5c598768-czkjv 2/2 Running 0 2h16m 192.168.16.32 dciokd-master-1 ovnkube-control-plane-5f5c598768-kg69r 2/2 Running 0 2h16m 192.168.16.31 dciokd-master-0 ovnkube-control-plane-5f5c598768-prfb5 2/2 Running 0 2h16m 192.168.16.33 dciokd-master-2 ovnkube-node-9hjv9 5/5 Running 1 3h43m 192.168.16.32 dciokd-master-1 ovnkube-node-fmswc 7/8 Running 19 2h10m 192.168.16.36 dciokd-worker-2 ovnkube-node-pcjhp 7/8 Running 20 2h15m 192.168.16.35 dciokd-worker-1 ovnkube-node-q7kcj 5/5 Running 1 3h43m 192.168.16.33 dciokd-master-2 ovnkube-node-qsngm 5/5 Running 3 3h27m 192.168.16.34 dciokd-worker-0 ovnkube-node-v2d4h 7/8 Running 20 2h15m 192.168.16.31 dciokd-master-0 $ oc logs ovnkube-node-9hjv9 -c ovnkube-node -n openshift-ovn-kubernetes | less ... 2023-09-19T03:40:23.112699529Z E0919 03:40:23.112660 5883 ovn_db.go:511] Failed to retrieve cluster/status info for database "OVN_Northbound", stderr: 2023-09-19T03:40:23Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnnb_db.ctl 2023-09-19T03:40:23.112699529Z ovn-appctl: cannot connect to "/var/run/ovn/ovnnb_db.ctl" (No such file or directory) 2023-09-19T03:40:23.112699529Z , err: (OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=5 cluster/status OVN_Northbound' failed: exit status 1) 2023-09-19T03:40:23.112699529Z E0919 03:40:23.112677 5883 ovn_db.go:590] OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=5 cluster/status OVN_Northbound' failed: exit status 1 2023-09-19T03:40:23.114791313Z E0919 03:40:23.114777 5883 ovn_db.go:283] Failed retrieving memory/show output for "OVN_NORTHBOUND", stderr: 2023-09-19T03:40:23Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnnb_db.ctl 2023-09-19T03:40:23.114791313Z ovn-appctl: cannot connect to "/var/run/ovn/ovnnb_db.ctl" (No such file or directory) 2023-09-19T03:40:23.114791313Z , err: (OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=5 memory/show' failed: exit status 1) 2023-09-19T03:40:23.116492808Z E0919 03:40:23.116478 5883 ovn_db.go:511] Failed to retrieve cluster/status info for database "OVN_Southbound", stderr: 2023-09-19T03:40:23Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl 2023-09-19T03:40:23.116492808Z ovn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (No such file or directory) 2023-09-19T03:40:23.116492808Z , err: (OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=5 cluster/status OVN_Southbound' failed: exit status 1) 2023-09-19T03:40:23.116492808Z E0919 03:40:23.116488 5883 ovn_db.go:590] OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=5 cluster/status OVN_Southbound' failed: exit status 1 2023-09-19T03:40:23.118468064Z E0919 03:40:23.118450 5883 ovn_db.go:283] Failed retrieving memory/show output for "OVN_SOUTHBOUND", stderr: 2023-09-19T03:40:23Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl 2023-09-19T03:40:23.118468064Z ovn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (No such file or directory) 2023-09-19T03:40:23.118468064Z , err: (OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=5 memory/show' failed: exit status 1) 2023-09-19T03:40:25.118085671Z E0919 03:40:25.118056 5883 ovn_northd.go:128] Failed to get ovn-northd status stderr() :(failed to run the command since failed to get ovn-northd's pid: open /var/run/ovn/ovn-northd.pid: no such file or directory)
- blocks
-
OCPBUGS-19771 OCP upgrade 4.13 to 4.14 fails with: an unknown error has occurred: MultipleErrors
- Closed
- is cloned by
-
OCPBUGS-19771 OCP upgrade 4.13 to 4.14 fails with: an unknown error has occurred: MultipleErrors
- Closed
- links to
-
RHEA-2023:7198 rpm