Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-19771

OCP upgrade 4.13 to 4.14 fails with: an unknown error has occurred: MultipleErrors

XMLWordPrintable

    • No
    • SDN Sprint 242, SDN Sprint 243
    • 2
    • Approved
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-19418. The following is the description of the original issue:

      Description of problem:

      OCP Upgrades fail with message "Upgrade error from 4.13.X: Unable to apply 4.14.0-X: an unknown error has occurred: MultipleErrors"
      

      Version-Release number of selected component (if applicable):

      Currently 4.14.0-rc.1, but we observed the same issue with previous 4.14 nightlies too: 
      4.14.0-0.nightly-2023-09-12-195514
      4.14.0-0.nightly-2023-09-02-132842
      4.14.0-0.nightly-2023-08-28-154013
      

      How reproducible:

      1 out of 2 upgrades
      

      Steps to Reproduce:

      1. Deploy OCP 4.13 with latest GA on a baremetal cluster with IPI and OVN-K
      2. Upgrade to latest 4.14 available
      3. Check cluster version status during the upgrade, at some point upgrade stops with message: "Upgrade error from 4.13.X Unable to apply 4.14.0-X: an unknown error has occurred: MultipleErrors"
      4. Check OVN pods "oc get pods -n openshift-ovn-kubernetes", there are pods running 7 out 8 containers (missing ovnkube-node) constantly restarting, and pods running only 5 containers that show errors to connect to the OVN DBs.
      5. Check cluster operators "oc get co" mainly dns, network, and machine-config remained in 4.13 and degraded.
      

      Actual results:

      Upgrade not completed, and OVN pods remain in a restarting loop with failures.
      

      Expected results:

      Upgrade should be completed without issues, and OVN pods should remain in a Running status without restarts.
      

      Additional info:

      • We have tested this with latest GA versions of 4.13 (as today Sep 19: 4.13.13 to 4.14.0-rc1), but we have been observing this since 20 days ago, with previous versions of 4.13 and 4.14.
      • Our deployments have single stack IPv4 , one NIC for provisioning and one NIC for baremetal (machine network)

      These are the results from our latest test from 4.13.13 to 4.14.0-rc1

      $ oc get clusterversion
      NAME     VERSION  AVAILABLE  PROGRESSING  SINCE  STATUS
      version           True       True         2h8m   Unable to apply 4.14.0-rc.1: an unknown error has occurred: MultipleErrors
      
      $ oc get mcp
      NAME    CONFIG                                            UPDATED  UPDATING  DEGRADED  MACHINECOUNT  READYMACHINECOUNT  UPDATEDMACHINECOUNT  DEGRADEDMACHINECOUNT  AGE
      master  rendered-master-ebb1da47ad5cb76c396983decb7df1ea  True     False     False     3             3                  3                    0                     3h41m
      worker  rendered-worker-26ccb35941236935a570dddaa0b699db  False    True      True      3             2                  2                    1                     3h41m
      
      $ oc get co
      NAME                                      VERSION      AVAILABLE  PROGRESSING  DEGRADED  SINCE
      authentication                            4.14.0-rc.1  True       False        False     2h21m
      baremetal                                 4.14.0-rc.1  True       False        False     3h38m
      cloud-controller-manager                  4.14.0-rc.1  True       False        False     3h41m
      cloud-credential                          4.14.0-rc.1  True       False        False     2h23m
      cluster-autoscaler                        4.14.0-rc.1  True       False        False     2h21m
      config-operator                           4.14.0-rc.1  True       False        False     3h40m
      console                                   4.14.0-rc.1  True       False        False     2h20m
      control-plane-machine-set                 4.14.0-rc.1  True       False        False     3h40m
      csi-snapshot-controller                   4.14.0-rc.1  True       False        False     2h21m
      dns                                       4.13.13      True       True         True      2h9m
      etcd                                      4.14.0-rc.1  True       False        False     2h40m
      image-registry                            4.14.0-rc.1  True       False        False     2h9m
      ingress                                   4.14.0-rc.1  True       True         True      1h14m
      insights                                  4.14.0-rc.1  True       False        False     3h34m
      kube-apiserver                            4.14.0-rc.1  True       False        False     2h35m
      kube-controller-manager                   4.14.0-rc.1  True       False        False     2h30m
      kube-scheduler                            4.14.0-rc.1  True       False        False     2h29m
      kube-storage-version-migrator             4.14.0-rc.1  False      True         False     2h9m
      machine-api                               4.14.0-rc.1  True       False        False     2h24m
      machine-approver                          4.14.0-rc.1  True       False        False     3h40m
      machine-config                            4.13.13      True       False        True      59m
      marketplace                               4.14.0-rc.1  True       False        False     3h40m
      monitoring                                4.14.0-rc.1  False      True         True      2h3m
      network                                   4.13.13      True       True         True      2h4m
      node-tuning                               4.14.0-rc.1  True       False        False     2h9m
      openshift-apiserver                       4.14.0-rc.1  True       False        False     2h20m
      openshift-controller-manager              4.14.0-rc.1  True       False        False     2h20m
      openshift-samples                         4.14.0-rc.1  True       False        False     2h23m
      operator-lifecycle-manager                4.14.0-rc.1  True       False        False     2h23m
      operator-lifecycle-manager-catalog        4.14.0-rc.1  True       False        False     2h18m
      operator-lifecycle-manager-packageserver  4.14.0-rc.1  True       False        False     2h20m
      service-ca                                4.14.0-rc.1  True       False        False     2h23m
      storage                                   4.14.0-rc.1  True       False        False     3h40m
      

      Some OVN pods are running 7 out 8 containers (missing ovnkube-node) constantly restarting, and pods running only 5 containers that show errors to connect to the OVN DBs.

      $ oc get pods -n openshift-ovn-kubernetes -o wide
      NAME                                    READY  STATUS   RESTARTS  AGE    IP             NODE
      ovnkube-control-plane-5f5c598768-czkjv  2/2    Running  0         2h16m  192.168.16.32  dciokd-master-1
      ovnkube-control-plane-5f5c598768-kg69r  2/2    Running  0         2h16m  192.168.16.31  dciokd-master-0
      ovnkube-control-plane-5f5c598768-prfb5  2/2    Running  0         2h16m  192.168.16.33  dciokd-master-2
      ovnkube-node-9hjv9                      5/5    Running  1         3h43m  192.168.16.32  dciokd-master-1
      ovnkube-node-fmswc                      7/8    Running  19        2h10m  192.168.16.36  dciokd-worker-2
      ovnkube-node-pcjhp                      7/8    Running  20        2h15m  192.168.16.35  dciokd-worker-1
      ovnkube-node-q7kcj                      5/5    Running  1         3h43m  192.168.16.33  dciokd-master-2
      ovnkube-node-qsngm                      5/5    Running  3         3h27m  192.168.16.34  dciokd-worker-0
      ovnkube-node-v2d4h                      7/8    Running  20        2h15m  192.168.16.31  dciokd-master-0
      
      $ oc logs ovnkube-node-9hjv9 -c ovnkube-node -n openshift-ovn-kubernetes | less
      ...
      2023-09-19T03:40:23.112699529Z E0919 03:40:23.112660    5883 ovn_db.go:511] Failed to retrieve cluster/status info for database "OVN_Northbound", stderr: 2023-09-19T03:40:23Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnnb_db.ctl
      2023-09-19T03:40:23.112699529Z ovn-appctl: cannot connect to "/var/run/ovn/ovnnb_db.ctl" (No such file or directory)
      2023-09-19T03:40:23.112699529Z , err: (OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=5 cluster/status OVN_Northbound' failed: exit status 1)
      2023-09-19T03:40:23.112699529Z E0919 03:40:23.112677    5883 ovn_db.go:590] OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=5 cluster/status OVN_Northbound' failed: exit status 1
      2023-09-19T03:40:23.114791313Z E0919 03:40:23.114777    5883 ovn_db.go:283] Failed retrieving memory/show output for "OVN_NORTHBOUND", stderr: 2023-09-19T03:40:23Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnnb_db.ctl
      2023-09-19T03:40:23.114791313Z ovn-appctl: cannot connect to "/var/run/ovn/ovnnb_db.ctl" (No such file or directory)
      2023-09-19T03:40:23.114791313Z , err: (OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=5 memory/show' failed: exit status 1)
      2023-09-19T03:40:23.116492808Z E0919 03:40:23.116478    5883 ovn_db.go:511] Failed to retrieve cluster/status info for database "OVN_Southbound", stderr: 2023-09-19T03:40:23Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl
      2023-09-19T03:40:23.116492808Z ovn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (No such file or directory)
      2023-09-19T03:40:23.116492808Z , err: (OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=5 cluster/status OVN_Southbound' failed: exit status 1)
      2023-09-19T03:40:23.116492808Z E0919 03:40:23.116488    5883 ovn_db.go:590] OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=5 cluster/status OVN_Southbound' failed: exit status 1
      2023-09-19T03:40:23.118468064Z E0919 03:40:23.118450    5883 ovn_db.go:283] Failed retrieving memory/show output for "OVN_SOUTHBOUND", stderr: 2023-09-19T03:40:23Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl
      2023-09-19T03:40:23.118468064Z ovn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (No such file or directory)
      2023-09-19T03:40:23.118468064Z , err: (OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=5 memory/show' failed: exit status 1)
      2023-09-19T03:40:25.118085671Z E0919 03:40:25.118056    5883 ovn_northd.go:128] Failed to get ovn-northd status stderr() :(failed to run the command since failed to get ovn-northd's pid: open /var/run/ovn/ovn-northd.pid: no such file or directory)
      

              rravaiol@redhat.com Riccardo Ravaioli
              openshift-crt-jira-prow OpenShift Prow Bot
              Anurag Saxena Anurag Saxena
              Riccardo Ravaioli
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: