Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-21837

[release-4.12] Clusters with large numbers of CSVs can CrashLoop the NTO and block upgrades

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done-Errata
    • Critical
    • 4.12.z
    • 4.12.z
    • Node Tuning Operator
    • None
    • Important
    • No
    • CNF Compute Sprint 244
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, a large number of `ClusterServiceVersion` (CSV) resources on startup caused a pod running the Node Tuning Operator (NTO) to restart and loop, which resulted in an error. With this update, the issue is fixed. (link:https://issues.redhat.com/browse/OCPBUGS-21837[*OCPBUGS-21837*])

      *Cause*: On NTO pod start CSVs are listed. If there are a great number of them operation will timeout and the pod will restart, looping again on the error.
      *Consequence*: upgrades are blocked.
      *Fix*: Use "pagination" feature in api-client to avoid possible timeouts.
      *Result*: Bug doesn’t present anymore.
      Show
      * Previously, a large number of `ClusterServiceVersion` (CSV) resources on startup caused a pod running the Node Tuning Operator (NTO) to restart and loop, which resulted in an error. With this update, the issue is fixed. (link: https://issues.redhat.com/browse/OCPBUGS-21837 [* OCPBUGS-21837 *]) *Cause*: On NTO pod start CSVs are listed. If there are a great number of them operation will timeout and the pod will restart, looping again on the error. *Consequence*: upgrades are blocked. *Fix*: Use "pagination" feature in api-client to avoid possible timeouts. *Result*: Bug doesn’t present anymore.
    • Bug Fix

    Description

      This is a clone of issue OCPBUGS-20069. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-20032. The following is the description of the original issue:

      Description of problem:

      OCP 4.11.18
      to
      OCP 4.12.14
      
      A similar issue was fixed via https://issues.redhat.com/browse/OCPBUGS-2437 in 4.11.17 which this cluster had.
      
      The error is slightly different though:
      
      The original bug error was: F1016 23:19:57.077167       1 main.go:130] unable to remove Performance addons OLM operator: the server was unable to return a response in the time allotted, but may still be processing the request (get clu
      
      The error in cluster is: unable to remove Performance addons OLM operator: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 515; INTERNAL_ERROR; received from peer
      
      2023-05-29T20:14:59.507908366Z I0529 20:14:59.507797       1 main.go:68] Go Version: go1.19.6
      2023-05-29T20:14:59.507908366Z I0529 20:14:59.507879       1 main.go:69] Go OS/Arch: linux/amd64
      2023-05-29T20:14:59.507908366Z I0529 20:14:59.507882       1 main.go:70] node-tuning Version: v4.12.0-202304070941.p0.ge2f6753.assembly.stream-0-g22a5414-dirty
      2023-05-29T20:15:00.565499739Z I0529 20:15:00.565455       1 request.go:601] Waited for 1.046632334s due to client-side throttling, not priority and fairness, request: GET:https://10.98.0.1:443/apis/wgpolicyk8s.io/v1alpha2?timeout=32s
      2023-05-29T20:16:09.413129360Z F0529 20:16:09.413090       1 main.go:136] unable to remove Performance addons OLM operator: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 515; INTERNAL_ERROR; received from peer
      2023-05-29T20:16:09.413255926Z goroutine 1 [running]:
      2023-05-29T20:16:09.413255926Z k8s.io/klog/v2.stacks(0x1)
      2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/klog/v2/klog.go:860 +0x89
      2023-05-29T20:16:09.413255926Z k8s.io/klog/v2.(*loggingT).output(0x2eb6700, 0x3, 0x0, 0xc0003d6ee0, 0x1, {0x234a333?, 0x1?}, 0xc000100c00?, 0x0)
      2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/klog/v2/klog.go:825 +0x685
      2023-05-29T20:16:09.413255926Z k8s.io/klog/v2.(*loggingT).printfDepth(0x2eb6700, 0x622d40?, 0x0, {0x0, 0x0}, 0x0?, {0x1c007f9, 0x34}, {0xc000c84050, 0x1, ...})
      2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/klog/v2/klog.go:630 +0x1f2
      2023-05-29T20:16:09.413255926Z k8s.io/klog/v2.(*loggingT).printf(...)
      2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/klog/v2/klog.go:612
      2023-05-29T20:16:09.413255926Z k8s.io/klog/v2.Fatalf(...)
      2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/klog/v2/klog.go:1516
      2023-05-29T20:16:09.413255926Z main.operatorRun()
      2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/cmd/cluster-node-tuning-operator/main.go:136 +0x41f
      2023-05-29T20:16:09.413255926Z main.glob..func1(0x2ddd660?, {0x1bb12fd?, 0x1?, 0x1?})
      2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/cmd/cluster-node-tuning-operator/main.go:77 +0x17
      2023-05-29T20:16:09.413255926Z github.com/spf13/cobra.(*Command).execute(0x2ddd660, {0xc000054050, 0x1, 0x1})
      2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/github.com/spf13/cobra/command.go:860 +0x663
      2023-05-29T20:16:09.413255926Z github.com/spf13/cobra.(*Command).ExecuteC(0x2ddd660)
      2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/github.com/spf13/cobra/command.go:974 +0x3bd
      2023-05-29T20:16:09.413255926Z github.com/spf13/cobra.(*Command).Execute(...)
      2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/github.com/spf13/cobra/command.go:902
      2023-05-29T20:16:09.413255926Z main.main()
      2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/cmd/cluster-node-tuning-operator/main.go:267 +0x125
      2023-05-29T20:16:09.413255926Z
      2023-05-29T20:16:09.413255926Z goroutine 98 [chan receive, 1 minutes]:
      2023-05-29T20:16:09.413255926Z k8s.io/klog.(*loggingT).flushDaemon(0xc000128540?)
      2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/klog/klog.go:1010 +0x6a
      2023-05-29T20:16:09.413255926Z created by k8s.io/klog.init.0
      2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/klog/klog.go:411 +0xef
      2023-05-29T20:16:09.413255926Z
      2023-05-29T20:16:09.413255926Z goroutine 117 [IO wait]:
      2023-05-29T20:16:09.413255926Z internal/poll.runtime_pollWait(0x7f258c0d3018, 0x72)
      2023-05-29T20:16:09.413255926Z  /usr/lib/golang/src/runtime/netpoll.go:305 +0x89
      2023-05-29T20:16:09.413255926Z internal/poll.(*pollDesc).wait(0xc000740a80?, 0xc00017a000?, 0x0)
      2023-05-29T20:16:09.413255926Z  /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:84 +0x32
      2023-05-29T20:16:09.413255926Z internal/poll.(*pollDesc).waitRead(...)
      2023-05-29T20:16:09.413255926Z  /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:89
      2023-05-29T20:16:09.413255926Z internal/poll.(*FD).Read(0xc000740a80, {0xc00017a000, 0x6000, 0x6000})
      2023-05-29T20:16:09.413255926Z  /usr/lib/golang/src/internal/poll/fd_unix.go:167 +0x25a
      2023-05-29T20:16:09.413255926Z net.(*netFD).Read(0xc000740a80, {0xc00017a000?, 0xc0001247e0?, 0xc00017d062?})
      2023-05-29T20:16:09.413255926Z  /usr/lib/golang/src/net/fd_posix.go:55 +0x29
      
      
      The get is still slow even with the filter:
      
      $ time oc get csv -l '!olm.copiedFrom' -A | wc -l
      27
      
      real    0m36.593s
      user    0m0.121s
      sys     0m0.028s
      
      Even slower without the filter:
      
      $ time oc get csv -A | wc -l
      19371
      
      real    0m50.996s
      user    0m9.805s
      sys     0m0.719s
      
      $ oc get namespaces | wc -l
      1615

      Version-Release number of selected component (if applicable):

      4.12.14

      How reproducible:

      On upgrade

      Steps to Reproduce:

      1. Initiated Upgrade
      2.
      3.
      

      Actual results:

      Blocked upgrade

      Expected results:

      Should progress

      Additional info:

       

      Attachments

        Issue Links

          Activity

            People

              jojosneg@redhat.com Jose Luis Ojosnegros
              openshift-crt-jira-prow OpenShift Prow Bot
              Mallapadi Niranjan Mallapadi Niranjan
              Votes:
              1 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: