Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.12.z
Affects Version/s: 4.12.z
Component/s: Node Tuning Operator
Labels:
None

Severity:
Important
Regression:
No
Sprint:
CNF Compute Sprint 244
sprint_count:
1
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, a large number of `ClusterServiceVersion` (CSV) resources on startup caused a pod running the Node Tuning Operator (NTO) to restart and loop, which resulted in an error. With this update, the issue is fixed. (link:https://issues.redhat.com/browse/OCPBUGS-21837[*~~OCPBUGS-21837~~*])

*Cause*: On NTO pod start CSVs are listed. If there are a great number of them operation will timeout and the pod will restart, looping again on the error.
*Consequence*: upgrades are blocked.
*Fix*: Use "pagination" feature in api-client to avoid possible timeouts.
*Result*: Bug doesn’t present anymore.

Show
* Previously, a large number of `ClusterServiceVersion` (CSV) resources on startup caused a pod running the Node Tuning Operator (NTO) to restart and loop, which resulted in an error. With this update, the issue is fixed. (link: https://issues.redhat.com/browse/OCPBUGS-21837 [* OCPBUGS-21837 *]) *Cause*: On NTO pod start CSVs are listed. If there are a great number of them operation will timeout and the pod will restart, looping again on the error. *Consequence*: upgrades are blocked. *Fix*: Use "pagination" feature in api-client to avoid possible timeouts. *Result*: Bug doesn’t present anymore.
Release Note Type:
Bug Fix
Target Version:

4.12.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

This is a clone of issue ~~OCPBUGS-20069~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-20032~~. The following is the description of the original issue:
—
Description of problem:

OCP 4.11.18
to
OCP 4.12.14

A similar issue was fixed via https://issues.redhat.com/browse/OCPBUGS-2437 in 4.11.17 which this cluster had.

The error is slightly different though:

The original bug error was: F1016 23:19:57.077167       1 main.go:130] unable to remove Performance addons OLM operator: the server was unable to return a response in the time allotted, but may still be processing the request (get clu

The error in cluster is: unable to remove Performance addons OLM operator: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 515; INTERNAL_ERROR; received from peer

2023-05-29T20:14:59.507908366Z I0529 20:14:59.507797       1 main.go:68] Go Version: go1.19.6
2023-05-29T20:14:59.507908366Z I0529 20:14:59.507879       1 main.go:69] Go OS/Arch: linux/amd64
2023-05-29T20:14:59.507908366Z I0529 20:14:59.507882       1 main.go:70] node-tuning Version: v4.12.0-202304070941.p0.ge2f6753.assembly.stream-0-g22a5414-dirty
2023-05-29T20:15:00.565499739Z I0529 20:15:00.565455       1 request.go:601] Waited for 1.046632334s due to client-side throttling, not priority and fairness, request: GET:https://10.98.0.1:443/apis/wgpolicyk8s.io/v1alpha2?timeout=32s
2023-05-29T20:16:09.413129360Z F0529 20:16:09.413090       1 main.go:136] unable to remove Performance addons OLM operator: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 515; INTERNAL_ERROR; received from peer
2023-05-29T20:16:09.413255926Z goroutine 1 [running]:
2023-05-29T20:16:09.413255926Z k8s.io/klog/v2.stacks(0x1)
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/klog/v2/klog.go:860 +0x89
2023-05-29T20:16:09.413255926Z k8s.io/klog/v2.(*loggingT).output(0x2eb6700, 0x3, 0x0, 0xc0003d6ee0, 0x1, {0x234a333?, 0x1?}, 0xc000100c00?, 0x0)
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/klog/v2/klog.go:825 +0x685
2023-05-29T20:16:09.413255926Z k8s.io/klog/v2.(*loggingT).printfDepth(0x2eb6700, 0x622d40?, 0x0, {0x0, 0x0}, 0x0?, {0x1c007f9, 0x34}, {0xc000c84050, 0x1, ...})
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/klog/v2/klog.go:630 +0x1f2
2023-05-29T20:16:09.413255926Z k8s.io/klog/v2.(*loggingT).printf(...)
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/klog/v2/klog.go:612
2023-05-29T20:16:09.413255926Z k8s.io/klog/v2.Fatalf(...)
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/klog/v2/klog.go:1516
2023-05-29T20:16:09.413255926Z main.operatorRun()
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/cmd/cluster-node-tuning-operator/main.go:136 +0x41f
2023-05-29T20:16:09.413255926Z main.glob..func1(0x2ddd660?, {0x1bb12fd?, 0x1?, 0x1?})
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/cmd/cluster-node-tuning-operator/main.go:77 +0x17
2023-05-29T20:16:09.413255926Z github.com/spf13/cobra.(*Command).execute(0x2ddd660, {0xc000054050, 0x1, 0x1})
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/github.com/spf13/cobra/command.go:860 +0x663
2023-05-29T20:16:09.413255926Z github.com/spf13/cobra.(*Command).ExecuteC(0x2ddd660)
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/github.com/spf13/cobra/command.go:974 +0x3bd
2023-05-29T20:16:09.413255926Z github.com/spf13/cobra.(*Command).Execute(...)
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/github.com/spf13/cobra/command.go:902
2023-05-29T20:16:09.413255926Z main.main()
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/cmd/cluster-node-tuning-operator/main.go:267 +0x125
2023-05-29T20:16:09.413255926Z
2023-05-29T20:16:09.413255926Z goroutine 98 [chan receive, 1 minutes]:
2023-05-29T20:16:09.413255926Z k8s.io/klog.(*loggingT).flushDaemon(0xc000128540?)
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/klog/klog.go:1010 +0x6a
2023-05-29T20:16:09.413255926Z created by k8s.io/klog.init.0
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/klog/klog.go:411 +0xef
2023-05-29T20:16:09.413255926Z
2023-05-29T20:16:09.413255926Z goroutine 117 [IO wait]:
2023-05-29T20:16:09.413255926Z internal/poll.runtime_pollWait(0x7f258c0d3018, 0x72)
2023-05-29T20:16:09.413255926Z  /usr/lib/golang/src/runtime/netpoll.go:305 +0x89
2023-05-29T20:16:09.413255926Z internal/poll.(*pollDesc).wait(0xc000740a80?, 0xc00017a000?, 0x0)
2023-05-29T20:16:09.413255926Z  /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:84 +0x32
2023-05-29T20:16:09.413255926Z internal/poll.(*pollDesc).waitRead(...)
2023-05-29T20:16:09.413255926Z  /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:89
2023-05-29T20:16:09.413255926Z internal/poll.(*FD).Read(0xc000740a80, {0xc00017a000, 0x6000, 0x6000})
2023-05-29T20:16:09.413255926Z  /usr/lib/golang/src/internal/poll/fd_unix.go:167 +0x25a
2023-05-29T20:16:09.413255926Z net.(*netFD).Read(0xc000740a80, {0xc00017a000?, 0xc0001247e0?, 0xc00017d062?})
2023-05-29T20:16:09.413255926Z  /usr/lib/golang/src/net/fd_posix.go:55 +0x29


The get is still slow even with the filter:

$ time oc get csv -l '!olm.copiedFrom' -A | wc -l
27

real    0m36.593s
user    0m0.121s
sys     0m0.028s

Even slower without the filter:

$ time oc get csv -A | wc -l
19371

real    0m50.996s
user    0m9.805s
sys     0m0.719s

$ oc get namespaces | wc -l
1615

Version-Release number of selected component (if applicable):

4.12.14

How reproducible:

On upgrade

Steps to Reproduce:

1. Initiated Upgrade
2.
3.

Actual results:

Blocked upgrade

Expected results:

Should progress

Additional info:

clones

OCPBUGS-20069 [release-4.13] Clusters with large numbers of CSVs can CrashLoop the NTO and block upgrades

Closed

is blocked by

OCPBUGS-20069 [release-4.13] Clusters with large numbers of CSVs can CrashLoop the NTO and block upgrades

Closed

links to

openshift/cluster-node-tuning-operator#838: [release-4.12] OCPBUGS-21837: nto: pao avoid timeout when there are too many CSV

RHBA-2023:6276 OpenShift Container Platform 4.12.z bug fix update

Assignee:: Jose Luis Ojosnegros (Inactive)

Reporter:: OpenShift Prow Bot

QA Contact:: Mallapadi Niranjan

Votes:: 1 Vote for this issue

Watchers:: 16 Start watching this issue

Created:: 2023/10/17 6:07 PM

Updated:: 2024/04/18 2:47 PM

Resolved:: 2023/11/08 10:41 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates