Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.12.z
Component/s: Node Tuning Operator
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
No

Target Backport Versions:
None
Target Version:

4.15.0
Release Blocker:
Rejected
Sprint:
CNF Compute Sprint 239, CNF Compute Sprint 240, CNF Compute Sprint 241, CNF Compute Sprint 242, CNF Compute Sprint 243, CNF Compute Sprint 244, CNF Compute Sprint 245
sprint_count:
7

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Priority Data:
PX Impact Score:

Release Note Status:
Done
Release Note Type:
Bug Fix
Release Note Text:

Hide
* Previously, when the Node Tuning Operator (NTO) pod restarted while there were a large number of CSVs in the cluster the NTO pod would fail and entered into `CrashBackLoop` state. With this update, pagination has been added to the list CSVs requests and this avoid the `api-server` timeout issue which resulted in the CrashBackLoop` state. (link:https://issues.redhat.com/browse/OCPBUGS-14241[*~~OCPBUGS-14241~~*])

Show
* Previously, when the Node Tuning Operator (NTO) pod restarted while there were a large number of CSVs in the cluster the NTO pod would fail and entered into `CrashBackLoop` state. With this update, pagination has been added to the list CSVs requests and this avoid the `api-server` timeout issue which resulted in the CrashBackLoop` state. (link: https://issues.redhat.com/browse/OCPBUGS-14241 [* OCPBUGS-14241 *])

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

OCP 4.11.18
to
OCP 4.12.14

A similar issue was fixed via https://issues.redhat.com/browse/OCPBUGS-2437 in 4.11.17 which this cluster had.

The error is slightly different though:

The original bug error was: F1016 23:19:57.077167       1 main.go:130] unable to remove Performance addons OLM operator: the server was unable to return a response in the time allotted, but may still be processing the request (get clu

The error in cluster is: unable to remove Performance addons OLM operator: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 515; INTERNAL_ERROR; received from peer

2023-05-29T20:14:59.507908366Z I0529 20:14:59.507797       1 main.go:68] Go Version: go1.19.6
2023-05-29T20:14:59.507908366Z I0529 20:14:59.507879       1 main.go:69] Go OS/Arch: linux/amd64
2023-05-29T20:14:59.507908366Z I0529 20:14:59.507882       1 main.go:70] node-tuning Version: v4.12.0-202304070941.p0.ge2f6753.assembly.stream-0-g22a5414-dirty
2023-05-29T20:15:00.565499739Z I0529 20:15:00.565455       1 request.go:601] Waited for 1.046632334s due to client-side throttling, not priority and fairness, request: GET:https://10.98.0.1:443/apis/wgpolicyk8s.io/v1alpha2?timeout=32s
2023-05-29T20:16:09.413129360Z F0529 20:16:09.413090       1 main.go:136] unable to remove Performance addons OLM operator: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 515; INTERNAL_ERROR; received from peer
2023-05-29T20:16:09.413255926Z goroutine 1 [running]:
2023-05-29T20:16:09.413255926Z k8s.io/klog/v2.stacks(0x1)
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/klog/v2/klog.go:860 +0x89
2023-05-29T20:16:09.413255926Z k8s.io/klog/v2.(*loggingT).output(0x2eb6700, 0x3, 0x0, 0xc0003d6ee0, 0x1, {0x234a333?, 0x1?}, 0xc000100c00?, 0x0)
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/klog/v2/klog.go:825 +0x685
2023-05-29T20:16:09.413255926Z k8s.io/klog/v2.(*loggingT).printfDepth(0x2eb6700, 0x622d40?, 0x0, {0x0, 0x0}, 0x0?, {0x1c007f9, 0x34}, {0xc000c84050, 0x1, ...})
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/klog/v2/klog.go:630 +0x1f2
2023-05-29T20:16:09.413255926Z k8s.io/klog/v2.(*loggingT).printf(...)
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/klog/v2/klog.go:612
2023-05-29T20:16:09.413255926Z k8s.io/klog/v2.Fatalf(...)
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/klog/v2/klog.go:1516
2023-05-29T20:16:09.413255926Z main.operatorRun()
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/cmd/cluster-node-tuning-operator/main.go:136 +0x41f
2023-05-29T20:16:09.413255926Z main.glob..func1(0x2ddd660?, {0x1bb12fd?, 0x1?, 0x1?})
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/cmd/cluster-node-tuning-operator/main.go:77 +0x17
2023-05-29T20:16:09.413255926Z github.com/spf13/cobra.(*Command).execute(0x2ddd660, {0xc000054050, 0x1, 0x1})
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/github.com/spf13/cobra/command.go:860 +0x663
2023-05-29T20:16:09.413255926Z github.com/spf13/cobra.(*Command).ExecuteC(0x2ddd660)
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/github.com/spf13/cobra/command.go:974 +0x3bd
2023-05-29T20:16:09.413255926Z github.com/spf13/cobra.(*Command).Execute(...)
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/github.com/spf13/cobra/command.go:902
2023-05-29T20:16:09.413255926Z main.main()
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/cmd/cluster-node-tuning-operator/main.go:267 +0x125
2023-05-29T20:16:09.413255926Z
2023-05-29T20:16:09.413255926Z goroutine 98 [chan receive, 1 minutes]:
2023-05-29T20:16:09.413255926Z k8s.io/klog.(*loggingT).flushDaemon(0xc000128540?)
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/klog/klog.go:1010 +0x6a
2023-05-29T20:16:09.413255926Z created by k8s.io/klog.init.0
2023-05-29T20:16:09.413255926Z  /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/klog/klog.go:411 +0xef
2023-05-29T20:16:09.413255926Z
2023-05-29T20:16:09.413255926Z goroutine 117 [IO wait]:
2023-05-29T20:16:09.413255926Z internal/poll.runtime_pollWait(0x7f258c0d3018, 0x72)
2023-05-29T20:16:09.413255926Z  /usr/lib/golang/src/runtime/netpoll.go:305 +0x89
2023-05-29T20:16:09.413255926Z internal/poll.(*pollDesc).wait(0xc000740a80?, 0xc00017a000?, 0x0)
2023-05-29T20:16:09.413255926Z  /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:84 +0x32
2023-05-29T20:16:09.413255926Z internal/poll.(*pollDesc).waitRead(...)
2023-05-29T20:16:09.413255926Z  /usr/lib/golang/src/internal/poll/fd_poll_runtime.go:89
2023-05-29T20:16:09.413255926Z internal/poll.(*FD).Read(0xc000740a80, {0xc00017a000, 0x6000, 0x6000})
2023-05-29T20:16:09.413255926Z  /usr/lib/golang/src/internal/poll/fd_unix.go:167 +0x25a
2023-05-29T20:16:09.413255926Z net.(*netFD).Read(0xc000740a80, {0xc00017a000?, 0xc0001247e0?, 0xc00017d062?})
2023-05-29T20:16:09.413255926Z  /usr/lib/golang/src/net/fd_posix.go:55 +0x29


The get is still slow even with the filter:

$ time oc get csv -l '!olm.copiedFrom' -A | wc -l
27

real    0m36.593s
user    0m0.121s
sys     0m0.028s

Even slower without the filter:

$ time oc get csv -A | wc -l
19371

real    0m50.996s
user    0m9.805s
sys     0m0.719s

$ oc get namespaces | wc -l
1615

Version-Release number of selected component (if applicable):

4.12.14

How reproducible:

On upgrade

Steps to Reproduce:

1. Initiated Upgrade
2.
3.

Actual results:

Blocked upgrade

Expected results:

Should progress

Additional info:

Steps to reproduce
- Add around 20k CSV 
- Restart cluster-node-tuning-operator pod

blocks

OCPBUGS-20032 [release-4.14] Clusters with large numbers of CSVs can CrashLoop the NTO and block upgrades

Closed

is cloned by

OCPBUGS-20032 [release-4.14] Clusters with large numbers of CSVs can CrashLoop the NTO and block upgrades

Closed

is depended on by

OCPBUGS-20032 [release-4.14] Clusters with large numbers of CSVs can CrashLoop the NTO and block upgrades

Closed

is duplicated by

OCPBUGS-14675 Clusters with large numbers of CSVs can cause crashloop, block upgrades

Closed

is related to

OCPBUGS-2437 Clusters with large numbers of CSVs can cause crashloop, block upgrades

Closed

links to

openshift/cluster-node-tuning-operator#731: OCPBUGS-14241: nto: pao avoid timeout when there are too many CSV

openshift/cluster-node-tuning-operator#817: [release-4.14] OCPBUGS-14241: nto: pao avoid timeout when there are too many CSV

RHEA-2023:7198 rpm

(3 links to)

Assignee:: Jose Luis Ojosnegros

Reporter:: Matt Robson

QA Contact:: Niranjan Mallapadi Raghavendra Rao

Need Info From:: None

Votes:: 1 Vote for this issue

Watchers:: 14 Start watching this issue

Created:: 2023/05/29 11:07 PM

Updated:: 2025/09/13 6:08 AM

Resolved:: 2024/02/27 8:58 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates