-
Bug
-
Resolution: Done
-
Critical
-
4.13
-
Critical
-
None
-
False
-
Description of problem:
Post upgrade to ROSA cluster version 4.13.43 cx is not able to access the Pod terminals via oc rsh on some of their pods.
Below are the error messages they see:
ERRO[0000] exec failed: unable to start container process: read init-p: connection reset by peer command terminated with exit code 255
We have isolated the issues with the tight pod limits. Also isolated the issue with Twistlock. Disabled and removed Twistlock completely, but they are still facing the same issue. Reference KCS: https://access.redhat.com/solutions/7062219 and https://access.redhat.com/solutions/3335421.
We went ahead and tried setting up the --pod-pids-limit=16384 by following these docs: https://docs.openshift.com/rosa/rosa_cluster_admin/rosa-configuring-pid-limits.html because the read init-p: connection reset by peer error message is related to exhausted PID limit. But that also did not fix the issue.
It would be also worth investigating the release notes for version 4.13.43 to which customer upgraded and see what changed at the COREOS level.
The 4.12-to-4.13 upgrade has a RHEL 8 to RHEL 9 bump https://docs.openshift.com/container-platform/4.13/release_notes/ocp-4-13-release-notes.html#ocp-4-13-rhcos-rhel-9-2-packages .
Version-Release number of selected component (if applicable):
4.13.43
Cluster ID: fc39e80e-d2a5-40d5-8d7d-d91a31e24106
Related OHSS ticket: https://issues.redhat.com/browse/OHSS-35807
Related slack thread on #sd-sre-platform: https://redhat-internal.slack.com/archives/CCX9DB894/p1721063734654469