Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-30978

crio memory usage increases with port-forward

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Normal Normal
    • 4.16.z
    • 4.13.z, 4.12.z, 4.14.z, 4.15.z, 4.16.z
    • Node / CRI-O
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, when sending multiple failing port-forwarding requests, CRI-O memory usage increases until the node dies. With this release, the memory leakage when sending a failing port-forward request is fixed and the issue is resolved. (link:https://issues.redhat.com/browse/OCPBUGS-30978[*OCPBUGS-30978*])
      Show
      * Previously, when sending multiple failing port-forwarding requests, CRI-O memory usage increases until the node dies. With this release, the memory leakage when sending a failing port-forward request is fixed and the issue is resolved. (link: https://issues.redhat.com/browse/OCPBUGS-30978 [* OCPBUGS-30978 *])
    • Bug Fix
    • Done

      Description of problem:

      while sending multiple failing port-forwarding requests, crio memory usage increases until the node dies

      Version-Release number of selected component (if applicable):

      OCP 4.15.2
      cri-o-1.28.4-4.rhaos4.15.git92d1839.el9.x86_64

      How reproducible:

      all the time

      Steps to Reproduce:

      apiVersion: v1
      kind: Pod
      metadata:
        name: myclient
      spec:
        restartPolicy: Never
        containers:
        - name: myclient
          image: quay.io/centos/centos:stream9
          ports:
          - containerPort: 80
          args: ["bash", "-c", "sleep 12000" ]
          resources:
            requests:
              cpu: 200m
          imagePullPolicy: IfNotPresent
          readinessProbe:
            failureThreshold: 3000
            httpGet:
              path: /ready
              port: 8080
              scheme: HTTP
            initialDelaySeconds: 1
            periodSeconds: 1
            successThreshold: 1
            timeoutSeconds: 1 
          1. Create this pod
          2. while sleep 1 ; do kubectl port-forward myclient 15000:8080 & sleep 1 && curl localhost:15000/ && wait; done     
      
          

      Actual results:

      crio memory increases as long as the while loop above is running until there is no memory left

      Expected results:

      crio memory remains stable

      Additional info:

      https://issues.redhat.com/browse/OSSM-2344 might be what is causing the many port-forward in our case

          

            [OCPBUGS-30978] crio memory usage increases with port-forward

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Important: OpenShift Container Platform 4.16.1 bug fix and security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2024:4156

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Important: OpenShift Container Platform 4.16.1 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:4156

            Sunil Choudhary added a comment - - edited

            Fix version was marked for 4.16.z, I see it is now included in 4.16.1 which is releasing today

            Sunil Choudhary added a comment - - edited Fix version was marked for 4.16.z, I see it is now included in 4.16.1 which is releasing today

            Peter Hunt added a comment -

            hm I would have expected this to make it into the 4.16.0 errata, why didn't it move? Bug permissions?

            Peter Hunt added a comment - hm I would have expected this to make it into the 4.16.0 errata, why didn't it move? Bug permissions?

            Checked on 4.16.1, crio memory usage remains stable

            % oc get clusterversion
            NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
            version   4.16.1    True        False         132m    Cluster version is 4.16.1
            
            % oc get nodes
            NAME                                        STATUS   ROLES                  AGE    VERSION
            ip-10-0-25-203.us-east-2.compute.internal   Ready    worker                 152m   v1.29.5+58452d8
            ip-10-0-34-211.us-east-2.compute.internal   Ready    control-plane,master   159m   v1.29.5+58452d8
            ip-10-0-54-247.us-east-2.compute.internal   Ready    worker                 152m   v1.29.5+58452d8
            ip-10-0-6-197.us-east-2.compute.internal    Ready    control-plane,master   159m   v1.29.5+58452d8
            ip-10-0-87-30.us-east-2.compute.internal    Ready    control-plane,master   158m   v1.29.5+58452d8
            ip-10-0-88-159.us-east-2.compute.internal   Ready    worker                 151m   v1.29.5+58452d8
            
            % cat crio-memory-usage.yaml 
            apiVersion: v1
            kind: Pod
            metadata:
              name: myclient
            spec:
              restartPolicy: Never
              containers:
              - name: myclient
                image: quay.io/centos/centos:stream9
                ports:
                - containerPort: 80
                args: ["bash", "-c", "sleep 12000" ]
                resources:
                  requests:
                    cpu: 200m
                imagePullPolicy: IfNotPresent
                readinessProbe:
                  failureThreshold: 3000
                  httpGet:
                    path: /ready
                    port: 8080
                    scheme: HTTP
                  initialDelaySeconds: 1
                  periodSeconds: 1
                  successThreshold: 1
                  timeoutSeconds: 1 
            
            % oc apply -f crio-memory-usage.yaml 
            ...
            pod/myclient created
            
            % oc get pods
            NAME       READY   STATUS    RESTARTS   AGE
            myclient   0/1     Running   0          9s
            
            % oc get pods -o wide
            NAME       READY   STATUS    RESTARTS   AGE     IP            NODE                                        NOMINATED NODE   READINESS GATES
            myclient   0/1     Running   0          2m29s   10.131.0.27   ip-10-0-25-203.us-east-2.compute.internal   <none>           <none>
            
            sh-5.1# ps aux | grep -e "/usr/bin/crio" -e PID
            USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
            root        2188  1.0  0.8 2627176 137408 ?      Ssl  05:47   1:47 /usr/bin/crio
            
            sh-5.1# ps aux | grep -e "/usr/bin/crio" -e PID
            USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
            root        2188  1.0  0.8 2627176 137152 ?      Ssl  05:47   1:50 /usr/bin/crio
            

            Sunil Choudhary added a comment - Checked on 4.16.1, crio memory usage remains stable % oc get clusterversion NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS version   4.16.1    True        False         132m    Cluster version is 4.16.1 % oc get nodes NAME                                        STATUS   ROLES                  AGE    VERSION ip-10-0-25-203.us-east-2.compute.internal   Ready    worker                 152m   v1.29.5+58452d8 ip-10-0-34-211.us-east-2.compute.internal   Ready    control-plane,master   159m   v1.29.5+58452d8 ip-10-0-54-247.us-east-2.compute.internal   Ready    worker                 152m   v1.29.5+58452d8 ip-10-0-6-197.us-east-2.compute.internal    Ready    control-plane,master   159m   v1.29.5+58452d8 ip-10-0-87-30.us-east-2.compute.internal    Ready    control-plane,master   158m   v1.29.5+58452d8 ip-10-0-88-159.us-east-2.compute.internal   Ready    worker                 151m   v1.29.5+58452d8 % cat crio-memory-usage.yaml  apiVersion: v1 kind: Pod metadata:   name: myclient spec:   restartPolicy: Never   containers:   - name: myclient     image: quay.io/centos/centos:stream9     ports:     - containerPort: 80     args: [ "bash" , "-c" , "sleep 12000" ]     resources:       requests:         cpu: 200m     imagePullPolicy: IfNotPresent     readinessProbe:       failureThreshold: 3000       httpGet:         path: /ready         port: 8080         scheme: HTTP       initialDelaySeconds: 1       periodSeconds: 1       successThreshold: 1       timeoutSeconds: 1  % oc apply -f crio-memory-usage.yaml  ... pod/myclient created % oc get pods NAME       READY   STATUS    RESTARTS   AGE myclient   0/1     Running   0          9s % oc get pods -o wide NAME       READY   STATUS    RESTARTS   AGE     IP            NODE                                        NOMINATED NODE   READINESS GATES myclient   0/1     Running   0          2m29s   10.131.0.27   ip-10-0-25-203.us-east-2.compute.internal   <none>           <none> sh-5.1# ps aux | grep -e "/usr/bin/crio" -e PID USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND root        2188  1.0  0.8 2627176 137408 ?      Ssl  05:47   1:47 /usr/bin/crio sh-5.1# ps aux | grep -e "/usr/bin/crio" -e PID USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND root        2188  1.0  0.8 2627176 137152 ?      Ssl  05:47   1:50 /usr/bin/crio

            Hi rh-ee-atokubi,

            Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            OpenShift Jira Bot added a comment - Hi rh-ee-atokubi , Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            Ayato Tokubi added a comment - Submitted a PR https://github.com/cri-o/cri-o/pull/8203

            [...]

            I will make use of the exisiting code and remove emptyStreamOnError to have it run always.

            rh-ee-atokubi, sounds like a good plan! I see no reason not to drain it every time, indeed.

            Krzysztof Wilczyński (Inactive) added a comment - [...] I will make use of the exisiting code and remove emptyStreamOnError to have it run always. rh-ee-atokubi , sounds like a good plan! I see no reason not to drain it every time, indeed.

            rh-ee-kwilczyn 

            It is not run if it reaches to `emptyStreamOnError = false`.

            In our scenario, it failed here https://github.com/cri-o/cri-o/blob/67907babd2b37ae32c9215eb1065f20cebff612a/internal/oci/runtime_oci_linux.go#L39 , which is after that line.

             

            According to the comment, it delegates the draining to PortForwardContainer.

                // defer responsibility of emptying stream to PortForwardContainer
                emptyStreamOnError = false

            https://github.com/cri-o/cri-o/blob/3ea70c49af21d34277525edd26124e7f123786f2/server/container_portforward.go#L62-L63

            However `PortForwardContainer` doesn't.

             

            Seems like the draining code is introduced in https://github.com/cri-o/cri-o/commit/7016c3e13fe19272d6a680ca2172df02da75a35f and removed in https://github.com/cri-o/cri-o/commit/7512d31664683c97a30d97b70496c2f850421e28

             

            I will make use of the exisiting code and remove emptyStreamOnError to have it run always.

            Ayato Tokubi added a comment - rh-ee-kwilczyn   It is not run if it reaches to `emptyStreamOnError = false`. In our scenario, it failed here https://github.com/cri-o/cri-o/blob/67907babd2b37ae32c9215eb1065f20cebff612a/internal/oci/runtime_oci_linux.go#L39 , which is after that line.   According to the comment, it delegates the draining to PortForwardContainer.     // defer responsibility of emptying stream to PortForwardContainer     emptyStreamOnError = false https://github.com/cri-o/cri-o/blob/3ea70c49af21d34277525edd26124e7f123786f2/server/container_portforward.go#L62-L63 However `PortForwardContainer` doesn't.   Seems like the draining code is introduced in https://github.com/cri-o/cri-o/commit/7016c3e13fe19272d6a680ca2172df02da75a35f and removed in https://github.com/cri-o/cri-o/commit/7512d31664683c97a30d97b70496c2f850421e28   I will make use of the exisiting code and remove emptyStreamOnError to have it run always.

            rh-ee-atokubi, we seem to be doing the draining on error per:

                // if we error in this function before Copying all of the content out of the stream,
                // this stream will eventually get full, which causes leakages and can eventually brick CRI-O
                // ref https://bugzilla.redhat.com/show_bug.cgi?id=1798193
                emptyStreamOnError := true
                defer func() {
                    if emptyStreamOnError && stream != nil {
                        go func() {
                            _, copyError := pools.Copy(io.Discard, stream)
                            log.Errorf(ctx, "Error closing port forward stream after other error: %v", copyError)
                        }()
                    }
                }() 

            This is from server/container_portforward.go#L24-L66. I wonder if this is not working as intended.

            Krzysztof Wilczyński (Inactive) added a comment - rh-ee-atokubi , we seem to be doing the draining on error per:     // if we error in this function before Copying all of the content out of the stream,     // this stream will eventually get full, which causes leakages and can eventually brick CRI-O     // ref https://bugzilla.redhat.com/show_bug.cgi?id=1798193     emptyStreamOnError := true     defer func() {         if emptyStreamOnError && stream != nil {             go func() {                 _, copyError := pools.Copy(io.Discard, stream)                 log.Errorf(ctx, "Error closing port forward stream after other error: %v" , copyError)             }()         }     }() This is from server/container_portforward.go#L24-L66 . I wonder if this is not working as intended.

            rh-ee-atokubi, nice detective work! Thank you!

            Re: the following.

            The possible fix can be done on our side is to consume all data before closing the stream although I wouldn't want to do because it's kind of magic-ish.

            https://github.com/bitoku/cri-o/commit/7a4e0458844872421da375f734af8d61ab4eaeda

            If this works, then I would consider it a part of the potential fix. Why? There is a very similar "problem" with a very similar "fix" that people employed, which is part of the standard library per:

            From the net/http#Response about the Body member:

            // Body represents the response body.
            //
            // The response body is streamed on demand as the Body field
            // is read. If the network connection fails or the server
            // terminates the response, Body.Read calls return an error.
            //
            // The http Client and Transport guarantee that Body is always
            // non-nil, even on responses without a body or responses with
            // a zero-length body. It is the caller's responsibility to
            // close Body. The default HTTP client's Transport may not
            // reuse HTTP/1.x "keep-alive" TCP connections if the Body is
            // not read to completion and closed.
            //
            // The Body is automatically dechunked if the server replied
            // with a "chunked" Transfer-Encoding.
            //
            // As of Go 1.12, the Body will also implement io.Writer
            // on a successful "101 Switching Protocols" response,
            // as used by WebSockets and HTTP/2's "h2c" mode.
            Body io.ReadCloser

            Especially this fragment:

            // The default HTTP client's Transport may not
            // reuse HTTP/1.x "keep-alive" TCP connections if the Body is
            // not read to completion and closed.

            This leads to people adding code quite similar to (an example):

            res, _ := client.Do(req)
            io.Copy(ioutil.Discard, res.Body)
            res.Body.Close()

            Which drains the response body before it can be closed. This aims to ensure that the connection reuse will take place.

            In the code snipped linked, the stream variable is of a ReadWriteCloser type, which means you can probably drain it in the same way as the example above, without the need to store the content in a temporary variable.

            This could be added to CRI-O as fix in the meantime while we wait for the upstream issue to be resolved. Provided that this is also safe to do.

            Also, adding a sensibly large limit that can be adjusted via a configuration file, aside from the workaround, would be the complete solution here.

            Krzysztof Wilczyński (Inactive) added a comment - - edited rh-ee-atokubi , nice detective work! Thank you! Re: the following. The possible fix can be done on our side is to consume all data before closing the stream although I wouldn't want to do because it's kind of magic-ish. https://github.com/bitoku/cri-o/commit/7a4e0458844872421da375f734af8d61ab4eaeda If this works, then I would consider it a part of the potential fix. Why? There is a very similar "problem" with a very similar "fix" that people employed, which is part of the standard library per: From the net/http#Response about the Body member: // Body represents the response body. // // The response body is streamed on demand as the Body field // is read. If the network connection fails or the server // terminates the response, Body.Read calls return an error. // // The http Client and Transport guarantee that Body is always // non-nil, even on responses without a body or responses with // a zero-length body. It is the caller's responsibility to // close Body. The default HTTP client's Transport may not // reuse HTTP/1.x "keep-alive" TCP connections if the Body is // not read to completion and closed. // // The Body is automatically dechunked if the server replied // with a "chunked" Transfer-Encoding. // // As of Go 1.12, the Body will also implement io.Writer // on a successful "101 Switching Protocols" response, // as used by WebSockets and HTTP/2's "h2c" mode. Body io.ReadCloser Especially this fragment: // The default HTTP client's Transport may not // reuse HTTP/1.x "keep-alive" TCP connections if the Body is // not read to completion and closed. This leads to people adding code quite similar to (an example): res, _ := client.Do(req) io.Copy(ioutil.Discard, res.Body) res.Body.Close() Which drains the response body before it can be closed. This aims to ensure that the connection reuse will take place. In the code snipped linked, the stream variable is of a ReadWriteCloser type, which means you can probably drain it in the same way as the example above, without the need to store the content in a temporary variable. This could be added to CRI-O as fix in the meantime while we wait for the upstream issue to be resolved. Provided that this is also safe to do. Also, adding a sensibly large limit that can be adjusted via a configuration file, aside from the workaround, would be the complete solution here.

              rh-ee-atokubi Ayato Tokubi
              frigault Francois Rigault
              Sunil Choudhary Sunil Choudhary
              Ayato Tokubi, Krzysztof Wilczyński (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: