Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-16904

All nodes Memory cgroup out of memory after stress testing in 4.13.x

XMLWordPrintable

    • Critical
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      8/9: customer succesfully tested workaround ; RHEL pending to analyze m-g/srs; KNIECO-7801
      8/2: telco priority pending clarification of bug in question (DM)
      Show
      8/9: customer succesfully tested workaround ; RHEL pending to analyze m-g/srs; KNIECO-7801 8/2: telco priority pending clarification of bug in question (DM)

      Description of problem:

      The OCP 4.13.0 MNO (3+5) environment are not usable for the other testing. All nodes are printing in the BMC console/terminal "Memory cgroup out of memory" When login to the master node via SSH can't even execute ls -ltr command also kubectl won't work. When rebooting all the nodes from BMC the cluster started work again. This looks like memory leak issue.

      Version-Release number of selected component (if applicable):

       

      How reproducible:

      Please use the deployment file that I have used and you will be able to reproduce the issue. Try to increase number of replicas to 40 and slowly nodes will start to go down

      Steps to Reproduce:

      1. Create deployment.yaml filewith below content:
      ~~~
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: stress-ng-test-limit
        labels:
          k8s-app: stress-ng-test-limit
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: stress-ng-test-limit
        template:
          metadata:
            name: stress-ng-test-limit
            labels:
              app: stress-ng-test-limit
          spec:
            serviceAccount: stress-ng-sa
            containers:
            - name: stress-ng-test-container-limit
              image: quay.io/dmoessne/stress-ng-test:0.2
              command: [ "/bin/bash", "-c", "--" ]
              args: [ "stress-ng -c 1 --vm 32 --vm-bytes 100% --vm-method all --madvise 2"]
              #command: ["sleep", "infinity"]
              resources:
                 requests:
                   memory: 100Mi
                 limits:
                   memory: "1G"
              securityContext:
                seccompProfile:
                  type: RuntimeDefault
                capabilities:
                  drop:
                  - ALL
                privileged: true
      ~~~
      2. [quickcluster@upi-0 ~]$ oc create -f deployment.yaml 
      deployment.apps/stress-ng-test-limit created
       3. [quickcluster@upi-0 ~]$ oc get all
      NAME                                        READY   STATUS    RESTARTS   AGE
      pod/stress-ng-test-limit-59c59dbf65-j6z84   1/1     Running   0          59sNAME                                   READY   UP-TO-DATE   AVAILABLE   AGE
      deployment.apps/stress-ng-test-limit   1/1     1            1           59sNAME                                              DESIRED   CURRENT   READY   AGE
      replicaset.apps/stress-ng-test-limit-59c59dbf65   1         1         1       59s
      
      4. Increase replicas to 40
      
      oc scale --replicas=40 deployment.apps/stress-ng-test-limit
      
      5. After sometime nodes will go down , openshift web console will stop working. Operators will start behaving abnormally. 

      Actual results:

      node goes down

      Expected results:

      During stress testing , node should not go down.

      Additional info:

       

              msivak@redhat.com Martin Sivak
              rhn-support-vismishr Vishvranjan Mishra
              Mallapadi Niranjan Mallapadi Niranjan
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: