Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.13.z
Component/s: Node / CRI-O
Labels:

Severity:
Critical
Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Internal Whiteboard:
Latest Status Summary:

Hide
8/9: customer succesfully tested workaround ; RHEL pending to analyze m-g/srs; KNIECO-7801
8/2: telco priority pending clarification of bug in question (DM)

Show
8/9: customer succesfully tested workaround ; RHEL pending to analyze m-g/srs; KNIECO-7801 8/2: telco priority pending clarification of bug in question (DM)
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

The OCP 4.13.0 MNO (3+5) environment are not usable for the other testing. All nodes are printing in the BMC console/terminal "Memory cgroup out of memory" When login to the master node via SSH can't even execute ls -ltr command also kubectl won't work. When rebooting all the nodes from BMC the cluster started work again. This looks like memory leak issue.

Version-Release number of selected component (if applicable):

How reproducible:

Please use the deployment file that I have used and you will be able to reproduce the issue. Try to increase number of replicas to 40 and slowly nodes will start to go down

Steps to Reproduce:

1. Create deployment.yaml filewith below content:
~~~
apiVersion: apps/v1
kind: Deployment
metadata:
  name: stress-ng-test-limit
  labels:
    k8s-app: stress-ng-test-limit
spec:
  replicas: 1
  selector:
    matchLabels:
      app: stress-ng-test-limit
  template:
    metadata:
      name: stress-ng-test-limit
      labels:
        app: stress-ng-test-limit
    spec:
      serviceAccount: stress-ng-sa
      containers:
      - name: stress-ng-test-container-limit
        image: quay.io/dmoessne/stress-ng-test:0.2
        command: [ "/bin/bash", "-c", "--" ]
        args: [ "stress-ng -c 1 --vm 32 --vm-bytes 100% --vm-method all --madvise 2"]
        #command: ["sleep", "infinity"]
        resources:
           requests:
             memory: 100Mi
           limits:
             memory: "1G"
        securityContext:
          seccompProfile:
            type: RuntimeDefault
          capabilities:
            drop:
            - ALL
          privileged: true
~~~
2. [quickcluster@upi-0 ~]$ oc create -f deployment.yaml 
deployment.apps/stress-ng-test-limit created
 3. [quickcluster@upi-0 ~]$ oc get all
NAME                                        READY   STATUS    RESTARTS   AGE
pod/stress-ng-test-limit-59c59dbf65-j6z84   1/1     Running   0          59sNAME                                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/stress-ng-test-limit   1/1     1            1           59sNAME                                              DESIRED   CURRENT   READY   AGE
replicaset.apps/stress-ng-test-limit-59c59dbf65   1         1         1       59s

4. Increase replicas to 40

oc scale --replicas=40 deployment.apps/stress-ng-test-limit

5. After sometime nodes will go down , openshift web console will stop working. Operators will start behaving abnormally.

Actual results:

node goes down

Expected results:

During stress testing , node should not go down.

Additional info:

is blocked by

OCPBUGS-15102 All burstable pods run with the reserved cpu affinity mask when PerformanceProfile is applied

Closed

Assignee:: Martin Sivak

Reporter:: Vishvranjan Mishra

QA Contact:: Mallapadi Niranjan

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: 2023/07/27 12:05 PM

Updated:: 2023/12/11 9:30 AM

Resolved:: 2023/12/11 9:30 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates