Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-63022

[RFE] Leave info of why a guest machine was killed by EvictionManager

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • CNV v4.18.z
    • CNV Infrastructure
    • None
    • Incidents & Support
    • False
    • Hide

      None

      Show
      None
    • True
    • None

      What is the nature and description of the request?
      No information on why a guest machine was restarted is left if it was killed by the EvictionManager.
      As a result, from user's point of view, it looks like that their guest machine was silently restarted suddenly.
      The info should be left, just like when Pod created by Deployment object was killed.

      Why does the customer need this? (List the business requirements here)

      Without this RFE, it's hard for users to debug why their guest machine was restarted.

      List any affected packages or components

      • kubevirt-hyperconverged-operator

      Additional Info (If needed)

      Please see the following reproduction steps. We tested this with OpenShift 4.18.13 and OpenShift Virtualization 4.18.4.

      Step1. Create "testvm" namespace.

      $ oc create namespace testvm 

      Step2. Apply LimitRange.

      apiVersion: v1
      kind: LimitRange
      metadata:
        name: limitrange
        namespace: testvm
      spec:
        limits:
        - default:
            cpu: 500m
            memory: 1Gi
            ephemeral-storage: 50Mi
          defaultRequest:
            cpu: 100m
            memory: 128Mi
            ephemeral-storage: 10Mi
          max:
            cpu: "1" 
            memory: 8Gi
            ephemeral-storage: 1Gi
          min:
            cpu: 1m
            memory: 512Ki
            ephemeral-storage: 512Ki
          type: Container 

      Step3. Create a new guest machine on "testvm" namespace

      $ virtctl create vm \
          --name rhel9 \
          --run-strategy RerunOnFailure \
          --namespace testvm \
          --instancetype u1.nano \
          --ssh-key <your ssh pub key>
          --volume-containerdisk src:registry.redhat.io/rhel9/rhel-guest-image:latest \
          | oc create -f -

      Step4. Create a 100MB file on the guest machine. It succeeded, but the guest machine will be killed suddenly by EvictionManager in a few minutes.

      $ virtctl ssh -n testvm cloud-user@rhel9-hatada
      ...
      [cloud-user@rhel9 ~]$ dd if=/dev/zero of=/tmp/bigfile bs=1024k count=100
      100+0 records in
      100+0 records out
      104857600 bytes (105 MB, 100 MiB) copied, 0.877866 s, 119 MB/s
      [cloud-user@rhel9-hatada ~]$ websocket: close 1006 (abnormal closure): unexpected EOF
                                                                                           client_loop: send disconnect: Broken pipe
      exit status 255 

      Step5. Wait a few minutes, then run "oc get vm,vmi,pods -n testvm -o wide" command.

      Actual Results

      From the customer's point of view, it looks like nothing happened.
      So, they are very confused about why their guest machine was restarted.

      $ oc get vm,vmi,pods -n testvm -o wide
      NAME                                          AGE     STATUS    READY
      virtualmachine.kubevirt.io/rhel9              5m16s   Running   True
      
      NAME                                          AGE   PHASE     IP             NODENAME                     READY   LIVE-MIGRATABLE   PAUSED
      virtualmachineinstance.kubevirt.io/rhel9      16s   Running   10.128.1.66    control-plane3.example.com   True    True
      
      NAME                                   READY   STATUS    RESTARTS   AGE   IP             NODE                         NOMINATED NODE   READINESS GATES
      pod/virt-launcher-rhel9-6fcwk          2/2     Running   0          16s   10.128.1.66    control-plane3.example.com   <none>           1/1 

      Expected Results
      We did the same test with a Pod created by `oc create deployment --image=image-registry.openshift-image-registry.svc:5000/openshift/cli -n testvm pod – sleep 3000`.

      As a result, the old Pod was left with "Error" status.

      $ oc get pods -n testvm -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-bbcbfbbf-br6v9 0/1 Error 0 44m 10.128.1.59 control-plane3.example.com <none> <none> pod-bbcbfbbf-lvdc9 1/1 Running 0 42m 10.128.1.60 control-plane3.example.com <none> <none>  

      We can check events of the old Pod. Ok, it was killed by EvictionManager.

      Events:
        Type     Reason          Age   From               Message
        ----     ------          ----  ----               -------
      ...
        Warning  Evicted         44m   kubelet            Pod ephemeral local storage usage exceeds the total limit of containers 50Mi.
        Normal   Killing         44m   kubelet            Stopping container cli 

      This is ideal. OpenShift Virtualization should follow this design.

              rhn-support-mtessun Martin Tessun
              rhn-support-kahara Kazuhisa Hara
              Geetika Kapoor Geetika Kapoor
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: