Loading...

Type: Feature Request
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: CNV v4.18.z
Component/s: CNV Infrastructure
Labels:
None

Activity Type:
Incidents & Support
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
True
Component Fix Version(s):
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Market:

What is the nature and description of the request?
No information on why a guest machine was restarted is left if it was killed by the EvictionManager.
As a result, from user's point of view, it looks like that their guest machine was silently restarted suddenly.
The info should be left, just like when Pod created by Deployment object was killed.

Why does the customer need this? (List the business requirements here)

Without this RFE, it's hard for users to debug why their guest machine was restarted.

List any affected packages or components

kubevirt-hyperconverged-operator

Additional Info (If needed)

Please see the following reproduction steps. We tested this with OpenShift 4.18.13 and OpenShift Virtualization 4.18.4.

Step1. Create "testvm" namespace.

$ oc create namespace testvm

Step2. Apply LimitRange.

apiVersion: v1
kind: LimitRange
metadata:
  name: limitrange
  namespace: testvm
spec:
  limits:
  - default:
      cpu: 500m
      memory: 1Gi
      ephemeral-storage: 50Mi
    defaultRequest:
      cpu: 100m
      memory: 128Mi
      ephemeral-storage: 10Mi
    max:
      cpu: "1" 
      memory: 8Gi
      ephemeral-storage: 1Gi
    min:
      cpu: 1m
      memory: 512Ki
      ephemeral-storage: 512Ki
    type: Container

Step3. Create a new guest machine on "testvm" namespace

$ virtctl create vm \
    --name rhel9 \
    --run-strategy RerunOnFailure \
    --namespace testvm \
    --instancetype u1.nano \
    --ssh-key <your ssh pub key>
    --volume-containerdisk src:registry.redhat.io/rhel9/rhel-guest-image:latest \
    | oc create -f -

Step4. Create a 100MB file on the guest machine. It succeeded, but the guest machine will be killed suddenly by EvictionManager in a few minutes.

$ virtctl ssh -n testvm cloud-user@rhel9-hatada
...
[cloud-user@rhel9 ~]$ dd if=/dev/zero of=/tmp/bigfile bs=1024k count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.877866 s, 119 MB/s
[cloud-user@rhel9-hatada ~]$ websocket: close 1006 (abnormal closure): unexpected EOF
                                                                                     client_loop: send disconnect: Broken pipe
exit status 255

Step5. Wait a few minutes, then run "oc get vm,vmi,pods -n testvm -o wide" command.

Actual Results

From the customer's point of view, it looks like nothing happened.
So, they are very confused about why their guest machine was restarted.

$ oc get vm,vmi,pods -n testvm -o wide
NAME                                          AGE     STATUS    READY
virtualmachine.kubevirt.io/rhel9              5m16s   Running   True

NAME                                          AGE   PHASE     IP             NODENAME                     READY   LIVE-MIGRATABLE   PAUSED
virtualmachineinstance.kubevirt.io/rhel9      16s   Running   10.128.1.66    control-plane3.example.com   True    True

NAME                                   READY   STATUS    RESTARTS   AGE   IP             NODE                         NOMINATED NODE   READINESS GATES
pod/virt-launcher-rhel9-6fcwk          2/2     Running   0          16s   10.128.1.66    control-plane3.example.com   <none>           1/1

Expected Results
We did the same test with a Pod created by `oc create deployment --image=image-registry.openshift-image-registry.svc:5000/openshift/cli -n testvm pod – sleep 3000`.

As a result, the old Pod was left with "Error" status.

$ oc get pods -n testvm -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-bbcbfbbf-br6v9 0/1 Error 0 44m 10.128.1.59 control-plane3.example.com <none> <none> pod-bbcbfbbf-lvdc9 1/1 Running 0 42m 10.128.1.60 control-plane3.example.com <none> <none>

We can check events of the old Pod. Ok, it was killed by EvictionManager.

Events:
  Type     Reason          Age   From               Message
  ----     ------          ----  ----               -------
...
  Warning  Evicted         44m   kubelet            Pod ephemeral local storage usage exceeds the total limit of containers 50Mi.
  Normal   Killing         44m   kubelet            Stopping container cli

This is ideal. OpenShift Virtualization should follow this design.

is related to

CNV-63851 GA: Live migrations to contain a reason/trigger

In Progress

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates