-
Feature Request
-
Resolution: Unresolved
-
Major
-
None
-
CNV v4.18.z
-
None
-
Incidents & Support
-
False
-
-
True
-
None
What is the nature and description of the request?
No information on why a guest machine was restarted is left if it was killed by the EvictionManager.
As a result, from user's point of view, it looks like that their guest machine was silently restarted suddenly.
The info should be left, just like when Pod created by Deployment object was killed.
Why does the customer need this? (List the business requirements here)
Without this RFE, it's hard for users to debug why their guest machine was restarted.
List any affected packages or components
- kubevirt-hyperconverged-operator
Additional Info (If needed)
Please see the following reproduction steps. We tested this with OpenShift 4.18.13 and OpenShift Virtualization 4.18.4.
Step1. Create "testvm" namespace.
$ oc create namespace testvm
Step2. Apply LimitRange.
apiVersion: v1 kind: LimitRange metadata: name: limitrange namespace: testvm spec: limits: - default: cpu: 500m memory: 1Gi ephemeral-storage: 50Mi defaultRequest: cpu: 100m memory: 128Mi ephemeral-storage: 10Mi max: cpu: "1" memory: 8Gi ephemeral-storage: 1Gi min: cpu: 1m memory: 512Ki ephemeral-storage: 512Ki type: Container
Step3. Create a new guest machine on "testvm" namespace
$ virtctl create vm \ --name rhel9 \ --run-strategy RerunOnFailure \ --namespace testvm \ --instancetype u1.nano \ --ssh-key <your ssh pub key> --volume-containerdisk src:registry.redhat.io/rhel9/rhel-guest-image:latest \ | oc create -f -
Step4. Create a 100MB file on the guest machine. It succeeded, but the guest machine will be killed suddenly by EvictionManager in a few minutes.
$ virtctl ssh -n testvm cloud-user@rhel9-hatada
...
[cloud-user@rhel9 ~]$ dd if=/dev/zero of=/tmp/bigfile bs=1024k count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.877866 s, 119 MB/s
[cloud-user@rhel9-hatada ~]$ websocket: close 1006 (abnormal closure): unexpected EOF
client_loop: send disconnect: Broken pipe
exit status 255
Step5. Wait a few minutes, then run "oc get vm,vmi,pods -n testvm -o wide" command.
Actual Results
From the customer's point of view, it looks like nothing happened.
So, they are very confused about why their guest machine was restarted.
$ oc get vm,vmi,pods -n testvm -o wide NAME AGE STATUS READY virtualmachine.kubevirt.io/rhel9 5m16s Running True NAME AGE PHASE IP NODENAME READY LIVE-MIGRATABLE PAUSED virtualmachineinstance.kubevirt.io/rhel9 16s Running 10.128.1.66 control-plane3.example.com True True NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/virt-launcher-rhel9-6fcwk 2/2 Running 0 16s 10.128.1.66 control-plane3.example.com <none> 1/1
Expected Results
We did the same test with a Pod created by `oc create deployment --image=image-registry.openshift-image-registry.svc:5000/openshift/cli -n testvm pod – sleep 3000`.
As a result, the old Pod was left with "Error" status.
$ oc get pods -n testvm -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-bbcbfbbf-br6v9 0/1 Error 0 44m 10.128.1.59 control-plane3.example.com <none> <none> pod-bbcbfbbf-lvdc9 1/1 Running 0 42m 10.128.1.60 control-plane3.example.com <none> <none>
We can check events of the old Pod. Ok, it was killed by EvictionManager.
Events: Type Reason Age From Message ---- ------ ---- ---- ------- ... Warning Evicted 44m kubelet Pod ephemeral local storage usage exceeds the total limit of containers 50Mi. Normal Killing 44m kubelet Stopping container cli
This is ideal. OpenShift Virtualization should follow this design.
- is related to
-
CNV-63851 Live migrations to contain a reason/trigger
-
- New
-