-
Bug
-
Resolution: Won't Do
-
Normal
-
None
-
False
-
-
False
-
CLOSED
-
---
-
---
-
Storage Core Sprint 233, Storage Core Sprint 234, Storage Core Sprint 235, Storage Core Sprint 237, Storage Core Sprint 239
-
None
Description of problem:
We got a case were cdi-deployment wasn't able to start due to PSA.
The deployment controller was clearly stating that on the conditions on the deployment status:
$ oc get deployments -n openshift-cnv cdi-deployment -o json | jq .status.conditions
[
,
,
{ "lastTransitionTime": "2022-09-21T21:31:29Z", "lastUpdateTime": "2022-09-21T21:31:29Z", "message": "ReplicaSet \"cdi-deployment-6f4888b5cb\" has timed out progressing.", "reason": "ProgressDeadlineExceeded", "status": "False", "type": "Progressing" }]
although cdi-operator was still reporting Progressing=True, Degraded=False on its CR:
$ oc get cdi cdi-kubevirt-hyperconverged -o yaml
apiVersion: cdi.kubevirt.io/v1beta1
kind: CDI
metadata:
annotations:
cdi.kubevirt.io/configAuthority: ""
creationTimestamp: "2022-09-21T21:21:24Z"
finalizers:
- operator.cdi.kubevirt.io
generation: 2
labels:
app: kubevirt-hyperconverged
app.kubernetes.io/component: storage
app.kubernetes.io/managed-by: hco-operator
app.kubernetes.io/part-of: hyperconverged-cluster
app.kubernetes.io/version: 4.11.0
name: cdi-kubevirt-hyperconverged
resourceVersion: "39138"
uid: 8cf09f52-9ad4-48d0-8831-d97923fdfe29
spec:
certConfig:
ca:
duration: 48h0m0s
renewBefore: 24h0m0s
server:
duration: 24h0m0s
renewBefore: 12h0m0s
config:
featureGates: - HonorWaitForFirstConsumer
infra: {}
uninstallStrategy: BlockUninstallIfWorkloadsExist
workload: {}
status:
conditions: - lastHeartbeatTime: "2022-09-21T21:21:25Z"
lastTransitionTime: "2022-09-21T21:21:25Z"
status: "False"
type: Available - lastHeartbeatTime: "2022-09-21T21:21:25Z"
lastTransitionTime: "2022-09-21T21:21:25Z"
message: Started Deployment
reason: DeployStarted
status: "True"
type: Progressing - lastHeartbeatTime: "2022-09-21T21:21:25Z"
lastTransitionTime: "2022-09-21T21:21:25Z"
status: "False"
type: Degraded
operatorVersion: 4.11.0
phase: Deploying
targetVersion: 4.11.0
so HCO is going to report progressing forever as well (HCO is reading only the conditions on CDI CR and not the ones on its operands)
Version-Release number of selected component (if applicable):
4.11.0
How reproducible:
100%
Steps to Reproduce:
1. try to deploy with cdi-deployment getting stuck (for instance CNV 4.11.0 on OCP 4.12.0 with enforcing restricted PSA on openshift-cnv)
2. check `oc get deployments -n openshift-cnv cdi-deployment -o json | jq .status.conditions`
3. compare with `oc get cdi cdi-kubevirt-hyperconverged -o json | jq .status.conditions`
Actual results:
a mismatch between cdi-deployment and CDI CR:
on CDI deployment:
status:
conditions:
- lastTransitionTime: "2022-09-21T21:21:28Z"
lastUpdateTime: "2022-09-21T21:21:28Z"
message: 'pods "cdi-deployment-6f4888b5cb-r9f5h" is forbidden: violates PodSecurity
"restricted:latest": allowPrivilegeEscalation != false (container "cdi-controller"
must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities
(container "cdi-controller" must set securityContext.capabilities.drop=["ALL"]),
seccompProfile (pod or container "cdi-controller" must set securityContext.seccompProfile.type
to "RuntimeDefault" or "Localhost")'
reason: FailedCreate
status: "True"
type: ReplicaFailure - lastTransitionTime: "2022-09-21T21:31:29Z"
lastUpdateTime: "2022-09-21T21:31:29Z"
message: ReplicaSet "cdi-deployment-6f4888b5cb" has timed out progressing.
reason: ProgressDeadlineExceeded
status: "False"
type: Progressing
on CDI CR:
status:
conditions:
- lastHeartbeatTime: "2022-09-21T21:21:25Z"
lastTransitionTime: "2022-09-21T21:21:25Z"
status: "False"
type: Available - lastHeartbeatTime: "2022-09-21T21:21:25Z"
lastTransitionTime: "2022-09-21T21:21:25Z"
message: Started Deployment
reason: DeployStarted
status: "True"
type: Progressing - lastHeartbeatTime: "2022-09-21T21:21:25Z"
lastTransitionTime: "2022-09-21T21:21:25Z"
status: "False"
type: Degraded
Expected results:
CDI CR conditions correctly aggregates the ones from its operands.
Additional info:
Please notice that correctly reporting a failed install/upgrade is actually a prerequisite for the unsafe fail forward upgrades feature:
https://olm.operatorframework.io/docs/advanced-tasks/unsafe-fail-forward-upgrades/
Currently we are still not opting-in, but that feature will eventually let customers try to recover from stuck upgrades.
- external trackers