Loading...

Type: Bug
Resolution: Won't Do
Priority: Normal
Fix Version/s: CNV v4.12.5
Affects Version/s: None
Component/s: Storage Platform
Labels:
- cnv-4?
- cnvbugsm
- devel_ack+
- pm_ack+
- qa_ack?

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
BZ Status:
CLOSED
BZ URL:
https://bugzilla.redhat.com/show_bug.cgi?id=2128906
Bugzilla Bug:
RHBZ: 2128906

Sprint:
Storage Core Sprint 233, Storage Core Sprint 234, Storage Core Sprint 235, Storage Core Sprint 237, Storage Core Sprint 239

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:
We got a case were cdi-deployment wasn't able to start due to PSA.
The deployment controller was clearly stating that on the conditions on the deployment status:

$ oc get deployments -n openshift-cnv cdi-deployment -o json | jq .status.conditions
[

{ "lastTransitionTime": "2022-09-21T21:21:28Z", "lastUpdateTime": "2022-09-21T21:21:28Z", "message": "Deployment does not have minimum availability.", "reason": "MinimumReplicasUnavailable", "status": "False", "type": "Available" }

,

{ "lastTransitionTime": "2022-09-21T21:21:28Z", "lastUpdateTime": "2022-09-21T21:21:28Z", "message": "pods \"cdi-deployment-6f4888b5cb-r9f5h\" is forbidden: violates PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (container \"cdi-controller\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container \"cdi-controller\" must set securityContext.capabilities.drop=[\"ALL\"]), seccompProfile (pod or container \"cdi-controller\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")", "reason": "FailedCreate", "status": "True", "type": "ReplicaFailure" }

,

{ "lastTransitionTime": "2022-09-21T21:31:29Z", "lastUpdateTime": "2022-09-21T21:31:29Z", "message": "ReplicaSet \"cdi-deployment-6f4888b5cb\" has timed out progressing.", "reason": "ProgressDeadlineExceeded", "status": "False", "type": "Progressing" }

]

although cdi-operator was still reporting Progressing=True, Degraded=False on its CR:

$ oc get cdi cdi-kubevirt-hyperconverged -o yaml
apiVersion: cdi.kubevirt.io/v1beta1
kind: CDI
metadata:
annotations:
cdi.kubevirt.io/configAuthority: ""
creationTimestamp: "2022-09-21T21:21:24Z"
finalizers:

operator.cdi.kubevirt.io
generation: 2
labels:
app: kubevirt-hyperconverged
app.kubernetes.io/component: storage
app.kubernetes.io/managed-by: hco-operator
app.kubernetes.io/part-of: hyperconverged-cluster
app.kubernetes.io/version: 4.11.0
name: cdi-kubevirt-hyperconverged
resourceVersion: "39138"
uid: 8cf09f52-9ad4-48d0-8831-d97923fdfe29
spec:
certConfig:
ca:
duration: 48h0m0s
renewBefore: 24h0m0s
server:
duration: 24h0m0s
renewBefore: 12h0m0s
config:
featureGates:
HonorWaitForFirstConsumer
infra: {}
uninstallStrategy: BlockUninstallIfWorkloadsExist
workload: {}
status:
conditions:
lastHeartbeatTime: "2022-09-21T21:21:25Z"
lastTransitionTime: "2022-09-21T21:21:25Z"
status: "False"
type: Available
lastHeartbeatTime: "2022-09-21T21:21:25Z"
lastTransitionTime: "2022-09-21T21:21:25Z"
message: Started Deployment
reason: DeployStarted
status: "True"
type: Progressing
lastHeartbeatTime: "2022-09-21T21:21:25Z"
lastTransitionTime: "2022-09-21T21:21:25Z"
status: "False"
type: Degraded
operatorVersion: 4.11.0
phase: Deploying
targetVersion: 4.11.0

so HCO is going to report progressing forever as well (HCO is reading only the conditions on CDI CR and not the ones on its operands)

Version-Release number of selected component (if applicable):
4.11.0

How reproducible:
100%

Steps to Reproduce:
1. try to deploy with cdi-deployment getting stuck (for instance CNV 4.11.0 on OCP 4.12.0 with enforcing restricted PSA on openshift-cnv)
2. check `oc get deployments -n openshift-cnv cdi-deployment -o json | jq .status.conditions`
3. compare with `oc get cdi cdi-kubevirt-hyperconverged -o json | jq .status.conditions`

Actual results:
a mismatch between cdi-deployment and CDI CR:

on CDI deployment:
status:
conditions:

lastTransitionTime: "2022-09-21T21:21:28Z"
lastUpdateTime: "2022-09-21T21:21:28Z"
message: 'pods "cdi-deployment-6f4888b5cb-r9f5h" is forbidden: violates PodSecurity
"restricted:latest": allowPrivilegeEscalation != false (container "cdi-controller"
must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities
(container "cdi-controller" must set securityContext.capabilities.drop=["ALL"]),
seccompProfile (pod or container "cdi-controller" must set securityContext.seccompProfile.type
to "RuntimeDefault" or "Localhost")'
reason: FailedCreate
status: "True"
type: ReplicaFailure
lastTransitionTime: "2022-09-21T21:31:29Z"
lastUpdateTime: "2022-09-21T21:31:29Z"
message: ReplicaSet "cdi-deployment-6f4888b5cb" has timed out progressing.
reason: ProgressDeadlineExceeded
status: "False"
type: Progressing

on CDI CR:
status:
conditions:

lastHeartbeatTime: "2022-09-21T21:21:25Z"
lastTransitionTime: "2022-09-21T21:21:25Z"
status: "False"
type: Available
lastHeartbeatTime: "2022-09-21T21:21:25Z"
lastTransitionTime: "2022-09-21T21:21:25Z"
message: Started Deployment
reason: DeployStarted
status: "True"
type: Progressing
lastHeartbeatTime: "2022-09-21T21:21:25Z"
lastTransitionTime: "2022-09-21T21:21:25Z"
status: "False"
type: Degraded

Expected results:
CDI CR conditions correctly aggregates the ones from its operands.

Additional info:
Please notice that correctly reporting a failed install/upgrade is actually a prerequisite for the unsafe fail forward upgrades feature:
https://olm.operatorframework.io/docs/advanced-tasks/unsafe-fail-forward-upgrades/

Currently we are still not opting-in, but that feature will eventually let customers try to recover from stuck upgrades.

external trackers

Red Hat Issue Tracker CNV-21402

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates