Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-21402

[2128906] CDI operator is not always collecting aggregating Progressing and Degraded conditions from its operands

XMLWordPrintable

    • Storage Core Sprint 233, Storage Core Sprint 234, Storage Core Sprint 235, Storage Core Sprint 237, Storage Core Sprint 239
    • None

      Description of problem:
      We got a case were cdi-deployment wasn't able to start due to PSA.
      The deployment controller was clearly stating that on the conditions on the deployment status:

      $ oc get deployments -n openshift-cnv cdi-deployment -o json | jq .status.conditions
      [

      { "lastTransitionTime": "2022-09-21T21:21:28Z", "lastUpdateTime": "2022-09-21T21:21:28Z", "message": "Deployment does not have minimum availability.", "reason": "MinimumReplicasUnavailable", "status": "False", "type": "Available" }

      ,

      { "lastTransitionTime": "2022-09-21T21:21:28Z", "lastUpdateTime": "2022-09-21T21:21:28Z", "message": "pods \"cdi-deployment-6f4888b5cb-r9f5h\" is forbidden: violates PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (container \"cdi-controller\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container \"cdi-controller\" must set securityContext.capabilities.drop=[\"ALL\"]), seccompProfile (pod or container \"cdi-controller\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")", "reason": "FailedCreate", "status": "True", "type": "ReplicaFailure" }

      ,

      { "lastTransitionTime": "2022-09-21T21:31:29Z", "lastUpdateTime": "2022-09-21T21:31:29Z", "message": "ReplicaSet \"cdi-deployment-6f4888b5cb\" has timed out progressing.", "reason": "ProgressDeadlineExceeded", "status": "False", "type": "Progressing" }

      ]

      although cdi-operator was still reporting Progressing=True, Degraded=False on its CR:

      $ oc get cdi cdi-kubevirt-hyperconverged -o yaml
      apiVersion: cdi.kubevirt.io/v1beta1
      kind: CDI
      metadata:
      annotations:
      cdi.kubevirt.io/configAuthority: ""
      creationTimestamp: "2022-09-21T21:21:24Z"
      finalizers:

      • operator.cdi.kubevirt.io
        generation: 2
        labels:
        app: kubevirt-hyperconverged
        app.kubernetes.io/component: storage
        app.kubernetes.io/managed-by: hco-operator
        app.kubernetes.io/part-of: hyperconverged-cluster
        app.kubernetes.io/version: 4.11.0
        name: cdi-kubevirt-hyperconverged
        resourceVersion: "39138"
        uid: 8cf09f52-9ad4-48d0-8831-d97923fdfe29
        spec:
        certConfig:
        ca:
        duration: 48h0m0s
        renewBefore: 24h0m0s
        server:
        duration: 24h0m0s
        renewBefore: 12h0m0s
        config:
        featureGates:
      • HonorWaitForFirstConsumer
        infra: {}
        uninstallStrategy: BlockUninstallIfWorkloadsExist
        workload: {}
        status:
        conditions:
      • lastHeartbeatTime: "2022-09-21T21:21:25Z"
        lastTransitionTime: "2022-09-21T21:21:25Z"
        status: "False"
        type: Available
      • lastHeartbeatTime: "2022-09-21T21:21:25Z"
        lastTransitionTime: "2022-09-21T21:21:25Z"
        message: Started Deployment
        reason: DeployStarted
        status: "True"
        type: Progressing
      • lastHeartbeatTime: "2022-09-21T21:21:25Z"
        lastTransitionTime: "2022-09-21T21:21:25Z"
        status: "False"
        type: Degraded
        operatorVersion: 4.11.0
        phase: Deploying
        targetVersion: 4.11.0

      so HCO is going to report progressing forever as well (HCO is reading only the conditions on CDI CR and not the ones on its operands)

      Version-Release number of selected component (if applicable):
      4.11.0

      How reproducible:
      100%

      Steps to Reproduce:
      1. try to deploy with cdi-deployment getting stuck (for instance CNV 4.11.0 on OCP 4.12.0 with enforcing restricted PSA on openshift-cnv)
      2. check `oc get deployments -n openshift-cnv cdi-deployment -o json | jq .status.conditions`
      3. compare with `oc get cdi cdi-kubevirt-hyperconverged -o json | jq .status.conditions`

      Actual results:
      a mismatch between cdi-deployment and CDI CR:

      on CDI deployment:
      status:
      conditions:

      • lastTransitionTime: "2022-09-21T21:21:28Z"
        lastUpdateTime: "2022-09-21T21:21:28Z"
        message: 'pods "cdi-deployment-6f4888b5cb-r9f5h" is forbidden: violates PodSecurity
        "restricted:latest": allowPrivilegeEscalation != false (container "cdi-controller"
        must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities
        (container "cdi-controller" must set securityContext.capabilities.drop=["ALL"]),
        seccompProfile (pod or container "cdi-controller" must set securityContext.seccompProfile.type
        to "RuntimeDefault" or "Localhost")'
        reason: FailedCreate
        status: "True"
        type: ReplicaFailure
      • lastTransitionTime: "2022-09-21T21:31:29Z"
        lastUpdateTime: "2022-09-21T21:31:29Z"
        message: ReplicaSet "cdi-deployment-6f4888b5cb" has timed out progressing.
        reason: ProgressDeadlineExceeded
        status: "False"
        type: Progressing

      on CDI CR:
      status:
      conditions:

      • lastHeartbeatTime: "2022-09-21T21:21:25Z"
        lastTransitionTime: "2022-09-21T21:21:25Z"
        status: "False"
        type: Available
      • lastHeartbeatTime: "2022-09-21T21:21:25Z"
        lastTransitionTime: "2022-09-21T21:21:25Z"
        message: Started Deployment
        reason: DeployStarted
        status: "True"
        type: Progressing
      • lastHeartbeatTime: "2022-09-21T21:21:25Z"
        lastTransitionTime: "2022-09-21T21:21:25Z"
        status: "False"
        type: Degraded

      Expected results:
      CDI CR conditions correctly aggregates the ones from its operands.

      Additional info:
      Please notice that correctly reporting a failed install/upgrade is actually a prerequisite for the unsafe fail forward upgrades feature:
      https://olm.operatorframework.io/docs/advanced-tasks/unsafe-fail-forward-upgrades/

      Currently we are still not opting-in, but that feature will eventually let customers try to recover from stuck upgrades.

              akalenyu Alex Kalenyuk
              stirabos Simone Tiraboschi
              Yan Du Yan Du
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: