Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-51365

Deploying a manifest with CRs needing webhook validation causes MicroShift to crash the first start

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 5
    • None
    • None
    • None
    • None
    • uShift Sprint 268, uShift Sprint 269, uShift Sprint 270
    • 3
    • Done
    • Bug Fix
    • Hide
      Previously, the `kustomizer` sub-service was blocking `microshift.service` readiness by adding a manifest that required a webhook for a Custom Resource (CR) before MicroShift and associated network services had started. This resulted in both `kustomizer` and the network node failing, causing MicroShift to fail. With this release, MicroShift is no longer dependent on the `kustomizer` sub-service, and can start network services and be ready before the application service starts.
      Show
      Previously, the `kustomizer` sub-service was blocking `microshift.service` readiness by adding a manifest that required a webhook for a Custom Resource (CR) before MicroShift and associated network services had started. This resulted in both `kustomizer` and the network node failing, causing MicroShift to fail. With this release, MicroShift is no longer dependent on the `kustomizer` sub-service, and can start network services and be ready before the application service starts.
    • None
    • None
    • None
    • None

      Description of problem:

      Adding a MicroShift manifests that include a Webhook for particular CR and instance of that CR (in the same manifest or different one) prevents MicroShift from successfully starting.

      Version-Release number of selected component (if applicable):

          Found on main (4.19) but most likely applicable to all microshift versions.

      How reproducible:

          always

      Steps to Reproduce:

          1. Install microshift & microshift-ai-model-serving RPMs
          2. Start MicroShift: systemctl start microshift    

      Actual results:

      systemctl start microshift fails.
      
      Inspecting journalctl -u microshift shows message:
      "Failed to initialize CSINode after retrying: timed out waiting for the condition" and "microshift.service: Main process exited, code=exited, status=255/EXCEPTION"

      Expected results:

      systemctl start microshift is successful  

      Additional info:

      Investigation findings:
      - When kubelet starts, it first wants to create kubepods.slice before it registers to the API Server
      - There's another goroutine in kubelet, that creates CSINode, but it needs the v1/Node first and it's only allowed to retry for ~27 seconds
      - kubepods.slice is created after microshift.service becomes ready (because kubepods.slice has a dependency on microshift.service)
      - Adding a manifest that cannot be really applied (because it has a CR and there's a webhook for that CR, but the webhook doesn't run yet, because kubelet isn't really up) results in kustomizer sub-service taking a long time until it fails.
      - In the mean time, the CSINode creation times out and kills microshift by exit(255).
      
      In happy flow (i.e. no CR for webhook to validate):
      - kustomizer doesn't block microshift.service readiness
      - microshift.service becomes ready and kubepods.slice is created
      - kubelet registers Node and CSINode is created before 27s timeout and doesn't kill the microshift
      
      Seems like best solution would be to extract kustomizer from the microshift readiness, so it doesn't block microshift.service becoming ready.
      
      This issue can go unnoticed in tests that do not run `systemctl start` directly because after microshift is killed, then kubepods.slice is immediately created, so next time kubelet starts it doesn't have to create it and proceeds to Node registration.
      
      If I comment out deploy of ServingRuntimes in ai-model-serving, then the microshift starts normally (i.e. creation of ServingRuntimes isn't blocked by Webhook that is not operational yet, hence the kustomizer finishes quickly, microshift.service becomes ready, kubepods.slice is created before CSINode creation loop kills the binary)

              pmatusza@redhat.com Patryk Matuszak
              pmatusza@redhat.com Patryk Matuszak
              None
              None
              John George John George
              Shauna Diaz Shauna Diaz
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: