Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-77363

Component Readiness: [BareMetal] [Installer / openshift-installer] [Other] metal3remediations.infrastructure.cluster.x-k8s.io CRD validation failure cause install failure

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • None
    • None
    • Proposed
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Serval latest build failed :

      level=info msg=Cluster operator openshift-apiserver EvaluationConditionsDetected is Unknown with NoData: 
      level=info msg=Cluster operator openshift-controller-manager EvaluationConditionsDetected is Unknown with NoData: 
      level=info msg=Cluster operator service-ca EvaluationConditionsDetected is Unknown with NoData: 
      level=info msg=Cluster operator storage EvaluationConditionsDetected is Unknown with NoData: 
      level=error msg=Cluster initialization failed because one or more operators are not functioning properly.
      level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
      level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
      level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation
      level=error msg=failed to initialize the cluster: Could not update customresourcedefinition "metal3remediations.infrastructure.cluster.x-k8s.io" (283 of 1050): the object is invalid, possibly due to local cluster configuration: timed out waiting for the condition
      

       

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-dualstack-techpreview/2026681892157263872

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-dualstack-techpreview/2026556389605773312

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-dualstack-techpreview/2026423090430349312

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-techpreview/2026681630432694272

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-techpreview/2026556132897591296

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-techpreview/2026423194990153728

      Component Readiness has found a potential regression in the following test:

      install should succeed: cluster creation

      Significant regression detected.
      Fishers Exact probability of a regression: 100.00%.
      Test pass rate dropped from 100.00% to 90.91%.

      Sample (being evaluated) Release: 4.22
      Start Time: 2026-02-19T00:00:00Z
      End Time: 2026-02-26T04:00:00Z
      Success Rate: 90.91%
      Successes: 80
      Failures: 8
      Flakes: 0
      Base (historical) Release: 4.21
      Start Time: 2026-01-04T00:00:00Z
      End Time: 2026-02-03T23:59:59Z
      Success Rate: 100.00%
      Successes: 285
      Failures: 0
      Flakes: 0

      View the test details report for additional context.

       

      The following is the analysis from ai-helper for https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-dualstack-techpreview/2026423090430349312

        The Actual Error

        From the Cluster Version Operator logs, the real error is:

        CustomResourceDefinition.apiextensions.k8s.io "metal3remediations.infrastructure.cluster.x-k8s.io" is invalid:
        status.storedVersions[1]: Invalid value: "v1beta2":
        missing from spec.versions; v1beta2 was previously a storage version, and must remain in spec.versions
        until a storage migration ensures no data remains persisted in v1beta2 and
        removes v1beta2 from status.storedVersions

        What This Means

        This is a CRD version migration issue, NOT a webhook timeout issue as initially suggested by the error message.

        The Problem:
        1. The CRD currently in the cluster has:
          - spec.versions: [v1beta1, v1beta2]
          - status.storedVersions: [v1beta1, v1beta2]
        2. The 4.22 nightly payload is trying to apply a new version of the CRD that:
          - Only includes v1beta1 in spec.versions
          - Removes v1beta2 from the list
        3. Kubernetes API server validation prevents removing a version from spec.versions if:
          - That version is still listed in status.storedVersions
          - This protects against data loss from objects stored in the removed version

        Why This Causes Install Failure

        The Cluster Version Operator:
        - Repeatedly tries to update the CRD (resource #283 of 1051)
        - Each attempt is rejected by API server validation
        - The update never succeeds
        - Installation gets stuck at 93% waiting for this CRD update
        - Eventually times out after 1 hour

        Component Status Summary

        ✅ Working Correctly:
        - API Server: Running and enforcing validation
        - capm3-webhook-service: Running, conversion webhook functional
        - Metal3 controllers: Running
        - Webhook communication: No timeout issues

        ❌ The Issue:
        - CRD version migration mismatch between current cluster state and new payload
        - CVO cannot complete the upgrade due to validation failure

        Why This Happened

        This appears to be a regression or issue in the 4.22.0-0.nightly-2026-02-24-222058 payload where:
        1. The Metal3Remediation CRD removed v1beta2 from its supported versions
        2. But the cluster already has data marked as stored in v1beta2
        3. No storage migration was performed to convert v1beta2 objects to v1beta1

        Required Fix

        The payload needs one of these fixes:

        1. Keep v1beta2 in spec.versions until all clusters can migrate
        2. Perform storage version migration before removing v1beta2
        3. Add a storage version migrator to convert existing v1beta2 objects to v1beta1 and update storedVersions

        —
        Summary

        This is NOT an installation infrastructure issue - it's a payload/CRD versioning bug in the nightly 4.22 release that prevents cluster initialization from completing.

        The error message "the object is invalid, possibly due to local cluster configuration: timed out waiting for the condition" was misleading - the timeout was from CVO retrying the failed
        update, not from a webhook timeout.

      =====================

      This issue is seen for the 1st time on 4.22.0-0.nightly-2026-02-24-222058, from the changelog, seem https://github.com/openshift/cluster-api-provider-metal3/pull/63 is suspicious.

              hpokorny@redhat.com Honza Pokorny
              yunjiang-1 Yunfei Jiang
              Jad Haj Yahya Jad Haj Yahya
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: