Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-27238

Ironic agent inspection fails

XMLWordPrintable

    • Critical
    • No
    • 3
    • Metal Platform 248, AI-41
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:
      BMH can't be ready and agent cr is not generated, extraworker VMs has:

      Jan 16 15:50:19 extraworker-0 podman[1578]: 2024-01-16 15:50:19.908 1 ERROR ironic_python_agent.inspector [-] inspector https://172.22.0.3:5050/v1/continue error 400: {"error":{"message":"Node 7b6b4f3c-6479-49f8-
      a8ab-4f5d1a0a7087 is not active, its provision state is clean wait"}}
      Jan 16 15:50:19 extraworker-0 podman[1578]: , proceeding with lookup
      Jan 16 15:50:19 extraworker-0 podman[1578]: 2024-01-16 15:50:19.908 1 ERROR ironic_python_agent.agent [-] Failed to perform inspection: stopping inspection, as inspector returned an error: ironic_python_agent.errors.InspectionError: stopping inspection, as inspector returned an error
      Jan 16 17:07:20 extraworker-0 podman[1580]: 2024-01-16 17:07:20.756 1 WARNING root [-] Invalid IP address : '' does not appear to be an IPv4 or IPv6 address
      Jan 16 17:07:22 extraworker-0 podman[1580]: netlink error: Operation not supported

      job link: https://github.com/openshift/release/pull/47643 

      Version-Release number of selected component (if applicable):

          OCP 4.14, MCE 2.4

      How reproducible:

          100%

      Steps to Reproduce:

          1. install MCE 2.4
          2. create AgentServiceConfig     
          3. create HostedCluster
          4. create BMH CR
          

      Actual results:

      BMH can't be ready and agent cr is not generated

      Expected results:

      BMH ready, agent cr is generated

      Additional info:
      slack discussion: https://redhat-internal.slack.com/archives/CTZTHFQRH/p1705334840347489 

      must-gather:  https://drive.google.com/file/d/1gc2nrcnuMSchVPduotwz-UESccoqZL30/view?usp=sharing 

       

      RCA:

      This issue ended up being caused by an outdated ironic agent image being used.
      This older image couldn't register properly with the ironic server on the hub and so the assisted agent never got started.

      The assisted service configured the ignition to use the old image because it was unable to securely determine the ironic agent image based on the hub release.
      This could be seen in the assisted-service logs as the following error:

      2024-01-22T15:54:08.372898307Z time="2024-01-22T15:54:08Z" level=warning msg="Failed to get ironic agent image by release for infraEnv: 2eb7c368ac446a3c8132" func="github.com/openshift/assisted-service/internal/controller/controllers.(*PreprovisioningImageReconciler).AddIronicAgentToInfraEnv" file="/remote-source/assisted-service/app/internal/controller/controllers/preprovisioningimage_controller.go:416" error="failed to inspect image, oc: command 'oc image info --output json --icsp-file=/tmp/icsp-file3513427847 registry.build05.ci.openshift.org/ci-op-t1ztvn7b/release@sha256:77c62041f6134c98acb2686c7db4bb7d4ae15b88f27260aa5a973fe954824393 --registry-config=/tmp/registry-config4076164437' exited with non-zero exit code 1: \nerror: unable to read image registry.build05.ci.openshift.org/ci-op-t1ztvn7b/release@sha256:77c62041f6134c98acb2686c7db4bb7d4ae15b88f27260aa5a973fe954824393: Get \"https://registry.build05.ci.openshift.org/v2/\": x509: certificate signed by unknown authority\n, skopeo: command 'skopeo inspect --raw --no-tags docker://registry.build05.ci.openshift.org/ci-op-t1ztvn7b/release@sha256:77c62041f6134c98acb2686c7db4bb7d4ae15b88f27260aa5a973fe954824393 --authfile /tmp/registry-config383545742' exited with non-zero exit code 1: \ntime=\"2024-01-22T15:54:08Z\" level=fatal msg=\"Error parsing image name \\\"docker://registry.build05.ci.openshift.org/ci-op-t1ztvn7b/release@sha256:77c62041f6134c98acb2686c7db4bb7d4ae15b88f27260aa5a973fe954824393\\\": pinging container registry registry.build05.ci.openshift.org: Get \\\"https://registry.build05.ci.openshift.org/v2/\\\": tls: failed to verify certificate: x509: certificate signed by unknown authority\"\n" go-id=363 preprovisioning_image=ostest-extraworker-1 preprovisioning_image_namespace=local-cluster-2eb7c368ac446a3c8132 request_id=4208e857-54bb-418b-ac21-81ff5fc66798
      2024-01-22T15:54:08.372917168Z time="2024-01-22T15:54:08Z" level=info msg="Setting default ironic agent image (quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d3f1d4d3cd5fbcf1b9249dd71d01be4b901d337fdc5f8f66569eb71df4d9d446) for infraEnv 2eb7c368ac446a3c8132" func="github.com/openshift/assisted-service/internal/controller/controllers.(*PreprovisioningImageReconciler).AddIronicAgentToInfraEnv" file="/remote-source/assisted-service/app/internal/controller/controllers/preprovisioningimage_controller.go:429" go-id=363 preprovisioning_image=ostest-extraworker-1 preprovisioning_image_namespace=local-cluster-2eb7c368ac446a3c8132 request_id=4208e857-54bb-418b-ac21-81ff5fc66798
      

      This occurred because the contents of /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem are replaced with the certs in the mirrorRegistryRef configmap when it is provided in the AgentServiceConfig.
      This causes the x509 error because the default content would have been used to trust this hub cluster registry, but the new content doesn't have the required certs.

      The proposed fix would be for the infrastructure operator to supply the user mirror registry certs as an additional pem bundle rather than in place of the existing CA bundle.
       

            ncarboni@redhat.com Nick Carboni
            rhn-support-liangli Liangquan Li
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: