-
Bug
-
Resolution: Done
-
Major
-
None
-
3
-
False
-
None
-
False
-
-
-
Metal Platform 248, AI-41, AI-48
-
Critical
-
Rejected
-
No
Description of problem:
BMH can't be ready and agent cr is not generated, extraworker VMs has:
Jan 16 15:50:19 extraworker-0 podman[1578]: 2024-01-16 15:50:19.908 1 ERROR ironic_python_agent.inspector [-] inspector https://172.22.0.3:5050/v1/continue error 400: {"error":{"message":"Node 7b6b4f3c-6479-49f8- a8ab-4f5d1a0a7087 is not active, its provision state is clean wait"}} Jan 16 15:50:19 extraworker-0 podman[1578]: , proceeding with lookup Jan 16 15:50:19 extraworker-0 podman[1578]: 2024-01-16 15:50:19.908 1 ERROR ironic_python_agent.agent [-] Failed to perform inspection: stopping inspection, as inspector returned an error: ironic_python_agent.errors.InspectionError: stopping inspection, as inspector returned an error Jan 16 17:07:20 extraworker-0 podman[1580]: 2024-01-16 17:07:20.756 1 WARNING root [-] Invalid IP address : '' does not appear to be an IPv4 or IPv6 address Jan 16 17:07:22 extraworker-0 podman[1580]: netlink error: Operation not supported
job link: https://github.com/openshift/release/pull/47643
Version-Release number of selected component (if applicable):
OCP 4.14, MCE 2.4
How reproducible:
100%
Steps to Reproduce:
1. install MCE 2.4 2. create AgentServiceConfig 3. create HostedCluster 4. create BMH CR
Actual results:
BMH can't be ready and agent cr is not generated
Expected results:
BMH ready, agent cr is generated
Additional info:
slack discussion: https://redhat-internal.slack.com/archives/CTZTHFQRH/p1705334840347489
must-gather: https://drive.google.com/file/d/1gc2nrcnuMSchVPduotwz-UESccoqZL30/view?usp=sharing
RCA:
This issue ended up being caused by an outdated ironic agent image being used.
This older image couldn't register properly with the ironic server on the hub and so the assisted agent never got started.
The assisted service configured the ignition to use the old image because it was unable to securely determine the ironic agent image based on the hub release.
This could be seen in the assisted-service logs as the following error:
2024-01-22T15:54:08.372898307Z time="2024-01-22T15:54:08Z" level=warning msg="Failed to get ironic agent image by release for infraEnv: 2eb7c368ac446a3c8132" func="github.com/openshift/assisted-service/internal/controller/controllers.(*PreprovisioningImageReconciler).AddIronicAgentToInfraEnv" file="/remote-source/assisted-service/app/internal/controller/controllers/preprovisioningimage_controller.go:416" error="failed to inspect image, oc: command 'oc image info --output json --icsp-file=/tmp/icsp-file3513427847 registry.build05.ci.openshift.org/ci-op-t1ztvn7b/release@sha256:77c62041f6134c98acb2686c7db4bb7d4ae15b88f27260aa5a973fe954824393 --registry-config=/tmp/registry-config4076164437' exited with non-zero exit code 1: \nerror: unable to read image registry.build05.ci.openshift.org/ci-op-t1ztvn7b/release@sha256:77c62041f6134c98acb2686c7db4bb7d4ae15b88f27260aa5a973fe954824393: Get \"https://registry.build05.ci.openshift.org/v2/\": x509: certificate signed by unknown authority\n, skopeo: command 'skopeo inspect --raw --no-tags docker://registry.build05.ci.openshift.org/ci-op-t1ztvn7b/release@sha256:77c62041f6134c98acb2686c7db4bb7d4ae15b88f27260aa5a973fe954824393 --authfile /tmp/registry-config383545742' exited with non-zero exit code 1: \ntime=\"2024-01-22T15:54:08Z\" level=fatal msg=\"Error parsing image name \\\"docker://registry.build05.ci.openshift.org/ci-op-t1ztvn7b/release@sha256:77c62041f6134c98acb2686c7db4bb7d4ae15b88f27260aa5a973fe954824393\\\": pinging container registry registry.build05.ci.openshift.org: Get \\\"https://registry.build05.ci.openshift.org/v2/\\\": tls: failed to verify certificate: x509: certificate signed by unknown authority\"\n" go-id=363 preprovisioning_image=ostest-extraworker-1 preprovisioning_image_namespace=local-cluster-2eb7c368ac446a3c8132 request_id=4208e857-54bb-418b-ac21-81ff5fc66798 2024-01-22T15:54:08.372917168Z time="2024-01-22T15:54:08Z" level=info msg="Setting default ironic agent image (quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d3f1d4d3cd5fbcf1b9249dd71d01be4b901d337fdc5f8f66569eb71df4d9d446) for infraEnv 2eb7c368ac446a3c8132" func="github.com/openshift/assisted-service/internal/controller/controllers.(*PreprovisioningImageReconciler).AddIronicAgentToInfraEnv" file="/remote-source/assisted-service/app/internal/controller/controllers/preprovisioningimage_controller.go:429" go-id=363 preprovisioning_image=ostest-extraworker-1 preprovisioning_image_namespace=local-cluster-2eb7c368ac446a3c8132 request_id=4208e857-54bb-418b-ac21-81ff5fc66798
This occurred because the contents of /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem are replaced with the certs in the mirrorRegistryRef configmap when it is provided in the AgentServiceConfig.
This causes the x509 error because the default content would have been used to trust this hub cluster registry, but the new content doesn't have the required certs.
The proposed fix would be for the infrastructure operator to supply the user mirror registry certs as an additional pem bundle rather than in place of the existing CA bundle.
- blocks
-
OCPBUGS-16189 Dual-Stack Hosted Cluster: IPv6 should not be the default pod/service network IPFamily
- Closed
-
OCPBUGS-19746 Add a network validation to avoid overlapping when you define KAS Advertise Address
- Closed
- is blocked by
-
OCPBUGS-29623 Some oc cli commands don't respect --certificate-authority
- Closed
- links to
- mentioned on