-
Bug
-
Resolution: Done-Errata
-
Major
-
4.14
-
Moderate
-
No
-
1
-
Metal Platform 239
-
1
-
Rejected
-
False
-
Description of problem:
OCP deployments are failing with machine-api-controller pod crashing.
Version-Release number of selected component (if applicable):
OCP 4.14.0-ec.3
How reproducible:
Always
Steps to Reproduce:
1. Deploy a Baremetal cluster 2. After bootstrap is completed, check the pods running in the openshift-machine-api namespace 3. Check machine-api-controllers-* pod status (it goes from Running to Crashing all the time) 4. Deployment eventually times out and stops with only the master nodes getting deployed.
Actual results:
machine-api-controllers-* pod remains in a crashing loop and OCP 4.14.0-ec.3 deployments fail.
Expected results:
machine-api-controllers-* pod remains running and OCP 4.14.0-ec.3 deployments are completed
Additional info:
Jobs with older nightly releases in 4.14 are passing, but since Saturday Jul 10th, our CI jobs are failing
$ oc version Client Version: 4.14.0-ec.3 Kustomize Version: v5.0.1 Kubernetes Version: v1.27.3+e8b13aa $ oc get nodes NAME STATUS ROLES AGE VERSION master-0 Ready control-plane,master 37m v1.27.3+e8b13aa master-1 Ready control-plane,master 37m v1.27.3+e8b13aa master-2 Ready control-plane,master 38m v1.27.3+e8b13aa $ oc -n openshift-machine-api get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-autoscaler-operator-75b96869d8-gzthq 2/2 Running 0 48m 10.129.0.6 master-0 <none> <none> cluster-baremetal-operator-7c9cb8cd69-6bqcg 2/2 Running 0 48m 10.129.0.7 master-0 <none> <none> control-plane-machine-set-operator-6b65b5b865-w996m 1/1 Running 0 48m 10.129.0.22 master-0 <none> <none> machine-api-controllers-59694ff965-v4kxb 6/7 CrashLoopBackOff 7 (2m31s ago) 46m 10.130.0.12 master-2 <none> <none> machine-api-operator-58b54d7c86-cnx4w 2/2 Running 0 48m 10.129.0.8 master-0 <none> <none> metal3-6ffbb8dcd4-drlq5 6/6 Running 0 45m 192.168.62.22 master-1 <none> <none> metal3-baremetal-operator-bd95b6695-q6k7c 1/1 Running 0 45m 10.130.0.16 master-2 <none> <none> metal3-image-cache-4p7ln 1/1 Running 0 45m 192.168.62.22 master-1 <none> <none> metal3-image-cache-lfmb4 1/1 Running 0 45m 192.168.62.23 master-2 <none> <none> metal3-image-cache-txjg5 1/1 Running 0 45m 192.168.62.21 master-0 <none> <none> metal3-image-customization-65cf987f5c-wgqs7 1/1 Running 0 45m 10.128.0.17 master-1 <none> <none>
$ oc -n openshift-machine-api logs machine-api-controllers-59694ff965-v4kxb -c machine-controller | less ... E0710 15:55:08.230413 1 logr.go:270] controller-runtime/source "msg"="if kind is a CRD, it should be installed before calling Start" "error"="no matches for kind \"Metal3Remediation\" in version \"infrastructure.cluster.x-k8s.io/v1beta1\"" "kind"={"Group":"infrastructure.cluster.x-k8s.io","Kind":"Metal3Remediation"} E0710 15:55:14.019930 1 controller.go:210] "msg"="Could not wait for Cache to sync" "error"="failed to wait for metal3remediation caches to sync: timed out waiting for cache to be synced" "controller"="metal3remediation" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="Metal3Remediation" I0710 15:55:14.020025 1 logr.go:252] "msg"="Stopping and waiting for non leader election runnables" I0710 15:55:14.020054 1 logr.go:252] "msg"="Stopping and waiting for leader election runnables" I0710 15:55:14.020095 1 controller.go:247] "msg"="Shutdown signal received, waiting for all workers to finish" "controller"="machine-drain-controller" I0710 15:55:14.020147 1 controller.go:247] "msg"="Shutdown signal received, waiting for all workers to finish" "controller"="machineset-controller" I0710 15:55:14.020169 1 controller.go:247] "msg"="Shutdown signal received, waiting for all workers to finish" "controller"="machine-controller" I0710 15:55:14.020184 1 controller.go:249] "msg"="All workers finished" "controller"="machineset-controller" I0710 15:55:14.020181 1 controller.go:249] "msg"="All workers finished" "controller"="machine-drain-controller" I0710 15:55:14.020190 1 controller.go:249] "msg"="All workers finished" "controller"="machine-controller" I0710 15:55:14.020209 1 logr.go:252] "msg"="Stopping and waiting for caches" I0710 15:55:14.020323 1 logr.go:252] "msg"="Stopping and waiting for webhooks" I0710 15:55:14.020327 1 reflector.go:225] Stopping reflector *v1alpha1.BareMetalHost (10h53m58.149951981s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 I0710 15:55:14.020393 1 reflector.go:225] Stopping reflector *v1beta1.Machine (9h40m22.116205595s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 I0710 15:55:14.020399 1 logr.go:252] controller-runtime/webhook "msg"="shutting down webhook server" I0710 15:55:14.020437 1 reflector.go:225] Stopping reflector *v1.Node (10h3m14.461941979s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 I0710 15:55:14.020466 1 logr.go:252] "msg"="Wait completed, proceeding to shutdown the manager" I0710 15:55:14.020485 1 reflector.go:225] Stopping reflector *v1beta1.MachineSet (10h7m28.391827596s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 E0710 15:55:14.020500 1 main.go:218] baremetal-controller-manager/entrypoint "msg"="unable to run manager" "error"="failed to wait for metal3remediation caches to sync: timed out waiting for cache to be synced" E0710 15:55:14.020504 1 logr.go:270] "msg"="error received after stop sequence was engaged" "error"="leader election lost"
Our CI job logs can be seen here (RedHat SSO): https://www.distributed-ci.io/jobs/7da8ee48-8918-4a97-8e3c-f525d19583b8/files
- is cloned by
-
OCPBUGS-16084 [4.13] OCP 4.14.0-ec.3 machine-api-controller pod crashing
- Closed
- is depended on by
-
OCPBUGS-16084 [4.13] OCP 4.14.0-ec.3 machine-api-controller pod crashing
- Closed
- links to
-
RHEA-2023:5006 rpm