Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: 4.14
Component/s: Cloud Compute / BareMetal Provider
Labels:
- otherQA
- triaged

Severity:
Moderate
Regression:
No
Story Points:
1
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.13.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:


OCP deployments are failing with machine-api-controller pod crashing.

Version-Release number of selected component (if applicable):

OCP 4.14.0-ec.3

How reproducible:

Always

Steps to Reproduce:

1. Deploy a Baremetal cluster
2. After bootstrap is completed, check the pods running in the openshift-machine-api namespace
3. Check machine-api-controllers-* pod status (it goes from Running to Crashing all the time)
4. Deployment eventually times out and stops with only the master nodes getting deployed.

Actual results:

machine-api-controllers-* pod remains in a crashing loop and OCP 4.14.0-ec.3 deployments fail.

Expected results:

machine-api-controllers-* pod remains running and OCP 4.14.0-ec.3 deployments are completed

Additional info:

Jobs with older nightly releases in 4.14 are passing, but since Saturday Jul 10th, our CI jobs are failing

$ oc version
Client Version: 4.14.0-ec.3
Kustomize Version: v5.0.1
Kubernetes Version: v1.27.3+e8b13aa

$ oc get nodes
NAME       STATUS   ROLES                  AGE   VERSION
master-0   Ready    control-plane,master   37m   v1.27.3+e8b13aa
master-1   Ready    control-plane,master   37m   v1.27.3+e8b13aa
master-2   Ready    control-plane,master   38m   v1.27.3+e8b13aa

$ oc -n openshift-machine-api get pods -o wide
NAME                                                  READY   STATUS             RESTARTS        AGE   IP              NODE       NOMINATED NODE   READINESS GATES
cluster-autoscaler-operator-75b96869d8-gzthq          2/2     Running            0               48m   10.129.0.6      master-0   <none>           <none>
cluster-baremetal-operator-7c9cb8cd69-6bqcg           2/2     Running            0               48m   10.129.0.7      master-0   <none>           <none>
control-plane-machine-set-operator-6b65b5b865-w996m   1/1     Running            0               48m   10.129.0.22     master-0   <none>           <none>
machine-api-controllers-59694ff965-v4kxb              6/7     CrashLoopBackOff   7 (2m31s ago)   46m   10.130.0.12     master-2   <none>           <none>
machine-api-operator-58b54d7c86-cnx4w                 2/2     Running            0               48m   10.129.0.8      master-0   <none>           <none>
metal3-6ffbb8dcd4-drlq5                               6/6     Running            0               45m   192.168.62.22   master-1   <none>           <none>
metal3-baremetal-operator-bd95b6695-q6k7c             1/1     Running            0               45m   10.130.0.16     master-2   <none>           <none>
metal3-image-cache-4p7ln                              1/1     Running            0               45m   192.168.62.22   master-1   <none>           <none>
metal3-image-cache-lfmb4                              1/1     Running            0               45m   192.168.62.23   master-2   <none>           <none>
metal3-image-cache-txjg5                              1/1     Running            0               45m   192.168.62.21   master-0   <none>           <none>
metal3-image-customization-65cf987f5c-wgqs7           1/1     Running            0               45m   10.128.0.17     master-1   <none>           <none>

$ oc -n openshift-machine-api logs machine-api-controllers-59694ff965-v4kxb -c machine-controller | less
...
E0710 15:55:08.230413       1 logr.go:270] controller-runtime/source "msg"="if kind is a CRD, it should be installed before calling Start" "error"="no matches for kind \"Metal3Remediation\" in version \"infrastructure.cluster.x-k8s.io/v1beta1\""  "kind"={"Group":"infrastructure.cluster.x-k8s.io","Kind":"Metal3Remediation"}
E0710 15:55:14.019930       1 controller.go:210]  "msg"="Could not wait for Cache to sync" "error"="failed to wait for metal3remediation caches to sync: timed out waiting for cache to be synced" "controller"="metal3remediation" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="Metal3Remediation" 
I0710 15:55:14.020025       1 logr.go:252]  "msg"="Stopping and waiting for non leader election runnables"  
I0710 15:55:14.020054       1 logr.go:252]  "msg"="Stopping and waiting for leader election runnables"  
I0710 15:55:14.020095       1 controller.go:247]  "msg"="Shutdown signal received, waiting for all workers to finish" "controller"="machine-drain-controller" 
I0710 15:55:14.020147       1 controller.go:247]  "msg"="Shutdown signal received, waiting for all workers to finish" "controller"="machineset-controller" 
I0710 15:55:14.020169       1 controller.go:247]  "msg"="Shutdown signal received, waiting for all workers to finish" "controller"="machine-controller" 
I0710 15:55:14.020184       1 controller.go:249]  "msg"="All workers finished" "controller"="machineset-controller" 
I0710 15:55:14.020181       1 controller.go:249]  "msg"="All workers finished" "controller"="machine-drain-controller" 
I0710 15:55:14.020190       1 controller.go:249]  "msg"="All workers finished" "controller"="machine-controller" 
I0710 15:55:14.020209       1 logr.go:252]  "msg"="Stopping and waiting for caches"  
I0710 15:55:14.020323       1 logr.go:252]  "msg"="Stopping and waiting for webhooks"  
I0710 15:55:14.020327       1 reflector.go:225] Stopping reflector *v1alpha1.BareMetalHost (10h53m58.149951981s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262
I0710 15:55:14.020393       1 reflector.go:225] Stopping reflector *v1beta1.Machine (9h40m22.116205595s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262
I0710 15:55:14.020399       1 logr.go:252] controller-runtime/webhook "msg"="shutting down webhook server"  
I0710 15:55:14.020437       1 reflector.go:225] Stopping reflector *v1.Node (10h3m14.461941979s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262
I0710 15:55:14.020466       1 logr.go:252]  "msg"="Wait completed, proceeding to shutdown the manager"  
I0710 15:55:14.020485       1 reflector.go:225] Stopping reflector *v1beta1.MachineSet (10h7m28.391827596s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262
E0710 15:55:14.020500       1 main.go:218] baremetal-controller-manager/entrypoint "msg"="unable to run manager" "error"="failed to wait for metal3remediation caches to sync: timed out waiting for cache to be synced"  
E0710 15:55:14.020504       1 logr.go:270]  "msg"="error received after stop sequence was engaged" "error"="leader election lost"

Our CI job logs can be seen here (RedHat SSO): https://www.distributed-ci.io/jobs/7da8ee48-8918-4a97-8e3c-f525d19583b8/files

clones

OCPBUGS-15992 OCP 4.14.0-ec.3 machine-api-controller pod crashing

Closed

depends on

OCPBUGS-15992 OCP 4.14.0-ec.3 machine-api-controller pod crashing

Closed

links to

openshift/cluster-api-provider-baremetal#194: OCPBUGS-16084: [release-4.13] Fix Metal3Remediation CRD install order

Assignee:: Marc Sluiter

Reporter:: Manuel Rodriguez

QA Contact:: Jad Haj Yahya

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2023/07/12 8:17 AM

Updated:: 2023/07/20 5:13 PM

Resolved:: 2023/07/20 5:13 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates