Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-16084

[4.13] OCP 4.14.0-ec.3 machine-api-controller pod crashing

XMLWordPrintable

    • Moderate
    • No
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      
      OCP deployments are failing with machine-api-controller pod crashing.
      

      Version-Release number of selected component (if applicable):

      OCP 4.14.0-ec.3 
      

      How reproducible:

      Always
      

      Steps to Reproduce:

      1. Deploy a Baremetal cluster
      2. After bootstrap is completed, check the pods running in the openshift-machine-api namespace
      3. Check machine-api-controllers-* pod status (it goes from Running to Crashing all the time)
      4. Deployment eventually times out and stops with only the master nodes getting deployed.
      

      Actual results:

      machine-api-controllers-* pod remains in a crashing loop and OCP 4.14.0-ec.3 deployments fail.
      

      Expected results:

      machine-api-controllers-* pod remains running and OCP 4.14.0-ec.3 deployments are completed 
      

      Additional info:

      Jobs with older nightly releases in 4.14 are passing, but since Saturday Jul 10th, our CI jobs are failing

      $ oc version
      Client Version: 4.14.0-ec.3
      Kustomize Version: v5.0.1
      Kubernetes Version: v1.27.3+e8b13aa
      
      $ oc get nodes
      NAME       STATUS   ROLES                  AGE   VERSION
      master-0   Ready    control-plane,master   37m   v1.27.3+e8b13aa
      master-1   Ready    control-plane,master   37m   v1.27.3+e8b13aa
      master-2   Ready    control-plane,master   38m   v1.27.3+e8b13aa
      
      $ oc -n openshift-machine-api get pods -o wide
      NAME                                                  READY   STATUS             RESTARTS        AGE   IP              NODE       NOMINATED NODE   READINESS GATES
      cluster-autoscaler-operator-75b96869d8-gzthq          2/2     Running            0               48m   10.129.0.6      master-0   <none>           <none>
      cluster-baremetal-operator-7c9cb8cd69-6bqcg           2/2     Running            0               48m   10.129.0.7      master-0   <none>           <none>
      control-plane-machine-set-operator-6b65b5b865-w996m   1/1     Running            0               48m   10.129.0.22     master-0   <none>           <none>
      machine-api-controllers-59694ff965-v4kxb              6/7     CrashLoopBackOff   7 (2m31s ago)   46m   10.130.0.12     master-2   <none>           <none>
      machine-api-operator-58b54d7c86-cnx4w                 2/2     Running            0               48m   10.129.0.8      master-0   <none>           <none>
      metal3-6ffbb8dcd4-drlq5                               6/6     Running            0               45m   192.168.62.22   master-1   <none>           <none>
      metal3-baremetal-operator-bd95b6695-q6k7c             1/1     Running            0               45m   10.130.0.16     master-2   <none>           <none>
      metal3-image-cache-4p7ln                              1/1     Running            0               45m   192.168.62.22   master-1   <none>           <none>
      metal3-image-cache-lfmb4                              1/1     Running            0               45m   192.168.62.23   master-2   <none>           <none>
      metal3-image-cache-txjg5                              1/1     Running            0               45m   192.168.62.21   master-0   <none>           <none>
      metal3-image-customization-65cf987f5c-wgqs7           1/1     Running            0               45m   10.128.0.17     master-1   <none>           <none>
      
      $ oc -n openshift-machine-api logs machine-api-controllers-59694ff965-v4kxb -c machine-controller | less
      ...
      E0710 15:55:08.230413       1 logr.go:270] controller-runtime/source "msg"="if kind is a CRD, it should be installed before calling Start" "error"="no matches for kind \"Metal3Remediation\" in version \"infrastructure.cluster.x-k8s.io/v1beta1\""  "kind"={"Group":"infrastructure.cluster.x-k8s.io","Kind":"Metal3Remediation"}
      E0710 15:55:14.019930       1 controller.go:210]  "msg"="Could not wait for Cache to sync" "error"="failed to wait for metal3remediation caches to sync: timed out waiting for cache to be synced" "controller"="metal3remediation" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="Metal3Remediation" 
      I0710 15:55:14.020025       1 logr.go:252]  "msg"="Stopping and waiting for non leader election runnables"  
      I0710 15:55:14.020054       1 logr.go:252]  "msg"="Stopping and waiting for leader election runnables"  
      I0710 15:55:14.020095       1 controller.go:247]  "msg"="Shutdown signal received, waiting for all workers to finish" "controller"="machine-drain-controller" 
      I0710 15:55:14.020147       1 controller.go:247]  "msg"="Shutdown signal received, waiting for all workers to finish" "controller"="machineset-controller" 
      I0710 15:55:14.020169       1 controller.go:247]  "msg"="Shutdown signal received, waiting for all workers to finish" "controller"="machine-controller" 
      I0710 15:55:14.020184       1 controller.go:249]  "msg"="All workers finished" "controller"="machineset-controller" 
      I0710 15:55:14.020181       1 controller.go:249]  "msg"="All workers finished" "controller"="machine-drain-controller" 
      I0710 15:55:14.020190       1 controller.go:249]  "msg"="All workers finished" "controller"="machine-controller" 
      I0710 15:55:14.020209       1 logr.go:252]  "msg"="Stopping and waiting for caches"  
      I0710 15:55:14.020323       1 logr.go:252]  "msg"="Stopping and waiting for webhooks"  
      I0710 15:55:14.020327       1 reflector.go:225] Stopping reflector *v1alpha1.BareMetalHost (10h53m58.149951981s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262
      I0710 15:55:14.020393       1 reflector.go:225] Stopping reflector *v1beta1.Machine (9h40m22.116205595s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262
      I0710 15:55:14.020399       1 logr.go:252] controller-runtime/webhook "msg"="shutting down webhook server"  
      I0710 15:55:14.020437       1 reflector.go:225] Stopping reflector *v1.Node (10h3m14.461941979s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262
      I0710 15:55:14.020466       1 logr.go:252]  "msg"="Wait completed, proceeding to shutdown the manager"  
      I0710 15:55:14.020485       1 reflector.go:225] Stopping reflector *v1beta1.MachineSet (10h7m28.391827596s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262
      E0710 15:55:14.020500       1 main.go:218] baremetal-controller-manager/entrypoint "msg"="unable to run manager" "error"="failed to wait for metal3remediation caches to sync: timed out waiting for cache to be synced"  
      E0710 15:55:14.020504       1 logr.go:270]  "msg"="error received after stop sequence was engaged" "error"="leader election lost" 
      

      Our CI job logs can be seen here (RedHat SSO): https://www.distributed-ci.io/jobs/7da8ee48-8918-4a97-8e3c-f525d19583b8/files

              slintes Marc Sluiter
              rhn-gps-manrodri Manuel Rodriguez
              Jad Haj Yahya Jad Haj Yahya
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: