Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-11068

openstack-operator-controller-manager crashing with nil pointer deference panic when deployment job fails

XMLWordPrintable

    • 3
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • openstack-operator-container-1.0.4-6
    • ?
    • ?
    • Yes
    • Approved
    • Important

      A failing deployment job causes a nil pointer deference:

       

      2024-10-30T14:32:11.001Z        INFO    Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference        {"controller": "openstackdataplanedeployment", "controllerGroup": "dataplane.openstack.org", "controllerKind": "OpenStackDataPlaneDeployment", "OpenStackDataPlaneDeployment": {"name":"edpm-deployment","namespace":"openstack"}, "namespace": "openstack", "name": "edpm-deployment", "rec
      oncileID": "242891d0-88cc-4d4a-a5af-0ef99b779d3f"}
      panic: runtime error: invalid memory address or nil pointer dereference [recovered]
              panic: runtime error: invalid memory address or nil pointer dereference
      [signal SIGSEGV: segmentation violation code=0x1 addr=0x58 pc=0x1ac64ca]
                                                                                                                 
      goroutine 903 [running]:                       
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
              /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.6/pkg/internal/controller/controller.go:116 +0x1e5
      panic({0x1d3d280?, 0x3464d30?})       
              /usr/lib/golang/src/runtime/panic.go:914 +0x21f
      github.com/openstack-k8s-operators/openstack-operator/pkg/dataplane.(*Deployer).ConditionalDeploy(_, {_, _}, {_, _}, {_, _}, {_, _}, {0xc0042b0690, ...}, ...)
              /remote-source/pkg/dataplane/deployment.go:237 +0xe2a
      github.com/openstack-k8s-operators/openstack-operator/pkg/dataplane.(*Deployer).Deploy(0xc00108c420, {0xc003b21100, 0x10, 0x10})
              /remote-source/pkg/dataplane/deployment.go:136 +0x6e5
      github.com/openstack-k8s-operators/openstack-operator/controllers/dataplane.(*OpenStackDataPlaneDeploymentReconciler).Reconcile(0xc000619ad0, {0x23dbe58?, 0xc0031404e0}, {{{0xc004582ae3?, 0x5?}, {0xc004582b20?, 0xc00108dd08?}}})
              /remote-source/controllers/dataplane/openstackdataplanedeployment_controller.go:333 +0x1cea
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x23df248?, {0x23dbe58?, 0xc0031404e0?}, {{{0xc004582ae3?, 0xb?}, {0xc004582b20?, 0x0?}}})
              /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.6/pkg/internal/controller/controller.go:119 +0xb7
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0006263c0, {0x23dbe90, 0xc000130500}, {0x1e205c0?, 0xc002a08a00?})
              /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.6/pkg/internal/controller/controller.go:316 +0x3cc
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0006263c0, {0x23dbe90, 0xc000130500})
              /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.6/pkg/internal/controller/controller.go:266 +0x1af
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
              /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.6/pkg/internal/controller/controller.go:227 +0x79
      created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 243
              /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.6/pkg/internal/controller/controller.go:223 +0x565

       

       

      Job status at the time is:

       

      status:
        failed: 2
        ready: 0
        startTime: "2024-10-30T14:32:01Z"
        terminating: 0
        uncountedTerminatedPods: {}

       

       

      Issue occurs here:

      https://github.com/openstack-k8s-operators/openstack-operator/blob/1d1ce65c57c0606d9b8d0ea9a92e56e7f04409ac/pkg/dataplane/deployment.go#L237

      Because ansibleCondition will be nil if there is no condition on the job status or the condition type is not batchv1.JobFailed

      The operator eventually recovers once the job backoff limit is reached and the expected condition is set on the job.

      While the operator is down it's CR cannot be altered e.g

      oc edit -n openstack openstackcontrolplane.core.openstack.org openstack-galera-network-isolation error: openstackcontrolplanes.core.openstack.org "openstack-galera-network-isolation" could not be patched: Internal error occurred: failed calling webhook "mopenstackcontrolplane.kb.io": failed to call webhook: Post "https://openstack-operator-controller-manager-service.openstack-operators.svc:443/mutate-core-openstack-org-v1beta1-openstackcontrolplane?timeout=10s": no endpoints available for service "openstack-operator-controller-manager-service" You can run `oc replace -f /tmp/oc-edit-4063109714.yaml` to try this update again.

              rhn-support-bshephar Brendan Shephard
              rhn-engineering-owalsh Oliver Walsh
              rhos-dfg-df
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: