Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-16994

service doesn't use updated keystone endpoint information if it changes

XMLWordPrintable

    • 8
    • False
    • Hide

      None

      Show
      None
    • False
    • No Docs Impact
    • openstack-operator-bundle-container-1.0.12-6
    • Impediment
    • rhos-conplat-core-operators
    • None
    • Waiting For Release
    • 1
    • Important

      To Reproduce Steps to reproduce the behavior:

      So far only reproduced using the following mentioned PR, but also seen in jobs

      1. reproduced the issue locally with a delay added to openstack-operator. See the details in https://github.com/openstack-k8s-operators/openstack-operator/pull/1457

      Expected behavior

      • nova caches the api endpoints, when e.g. the placement api endpoint changes the service automatic use the new information

      Bug impact

      • depending on the scenario (switch non tls to tls), nova components can not talk to the component and e.g. migration

      Known workaround

      • currently when the issue is seen, a manual restart of the affected pods is required, `oc delete pod -n openstack nova..`

      Additional context

      • root cause

      for one situation where this can happen:

      On the initial create of the placement CR, there are no placement k8s services. Because of this call for the endpoint services https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/openstack/placement.go#L79 , will not return any. As the result we'll not call EnsureEndpointConfig() which takes care of creating the endpoint cert secrets. Therefore the initial placement CR will not have the certSecret set and not enable TLS initially.
      As soon placement-op called its ensureServiceExpose, https://github.com/openstack-k8s-operators/placement-operator/blob/main/controllers/placementapi_controller.go#L448C33-L448C53 , the openstack-op will reconcile and create them and update the placement CR with the certSecret detail.
      Placement will only create the keystone endpoint when dbsync, the deployment and its keystoneservice was created, https://github.com/openstack-k8s-operators/placement-operator/blob/main/controllers/placementapi_controller.go#L456-L474.

      With this we only hit the described issue when openstack-op is not getting to the point (waiting for other services) to create the placement certs and update the placement cr with the certSecret details. This can happen when the openstack-op is returning on any of the services before placement in https://github.com/openstack-k8s-operators/openstack-operator/blob/main/controllers/core/openstackcontrolplane_controller.go#L363-L396 . Rabbit, Galera and memcached always get setup with tls right from initial start due to the disruptiveness if they get restarted.

      I tried to correlate the placement-operator and openstack-operator logs from https://sf.apps.int.gpc.ocp-hub.prod.psi.redhat.com/logs/5fe/components-integration/5fe82f465fdf4544b3dc9b1c4ded0ff6/controller/ci-framework-data/logs/openstack-k8s-operators-openstack-must-gather/namespaces/openstack-operators/pods/placement-operator-controller-manager-65b54b6b77-dcfhp/logs/ and I think we see exact this.

      From the placement-operator log we see the KeystoneEndpoint got created at `04:35:38.818Z`:

      2025-05-19T04:35:38.818Z        INFO    Controllers.PlacementAPI        KeystoneEndpoint placement - created    {"controller": "placementapi", "controllerGroup": "placement.openstack.org", "controllerKind": "PlacementAPI", "PlacementAPI": {"name":"placement","namespace":"openstack"}, "namespace": "openstack", "name": "placement", "reconcileID": "c46fa68e-3d6c-431d-8b2c-e4c72445ad28"}
      2025-05-19T04:35:38.830Z        INFO    Controllers.PlacementAPI        Successfully ensured MariaDBAccount placement exists; database username is placement_2304       {"controller": "placementapi", "controllerGroup": "placement.openstack.org", "controllerKind": "PlacementAPI", "PlacementAPI": {"name":"placement","namespace":"openstack"}, "namespace": "openstack", "name": "placement", "reconcileID": "f2e81d60-5cbe-44d3-9b93-a4cb8af20676", "ObjectType": "*v1beta1.MariaDBAccount", "ObjectNamespace": "openstack", "ObjectName": "placement"}
      

      When we check the openstack-operator log, we see that the placement k8s service certs got created after that, at `2025-05-19T04:36:21.302Z`:

      2025-05-19T04:36:21.302Z        INFO    Controllers.OpenStackControlPlane       Route placement-internal-svc - created  {"controller": "openstackcontrolplane", "controllerGroup": "core.openstack.org", "controllerKind": "OpenStackControlPlane", "OpenStackControlPlane": {"name":"controlplane","namespace":"openstack"}, "namespace": "openstack", "name": "controlplane", "reconcileID": "c0b55ccc-cf57-4a02-8a48-a1e134d3a6bd"}
      2025-05-19T04:36:21.302Z        INFO    Controllers.OpenStackControlPlane       Secret cert-placement-internal-svc not found, reconcile in 5ns  {"controller": "openstackcontrolplane", "controllerGroup": "core.openstack.org", "controllerKind": "OpenStackControlPlane", "OpenStackControlPlane": {"name":"controlplane","namespace":"openstack"}, "namespace": "openstack", "name": "controlplane", "reconcileID": "c0b55ccc-cf57-4a02-8a48-a1e134d3a6bd"}
      

      From `04:34:43.640Z` till `04:36:19.136Z` the openstack-operator fails to update the keystone route and therefore does not continue to reconcile placement with its certs:

      2025-05-19T04:34:43.640Z        INFO    Controllers.OpenStackControlPlane       Error reconciling normal        {"controller": "openstackcontrolplane", "controllerGroup": "core.openstack.org", "controllerKind": "OpenStackControlPlane", "OpenStackControlPlane": {"name":"controlplane","namespace":"openstack"}, "namespace": "openstack", "name": "controlplane", "reconcileID": "60b25436-fcfa-492a-97aa-dea3294a8be5", "error": "the server is currently unable to handle the request (patch routes.route.openshift.io keystone-public)"}
      2025-05-19T04:34:43.710Z        ERROR   Reconciler error        {"controller": "openstackcontrolplane", "controllerGroup": "core.openstack.org", "controllerKind": "OpenStackControlPlane", "OpenStackControlPlane": {"name":"controlplane","namespace":"openstack"}, "namespace": "openstack", "name": "controlplane", "reconcileID": "60b25436-fcfa-492a-97aa-dea3294a8be5", "error": "the server is currently unable to handle the request (patch routes.route.openshift.io keystone-public)"}
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
              /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.6/pkg/internal/controller/controller.go:329
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
              /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.6/pkg/internal/controller/controller.go:266
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
              /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.6/pkg/internal/controller/controller.go:227
      ...
      2025-05-19T04:36:19.136Z        INFO    Controllers.OpenStackControlPlane       Error reconciling normal        {"controller": "openstackcontrolplane", "controllerGroup": "core.openstack.org", "controllerKind": "OpenS
      tackControlPlane", "OpenStackControlPlane": {"name":"controlplane","namespace":"openstack"}, "namespace": "openstack", "name": "controlplane", "reconcileID": "eb7129fb-7a91-44f0-a6c0-6a518b39102c", "error": "the serve
      r is currently unable to handle the request (patch routes.route.openshift.io keystone-public)"}
      2025-05-19T04:36:19.141Z        ERROR   Reconciler error        {"controller": "openstackcontrolplane", "controllerGroup": "core.openstack.org", "controllerKind": "OpenStackControlPlane", "OpenStackControlPlane": {"na
      me":"controlplane","namespace":"openstack"}, "namespace": "openstack", "name": "controlplane", "reconcileID": "eb7129fb-7a91-44f0-a6c0-6a518b39102c", "error": "the server is currently unable to handle the request (pat
      ch routes.route.openshift.io keystone-public)"}
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
              /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.6/pkg/internal/controller/controller.go:329
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
              /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.6/pkg/internal/controller/controller.go:266
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
              /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.6/pkg/internal/controller/controller.go:227
      

      At that time we see errors in the kube-apiserver.log https://sf.apps.int.gpc.ocp-hub.prod.psi.redhat.com/logs/5fe/components-integration/5fe82f465fdf4544b3dc9b1c4ded0ff6/controller/ci-framework-data/logs/openstack-k8s-operators-openstack-must-gather/namespaces/openshift-kube-apiserver/pods/kube-apiserver-crc/logs/kube-apiserver.log starting at 04:34:21.494871 and back operational at 04:36:19.242852

        •  
           

              rhn-support-mschuppe Martin Schuppert
              rhn-support-mschuppe Martin Schuppert
              rhos-conplat-core-operators
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: