-
Bug
-
Resolution: Done-Errata
-
Critical
-
rhos-18.0.9
-
None
-
8
-
False
-
-
False
-
No Docs Impact
-
openstack-operator-bundle-container-1.0.12-6
-
Impediment
-
rhos-conplat-core-operators
-
None
-
-
-
Waiting For Release
-
1
-
Important
To Reproduce Steps to reproduce the behavior:
So far only reproduced using the following mentioned PR, but also seen in jobs
- reproduced the issue locally with a delay added to openstack-operator. See the details in https://github.com/openstack-k8s-operators/openstack-operator/pull/1457
Expected behavior
- nova caches the api endpoints, when e.g. the placement api endpoint changes the service automatic use the new information
Bug impact
- depending on the scenario (switch non tls to tls), nova components can not talk to the component and e.g. migration
Known workaround
- currently when the issue is seen, a manual restart of the affected pods is required, `oc delete pod -n openstack nova..`
Additional context
- root cause
for one situation where this can happen:
On the initial create of the placement CR, there are no placement k8s services. Because of this call for the endpoint services https://github.com/openstack-k8s-operators/openstack-operator/blob/main/pkg/openstack/placement.go#L79 , will not return any. As the result we'll not call EnsureEndpointConfig() which takes care of creating the endpoint cert secrets. Therefore the initial placement CR will not have the certSecret set and not enable TLS initially.
As soon placement-op called its ensureServiceExpose, https://github.com/openstack-k8s-operators/placement-operator/blob/main/controllers/placementapi_controller.go#L448C33-L448C53 , the openstack-op will reconcile and create them and update the placement CR with the certSecret detail.
Placement will only create the keystone endpoint when dbsync, the deployment and its keystoneservice was created, https://github.com/openstack-k8s-operators/placement-operator/blob/main/controllers/placementapi_controller.go#L456-L474.
With this we only hit the described issue when openstack-op is not getting to the point (waiting for other services) to create the placement certs and update the placement cr with the certSecret details. This can happen when the openstack-op is returning on any of the services before placement in https://github.com/openstack-k8s-operators/openstack-operator/blob/main/controllers/core/openstackcontrolplane_controller.go#L363-L396 . Rabbit, Galera and memcached always get setup with tls right from initial start due to the disruptiveness if they get restarted.
I tried to correlate the placement-operator and openstack-operator logs from https://sf.apps.int.gpc.ocp-hub.prod.psi.redhat.com/logs/5fe/components-integration/5fe82f465fdf4544b3dc9b1c4ded0ff6/controller/ci-framework-data/logs/openstack-k8s-operators-openstack-must-gather/namespaces/openstack-operators/pods/placement-operator-controller-manager-65b54b6b77-dcfhp/logs/ and I think we see exact this.
From the placement-operator log we see the KeystoneEndpoint got created at `04:35:38.818Z`:
2025-05-19T04:35:38.818Z INFO Controllers.PlacementAPI KeystoneEndpoint placement - created {"controller": "placementapi", "controllerGroup": "placement.openstack.org", "controllerKind": "PlacementAPI", "PlacementAPI": {"name":"placement","namespace":"openstack"}, "namespace": "openstack", "name": "placement", "reconcileID": "c46fa68e-3d6c-431d-8b2c-e4c72445ad28"} 2025-05-19T04:35:38.830Z INFO Controllers.PlacementAPI Successfully ensured MariaDBAccount placement exists; database username is placement_2304 {"controller": "placementapi", "controllerGroup": "placement.openstack.org", "controllerKind": "PlacementAPI", "PlacementAPI": {"name":"placement","namespace":"openstack"}, "namespace": "openstack", "name": "placement", "reconcileID": "f2e81d60-5cbe-44d3-9b93-a4cb8af20676", "ObjectType": "*v1beta1.MariaDBAccount", "ObjectNamespace": "openstack", "ObjectName": "placement"}
When we check the openstack-operator log, we see that the placement k8s service certs got created after that, at `2025-05-19T04:36:21.302Z`:
2025-05-19T04:36:21.302Z INFO Controllers.OpenStackControlPlane Route placement-internal-svc - created {"controller": "openstackcontrolplane", "controllerGroup": "core.openstack.org", "controllerKind": "OpenStackControlPlane", "OpenStackControlPlane": {"name":"controlplane","namespace":"openstack"}, "namespace": "openstack", "name": "controlplane", "reconcileID": "c0b55ccc-cf57-4a02-8a48-a1e134d3a6bd"} 2025-05-19T04:36:21.302Z INFO Controllers.OpenStackControlPlane Secret cert-placement-internal-svc not found, reconcile in 5ns {"controller": "openstackcontrolplane", "controllerGroup": "core.openstack.org", "controllerKind": "OpenStackControlPlane", "OpenStackControlPlane": {"name":"controlplane","namespace":"openstack"}, "namespace": "openstack", "name": "controlplane", "reconcileID": "c0b55ccc-cf57-4a02-8a48-a1e134d3a6bd"}
From `04:34:43.640Z` till `04:36:19.136Z` the openstack-operator fails to update the keystone route and therefore does not continue to reconcile placement with its certs:
2025-05-19T04:34:43.640Z INFO Controllers.OpenStackControlPlane Error reconciling normal {"controller": "openstackcontrolplane", "controllerGroup": "core.openstack.org", "controllerKind": "OpenStackControlPlane", "OpenStackControlPlane": {"name":"controlplane","namespace":"openstack"}, "namespace": "openstack", "name": "controlplane", "reconcileID": "60b25436-fcfa-492a-97aa-dea3294a8be5", "error": "the server is currently unable to handle the request (patch routes.route.openshift.io keystone-public)"} 2025-05-19T04:34:43.710Z ERROR Reconciler error {"controller": "openstackcontrolplane", "controllerGroup": "core.openstack.org", "controllerKind": "OpenStackControlPlane", "OpenStackControlPlane": {"name":"controlplane","namespace":"openstack"}, "namespace": "openstack", "name": "controlplane", "reconcileID": "60b25436-fcfa-492a-97aa-dea3294a8be5", "error": "the server is currently unable to handle the request (patch routes.route.openshift.io keystone-public)"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.6/pkg/internal/controller/controller.go:329 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.6/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.6/pkg/internal/controller/controller.go:227 ... 2025-05-19T04:36:19.136Z INFO Controllers.OpenStackControlPlane Error reconciling normal {"controller": "openstackcontrolplane", "controllerGroup": "core.openstack.org", "controllerKind": "OpenS tackControlPlane", "OpenStackControlPlane": {"name":"controlplane","namespace":"openstack"}, "namespace": "openstack", "name": "controlplane", "reconcileID": "eb7129fb-7a91-44f0-a6c0-6a518b39102c", "error": "the serve r is currently unable to handle the request (patch routes.route.openshift.io keystone-public)"} 2025-05-19T04:36:19.141Z ERROR Reconciler error {"controller": "openstackcontrolplane", "controllerGroup": "core.openstack.org", "controllerKind": "OpenStackControlPlane", "OpenStackControlPlane": {"na me":"controlplane","namespace":"openstack"}, "namespace": "openstack", "name": "controlplane", "reconcileID": "eb7129fb-7a91-44f0-a6c0-6a518b39102c", "error": "the server is currently unable to handle the request (pat ch routes.route.openshift.io keystone-public)"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.6/pkg/internal/controller/controller.go:329 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.6/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.6/pkg/internal/controller/controller.go:227
At that time we see errors in the kube-apiserver.log https://sf.apps.int.gpc.ocp-hub.prod.psi.redhat.com/logs/5fe/components-integration/5fe82f465fdf4544b3dc9b1c4ded0ff6/controller/ci-framework-data/logs/openstack-k8s-operators-openstack-must-gather/namespaces/openshift-kube-apiserver/pods/kube-apiserver-crc/logs/kube-apiserver.log starting at 04:34:21.494871 and back operational at 04:36:19.242852
- links to
-
RHSA-2025:152105 Release of containers for RHOSO OpenStack Podified operator
- mentioned on