-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
None
-
False
-
-
False
-
Committed
-
No Docs Impact
-
Committed
-
Committed
-
None
-
-
Known Issue
-
Done
-
-
-
Rejected
-
Critical
Hi,
From this uni03gamma-update-sdatko job we are testing OpenStack service availability during update by spawning and destroying a Instance on OpenStack.
We observe two things during the update:
- First, we have 500 responce from the serveur during a few minute. This has been observed in several builds
- We are not able to spawn Instance after the update in less than 20 seconds. This has been seen in one build.
As a reference, we are able to create one and then it fails, as seen by the log of the instance creation (in this this directory)
Thu Oct 17 22:49:50 EDT 2024 (1729219790) 329s SUCCESS (0) Thu Oct 17 22:55:20 EDT 2024 (1729220120) 135s FAILED (127) Thu Oct 17 22:57:36 EDT 2024 (1729220256) 392s FAILED (1) Thu Oct 17 23:04:09 EDT 2024 (1729220649) 288s FAILED (1) Thu Oct 17 23:08:58 EDT 2024 (1729220938) 301s FAILED (1)
Note, that the log on the compute node are +4h, so it will be around Oct 18 02:55 there for the first error.
Here is the timeline of the step in the update process:
01 before_update-containers 2024-10-18 02:49 01 before_update-packages 2024-10-18 02:49 01 before_update-pods 2024-10-18 02:50 02 after_ovn_controlplane_update 2024-10-18 02:51 03 after_ovn_dataplane_update 2024-10-18 02:54 04 after_controlplane_update 2024-10-18 03:02 05 after_update 2024-10-18 03:13
The first 500 error is seen at that time (this file)
+ openstack compute service list --service nova-compute -f value -c State
I1017 22:55:24.514599 64273 log.go:245] (0xc0006f80b0) Data frame received for 7
I1017 22:55:24.514634 64273 log.go:245] (0xc001149d60) (7) Data frame handling
I1017 22:55:24.514650 64273 log.go:245] (0xc001149d60) (7) Data frame sent
Internal Server Error (HTTP 500)
So 02:55:24 on the compute nodes, after ovn_dataplane_update and during controlplane_update.
It's associated with that error message (from logs/controller-0/ci-framework-data/logs/compute-zwjjmcmr-1.utility/log/messages)
Oct 18 02:55:05 compute-zwjjmcmr-1 nova_compute[94588]: 2024-10-18 02:55:05.969 2 DEBUG neutronclient.v2_0.client [None req-48ebd31b-210d-4306-adf5-adf5020e4910 - - - - - -] Error message: {"message": "The server is currently unavailable. Please try again at a later time.<br /><br />\nThe Keystone service is temporarily unavailable.\n\n", "code": "503 Service Unavailable", "title": "Service Unavailable"} _handle_fault_response /usr/lib/python3.9/site-packages/neutronclient/v2_0/client.py:262#033[00m Oct 18 02:55:05 compute-zwjjmcmr-1 nova_compute[94588]: 2024-10-18 02:55:05.970 2 ERROR nova.compute.manager [None req-48ebd31b-210d-4306-adf5-adf5020e4910 - - - - - -] [instance: 8a844f0e-eb9d-4028-8c61-a5574d854d59] An error occurred while refreshing the network cache.: neutronclient.common.exceptions.ServiceUnavailable: The server is currently unavailable. Please try again at a later time.<br /><br /> Oct 18 02:55:05 compute-zwjjmcmr-1 nova_compute[94588]: The Keystone service is temporarily unavailable.
Then this service recover there, at around 02:59:01:
+ openstack compute service list --service nova-compute -f value -c State I1017 22:58:56.574853 64464 log.go:245] (0xc00060e1e0) (7) Data frame sent I1017 22:58:56.581468 64464 log.go:245] (0xc000948000) Data frame received for 7 I1017 22:58:56.581486 64464 log.go:245] (0xc00060e1e0) (7) Data frame handling I1017 22:58:56.581493 64464 log.go:245] (0xc00060e1e0) (7) Data frame sent + grep -q down I1017 22:59:01.478058 64464 log.go:245] (0xc000948000) Data frame received for 7 I1017 22:59:01.478087 64464 log.go:245] (0xc00060e1e0) (7) Data frame handling I1017 22:59:01.478125 64464 log.go:245] (0xc00060e1e0) (7) Data frame sent Internal Server Error (HTTP 500) I1017 22:59:01.850213 64464 log.go:245] (0xc000948000) Data frame received for 7 I1017 22:59:01.850239 64464 log.go:245] (0xc00060e1e0) (7) Data frame handling I1017 22:59:01.850249 64464 log.go:245] (0xc00060e1e0) (7) Data frame sent + openstack network agent list -f value -c Alive + grep -q false I1017 22:59:08.387804 64464 log.go:245] (0xc000948000) Data frame received for 5 I1017 22:59:08.387830 64464 log.go:245] (0xc000a312c0) (5) Data frame handling All compute and networking services are up and running
But from there we are unable to create a new instance in less than 20 seconds, so from around 03:01 on:
I1017 23:01:36.961721 64464 log.go:245] (0xc00060e1e0) (7) Data frame sent + TENANT_NET_ID=7a1eb4ce-a86a-4731-a252-d58f5b5a0131 + echo 'Creating overcloud instance instance_dcaf13ec78' + os_cmd server create --image upgrade_workload_dcaf13ec78 --flavor v1-512M-10G-dcaf13ec78 --security-group allow-icmp-ssh-dcaf13ec78 --key-name userkey_dcaf13ec78 --nic net-id=7a1eb4ce-a86a-4731-a252-d58f5b5a0131 instance_dcaf13ec78 .... I1017 23:03:02.770058 64464 log.go:245] (0xc000948000) Data frame received for 7 I1017 23:03:02.770066 64464 log.go:245] (0xc00060e1e0) (7) Data frame handling + INSTANCE_STATUS=ERROR + case "${INSTANCE_STATUS}" in + echo 'instance_dcaf13ec78 failed' + exit 1 + cleanup_on_exit
The instance id is
6c822b56-e497-435c-86c6-f30cb84fae85
but I could not find relevant log around it in the "messages" files.
Logs from computes:
compute-zwjjmcmr-0.utility/
compute-zwjjmcmr-1.utility/
compute-zwjjmcmr-2.utility/
- is related to
-
OSPRH-11488 failover of the galera service exposes disconnected galera node to the openstack clients
- Dev Complete
- links to
- mentioned on