Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: rhos-18.0.6
Affects Version/s: None
Component/s: openstack-keystone, openstack-nova
Labels:
None

Story Points:
8
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs Approval:
No Docs Impact
Regression:
None
Release Note Text:

Hide
.Control plane temporarily unavailable during minor update

During the minor update to 18.0 Feature Release 1, the RHOSO control plane temporarily becomes unavailable. API requests might fail with HTTP error codes, such as error 500. Alternatively, the API requests might succeed but the underlying life cycle operation fails. For example, a virtual machine (VM) created with the `openstack server create` command during the minor update never reaches the `ACTIVE` state. The control plane outage is temporary and automatically recovers after the minor update is finished. The control plane outage does not affect the already running workload.

Show
.Control plane temporarily unavailable during minor update During the minor update to 18.0 Feature Release 1, the RHOSO control plane temporarily becomes unavailable. API requests might fail with HTTP error codes, such as error 500. Alternatively, the API requests might succeed but the underlying life cycle operation fails. For example, a virtual machine (VM) created with the `openstack server create` command during the minor update never reaches the `ACTIVE` state. The control plane outage is temporary and automatically recovers after the minor update is finished. The control plane outage does not affect the already running workload.
Release Note Type:
Known Issue
Release Note Status:
Done
Intelligence Requested:
Market:
Release Blocker:
Rejected

Severity:
Critical

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Hi,

From this uni03gamma-update-sdatko job we are testing OpenStack service availability during update by spawning and destroying a Instance on OpenStack.

We observe two things during the update:

First, we have 500 responce from the serveur during a few minute. This has been observed in several builds
We are not able to spawn Instance after the update in less than 20 seconds. This has been seen in one build.

As a reference, we are able to create one and then it fails, as seen by the log of the instance creation (in this this directory)

Thu Oct 17 22:49:50 EDT 2024 (1729219790) 329s SUCCESS (0)
Thu Oct 17 22:55:20 EDT 2024 (1729220120) 135s FAILED (127)
Thu Oct 17 22:57:36 EDT 2024 (1729220256) 392s FAILED (1)
Thu Oct 17 23:04:09 EDT 2024 (1729220649) 288s FAILED (1)
Thu Oct 17 23:08:58 EDT 2024 (1729220938) 301s FAILED (1)

Note, that the log on the compute node are +4h, so it will be around Oct 18 02:55 there for the first error.

Here is the timeline of the step in the update process:

01 before_update-containers     2024-10-18 02:49 
01 before_update-packages     2024-10-18 02:49 
01 before_update-pods    2024-10-18 02:50 
02 after_ovn_controlplane_update     2024-10-18 02:51 
03 after_ovn_dataplane_update     2024-10-18 02:54 
04 after_controlplane_update     2024-10-18 03:02 
05 after_update     2024-10-18 03:13

The first 500 error is seen at that time (this file)

+ openstack compute service list --service nova-compute -f value -c State
I1017 22:55:24.514599   64273 log.go:245] (0xc0006f80b0) Data frame received for 7
I1017 22:55:24.514634   64273 log.go:245] (0xc001149d60) (7) Data frame handling
I1017 22:55:24.514650   64273 log.go:245] (0xc001149d60) (7) Data frame sent
Internal Server Error (HTTP 500)

So 02:55:24 on the compute nodes, after ovn_dataplane_update and during controlplane_update.

It's associated with that error message (from logs/controller-0/ci-framework-data/logs/compute-zwjjmcmr-1.utility/log/messages)

Oct 18 02:55:05 compute-zwjjmcmr-1 nova_compute[94588]: 2024-10-18 02:55:05.969 2 DEBUG neutronclient.v2_0.client [None req-48ebd31b-210d-4306-adf5-adf5020e4910 - - - - - -] Error message: {"message": "The server is currently unavailable. Please try again at a later time.<br /><br />\nThe Keystone service is temporarily unavailable.\n\n", "code": "503 Service Unavailable", "title": "Service Unavailable"} _handle_fault_response /usr/lib/python3.9/site-packages/neutronclient/v2_0/client.py:262#033[00m
Oct 18 02:55:05 compute-zwjjmcmr-1 nova_compute[94588]: 2024-10-18 02:55:05.970 2 ERROR nova.compute.manager [None req-48ebd31b-210d-4306-adf5-adf5020e4910 - - - - - -] [instance: 8a844f0e-eb9d-4028-8c61-a5574d854d59] An error occurred while refreshing the network cache.: neutronclient.common.exceptions.ServiceUnavailable: The server is currently unavailable. Please try again at a later time.<br /><br />
Oct 18 02:55:05 compute-zwjjmcmr-1 nova_compute[94588]: The Keystone service is temporarily unavailable.

Then this service recover there, at around 02:59:01:

+ openstack compute service list --service nova-compute -f value -c State
I1017 22:58:56.574853   64464 log.go:245] (0xc00060e1e0) (7) Data frame sent
I1017 22:58:56.581468   64464 log.go:245] (0xc000948000) Data frame received for 7
I1017 22:58:56.581486   64464 log.go:245] (0xc00060e1e0) (7) Data frame handling
I1017 22:58:56.581493   64464 log.go:245] (0xc00060e1e0) (7) Data frame sent
+ grep -q down
I1017 22:59:01.478058   64464 log.go:245] (0xc000948000) Data frame received for 7
I1017 22:59:01.478087   64464 log.go:245] (0xc00060e1e0) (7) Data frame handling
I1017 22:59:01.478125   64464 log.go:245] (0xc00060e1e0) (7) Data frame sent
Internal Server Error (HTTP 500)
I1017 22:59:01.850213   64464 log.go:245] (0xc000948000) Data frame received for 7
I1017 22:59:01.850239   64464 log.go:245] (0xc00060e1e0) (7) Data frame handling
I1017 22:59:01.850249   64464 log.go:245] (0xc00060e1e0) (7) Data frame sent
+ openstack network agent list -f value -c Alive
+ grep -q false
I1017 22:59:08.387804   64464 log.go:245] (0xc000948000) Data frame received for 5
I1017 22:59:08.387830   64464 log.go:245] (0xc000a312c0) (5) Data frame handling
All compute and networking services are up and running

But from there we are unable to create a new instance in less than 20 seconds, so from around 03:01 on:

I1017 23:01:36.961721   64464 log.go:245] (0xc00060e1e0) (7) Data frame sent
+ TENANT_NET_ID=7a1eb4ce-a86a-4731-a252-d58f5b5a0131
+ echo 'Creating overcloud instance instance_dcaf13ec78'
+ os_cmd server create --image upgrade_workload_dcaf13ec78 --flavor v1-512M-10G-dcaf13ec78 --security-group allow-icmp-ssh-dcaf13ec78 --key-name userkey_dcaf13ec78 --nic net-id=7a1eb4ce-a86a-4731-a252-d58f5b5a0131 instance_dcaf13ec78
....
I1017 23:03:02.770058   64464 log.go:245] (0xc000948000) Data frame received for 7
I1017 23:03:02.770066   64464 log.go:245] (0xc00060e1e0) (7) Data frame handling
+ INSTANCE_STATUS=ERROR
+ case "${INSTANCE_STATUS}" in
+ echo 'instance_dcaf13ec78 failed'
+ exit 1
+ cleanup_on_exit

The instance id is