Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-10790

Cut in service availability during update and unable to create vm after update

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Committed
    • No Docs Impact
    • Committed
    • Committed
    • None
    • Hide
      .Control plane temporarily unavailable during minor update

      During the minor update to 18.0 Feature Release 1, the Red Hat OpenStack Platform control plane temporarily becomes unavailable. API requests might fail with HTTP error codes, such as error 500. Alternatively, the API requests might succeed but the underlying life cycle operation fails. For example, a virtual machine (VM) created with the `openstack server create` command during the minor update never reaches the `ACTIVE` state. The control plane outage is temporary and automatically recovers after the minor update is finished. The control plane outage does not affect the already running workload.

      Show
      .Control plane temporarily unavailable during minor update During the minor update to 18.0 Feature Release 1, the Red Hat OpenStack Platform control plane temporarily becomes unavailable. API requests might fail with HTTP error codes, such as error 500. Alternatively, the API requests might succeed but the underlying life cycle operation fails. For example, a virtual machine (VM) created with the `openstack server create` command during the minor update never reaches the `ACTIVE` state. The control plane outage is temporary and automatically recovers after the minor update is finished. The control plane outage does not affect the already running workload.
    • Known Issue
    • Done
    • Rejected
    • Critical

      Hi,

      From this uni03gamma-update-sdatko job we are testing OpenStack service availability during update by spawning and destroying a Instance on OpenStack.

       

      We observe two things during the update:

      1. First, we have 500 responce from the serveur during a few minute.  This has been observed in several builds
      2. We are not able to spawn Instance after the update in less than 20 seconds.  This has been seen in one build.

       

      As a reference, we are able to create one and then it fails, as seen by the log of the instance creation (in this this directory)

       

      Thu Oct 17 22:49:50 EDT 2024 (1729219790) 329s SUCCESS (0)
      Thu Oct 17 22:55:20 EDT 2024 (1729220120) 135s FAILED (127)
      Thu Oct 17 22:57:36 EDT 2024 (1729220256) 392s FAILED (1)
      Thu Oct 17 23:04:09 EDT 2024 (1729220649) 288s FAILED (1)
      Thu Oct 17 23:08:58 EDT 2024 (1729220938) 301s FAILED (1)

       

      Note, that the log on the compute node are +4h, so it will be around Oct 18 02:55 there for the first error.

      Here is the timeline of the step in the update process:

       

      01 before_update-containers     2024-10-18 02:49 
      01 before_update-packages     2024-10-18 02:49 
      01 before_update-pods    2024-10-18 02:50 
      02 after_ovn_controlplane_update     2024-10-18 02:51 
      03 after_ovn_dataplane_update     2024-10-18 02:54 
      04 after_controlplane_update     2024-10-18 03:02 
      05 after_update     2024-10-18 03:13 

       

      The first 500 error is seen at that time (this file)

       

      + openstack compute service list --service nova-compute -f value -c State
      I1017 22:55:24.514599   64273 log.go:245] (0xc0006f80b0) Data frame received for 7
      I1017 22:55:24.514634   64273 log.go:245] (0xc001149d60) (7) Data frame handling
      I1017 22:55:24.514650   64273 log.go:245] (0xc001149d60) (7) Data frame sent
      Internal Server Error (HTTP 500)

       

      So 02:55:24 on the compute nodes, after ovn_dataplane_update and during controlplane_update.

      It's associated with that error message (from logs/controller-0/ci-framework-data/logs/compute-zwjjmcmr-1.utility/log/messages)

      Oct 18 02:55:05 compute-zwjjmcmr-1 nova_compute[94588]: 2024-10-18 02:55:05.969 2 DEBUG neutronclient.v2_0.client [None req-48ebd31b-210d-4306-adf5-adf5020e4910 - - - - - -] Error message: {"message": "The server is currently unavailable. Please try again at a later time.<br /><br />\nThe Keystone service is temporarily unavailable.\n\n", "code": "503 Service Unavailable", "title": "Service Unavailable"} _handle_fault_response /usr/lib/python3.9/site-packages/neutronclient/v2_0/client.py:262#033[00m
      Oct 18 02:55:05 compute-zwjjmcmr-1 nova_compute[94588]: 2024-10-18 02:55:05.970 2 ERROR nova.compute.manager [None req-48ebd31b-210d-4306-adf5-adf5020e4910 - - - - - -] [instance: 8a844f0e-eb9d-4028-8c61-a5574d854d59] An error occurred while refreshing the network cache.: neutronclient.common.exceptions.ServiceUnavailable: The server is currently unavailable. Please try again at a later time.<br /><br />
      Oct 18 02:55:05 compute-zwjjmcmr-1 nova_compute[94588]: The Keystone service is temporarily unavailable.

       

      Then this service recover there, at around 02:59:01:

       

      + openstack compute service list --service nova-compute -f value -c State
      I1017 22:58:56.574853   64464 log.go:245] (0xc00060e1e0) (7) Data frame sent
      I1017 22:58:56.581468   64464 log.go:245] (0xc000948000) Data frame received for 7
      I1017 22:58:56.581486   64464 log.go:245] (0xc00060e1e0) (7) Data frame handling
      I1017 22:58:56.581493   64464 log.go:245] (0xc00060e1e0) (7) Data frame sent
      + grep -q down
      I1017 22:59:01.478058   64464 log.go:245] (0xc000948000) Data frame received for 7
      I1017 22:59:01.478087   64464 log.go:245] (0xc00060e1e0) (7) Data frame handling
      I1017 22:59:01.478125   64464 log.go:245] (0xc00060e1e0) (7) Data frame sent
      Internal Server Error (HTTP 500)
      I1017 22:59:01.850213   64464 log.go:245] (0xc000948000) Data frame received for 7
      I1017 22:59:01.850239   64464 log.go:245] (0xc00060e1e0) (7) Data frame handling
      I1017 22:59:01.850249   64464 log.go:245] (0xc00060e1e0) (7) Data frame sent
      + openstack network agent list -f value -c Alive
      + grep -q false
      I1017 22:59:08.387804   64464 log.go:245] (0xc000948000) Data frame received for 5
      I1017 22:59:08.387830   64464 log.go:245] (0xc000a312c0) (5) Data frame handling
      All compute and networking services are up and running

       

      But from there we are unable to create a new instance in less than 20 seconds, so from around 03:01 on:

       

      I1017 23:01:36.961721   64464 log.go:245] (0xc00060e1e0) (7) Data frame sent
      + TENANT_NET_ID=7a1eb4ce-a86a-4731-a252-d58f5b5a0131
      + echo 'Creating overcloud instance instance_dcaf13ec78'
      + os_cmd server create --image upgrade_workload_dcaf13ec78 --flavor v1-512M-10G-dcaf13ec78 --security-group allow-icmp-ssh-dcaf13ec78 --key-name userkey_dcaf13ec78 --nic net-id=7a1eb4ce-a86a-4731-a252-d58f5b5a0131 instance_dcaf13ec78
      ....
      I1017 23:03:02.770058   64464 log.go:245] (0xc000948000) Data frame received for 7
      I1017 23:03:02.770066   64464 log.go:245] (0xc00060e1e0) (7) Data frame handling
      + INSTANCE_STATUS=ERROR
      + case "${INSTANCE_STATUS}" in
      + echo 'instance_dcaf13ec78 failed'
      + exit 1
      + cleanup_on_exit
      

       

      The instance id is 

      6c822b56-e497-435c-86c6-f30cb84fae85

      but I could not find relevant log around it in the "messages" files.

       

       

      Must-gather logs: https://sf.hosted.upshift.rdu2.redhat.com/logs/42/742/9674b907727c7f456ce0425dbccfc6936832e35a/check-gitlab-cee/uni03gamma-update-sdatko/79f879a/logs/controller-0/ci-framework-data/logs/openstack-k8s-operators-openstack-must-gather/registry-proxy-engineering-redhat-com-rh-osbs-rhoso-podified-trunk-openstack-must-gather-rhel9-sha256-170540e2a763cf53611fe62d27cfded951012cd0bb6bffbf1c4c1eb16da08195/

      Logs from computes:
      compute-zwjjmcmr-0.utility/
      compute-zwjjmcmr-1.utility/
      compute-zwjjmcmr-2.utility/

              chjones@redhat.com Chris Jones
              sathlang@redhat.com Sofer Athlan Guyot
              rhos-dfg-ospk8s
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

                Created:
                Updated: