Loading...

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: rhos-18.0.1
Affects Version/s: None
Component/s: neutron-operator
Labels:
None

Story Points:
3
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs Approval:
?
Regression:
None
Intelligence Requested:
Market:
Errata Link:
https://errata.engineering.redhat.com/advisory/138623

Severity:
Important

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Seen in uni-gamma[1] but could happen in other scenarios as well if the underlying condition is triggered.

Test fails as:-

2024-08-07 06:25:54,472 44 INFO [tempest.lib.common.rest_client] Request (FloatingIPPortDetailsTest:test_floatingip_port_details): 503 GET https://neutron-public-openstack.apps.ocp.openstack.lab/v2.0/ports/a3b60cf3-1578-4626-b670-e60cdb21edd5 0.133s 2024-08-07 06:25:54,473 44 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'} Body: None Response - Headers: {'date': 'Wed, 07 Aug 2024 06:25:54 GMT', 'server': 'Apache', 'content-length': '299', 'content-type': 'text/html; charset=iso-8859-1', 'set-cookie': '38cec1d9c789c59243203143f8ef839e=c7881776180e421d9393c22bc7768f48; path=/; HttpOnly; Secure; SameSite=None', 'connection': 'close', 'status': '503', 'content-location': 'https://neutron-public-openstack.apps.ocp.openstack.lab/v2.0/ports/a3b60cf3-1578-4626-b670-e60cdb21edd5'} Body: b'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>503 Service Unavailable</title>\n</head><body>\n<h1>Service Unavailable</h1>\n<p>The server is temporarily unable to service your\nrequest due to maintenance downtime or capacity\nproblems. Please try again later.</p>\n</body></html>\n'

OR

2024-08-07 06:24:45,926 42 INFO [tempest.lib.common.rest_client] Request (StatelessNetworkSecGroupIPv4Test:test_ip_prefix_negative): 502 POST https://neutron-public-openstack.apps.ocp.openstack.lab/v2.0/security-group-rules 0.788s 2024-08-07 06:24:45,927 42 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'} Body: {"security_group_rule": {"protocol": "tcp", "port_range_min": 22, "port_range_max": 22, "ethertype": "IPv4", "remote_ip_prefix": "0.0.0.0/0", "direction": "ingress", "security_group_id": "dc08226f-e745-4a7a-b5c7-e6e2e67fc98a"}} Response - Headers: {'date': 'Wed, 07 Aug 2024 06:24:45 GMT', 'server': 'Apache', 'content-length': '341', 'content-type': 'text/html; charset=iso-8859-1', 'set-cookie': '38cec1d9c789c59243203143f8ef839e=c7881776180e421d9393c22bc7768f48; path=/; HttpOnly; Secure; SameSite=None', 'connection': 'close', 'status': '502', 'content-location': 'https://neutron-public-openstack.apps.ocp.openstack.lab/v2.0/security-group-rules'} Body: b'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>502 Proxy Error</title>\n</head><body>\n<h1>Proxy Error</h1>\n<p>The proxy server received an invalid\r\nresponse from an upstream server.<br />\r\nThe proxy server could not handle the request<p>Reason: <strong>Error reading from remote server</strong></p></p>\n</body></html>\n'

Two tests failed at neutron api calls with 502 Proxy Error and 503 Service Unavailable, and that time neutron pod on master-1 restarted with readyness/liveness probe failures.

Same happened to couple of other pods during the same time, and no neutron pod logs available from that time, so something environment happend at that point of time:-
$ oc get pods -o wide|grep ago
barbican-api-api-656c78c5dd-xpmks 2/2 Running 5 (3h45m ago) 8h 192.168.20.111 master-1 <none> <none>
cinder-c20a1-api-0 2/2 Running 2 (3h45m ago) 8h 192.168.20.135 master-1 <none> <none>
cinder-c20a1-backup-0 2/2 Running 1 (3h46m ago) 8h 192.168.20.150 master-1 <none> <none>
manila-scheduler-0 2/2 Running 1 (3h45m ago) 8h 192.168.20.162 master-1 <none> <none>
manila-share-share1-0 2/2 Running 1 (3h45m ago) 8h 192.168.20.160 master-1 <none> <none>
neutron-7889f8d466-gh455 2/2 Running 3 (3h45m ago) 8h 192.168.20.119 master-1 <none> <none>
nova-metadata-1 2/2 Running 2 (3h46m ago) 8h 192.168.20.145 master-1 <none> <none> test-operator-controller-manager-869f857c6f-vr6zn 2/2 Running 1 (3h46m ago) 7h48m 192.168.20.171 master-1 <none> <none>

Strange is restarts happened only to pods running on master-1 node(likely related to some slowness on node and resources utilization at that point of time)
[zuul@controller-0 ~]$ oc get pods -A -o wide|grep master-1 -c
144
[zuul@controller-0 ~]$ oc get pods -A -o wide|grep master-2 -c
130
[zuul@controller-0 ~]$ oc get pods -A -o wide|grep master-0 -c
119
[zuul@controller-0 ~]$ oc adm top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
master-0 3732m 39% 23358Mi 81%
master-1 4880m 51% 21714Mi 75%
master-2 6871m 72% 21491Mi 74%

From kubelet logs for neutron failing pod:-
Aug 07 06:24:14 master-1 kubenswrapper[3452]: I0807 06:24:03.581899 3452 prober.go:107] "Probe failed" probeType="Liveness" pod="openstack/neutron-7889f8d466-gh455" podUID="0c6882c6-b32c-488c-8e9a-826de0042f9c" containerName="neutron-httpd" probeResult="failure" output="Get \"https://192.168.20.119:9696/\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" Aug 07 06:24:16 master-1 kubenswrapper[3452]: I0807 06:24:08.828844 3452 prober.go:107] "Probe failed" probeType="Liveness" pod="openstack/neutron-7889f8d466-gh455" podUID="0c6882c6-b32c-488c-8e9a-826de0042f9c" containerName="neutron-httpd" probeResult="failure" output="Get \"https://192.168.20.119:9696/\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" Aug 07 06:24:17 master-1 kubenswrapper[3452]: I0807 06:24:14.104705 3452 kuberuntime_manager.go:1029] "Message for Container of pod" containerName="neutron-httpd" containerStatusID={"Type":"cri-o","ID":"4ce1b0cc23a13051409d52f81cc86c04272b3d65346117611ac7fd00ca35764e"} pod="openstack/neutron-7889f8d466-gh455" containerMessage="Container neutron-httpd failed liveness probe, will be restarted" Aug 07 06:24:17 master-1 kubenswrapper[3452]: I0807 06:24:13.933527 3452 prober.go:107] "Probe failed" probeType="Liveness" pod="openstack/neutron-7889f8d466-gh455" podUID="0c6882c6-b32c-488c-8e9a-826de0042f9c" containerName="neutron-api" probeResult="failure" output="Get \"https://192.168.20.119:9696/\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" Aug 07 06:24:17 master-1 kubenswrapper[3452]: I0807 06:24:13.935124 3452 prober.go:107] "Probe failed" probeType="Readiness" pod="openstack/neutron-7889f8d466-gh455" podUID="0c6882c6-b32c-488c-8e9a-826de0042f9c" containerName="neutron-api" probeResult="failure" output="Get \"https://192.168.20.119:9696/\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)" Aug 07 06:24:17 master-1 kubenswrapper[3452]: I0807 06:24:13.990118 3452 kubelet.go:2533] "SyncLoop (probe)" probe="liveness" status="unhealthy" pod="openstack/neutron-7889f8d466-gh455" Aug 07 06:24:17 master-1 kubenswrapper[3452]: I0807 06:24:13.997244 3452 kubelet.go:2533] "SyncLoop (probe)" probe="readiness" status="" pod="openstack/neutron-7889f8d466-gh455" Aug 07 06:24:17 master-1 kubenswrapper[3452]: I0807 06:24:13.997329 3452 kubelet.go:2533] "SyncLoop (probe)" probe="liveness" status="unhealthy" pod="openstack/neutron-7889f8d466-gh455" Aug 07 06:24:17 master-1 kubenswrapper[3452]: I0807 06:24:14.095117 3452 kuberuntime_manager.go:1029] "Message for Container of pod" containerName="neutron-api" containerStatusID={"Type":"cri-o","ID":"ef26fd3dd1a4fa5db1ef29aa73b4960b026e2420409a3391eb59eb99cf8e29e7"} pod="openstack/neutron-7889f8d466-gh455" containerMessage="Container neutron-api failed liveness probe, will be restarted" Aug 07 06:24:17 master-1 kubenswrapper[3452]: I0807 06:24:14.104705 3452 kuberuntime_manager.go:1029] "Message for Container of pod" containerName="neutron-httpd" containerStatusID={"Type":"cri-o","ID":"4ce1b0cc23a13051409d52f81cc86c04272b3d65346117611ac7fd00ca35764e"} pod="openstack/neutron-7889f8d466-gh455" containerMessage="Container neutron-httpd failed liveness probe, will be restarted" Aug 07 06:24:17 master-1 kubenswrapper[3452]: I0807 06:24:14.105843 3452 kuberuntime_container.go:770] "Killing container with a grace period" pod="openstack/neutron-7889f8d466-gh455" podUID="0c6882c6-b32c-488c-8e9a-826de0042f9c" containerName="neutron-api" containerID="cri-o://ef26fd3dd1a4fa5db1ef29aa73b4960b026e2420409a3391eb59eb99cf8e29e7" gracePeriod=30

Last successfull probe request as per neutron-httpd log before failed probe message:-
2024-08-07T06:23:53.809968490+00:00 stdout F 192.168.20.2 - - [07/Aug/2024:06:23:53 +0000] "GET / HTTP/1.1" 200 123 "-" "kube-probe/1.29"

Something general also looks wrong, iiuc it's processing all requests around 14+ seconds late and same likely applied to below slow neutron requests:-
Aug 07 06:24:13 master-1 kubenswrapper[3452]: I0807 06:23:59.413919 3452 kubelet.go:2429] "SyncLoop ADD" source="api" pods=["openshift-console/console-7b78fdcff4-75ww6"] Aug 07 06:24:13 master-1 kubenswrapper[3452]: I0807 06:23:59.542084 3452 topology_manager.go:215] "Topology Admit Handler" podUID="19a7e938-a9e2-4b33-83c6-4d39704a7171" podNamespace="openshift-console" podName="console-7b78fdcff4-75ww6 " Aug 07 06:24:13 master-1 kubenswrapper[3452]: E0807 06:23:59.563991 3452 cpu_manager.go:396] "RemoveStaleState: removing container" podUID="f7a8723c-d122-473f-a9d9-1454cae70ecb" containerName="console" Aug 07 06:24:13 master-1 kubenswrapper[3452]: I0807 06:23:59.564088 3452 state_mem.go:107] "Deleted CPUSet assignment" podUID="f7a8723c-d122-473f-a9d9-1454cae70ecb" containerName="console" Aug 07 06:24:13 master-1 kubenswrapper[3452]: I0807 06:23:59.580832 3452 memory_manager.go:354] "RemoveStaleState removing state" podUID="f7a8723c-d122-473f-a9d9-1454cae70ecb" containerName="console" Aug 07 06:24:13 master-1 kubenswrapper[3452]: I0807 06:23:59.822865 3452 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"trusted-ca-bundle\" (UniqueName: \"kubernetes.io/configmap/19a7e9 38-a9e2-4b33-83c6-4d39704a7171-trusted-ca-bundle\") pod \"console-7b78fdcff4-75ww6\" (UID: \"19a7e938-a9e2-4b33-83c6-4d39704a7171\") " pod="openshift-console/console-7b78fdcff4-75ww6"

Also see some long running requests around that time, there could be other as well but neutron-api logs not available from that time so couldn't confirm:-
06:24:25.963 https://neutron-public-openstack.apps.ocp.openstack.lab/v2.0/security-groups/3a6f4813-1da8-45d3-a1fc-c09a6980886f 29.520s

By default we configure 2 api workers, and if both are busy processing requests, HTTP probes can fail(configured as periodSeconds=5, timeout=5, failcount=3) and lead to container restarts.

So seems we need to improve these probes to avoid unnecessary restarts due to the probes on busy neutron workers.
Also we running same HTTPprobe(port 9696) for both neutron-api and neutron-httpd containers, we likely need only for httpd as that proxies the request to neutron-api.

We may also have to check if we have to increase the probe timeouts for neutron, i see other services like nova/glance/placement etc running with higher probe timeouts(periodSeconds=30s, timeout=30s).

Also in the same env the neutron pod took ~36 to start and that also resulted into 3 probe failures and lead to container restart. So that also needs to be considered for slow environments.

Other observations:-

In total there were 3 restarts for neutron pod on master 1, the logs of 1st restart were not available, checked under /var/log/pod on master-1 node. So need to check from product perspective how it will be dealt on Cu setup, is there some centralized logging solution be enabled by default?

For CI case increasing api_workers count should also help with current set of probe configurations until the fixes in neutron-operator land and available for CI.

[1] https://sf.hosted.upshift.rdu2.redhat.com/logs/openstack-uni-jobs-periodic-integration-rhoso18.0-rhel9/gitlab.cee.redhat.com/ci-framework/ci-framework-jobs/main/uni03gamma-rhel9-rhoso18.0/41460f3/logs/controller-0/ci-framework-data/tests/test_operator/tempest-tests/stestr_results.html

is blocked by

OSPRH-10039 Openstack edpm update failed on telemetry node_export update task