Loading...

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.17
Component/s: Networking / router
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Investigating kube-apiserver disruption for metal and vshpere jobs shows that during node upgrade / reboots as the haproxy container is being shutdown, with a grace period, kubenswrapper streamwatcher frequently reports `Unexpected EOF during watch stream event decoding: unexpected EOF` across multiple master nodes at the same time.

This leads to an inability to communicate with kube-apiserver and measured disruption for metal and vsphere jobs. Reviewing the 'Percentage of Non-Zero Disruption' graph shows these jobs record disruption close to 100% of the time for micro upgrades.

periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ipi-ovn-bm-upgrade shows kube-api-new-connections disruption for 4s

Jul 02 14:45:29.271 - 4s    E backend-disruption-name/kube-api-new-connections connection/new disruption/openshift-tests reason/DisruptionBegan request-audit-id/bce4d8a2-bf30-4227-841b-f07418d02cd2 backend-disruption-name/kube-api-new-connections connection/new disruption/openshift-tests stopped responding to GET requests over new connections: Get "https://api.cluster16.ocpci.eng.rdu2.redhat.com:6443/api/v1/namespaces/default": Service Unavailable

This is during the time host2.cluster16.ocpci.eng.rdu2.redhat.com is being updated and we see

Jul 02 14:45:28.420990 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: I0702 14:45:28.420772    3436 kuberuntime_container.go:779] "Killing container with a grace period" pod="openshift-kni-infra/haproxy-host2.cluster16.ocpci.eng.rdu2.redhat.com" podUID="c2a9f84825eaf139fe73e71af9d76ff1" containerName="haproxy" containerID="cri-o://aed0bce542fb53551b0183d55e2653cf1a853a75e5eab5c08f4312fe67cd2e12" gracePeriod=65
Jul 02 14:45:28.420990 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: I0702 14:45:28.420813    3436 kuberuntime_container.go:779] "Killing container with a grace period" pod="openshift-kni-infra/haproxy-host2.cluster16.ocpci.eng.rdu2.redhat.com" podUID="c2a9f84825eaf139fe73e71af9d76ff1" containerName="haproxy-monitor" containerID="cri-o://58403feff7fbcc21c2d2df0168fc6534f9b9463e2a2b4073287492fe0aa80571" gracePeriod=65
Jul 02 14:45:28.420990 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: I0702 14:45:28.420827    3436 memory_manager.go:354] "RemoveStaleState removing state" podUID="c2a9f84825eaf139fe73e71af9d76ff1" containerName="haproxy-monitor"
Jul 02 14:45:28.420990 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: I0702 14:45:28.420839    3436 memory_manager.go:354] "RemoveStaleState removing state" podUID="c2a9f84825eaf139fe73e71af9d76ff1" containerName="verify-api-int-resolvable"
Jul 02 14:45:28.420990 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: I0702 14:45:28.420846    3436 memory_manager.go:354] "RemoveStaleState removing state" podUID="c2a9f84825eaf139fe73e71af9d76ff1" containerName="haproxy"
Jul 02 14:45:28.420990 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: E0702 14:45:28.420895    3436 cpu_manager.go:396] "RemoveStaleState: removing container" podUID="c2a9f84825eaf139fe73e71af9d76ff1" containerName="verify-api-int-resolvable"
Jul 02 14:45:28.420990 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: I0702 14:45:28.420904    3436 state_mem.go:107] "Deleted CPUSet assignment" podUID="c2a9f84825eaf139fe73e71af9d76ff1" containerName="verify-api-int-resolvable"
Jul 02 14:45:28.421254 host2.cluster16.ocpci.eng.rdu2.redhat.com crio[3369]: time="2024-07-02 14:45:28.421071873Z" level=info msg="Stopping container: aed0bce542fb53551b0183d55e2653cf1a853a75e5eab5c08f4312fe67cd2e12 (timeout: 65s)" id=cfb9410d-31a9-49cf-a6d3-a06cb652ab9b name=/runtime.v1.RuntimeService/StopContainer
Jul 02 14:45:28.421254 host2.cluster16.ocpci.eng.rdu2.redhat.com crio[3369]: time="2024-07-02 14:45:28.421087964Z" level=info msg="Stopping container: 58403feff7fbcc21c2d2df0168fc6534f9b9463e2a2b4073287492fe0aa80571 (timeout: 65s)" id=ecb3f0da-8b35-4502-8917-75568fc45a4a name=/runtime.v1.RuntimeService/StopContainer
Jul 02 14:45:28.435879 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: I0702 14:45:28.435755    3436 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
Jul 02 14:45:28.435879 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: I0702 14:45:28.435773    3436 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
Jul 02 14:45:28.436197 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: I0702 14:45:28.435888    3436 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
...
Jul 02 14:45:28.437223 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: E0702 14:45:28.436717    3436 event.go:368] "Unable to write event (may retry after sleeping)" err="Post \"https://api-int.cluster16.ocpci.eng.rdu2.redhat.com:6443/api/v1/namespaces/openshift-kni-infra/events\": unexpected EOF" event="&Event{ObjectMeta:{haproxy-host2.cluster16.ocpci.eng.rdu2.redhat.com.17de6c9ccd84399e  openshift-kni-infra    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:openshift-kni-infra,Name:haproxy-host2.cluster16.ocpci.eng.rdu2.redhat.com,UID:c2a9f84825eaf139fe73e71af9d76ff1,APIVersion:v1,ResourceVersion:,FieldPath:spec.containers{haproxy-monitor},},Reason:Killing,Message:Stopping container haproxy-monitor,Source:EventSource{Component:kubelet,Host:host2.cluster16.ocpci.eng.rdu2.redhat.com,},FirstTimestamp:2024-07-02 14:45:28.420800926 +0000 UTC m=+7323.559846197,LastTimestamp:2024-07-02 14:45:28.420800926 +0000 UTC m=+7323.559846197,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:kubelet,ReportingInstance:host2.cluster16.ocpci.eng.rdu2.redhat.com,}"
...
Jul 02 14:45:28.443402 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: E0702 14:45:28.441474    3436 reflector.go:150] object-"openshift-ovn-kubernetes"/"env-overrides": Failed to watch *v1.ConfigMap: Get "https://api-int.cluster16.ocpci.eng.rdu2.redhat.com:6443/api/v1/namespaces/openshift-ovn-kubernetes/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Denv-overrides&resourceVersion=82529&timeout=6m8s&timeoutSeconds=368&watch=true": dial tcp: lookup api-int.cluster16.ocpci.eng.rdu2.redhat.com on 10.11.5.160:53: no such host
Jul 02 14:45:28.443402 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: E0702 14:45:28.441536    3436 reflector.go:150] object-"openshift-multus"/"multus-daemon-config": Failed to watch *v1.ConfigMap: Get "https://api-int.cluster16.ocpci.eng.rdu2.redhat.com:6443/api/v1/namespaces/openshift-multus/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dmultus-daemon-config&resourceVersion=82633&timeout=9m21s&timeoutSeconds=561&watch=true": dial tcp: lookup api-int.cluster16.ocpci.eng.rdu2.redhat.com on 10.11.5.160:53: no such host
Jul 02 14:45:28.443402 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: E0702 14:45:28.441622    3436 reflector.go:150] object-"openshift-monitoring"/"openshift-service-ca.crt": Failed to watch *v1.ConfigMap: Get "https://api-int.cluster16.ocpci.eng.rdu2.redhat.com:6443/api/v1/namespaces/openshift-monitoring/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dopenshift-service-ca.crt&resourceVersion=82341&timeout=6m48s&timeoutSeconds=408&watch=true": dial tcp: lookup api-int.cluster16.ocpci.eng.rdu2.redhat.com on 10.11.5.160:53: no such host
Jul 02 14:45:28.443402 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: E0702 14:45:28.441689    3436 reflector.go:150] object-"openshift-multus"/"metrics-daemon-sa-dockercfg-l5bdr": Failed to watch *v1.Secret: Get "https://api-int.cluster16.ocpci.eng.rdu2.redhat.com:6443/api/v1/namespaces/openshift-multus/secrets?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dmetrics-daemon-sa-dockercfg-l5bdr&resourceVersion=82551&timeout=7m52s&timeoutSeconds=472&watch=true": dial tcp: lookup api-int.cluster16.ocpci.eng.rdu2.redhat.com on 10.11.5.160:53: no such host
...

At the same time we see similar errors on host3.cluster16.ocpci.eng.rdu2.redhat.com

Jul 02 14:45:28.450316 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: I0702 14:45:28.444779    3069 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
Jul 02 14:45:28.450316 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: I0702 14:45:28.444797    3069 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
Jul 02 14:45:28.450316 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: I0702 14:45:28.444813    3069 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
Jul 02 14:45:29.001194 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: E0702 14:45:29.001153    3069 kubelet_node_status.go:594] "Error updating node status, will retry" err="error getting node \"host3.cluster16.ocpci.eng.rdu2.redhat.com\": Get \"https://api-int.cluster16.ocpci.eng.rdu2.redhat.com:6443/api/v1/nodes/host3.cluster16.ocpci.eng.rdu2.redhat.com?resourceVersion=0&timeout=10s\": dial tcp 10.10.129.136:6443: connect: connection refused"
Jul 02 14:45:29.001606 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: E0702 14:45:29.001591    3069 kubelet_node_status.go:594] "Error updating node status, will retry" err="error getting node \"host3.cluster16.ocpci.eng.rdu2.redhat.com\": Get \"https://api-int.cluster16.ocpci.eng.rdu2.redhat.com:6443/api/v1/nodes/host3.cluster16.ocpci.eng.rdu2.redhat.com?timeout=10s\": dial tcp 10.10.129.136:6443: connect: connection refused"
Jul 02 14:45:29.001991 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: E0702 14:45:29.001977    3069 kubelet_node_status.go:594] "Error updating node status, will retry" err="error getting node \"host3.cluster16.ocpci.eng.rdu2.redhat.com\": Get \"https://api-int.cluster16.ocpci.eng.rdu2.redhat.com:6443/api/v1/nodes/host3.cluster16.ocpci.eng.rdu2.redhat.com?timeout=10s\": dial tcp 10.10.129.136:6443: connect: connection refused"
Jul 02 14:45:29.002331 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: E0702 14:45:29.002315    3069 kubelet_node_status.go:594] "Error updating node status, will retry" err="error getting node \"host3.cluster16.ocpci.eng.rdu2.redhat.com\": Get \"https://api-int.cluster16.ocpci.eng.rdu2.redhat.com:6443/api/v1/nodes/host3.cluster16.ocpci.eng.rdu2.redhat.com?timeout=10s\": dial tcp 10.10.129.136:6443: connect: connection refused"
Jul 02 14:45:29.002626 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: E0702 14:45:29.002608    3069 kubelet_node_status.go:594] "Error updating node status, will retry" err="error getting node \"host3.cluster16.ocpci.eng.rdu2.redhat.com\": Get \"https://api-int.cluster16.ocpci.eng.rdu2.redhat.com:6443/api/v1/nodes/host3.cluster16.ocpci.eng.rdu2.redhat.com?timeout=10s\": dial tcp 10.10.129.136:6443: connect: connection refused"
Jul 02 14:45:29.002677 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: E0702 14:45:29.002626    3069 kubelet_node_status.go:581] "Unable to update node status" err="update node status exceeds retry count"
Jul 02 14:45:32.245270 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: I0702 14:45:32.245250    3069 log.go:245] http: TLS handshake error from 10.10.129.132:53494: EOF
Jul 02 14:45:33.112474 host3.cluster16.ocpci.eng.rdu2.redhat.com ovs-vswitchd[1216]: ovs|00320|connmgr|INFO|br-ex<->unix#649: 2 flow_mods in the last 0 s (1 adds, 1 deletes)
Jul 02 14:45:35.901626 host3.cluster16.ocpci.eng.rdu2.redhat.com ovs-vswitchd[1216]: ovs|00321|connmgr|INFO|br-ex<->unix#653: 3 flow_mods in the last 0 s (2 adds, 1 deletes)
Jul 02 14:45:35.918151 host3.cluster16.ocpci.eng.rdu2.redhat.com ovs-vswitchd[1216]: ovs|00322|connmgr|INFO|br-ex<->unix#656: 2 flow_mods in the last 0 s (1 adds, 1 deletes)
Jul 02 14:45:35.942138 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: W0702 14:45:35.942113    3069 reflector.go:470] object-"openshift-marketplace"/"redhat-operators-dockercfg-4vm4l": watch of *v1.Secret ended with: very short watch: object-"openshift-marketplace"/"redhat-operators-dockercfg-4vm4l": Unexpected watch close - watch lasted less than a second and no items received
Jul 02 14:45:35.943871 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: W0702 14:45:35.943847    3069 reflector.go:470] object-"openshift-authentication-operator"/"authentication-operator-dockercfg-hhtfm": watch of *v1.Secret ended with: very short watch: object-"openshift-authentication-operator"/"authentication-operator-dockercfg-hhtfm": Unexpected watch close - watch lasted less than a second and no items received
Jul 02 14:45:35.943994 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: W0702 14:45:35.943890    3069 reflector.go:470] object-"openshift-etcd-operator"/"etcd-operator-config": watch of *v1.ConfigMap ended with: very short watch: object-"openshift-etcd-operator"/"etcd-operator-config": Unexpected watch close - watch lasted less than a second and no items received

A similar pattern is seen for periodic-ci-openshift-release-master-ci-4.17-e2e-vsphere-ovn-upgrade jobs as well kube-api-new-connections disruption for 4s

Jul 06 22:32:20.636 - 4s    E backend-disruption-name/kube-api-new-connections connection/new disruption/openshift-tests reason/DisruptionBegan request-audit-id/9ae8a7ad-6895-40ce-9544-b793939e818b backend-disruption-name/kube-api-new-connections connection/new disruption/openshift-tests stopped responding to GET requests over new connections: Get "https://api.ci-op-l15qyv3b-6903e.vmc-ci.devcluster.openshift.com:6443/api/v1/namespaces/default": dial tcp 10.5.183.2:6443: connect: connection refused'

ci-op-l15qyv3b-6903e-l4mr9-master-0:

Jul 06 22:32:19.908154 ci-op-l15qyv3b-6903e-l4mr9-master-0 kubenswrapper[2708]: I0706 22:32:19.908130    2708 kuberuntime_container.go:779] "Killing container with a grace period" pod="openshift-vsphere-infra/haproxy-ci-op-l15qyv3b-6903e-l4mr9-master-0" podUID="12c5243d4d3954e578b411314242c9c6" containerName="haproxy" containerID="cri-o://1ef0090fc50f9702d5bc4c5b2327d92c2a318f95840a4673d6d321663bcbfc89" gracePeriod=65
Jul 06 22:32:19.908256 ci-op-l15qyv3b-6903e-l4mr9-master-0 kubenswrapper[2708]: I0706 22:32:19.908237    2708 kuberuntime_container.go:779] "Killing container with a grace period" pod="openshift-vsphere-infra/haproxy-ci-op-l15qyv3b-6903e-l4mr9-master-0" podUID="12c5243d4d3954e578b411314242c9c6" containerName="haproxy-monitor" containerID="cri-o://0eadce871a6c78451090a3d3eccf81f3621b0cc00afbb38afe51d7b4c2315be6" gracePeriod=65
Jul 06 22:32:19.908953 ci-op-l15qyv3b-6903e-l4mr9-master-0 crio[2646]: time="2024-07-06 22:32:19.908581548Z" level=info msg="Stopping container: 1ef0090fc50f9702d5bc4c5b2327d92c2a318f95840a4673d6d321663bcbfc89 (timeout: 65s)" id=e7385706-b5e8-4645-a7f2-fb76d64f6c10 name=/runtime.v1.RuntimeService/StopContainer
Jul 06 22:32:19.908953 ci-op-l15qyv3b-6903e-l4mr9-master-0 crio[2646]: time="2024-07-06 22:32:19.908581118Z" level=info msg="Stopping container: 0eadce871a6c78451090a3d3eccf81f3621b0cc00afbb38afe51d7b4c2315be6 (timeout: 65s)" id=48316ca3-34a5-48ad-bb20-ccaf7d684a84 name=/runtime.v1.RuntimeService/StopContainer
Jul 06 22:32:19.918699 ci-op-l15qyv3b-6903e-l4mr9-master-0 kubenswrapper[2708]: I0706 22:32:19.918658    2708 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF

ci-op-l15qyv3b-6903e-l4mr9-master-2:

Jul 06 22:32:19.923431 ci-op-l15qyv3b-6903e-l4mr9-master-2 kubenswrapper[2708]: I0706 22:32:19.923379    2708 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
Jul 06 22:32:19.923868 ci-op-l15qyv3b-6903e-l4mr9-master-2 kubenswrapper[2708]: I0706 22:32:19.923641    2708 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
Jul 06 22:32:19.923868 ci-op-l15qyv3b-6903e-l4mr9-master-2 kubenswrapper[2708]: I0706 22:32:19.923719    2708 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2024-07-19-09-10-45-539.png
2024/07/19 1:10 PM
378 kB
Forrest Babcock
image-2024-07-19-09-14-34-178.png
2024/07/19 1:14 PM
27 kB
Forrest Babcock
image-2024-07-19-09-15-44-746.png
2024/07/19 1:15 PM
27 kB
Forrest Babcock
image-2024-07-19-09-20-11-371.png
2024/07/19 1:20 PM
283 kB
Forrest Babcock

is related to

OCPBUGS-59925 Metal and Vsphere kube-api-new-connections Disruption Increase

New

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates