-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.17
-
None
-
Rejected
-
False
-
Investigating kube-apiserver disruption for metal and vshpere jobs shows that during node upgrade / reboots as the haproxy container is being shutdown, with a grace period, kubenswrapper streamwatcher frequently reports `Unexpected EOF during watch stream event decoding: unexpected EOF` across multiple master nodes at the same time.
This leads to an inability to communicate with kube-apiserver and measured disruption for metal and vsphere jobs. Reviewing the 'Percentage of Non-Zero Disruption' graph shows these jobs record disruption close to 100% of the time for micro upgrades.
periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ipi-ovn-bm-upgrade shows kube-api-new-connections disruption for 4s
Jul 02 14:45:29.271 - 4s E backend-disruption-name/kube-api-new-connections connection/new disruption/openshift-tests reason/DisruptionBegan request-audit-id/bce4d8a2-bf30-4227-841b-f07418d02cd2 backend-disruption-name/kube-api-new-connections connection/new disruption/openshift-tests stopped responding to GET requests over new connections: Get "https://api.cluster16.ocpci.eng.rdu2.redhat.com:6443/api/v1/namespaces/default": Service Unavailable
This is during the time host2.cluster16.ocpci.eng.rdu2.redhat.com is being updated and we see
Jul 02 14:45:28.420990 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: I0702 14:45:28.420772 3436 kuberuntime_container.go:779] "Killing container with a grace period" pod="openshift-kni-infra/haproxy-host2.cluster16.ocpci.eng.rdu2.redhat.com" podUID="c2a9f84825eaf139fe73e71af9d76ff1" containerName="haproxy" containerID="cri-o://aed0bce542fb53551b0183d55e2653cf1a853a75e5eab5c08f4312fe67cd2e12" gracePeriod=65 Jul 02 14:45:28.420990 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: I0702 14:45:28.420813 3436 kuberuntime_container.go:779] "Killing container with a grace period" pod="openshift-kni-infra/haproxy-host2.cluster16.ocpci.eng.rdu2.redhat.com" podUID="c2a9f84825eaf139fe73e71af9d76ff1" containerName="haproxy-monitor" containerID="cri-o://58403feff7fbcc21c2d2df0168fc6534f9b9463e2a2b4073287492fe0aa80571" gracePeriod=65 Jul 02 14:45:28.420990 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: I0702 14:45:28.420827 3436 memory_manager.go:354] "RemoveStaleState removing state" podUID="c2a9f84825eaf139fe73e71af9d76ff1" containerName="haproxy-monitor" Jul 02 14:45:28.420990 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: I0702 14:45:28.420839 3436 memory_manager.go:354] "RemoveStaleState removing state" podUID="c2a9f84825eaf139fe73e71af9d76ff1" containerName="verify-api-int-resolvable" Jul 02 14:45:28.420990 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: I0702 14:45:28.420846 3436 memory_manager.go:354] "RemoveStaleState removing state" podUID="c2a9f84825eaf139fe73e71af9d76ff1" containerName="haproxy" Jul 02 14:45:28.420990 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: E0702 14:45:28.420895 3436 cpu_manager.go:396] "RemoveStaleState: removing container" podUID="c2a9f84825eaf139fe73e71af9d76ff1" containerName="verify-api-int-resolvable" Jul 02 14:45:28.420990 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: I0702 14:45:28.420904 3436 state_mem.go:107] "Deleted CPUSet assignment" podUID="c2a9f84825eaf139fe73e71af9d76ff1" containerName="verify-api-int-resolvable" Jul 02 14:45:28.421254 host2.cluster16.ocpci.eng.rdu2.redhat.com crio[3369]: time="2024-07-02 14:45:28.421071873Z" level=info msg="Stopping container: aed0bce542fb53551b0183d55e2653cf1a853a75e5eab5c08f4312fe67cd2e12 (timeout: 65s)" id=cfb9410d-31a9-49cf-a6d3-a06cb652ab9b name=/runtime.v1.RuntimeService/StopContainer Jul 02 14:45:28.421254 host2.cluster16.ocpci.eng.rdu2.redhat.com crio[3369]: time="2024-07-02 14:45:28.421087964Z" level=info msg="Stopping container: 58403feff7fbcc21c2d2df0168fc6534f9b9463e2a2b4073287492fe0aa80571 (timeout: 65s)" id=ecb3f0da-8b35-4502-8917-75568fc45a4a name=/runtime.v1.RuntimeService/StopContainer Jul 02 14:45:28.435879 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: I0702 14:45:28.435755 3436 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF Jul 02 14:45:28.435879 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: I0702 14:45:28.435773 3436 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF Jul 02 14:45:28.436197 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: I0702 14:45:28.435888 3436 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF ... Jul 02 14:45:28.437223 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: E0702 14:45:28.436717 3436 event.go:368] "Unable to write event (may retry after sleeping)" err="Post \"https://api-int.cluster16.ocpci.eng.rdu2.redhat.com:6443/api/v1/namespaces/openshift-kni-infra/events\": unexpected EOF" event="&Event{ObjectMeta:{haproxy-host2.cluster16.ocpci.eng.rdu2.redhat.com.17de6c9ccd84399e openshift-kni-infra 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:openshift-kni-infra,Name:haproxy-host2.cluster16.ocpci.eng.rdu2.redhat.com,UID:c2a9f84825eaf139fe73e71af9d76ff1,APIVersion:v1,ResourceVersion:,FieldPath:spec.containers{haproxy-monitor},},Reason:Killing,Message:Stopping container haproxy-monitor,Source:EventSource{Component:kubelet,Host:host2.cluster16.ocpci.eng.rdu2.redhat.com,},FirstTimestamp:2024-07-02 14:45:28.420800926 +0000 UTC m=+7323.559846197,LastTimestamp:2024-07-02 14:45:28.420800926 +0000 UTC m=+7323.559846197,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:kubelet,ReportingInstance:host2.cluster16.ocpci.eng.rdu2.redhat.com,}" ... Jul 02 14:45:28.443402 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: E0702 14:45:28.441474 3436 reflector.go:150] object-"openshift-ovn-kubernetes"/"env-overrides": Failed to watch *v1.ConfigMap: Get "https://api-int.cluster16.ocpci.eng.rdu2.redhat.com:6443/api/v1/namespaces/openshift-ovn-kubernetes/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Denv-overrides&resourceVersion=82529&timeout=6m8s&timeoutSeconds=368&watch=true": dial tcp: lookup api-int.cluster16.ocpci.eng.rdu2.redhat.com on 10.11.5.160:53: no such host Jul 02 14:45:28.443402 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: E0702 14:45:28.441536 3436 reflector.go:150] object-"openshift-multus"/"multus-daemon-config": Failed to watch *v1.ConfigMap: Get "https://api-int.cluster16.ocpci.eng.rdu2.redhat.com:6443/api/v1/namespaces/openshift-multus/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dmultus-daemon-config&resourceVersion=82633&timeout=9m21s&timeoutSeconds=561&watch=true": dial tcp: lookup api-int.cluster16.ocpci.eng.rdu2.redhat.com on 10.11.5.160:53: no such host Jul 02 14:45:28.443402 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: E0702 14:45:28.441622 3436 reflector.go:150] object-"openshift-monitoring"/"openshift-service-ca.crt": Failed to watch *v1.ConfigMap: Get "https://api-int.cluster16.ocpci.eng.rdu2.redhat.com:6443/api/v1/namespaces/openshift-monitoring/configmaps?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dopenshift-service-ca.crt&resourceVersion=82341&timeout=6m48s&timeoutSeconds=408&watch=true": dial tcp: lookup api-int.cluster16.ocpci.eng.rdu2.redhat.com on 10.11.5.160:53: no such host Jul 02 14:45:28.443402 host2.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3436]: E0702 14:45:28.441689 3436 reflector.go:150] object-"openshift-multus"/"metrics-daemon-sa-dockercfg-l5bdr": Failed to watch *v1.Secret: Get "https://api-int.cluster16.ocpci.eng.rdu2.redhat.com:6443/api/v1/namespaces/openshift-multus/secrets?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dmetrics-daemon-sa-dockercfg-l5bdr&resourceVersion=82551&timeout=7m52s&timeoutSeconds=472&watch=true": dial tcp: lookup api-int.cluster16.ocpci.eng.rdu2.redhat.com on 10.11.5.160:53: no such host ...
At the same time we see similar errors on host3.cluster16.ocpci.eng.rdu2.redhat.com
Jul 02 14:45:28.450316 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: I0702 14:45:28.444779 3069 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF Jul 02 14:45:28.450316 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: I0702 14:45:28.444797 3069 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF Jul 02 14:45:28.450316 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: I0702 14:45:28.444813 3069 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF Jul 02 14:45:29.001194 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: E0702 14:45:29.001153 3069 kubelet_node_status.go:594] "Error updating node status, will retry" err="error getting node \"host3.cluster16.ocpci.eng.rdu2.redhat.com\": Get \"https://api-int.cluster16.ocpci.eng.rdu2.redhat.com:6443/api/v1/nodes/host3.cluster16.ocpci.eng.rdu2.redhat.com?resourceVersion=0&timeout=10s\": dial tcp 10.10.129.136:6443: connect: connection refused" Jul 02 14:45:29.001606 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: E0702 14:45:29.001591 3069 kubelet_node_status.go:594] "Error updating node status, will retry" err="error getting node \"host3.cluster16.ocpci.eng.rdu2.redhat.com\": Get \"https://api-int.cluster16.ocpci.eng.rdu2.redhat.com:6443/api/v1/nodes/host3.cluster16.ocpci.eng.rdu2.redhat.com?timeout=10s\": dial tcp 10.10.129.136:6443: connect: connection refused" Jul 02 14:45:29.001991 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: E0702 14:45:29.001977 3069 kubelet_node_status.go:594] "Error updating node status, will retry" err="error getting node \"host3.cluster16.ocpci.eng.rdu2.redhat.com\": Get \"https://api-int.cluster16.ocpci.eng.rdu2.redhat.com:6443/api/v1/nodes/host3.cluster16.ocpci.eng.rdu2.redhat.com?timeout=10s\": dial tcp 10.10.129.136:6443: connect: connection refused" Jul 02 14:45:29.002331 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: E0702 14:45:29.002315 3069 kubelet_node_status.go:594] "Error updating node status, will retry" err="error getting node \"host3.cluster16.ocpci.eng.rdu2.redhat.com\": Get \"https://api-int.cluster16.ocpci.eng.rdu2.redhat.com:6443/api/v1/nodes/host3.cluster16.ocpci.eng.rdu2.redhat.com?timeout=10s\": dial tcp 10.10.129.136:6443: connect: connection refused" Jul 02 14:45:29.002626 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: E0702 14:45:29.002608 3069 kubelet_node_status.go:594] "Error updating node status, will retry" err="error getting node \"host3.cluster16.ocpci.eng.rdu2.redhat.com\": Get \"https://api-int.cluster16.ocpci.eng.rdu2.redhat.com:6443/api/v1/nodes/host3.cluster16.ocpci.eng.rdu2.redhat.com?timeout=10s\": dial tcp 10.10.129.136:6443: connect: connection refused" Jul 02 14:45:29.002677 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: E0702 14:45:29.002626 3069 kubelet_node_status.go:581] "Unable to update node status" err="update node status exceeds retry count" Jul 02 14:45:32.245270 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: I0702 14:45:32.245250 3069 log.go:245] http: TLS handshake error from 10.10.129.132:53494: EOF Jul 02 14:45:33.112474 host3.cluster16.ocpci.eng.rdu2.redhat.com ovs-vswitchd[1216]: ovs|00320|connmgr|INFO|br-ex<->unix#649: 2 flow_mods in the last 0 s (1 adds, 1 deletes) Jul 02 14:45:35.901626 host3.cluster16.ocpci.eng.rdu2.redhat.com ovs-vswitchd[1216]: ovs|00321|connmgr|INFO|br-ex<->unix#653: 3 flow_mods in the last 0 s (2 adds, 1 deletes) Jul 02 14:45:35.918151 host3.cluster16.ocpci.eng.rdu2.redhat.com ovs-vswitchd[1216]: ovs|00322|connmgr|INFO|br-ex<->unix#656: 2 flow_mods in the last 0 s (1 adds, 1 deletes) Jul 02 14:45:35.942138 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: W0702 14:45:35.942113 3069 reflector.go:470] object-"openshift-marketplace"/"redhat-operators-dockercfg-4vm4l": watch of *v1.Secret ended with: very short watch: object-"openshift-marketplace"/"redhat-operators-dockercfg-4vm4l": Unexpected watch close - watch lasted less than a second and no items received Jul 02 14:45:35.943871 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: W0702 14:45:35.943847 3069 reflector.go:470] object-"openshift-authentication-operator"/"authentication-operator-dockercfg-hhtfm": watch of *v1.Secret ended with: very short watch: object-"openshift-authentication-operator"/"authentication-operator-dockercfg-hhtfm": Unexpected watch close - watch lasted less than a second and no items received Jul 02 14:45:35.943994 host3.cluster16.ocpci.eng.rdu2.redhat.com kubenswrapper[3069]: W0702 14:45:35.943890 3069 reflector.go:470] object-"openshift-etcd-operator"/"etcd-operator-config": watch of *v1.ConfigMap ended with: very short watch: object-"openshift-etcd-operator"/"etcd-operator-config": Unexpected watch close - watch lasted less than a second and no items received
A similar pattern is seen for periodic-ci-openshift-release-master-ci-4.17-e2e-vsphere-ovn-upgrade jobs as well kube-api-new-connections disruption for 4s
Jul 06 22:32:20.636 - 4s E backend-disruption-name/kube-api-new-connections connection/new disruption/openshift-tests reason/DisruptionBegan request-audit-id/9ae8a7ad-6895-40ce-9544-b793939e818b backend-disruption-name/kube-api-new-connections connection/new disruption/openshift-tests stopped responding to GET requests over new connections: Get "https://api.ci-op-l15qyv3b-6903e.vmc-ci.devcluster.openshift.com:6443/api/v1/namespaces/default": dial tcp 10.5.183.2:6443: connect: connection refused'
ci-op-l15qyv3b-6903e-l4mr9-master-0:
Jul 06 22:32:19.908154 ci-op-l15qyv3b-6903e-l4mr9-master-0 kubenswrapper[2708]: I0706 22:32:19.908130 2708 kuberuntime_container.go:779] "Killing container with a grace period" pod="openshift-vsphere-infra/haproxy-ci-op-l15qyv3b-6903e-l4mr9-master-0" podUID="12c5243d4d3954e578b411314242c9c6" containerName="haproxy" containerID="cri-o://1ef0090fc50f9702d5bc4c5b2327d92c2a318f95840a4673d6d321663bcbfc89" gracePeriod=65 Jul 06 22:32:19.908256 ci-op-l15qyv3b-6903e-l4mr9-master-0 kubenswrapper[2708]: I0706 22:32:19.908237 2708 kuberuntime_container.go:779] "Killing container with a grace period" pod="openshift-vsphere-infra/haproxy-ci-op-l15qyv3b-6903e-l4mr9-master-0" podUID="12c5243d4d3954e578b411314242c9c6" containerName="haproxy-monitor" containerID="cri-o://0eadce871a6c78451090a3d3eccf81f3621b0cc00afbb38afe51d7b4c2315be6" gracePeriod=65 Jul 06 22:32:19.908953 ci-op-l15qyv3b-6903e-l4mr9-master-0 crio[2646]: time="2024-07-06 22:32:19.908581548Z" level=info msg="Stopping container: 1ef0090fc50f9702d5bc4c5b2327d92c2a318f95840a4673d6d321663bcbfc89 (timeout: 65s)" id=e7385706-b5e8-4645-a7f2-fb76d64f6c10 name=/runtime.v1.RuntimeService/StopContainer Jul 06 22:32:19.908953 ci-op-l15qyv3b-6903e-l4mr9-master-0 crio[2646]: time="2024-07-06 22:32:19.908581118Z" level=info msg="Stopping container: 0eadce871a6c78451090a3d3eccf81f3621b0cc00afbb38afe51d7b4c2315be6 (timeout: 65s)" id=48316ca3-34a5-48ad-bb20-ccaf7d684a84 name=/runtime.v1.RuntimeService/StopContainer Jul 06 22:32:19.918699 ci-op-l15qyv3b-6903e-l4mr9-master-0 kubenswrapper[2708]: I0706 22:32:19.918658 2708 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF
ci-op-l15qyv3b-6903e-l4mr9-master-2:
Jul 06 22:32:19.923431 ci-op-l15qyv3b-6903e-l4mr9-master-2 kubenswrapper[2708]: I0706 22:32:19.923379 2708 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF Jul 06 22:32:19.923868 ci-op-l15qyv3b-6903e-l4mr9-master-2 kubenswrapper[2708]: I0706 22:32:19.923641 2708 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF Jul 06 22:32:19.923868 ci-op-l15qyv3b-6903e-l4mr9-master-2 kubenswrapper[2708]: I0706 22:32:19.923719 2708 streamwatcher.go:111] Unexpected EOF during watch stream event decoding: unexpected EOF