-
Story
-
Resolution: Done
-
Normal
-
None
-
None
-
None
-
False
-
None
-
False
-
-
In TRT-1514 we found an issue where conformance disruption monitors are reporting disruption for service endpoints created by the upgrade monitor tests as noted here .
That investigation was followed up looking at another run with similar symptoms.
Reviewing resource-watch we can observe when the services were created and the values
git log | grep -B5 "services/pod-network-service" commit c43b533b59edc542e4f5dd8e9b37ca21a90dfb23 Author: unknown <ci-monitor@openshift.io> Date: Wed Feb 21 05:17:58 2024 +0000 removed services/pod-network-service -n e2e-pod-network-disruption-test-k92nt -- commit 96678a29585e1c4b35e34aa898dbab2d27f16d8e Author: openshift-tests <ci-monitor@openshift.io> Date: Wed Feb 21 04:23:45 2024 +0000 added services/pod-network-service -n e2e-pod-network-disruption-test-k92nt -- commit 0fb6688ece33c61a1ed7c7f851f5face24a134e5 Author: unknown <ci-monitor@openshift.io> Date: Wed Feb 21 04:23:12 2024 +0000 removed services/pod-network-service -n e2e-pod-network-disruption-test-vv2x4 -- commit 08faa9e4e8419275be3281d8f821666a990f9788 Author: openshift-tests <ci-monitor@openshift.io> Date: Wed Feb 21 03:13:20 2024 +0000 added services/pod-network-service -n e2e-pod-network-disruption-test-vv2x4
git checkout 08faa9e4e8419275be3281d8f821666a990f9788
HEAD is now at 08faa9e4e8 added services/pod-network-service -n e2e-pod-network-disruption-test-vv2x4
cat namespaces/e2e-pod-network-disruption-test-vv2x4/core/services/pod-network-service.yaml
name: pod-network-service namespace: e2e-pod-network-disruption-test-vv2x4 resourceVersion: "47260" uid: e726c46d-562c-4bab-b813-d4afeb857e0b spec: clusterIP: 172.30.136.5 clusterIPs: - 172.30.136.5
git checkout 96678a29585e1c4b35e34aa898dbab2d27f16d8e
Previous HEAD position was 08faa9e4e8 added services/pod-network-service -n e2e-pod-network-disruption-test-vv2x4
HEAD is now at 96678a2958 added services/pod-network-service -n e2e-pod-network-disruption-test-k92nt
fsb: $ cat namespaces/e2e-pod-network-disruption-test-k92nt/core/services/pod-network-service.yaml
name: pod-network-service namespace: e2e-pod-network-disruption-test-k92nt resourceVersion: "105532" uid: 4ec3aaaa-7ea6-4745-9d66-1f64bd1e8a3a spec: clusterIP: 172.30.19.181 clusterIPs: - 172.30.19.181
However our conformance backend-disruption_20240221-042147 is recording outages for the previous IP
"Feb 21 04:21:28.000 - 12s E backend-disruption-name/pod-to-service-new-connections connection/new disruption/pod-to-service-to-service-from-node-ci-op-89h3f2lm-9825d-5bzsl-worker-a-lp7c4-to-clusterIP-172.30.136.5 reason/DisruptionBegan request-audit-id/2afb874c-40eb-45c4-8d4f-e05b45e54e5a backend-disruption-name/pod-to-service-new-connections connection/new disruption/pod-to-service-to-service-from-node-ci-op-89h3f2lm-9825d-5bzsl-worker-a-lp7c4-to-clusterIP-172.30.136.5 stopped responding to GET requests over new connections: Get \"http://172.30.136.5:80\": dial tcp 172.30.136.5:80: connect: connection refused",
And loki indicates that the new pod in the conformance namespace is initially initializing with the old IP address
2024-02-21 04:21:54 {"container":"disruption-poller","host":"ci-op-89h3f2lm-9825d-5bzsl-worker-a-lp7c4","namespace":"e2e-pod-network-disruption-test-k92nt","pod":"pod-network-to-service-disruption-poller-65d867cdbd-64wlz","_entry":"Initializing to watch clusterIP 172.30.19.181:80"} 2024-02-21 04:21:54 {"container":"disruption-poller","host":"ci-op-89h3f2lm-9825d-5bzsl-worker-a-lp7c4","namespace":"e2e-pod-network-disruption-test-k92nt","pod":"pod-network-to-service-disruption-poller-65d867cdbd-64wlz","_entry":"Initializing to watch clusterIP 172.30.19.181:80"} 2024-02-21 04:21:54 {"container":"disruption-poller","host":"ci-op-89h3f2lm-9825d-5bzsl-worker-a-lp7c4","namespace":"e2e-pod-network-disruption-test-k92nt","pod":"pod-network-to-service-disruption-poller-65d867cdbd-64wlz","_entry":"Initializing to watch clusterIP 172.30.136.5:80"} 2024-02-21 04:21:54 {"container":"disruption-poller","host":"ci-op-89h3f2lm-9825d-5bzsl-worker-a-lp7c4","namespace":"e2e-pod-network-disruption-test-k92nt","pod":"pod-network-to-service-disruption-poller-65d867cdbd-64wlz","_entry":"Initializing to watch clusterIP 172.30.136.5:80"} 2024-02-21 04:21:54 {"container":"disruption-poller","host":"ci-op-89h3f2lm-9825d-5bzsl-worker-a-lp7c4","namespace":"e2e-pod-network-disruption-test-k92nt","pod":"pod-network-to-service-disruption-poller-65d867cdbd-64wlz","_entry":"Initializing to watch clusterIP 172.30.136.5:80"} 2024-02-21 04:21:54 {"container":"disruption-poller","host":"ci-op-89h3f2lm-9825d-5bzsl-worker-a-lp7c4","namespace":"e2e-pod-network-disruption-test-k92nt","pod":"pod-network-to-service-disruption-poller-65d867cdbd-64wlz","_entry":"Initializing to watch clusterIP 172.30.136.5:80"} 2024-02-21 04:21:54 {"container":"disruption-poller","host":"ci-op-89h3f2lm-9825d-5bzsl-worker-a-lp7c4","namespace":"e2e-pod-network-disruption-test-k92nt","pod":"pod-network-to-service-disruption-poller-65d867cdbd-64wlz","_entry":"Initializing to watch clusterIP 172.30.19.181:80"} 2024-02-21 04:21:54 {"container":"disruption-poller","host":"ci-op-89h3f2lm-9825d-5bzsl-worker-a-lp7c4","namespace":"e2e-pod-network-disruption-test-k92nt","pod":"pod-network-to-service-disruption-poller-65d867cdbd-64wlz","_entry":"Initializing to watch clusterIP 172.30.19.181:80"}
For brevity (heh) I'm not including the deployments from resource watch but they appear to have the correct IP addresses. Our investigation will start with trying to verify we aren't missing resource updates and that loki is attributing the right log entries with the proper namespaces, etc.
- relates to
-
TRT-1514 Mass 35s network outage
- Closed
- links to