Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-1530

Investigate pod-network-to-service disruption across monitor test namespaces

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Normal Normal
    • None
    • None
    • None
    • False
    • None
    • False

      In TRT-1514 we found an issue where conformance disruption monitors are reporting disruption for service endpoints created by the upgrade monitor tests as noted here .

      That investigation was followed up looking at another run with similar symptoms.

      Reviewing resource-watch we can observe when the services were created and the values

      git log | grep -B5  "services/pod-network-service"
      
      commit c43b533b59edc542e4f5dd8e9b37ca21a90dfb23
      Author: unknown <ci-monitor@openshift.io>
      Date:   Wed Feb 21 05:17:58 2024 +0000
      
          removed services/pod-network-service -n e2e-pod-network-disruption-test-k92nt
      --
      
      commit 96678a29585e1c4b35e34aa898dbab2d27f16d8e
      Author: openshift-tests <ci-monitor@openshift.io>
      Date:   Wed Feb 21 04:23:45 2024 +0000
      
          added services/pod-network-service -n e2e-pod-network-disruption-test-k92nt
      --
      
      commit 0fb6688ece33c61a1ed7c7f851f5face24a134e5
      Author: unknown <ci-monitor@openshift.io>
      Date:   Wed Feb 21 04:23:12 2024 +0000
      
          removed services/pod-network-service -n e2e-pod-network-disruption-test-vv2x4
      --
      
      commit 08faa9e4e8419275be3281d8f821666a990f9788
      Author: openshift-tests <ci-monitor@openshift.io>
      Date:   Wed Feb 21 03:13:20 2024 +0000
      
          added services/pod-network-service -n e2e-pod-network-disruption-test-vv2x4
      
      

      git checkout 08faa9e4e8419275be3281d8f821666a990f9788
      HEAD is now at 08faa9e4e8 added services/pod-network-service -n e2e-pod-network-disruption-test-vv2x4

      cat namespaces/e2e-pod-network-disruption-test-vv2x4/core/services/pod-network-service.yaml

        name: pod-network-service
        namespace: e2e-pod-network-disruption-test-vv2x4
        resourceVersion: "47260"
        uid: e726c46d-562c-4bab-b813-d4afeb857e0b
      spec:
        clusterIP: 172.30.136.5
        clusterIPs:
        - 172.30.136.5
      
      

      git checkout 96678a29585e1c4b35e34aa898dbab2d27f16d8e
      Previous HEAD position was 08faa9e4e8 added services/pod-network-service -n e2e-pod-network-disruption-test-vv2x4
      HEAD is now at 96678a2958 added services/pod-network-service -n e2e-pod-network-disruption-test-k92nt

      fsb: $ cat namespaces/e2e-pod-network-disruption-test-k92nt/core/services/pod-network-service.yaml

        name: pod-network-service
        namespace: e2e-pod-network-disruption-test-k92nt
        resourceVersion: "105532"
        uid: 4ec3aaaa-7ea6-4745-9d66-1f64bd1e8a3a
      spec:
        clusterIP: 172.30.19.181
        clusterIPs:
        - 172.30.19.181
      
      

      However our conformance backend-disruption_20240221-042147 is recording outages for the previous IP

                      "Feb 21 04:21:28.000 - 12s   E backend-disruption-name/pod-to-service-new-connections connection/new disruption/pod-to-service-to-service-from-node-ci-op-89h3f2lm-9825d-5bzsl-worker-a-lp7c4-to-clusterIP-172.30.136.5 reason/DisruptionBegan request-audit-id/2afb874c-40eb-45c4-8d4f-e05b45e54e5a backend-disruption-name/pod-to-service-new-connections connection/new disruption/pod-to-service-to-service-from-node-ci-op-89h3f2lm-9825d-5bzsl-worker-a-lp7c4-to-clusterIP-172.30.136.5 stopped responding to GET requests over new connections: Get \"http://172.30.136.5:80\": dial tcp 172.30.136.5:80: connect: connection refused",
      

      And loki indicates that the new pod in the conformance namespace is initially initializing with the old IP address

      	2024-02-21 04:21:54	
      {"container":"disruption-poller","host":"ci-op-89h3f2lm-9825d-5bzsl-worker-a-lp7c4","namespace":"e2e-pod-network-disruption-test-k92nt","pod":"pod-network-to-service-disruption-poller-65d867cdbd-64wlz","_entry":"Initializing to watch clusterIP 172.30.19.181:80"}
      	
      	
      2024-02-21 04:21:54	
      {"container":"disruption-poller","host":"ci-op-89h3f2lm-9825d-5bzsl-worker-a-lp7c4","namespace":"e2e-pod-network-disruption-test-k92nt","pod":"pod-network-to-service-disruption-poller-65d867cdbd-64wlz","_entry":"Initializing to watch clusterIP 172.30.19.181:80"}
      	
      	
      2024-02-21 04:21:54	
      {"container":"disruption-poller","host":"ci-op-89h3f2lm-9825d-5bzsl-worker-a-lp7c4","namespace":"e2e-pod-network-disruption-test-k92nt","pod":"pod-network-to-service-disruption-poller-65d867cdbd-64wlz","_entry":"Initializing to watch clusterIP 172.30.136.5:80"}
      	
      	
      2024-02-21 04:21:54	
      {"container":"disruption-poller","host":"ci-op-89h3f2lm-9825d-5bzsl-worker-a-lp7c4","namespace":"e2e-pod-network-disruption-test-k92nt","pod":"pod-network-to-service-disruption-poller-65d867cdbd-64wlz","_entry":"Initializing to watch clusterIP 172.30.136.5:80"}
      	
      	
      2024-02-21 04:21:54	
      {"container":"disruption-poller","host":"ci-op-89h3f2lm-9825d-5bzsl-worker-a-lp7c4","namespace":"e2e-pod-network-disruption-test-k92nt","pod":"pod-network-to-service-disruption-poller-65d867cdbd-64wlz","_entry":"Initializing to watch clusterIP 172.30.136.5:80"}
      	
      	
      2024-02-21 04:21:54	
      {"container":"disruption-poller","host":"ci-op-89h3f2lm-9825d-5bzsl-worker-a-lp7c4","namespace":"e2e-pod-network-disruption-test-k92nt","pod":"pod-network-to-service-disruption-poller-65d867cdbd-64wlz","_entry":"Initializing to watch clusterIP 172.30.136.5:80"}
      	
      	
      2024-02-21 04:21:54	
      {"container":"disruption-poller","host":"ci-op-89h3f2lm-9825d-5bzsl-worker-a-lp7c4","namespace":"e2e-pod-network-disruption-test-k92nt","pod":"pod-network-to-service-disruption-poller-65d867cdbd-64wlz","_entry":"Initializing to watch clusterIP 172.30.19.181:80"}
      	
      	
      2024-02-21 04:21:54	
      {"container":"disruption-poller","host":"ci-op-89h3f2lm-9825d-5bzsl-worker-a-lp7c4","namespace":"e2e-pod-network-disruption-test-k92nt","pod":"pod-network-to-service-disruption-poller-65d867cdbd-64wlz","_entry":"Initializing to watch clusterIP 172.30.19.181:80"}
      

      For brevity (heh) I'm not including the deployments from resource watch but they appear to have the correct IP addresses. Our investigation will start with trying to verify we aren't missing resource updates and that loki is attributing the right log entries with the proper namespaces, etc.

            rh-ee-fbabcock Forrest Babcock
            rh-ee-fbabcock Forrest Babcock
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: