Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-25832

Failure to communicate between pods on different hypershift-deployed bare metal nodes

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Test Pending
    • Icon: Undefined Undefined
    • None
    • 4.14.z
    • None
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      I have a two-node cluster running OpenShift 4.15.6 deployed using hosted control planes. The management cluster is running OpenShift 4.15.5.

      NAME     STATUS   ROLES    AGE     VERSION
      wrk-79   Ready    worker   4h28m   v1.27.8+4fab27b
      wrk-84   Ready    worker   4h49m   v1.27.8+4fab27b
      

      The user-visible problem is that the console UI fails to load. The static HTML content at https://console-openshift-console.apps.vcluster1.int.massopen.cloud loads correctly, but many of the dynamic resources loaded by that page eventually timeout, causing the browser to show nothing but a blank screen.

      In investigating this with Cesar Wong, we discovered this error in the console pod:

      E1221 19:08:04.980783       1 handlers.go:164] GET request for "monitoring-plugin" plugin failed: Get "https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
      2023/12/21 19:08:04 http: panic serving 10.132.2.2:40966: runtime error: invalid memory address or nil pointer dereference
      goroutine 19072 [running]:
      net/http.(*conn).serve.func1()
              /usr/lib/golang/src/net/http/server.go:1854 +0xbf
      panic({0x325b720, 0x4fd0150})
              /usr/lib/golang/src/runtime/panic.go:890 +0x263
      github.com/openshift/console/pkg/plugins.(*PluginsHandler).proxyPluginRequest(0xc00047f1e0, 0x2?, {0xc000b7e151, 0x11}, {0x3a44b00, 0xc0001cc9a0}, 0xa?)
              /go/src/github.com/openshift/console/pkg/plugins/handlers.go:165 +0x582
      github.com/openshift/console/pkg/plugins.(*PluginsHandler).HandlePluginAssets(0xfc00000000000010?, {0x3a44b00, 0xc0001cc9a0}, 0xc0016bd300)
              /go/src/github.com/openshift/console/pkg/plugins/handlers.go:147 +0x26d
      github.com/openshift/console/pkg/server.(*Server).HTTPHandler.func23({0x3a44b00?, 0xc0001cc9a0?}, 0xc000edc680?)
              /go/src/github.com/openshift/console/pkg/server/server.go:605 +0x33
      net/http.HandlerFunc.ServeHTTP(...)
              /usr/lib/golang/src/net/http/server.go:2122
      github.com/openshift/console/pkg/server.authMiddleware.func1(0x3404d60?, {0x3a44b00?, 0xc0001cc9a0?}, 0xd?)
              /go/src/github.com/openshift/console/pkg/server/middleware.go:27 +0x31
      github.com/openshift/console/pkg/server.authMiddlewareWithUser.func1({0x3a44b00, 0xc0001cc9a0}, 0xc0016bd300)
              /go/src/github.com/openshift/console/pkg/server/middleware.go:41 +0x24c
      net/http.HandlerFunc.ServeHTTP(...)
              /usr/lib/golang/src/net/http/server.go:2122
      github.com/openshift/console/pkg/server.verifyCSRF.func1({0x3a44b00, 0xc0001cc9a0}, 0xc0016bd300)
              /go/src/github.com/openshift/console/pkg/server/middleware.go:74 +0x205
      net/http.HandlerFunc.ServeHTTP(0xc000f486c0?, {0x3a44b00?, 0xc0001cc9a0?}, 0x7f082eb23a68?)
              /usr/lib/golang/src/net/http/server.go:2122 +0x2f
      net/http.StripPrefix.func1({0x3a44b00, 0xc0001cc9a0}, 0xc0016bd200)
              /usr/lib/golang/src/net/http/server.go:2165 +0x332
      net/http.HandlerFunc.ServeHTTP(0xc0003fac00?, {0x3a44b00?, 0xc0001cc9a0?}, 0xc0009dba00?)
              /usr/lib/golang/src/net/http/server.go:2122 +0x2f
      net/http.(*ServeMux).ServeHTTP(0x3404d60?, {0x3a44b00, 0xc0001cc9a0}, 0xc0016bd200)
              /usr/lib/golang/src/net/http/server.go:2500 +0x149
      github.com/openshift/console/pkg/server.securityHeadersMiddleware.func1({0x3a44b00, 0xc0001cc9a0}, 0x33075a0?)
              /go/src/github.com/openshift/console/pkg/server/middleware.go:139 +0x3af
      net/http.HandlerFunc.ServeHTTP(0x0?, {0x3a44b00?, 0xc0001cc9a0?}, 0x11dbe6e?)
              /usr/lib/golang/src/net/http/server.go:2122 +0x2f
      net/http.serverHandler.ServeHTTP({0xc000edf980?}, {0x3a44b00, 0xc0001cc9a0}, 0xc0016bd200)
              /usr/lib/golang/src/net/http/server.go:2936 +0x316
      net/http.(*conn).serve(0xc000f48480, {0x3a469d0, 0xc000edf740})
              /usr/lib/golang/src/net/http/server.go:1995 +0x612
      created by net/http.(*Server).Serve
              /usr/lib/golang/src/net/http/server.go:3089 +0x5ed
      

      The console pod is running on node wrk-79:

      $ oc -n openshift-console get pod -o wide
      NAME                         READY   STATUS    RESTARTS   AGE     IP            NODE     NOMINATED NODE   READINESS GATES
      console-7fcdd9565d-k48c7     1/1     Running   0          4h30m   10.132.0.11   wrk-79   <none>           <none>
      downloads-79c558955f-djrz9   1/1     Running   0          4h30m   10.132.0.12   wrk-79   <none>           <none>
      

      The monitoring-plugin pod is running on node wrk-84:

      $ oc -n openshift-monitoring get pod -o wide -l app.kubernetes.io/name=monitoring-plugin
      NAME                                 READY   STATUS    RESTARTS   AGE     IP            NODE     NOMINATED NODE   READINESS GATES
      monitoring-plugin-6b644876bb-wjvc6   1/1     Running   0          4h54m   10.132.2.15   wrk-84   <none>           <none>
      

      Attempting to access that address from node wrk-79 fails:

      [core@wrk-79 ~]$ curl -k --connect-timeout 10 https://10.132.2.15:9443
      curl: (28) Operation timed out after 10001 milliseconds with 0 out of 0 bytes received
      

      That same pod is accessible from node wrk-84:

      [root@wrk-84 ~]# curl -s -k --connect-timeout 10 --write-out '%{response_code}\n' -o /dev/null https://10.132.2.15:9443
      200
      

      Running tcpdump on the geneve interface between the two nodes, we see the following sequence:

      [root@wrk-84 /]# tcpdump -i genev_sys_6081 -n port 9443 or \( host 10.132.0.2 and icmp \)
      dropped privs to tcpdump
      19:24:18.877023 IP 10.132.0.2.34818 > 10.132.2.15.9443: Flags [S], seq 2726740528, win 65280, options [mss 1360,sackOK,TS val 366089489 ecr 0,nop,wscale 7], length 0
      19:24:18.877285 IP 10.132.2.15.9443 > 10.132.0.2.34818: Flags [S.], seq 732634800, ack 2726740529, win 64704, options [mss 1360,sackOK,TS val 2418241540 ecr 366089489,nop,wscale 7], length 0
      19:24:18.877407 IP 10.132.0.2.34818 > 10.132.2.15.9443: Flags [.], ack 1, win 510, options [nop,nop,TS val 366089490 ecr 2418241540], length 0
      19:24:18.913797 IP 10.132.0.2.34818 > 10.132.2.15.9443: Flags [P.], seq 1:518, ack 1, win 510, options [nop,nop,TS val 366089526 ecr 2418241540], length 517
      19:24:18.913817 IP 10.132.2.15.9443 > 10.132.0.2.34818: Flags [.], ack 518, win 502, options [nop,nop,TS val 2418241576 ecr 366089526], length 0
      19:24:18.915082 IP 10.132.2.15.9443 > 10.132.0.2.34818: Flags [P.], seq 1:2472, ack 518, win 502, options [nop,nop,TS val 2418241578 ecr 366089526], length 2471
      19:24:18.915090 IP 10.132.0.2 > 10.132.2.15: ICMP 10.132.0.2 unreachable - need to frag (mtu 1342), length 556
      19:24:18.918146 IP 10.132.2.15.9443 > 10.132.0.2.34818: Flags [P.], seq 1349:2472, ack 518, win 502, options [nop,nop,TS val 2418241581 ecr 366089526], length 1123
      19:24:18.918288 IP 10.132.0.2.34818 > 10.132.2.15.9443: Flags [.], ack 1, win 510, options [nop,nop,TS val 366089531 ecr 2418241576,nop,nop,sack 1 {1349:2472}], length 0
      19:24:18.918323 IP 10.132.2.15.9443 > 10.132.0.2.34818: Flags [.], seq 1:1349, ack 518, win 502, options [nop,nop,TS val 2418241581 ecr 366089531], length 1348
      19:24:18.918339 IP 10.132.0.2 > 10.132.2.15: ICMP 10.132.0.2 unreachable - need to frag (mtu 1342), length 556
      19:24:19.120168 IP 10.132.2.15.9443 > 10.132.0.2.34818: Flags [.], seq 1:1349, ack 518, win 502, options [nop,nop,TS val 2418241783 ecr 366089531], length 1348
      19:24:19.120179 IP 10.132.0.2 > 10.132.2.15: ICMP 10.132.0.2 unreachable - need to frag (mtu 1342), length 556
      19:24:19.163652 IP 10.132.0.2 > 10.132.2.22: ICMP 10.132.0.2 unreachable - need to frag (mtu 1342), length 556
      19:24:19.172034 IP 10.132.0.2 > 10.132.2.22: ICMP 10.132.0.2 unreachable - need to frag (mtu 1342), length 556
      19:24:19.172195 IP 10.132.0.2 > 10.132.2.22: ICMP 10.132.0.2 unreachable - need to frag (mtu 1342), length 556
      19:24:19.175306 IP 10.132.0.2 > 10.132.2.22: ICMP 10.132.0.2 unreachable - need to frag (mtu 1342), length 556
      19:24:19.176277 IP 10.132.0.2 > 10.132.2.22: ICMP 10.132.0.2 unreachable - need to frag (mtu 1342), length 556
      19:24:19.368137 IP 10.132.0.2 > 10.132.2.22: ICMP 10.132.0.2 unreachable - need to frag (mtu 1342), length 556
      19:24:19.376150 IP 10.132.0.2 > 10.132.2.22: ICMP 10.132.0.2 unreachable - need to frag (mtu 1342), length 556
      19:24:19.384176 IP 10.132.0.2 > 10.132.2.22: ICMP 10.132.0.2 unreachable - need to frag (mtu 1342), length 556
      19:24:19.528155 IP 10.132.2.15.9443 > 10.132.0.2.34818: Flags [.], seq 1:1349, ack 518, win 502, options [nop,nop,TS val 2418242191 ecr 366089531], length 1348
      19:24:19.528169 IP 10.132.0.2 > 10.132.2.15: ICMP 10.132.0.2 unreachable - need to frag (mtu 1342), length 556
      19:24:19.776171 IP 10.132.0.2 > 10.132.2.22: ICMP 10.132.0.2 unreachable - need to frag (mtu 1342), length 556
      19:24:19.784142 IP 10.132.0.2 > 10.132.2.22: ICMP 10.132.0.2 unreachable - need to frag (mtu 1342), length 556
      19:24:19.792166 IP 10.132.0.2 > 10.132.2.22: ICMP 10.132.0.2 unreachable - need to frag (mtu 1342), length 556
      19:24:20.352159 IP 10.132.2.15.9443 > 10.132.0.2.34818: Flags [.], seq 1:1349, ack 518, win 502, options [nop,nop,TS val 2418243015 ecr 366089531], length 1348
      19:24:20.352166 IP 10.132.0.2 > 10.132.2.15: ICMP 10.132.0.2 unreachable - need to frag (mtu 1342), length 556
      19:24:20.608177 IP 10.132.0.2 > 10.132.2.22: ICMP 10.132.0.2 unreachable - need to frag (mtu 1342), length 556
      19:24:20.608202 IP 10.132.0.2 > 10.132.2.22: ICMP 10.132.0.2 unreachable - need to frag (mtu 1342), length 556
      

      What is causing this communication failure?

      Version-Release number of selected component (if applicable):

      Client Version: 4.13.0-202310162157.p0.g717d4a5.assembly.stream-717d4a5
      Kustomize Version: v4.5.7
      Server Version: 4.14.6
      Kubernetes Version: v1.27.8+4fab27b
      

            bbennett@redhat.com Ben Bennett
            lkellogg@redhat.com Lars Kellogg-Stedman
            Zhanqi Zhao Zhanqi Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: