Uploaded image for project: 'Knative Serving'
  1. Knative Serving
  2. SRVKS-1158

[scalability] With "full-mesh" setup, 1000 ksvcs on the cluster, activator crashloops, istio-proxy OOMKilled

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 1.31.0
    • None
    • None

      On a "full-mesh" setup, with 100 single-namespace tenants with 10 ksvcs each,

      after reinstalling KnativeServing (deleting and re-creating knativeserving )

      the KnativeServing is not becoming Ready with

       oc get pods -n knative-serving
      NAME                                                          READY   STATUS             RESTARTS        AGE
      activator-6c49f676cd-7bcwg                                    0/2     CrashLoopBackOff   18 (8s ago)     32m
      activator-6c49f676cd-rhmdh                                    1/2     CrashLoopBackOff   18 (12s ago)    33m
      autoscaler-764884ddb-5zkkl                                    1/2     Running            15 (12s ago)    32m
      autoscaler-764884ddb-pgggm                                    1/2     CrashLoopBackOff   19 (10s ago)    32m
      autoscaler-hpa-54d9bf5777-fb4dg                               1/1     Running            0               32m
      autoscaler-hpa-54d9bf5777-snwsn                               1/1     Running            0               32m
      controller-8476b46bc7-g56jj                                   1/1     Running            0               32m
      controller-8476b46bc7-nchp4                                   1/1     Running            0               32m
      domain-mapping-5fc9879c7b-2w2d2                               1/1     Running            0               32m
      domain-mapping-5fc9879c7b-jsfvr                               1/1     Running            0               32m
      domainmapping-webhook-74f4f67dcb-4wzq7                        1/1     Running            0               32m
      domainmapping-webhook-74f4f67dcb-cv2j2                        1/1     Running            0               32m
      net-istio-controller-656455f449-25hbq                         1/1     Running            0               32m
      net-istio-controller-656455f449-777gb                         1/1     Running            0               32m
      net-istio-webhook-7dcb75cfdf-67zfx                            1/1     Running            0               32m
      net-istio-webhook-7dcb75cfdf-pp44b                            1/1     Running            0               32m
      storage-version-migration-serving-serving-1.10-1.32.0-9sgx4   0/1     CrashLoopBackOff   11 (108s ago)   44m
      storage-version-migration-serving-serving-1.10-1.32.0-n2zwl   1/1     Running            0               9m30s
      webhook-7c79f46b97-25swl                                      1/1     Running            0               32m
      webhook-7c79f46b97-lb99f                                      1/1     Running            0               32m
      

      The activator seems to be restarting on

      $ oc logs -n knative-serving activator-6c49f676cd-7bcwg -p
      2023/10/10 12:48:13 Registering 3 clients
      2023/10/10 12:48:13 Registering 3 informer factories
      2023/10/10 12:48:13 Registering 4 informers
      2023/10/10 12:48:13 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:14 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:15 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:16 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:17 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:18 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:19 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:20 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:21 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:22 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:23 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:24 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:25 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:26 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:27 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:28 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:29 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:30 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:31 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:32 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:33 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:34 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:35 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:36 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:37 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:38 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:39 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:40 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:41 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:42 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:43 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:44 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:45 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:46 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:47 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:48 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:49 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:50 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:51 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:52 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:53 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:54 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:55 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:56 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:57 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:58 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:48:59 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:49:00 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:49:01 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:49:02 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:49:03 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:49:04 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:49:05 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:49:06 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:49:07 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:49:08 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:49:09 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:49:10 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:49:11 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:49:12 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:49:13 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:49:13 Failed to get k8s version Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      2023/10/10 12:49:13 Timed out attempting to get k8s version: Get "https://172.30.0.1:443/version": dial tcp 172.30.0.1:443: connect: connection refused
      

      According to the pod container status, the istio-proxy is OOMKilled

      status:
        conditions:
        - lastProbeTime: null
          lastTransitionTime: "2023-10-10T13:09:05Z"
          status: "True"
          type: Initialized
        - lastProbeTime: null
          lastTransitionTime: "2023-10-10T13:17:38Z"
          message: 'containers with unready status: [istio-proxy]'
          reason: ContainersNotReady
          status: "False"
          type: Ready
        - lastProbeTime: null
          lastTransitionTime: "2023-10-10T13:17:38Z"
          message: 'containers with unready status: [istio-proxy]'
          reason: ContainersNotReady
          status: "False"
          type: ContainersReady
        - lastProbeTime: null
          lastTransitionTime: "2023-10-10T13:09:05Z"
          status: "True"
          type: PodScheduled
        containerStatuses:
        - containerID: cri-o://d36ac162e5722f3fe7ba28e9c08c9093df35a6949dabc0d378635f6bcb88a630
          image: registry.ci.openshift.org/openshift/knative-serving-activator:knative-v1.10
          imageID: registry.ci.openshift.org/openshift/knative-serving-activator@sha256:db7f01a85dc533d90f7f22b7c615238a751aefc080375ab10cce42389cfbcfc0
          lastState: {}
          name: activator
          ready: true
          restartCount: 0
          started: true
          state:
            running:
              startedAt: "2023-10-10T13:09:08Z"
        - containerID: cri-o://3b901e60f2906c7b9ed8386403fbe908a1b48ed8b88c9bf83f6ce3d4913bef3f
          image: registry.redhat.io/openshift-service-mesh/proxyv2-rhel8@sha256:0f1b44e867c5ffc1a7d9ab82907ac9eafb44e12cfa9b76967f85ac0c38e5cb7c
          imageID: registry.redhat.io/openshift-service-mesh/proxyv2-rhel8@sha256:0f1b44e867c5ffc1a7d9ab82907ac9eafb44e12cfa9b76967f85ac0c38e5cb7c
          lastState:
            terminated:
              containerID: cri-o://3b901e60f2906c7b9ed8386403fbe908a1b48ed8b88c9bf83f6ce3d4913bef3f
              exitCode: 137
              finishedAt: "2023-10-10T13:17:37Z"
              reason: OOMKilled
              startedAt: "2023-10-10T13:16:51Z"
          name: istio-proxy
          ready: false
          restartCount: 3
          started: false
          state:
            waiting:
              message: back-off 40s restarting failed container=istio-proxy pod=activator-6c49f676cd-nsrvw_knative-serving(c9f0ad07-bdee-41d6-9d10-617c7277bae1)
              reason: CrashLoopBackOff
      

              Unassigned Unassigned
              maschmid@redhat.com Marek Schmidt
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: