Uploaded image for project: 'OpenShift Service Mesh'
  1. OpenShift Service Mesh
  2. OSSM-9340

Pod healthcheck is not working in ambient mode on OCP 4.16

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • OSSM 3.0.0
    • Sail Operator
    • None

      This bug is similar to the following bug, but in another OCP version:
      https://issues.redhat.com/browse/OSSM-9053

      The Openshift custer configured with ambient mode.
      The ambient mode configuration exists on the namespace level.
      "istio.io/dataplane-mode: ambient"
      The pods stuck in the CrashLoopBackOff state, because unable to pass the health check.

      Note:
      In OCP 4.16 this issue happens when the ambient mode installed second time.
      When executing, for example e2e testing for the first time after cluster fresh deployment, everything is working and ambient mode pods are able to start.
      But when I'm executing the same exct e2e flow second, third, etc.. time, I'm facing this issue.

      Overall pods status.
      The running pods are running with "istio.io/dataplane-mode: none" label within pods.

      NAME                                              READY   STATUS             RESTARTS         AGE
      captured-v1-7987bd7db4-rrrs6                      0/1     CrashLoopBackOff   21 (14s ago)     45m
      captured-v2-674d878bb-rlwz7                       0/1     CrashLoopBackOff   21 (9s ago)      45m
      service-addressed-waypoint-v1-6cbf7b65b7-xgqg8    0/1     CrashLoopBackOff   21 (50s ago)     45m
      service-addressed-waypoint-v2-85d78fd549-m26ls    0/1     CrashLoopBackOff   21 (6s ago)      45m
      sidecar-v1-8644c8b7fc-pdfp9                       2/2     Running            0                45m
      sidecar-v2-9dbbd4d7-sjfp9                         2/2     Running            0                45m
      uncaptured-v1-5b8b4dcb7d-x4ln6                    1/1     Running            0                45m
      uncaptured-v2-6d96cb477b-jcdjf                    1/1     Running            0                45m
      workload-addressed-waypoint-v1-7968cfd7d4-h5bxn   0/1     CrashLoopBackOff   19 (4m41s ago)   45m
      workload-addressed-waypoint-v2-64cb85878-n6dmw    0/1     CrashLoopBackOff   21 (2s ago)      45m 

      One of the failed pods events:

      Events:
        Type    Reason                 Age   From                                       Message
        ----    ------                 ----  ----                                       -------
        Normal  Scheduled              66m   default-scheduler                          Successfully assigned echo-2-4894/captured-v2-674d878bb-rlwz7 to ip-10-0-63-201.ec2.internal
        Normal  IPTablesUsageObserved  59m   openshift.io/iptables-deprecation-alerter  This pod appears to have created one or more iptables rules. IPTables is
      deprecated and will no longer be available in RHEL 10 and later. You should
      consider migrating to another API such as nftables or eBPF. See also
      https://access.redhat.com/solutions/6739041Example iptables rule seen in this pod:
      -A PREROUTING -j ISTIO_PRERT
        Normal   AddedInterface  66m                  multus   Add eth0 [10.128.2.84/23] from ovn-kubernetes
        Normal   Killing         65m                  kubelet  Container app failed startup probe, will be restarted
        Normal   Created         65m (x2 over 66m)    kubelet  Created container: app
        Normal   Started         65m (x2 over 66m)    kubelet  Started container app
        Warning  Unhealthy       65m (x18 over 66m)   kubelet  Startup probe failed: dial tcp 10.128.2.84:3333: i/o timeout
        Normal   Pulled          21m (x21 over 66m)   kubelet  Container image "image-registry.openshift-image-registry.svc:5000/istio-system/app:istio-testing" already present on machine
        Warning  BackOff         69s (x299 over 65m)  kubelet  Back-off restarting failed container app in pod captured-v2-674d878bb-rlwz7_echo-2-4894(c52eff71-9aba-4c33-a1ab-e8d3aa9c519a) 

      The "startupProbe" points to a "tcp-health-port", which is defined with port 3333.
      The container itself defines the port 3333, during the startup arguments.

      apiVersion: v1
      kind: Pod
      metadata:
        annotations:
          ambient.istio.io/redirection: enabled
          k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.128.2.84/23"],"mac_address":"0a:58:0a:80:02:54","gateway_ips":["10.128.2.1"],"routes":[{"dest":"10.128.0.0/14","nextHop":"10.128.2.1"},{"dest":"172.30.0.0/16","nextHop":"10.128.2.1"},{"dest":"169.254.169.5/32","nextHop":"10.128.2.1"},{"dest":"100.64.0.0/16","nextHop":"10.128.2.1"}],"ip_address":"10.128.2.84/23","gateway_ip":"10.128.2.1"}}'
          k8s.v1.cni.cncf.io/network-status: |-
            [{
                "name": "ovn-kubernetes",
                "interface": "eth0",
                "ips": [
                    "10.128.2.84"
                ],
                "mac": "0a:58:0a:80:02:54",
                "default": true,
                "dns": {}
            }]
          openshift.io/scc: restricted-v2
          prometheus.io/port: "15014"
          prometheus.io/scrape: "true"
          seccomp.security.alpha.kubernetes.io/pod: runtime/default
        creationTimestamp: "2025-04-09T11:59:20Z"
        generateName: captured-v2-674d878bb-
        labels:
          app: captured
          pod-template-hash: 674d878bb
          test.istio.io/class: captured
          version: v2
        name: captured-v2-674d878bb-rlwz7
        namespace: echo-2-4894
        ownerReferences:
        - apiVersion: apps/v1
          blockOwnerDeletion: true
          controller: true
          kind: ReplicaSet
          name: captured-v2-674d878bb
          uid: 5385f545-65af-4f12-8a7e-783004849582
        resourceVersion: "197479"
        uid: c52eff71-9aba-4c33-a1ab-e8d3aa9c519a
      spec:
        containers:
        - args:
          - --metrics=15014
          - --cluster=cluster-0
          - --port=18080
          - --grpc=17070
          - --port=18085
          - --tcp=19090
          - --port=18443
          - --tls=18443
          - --tcp=16060
          - --server-first=16060
          - --tcp=19091
          - --tcp=16061
          - --server-first=16061
          - --port=18081
          - --grpc=17071
          - --port=19443
          - --tls=19443
          - --port=18082
          - --bind-ip=18082
          - --port=18084
          - --bind-localhost=18084
          - --tcp=19092
          - --port=18083
          - --port=18086
          - --port=18087
          - --proxy-protocol=18087
          - --port=8080
          - --port=3333
          - --version=v2
          - --istio-version=
          - --crt=/cert.crt
          - --key=/cert.key
          env:
          - name: INSTANCE_IPS
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: status.podIPs
          - name: NAMESPACE
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
          - name: BIND_FAMILY
          image: image-registry.openshift-image-registry.svc:5000/istio-system/app:istio-testing
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 10
            initialDelaySeconds: 10
            periodSeconds: 10
            successThreshold: 1
            tcpSocket:
              port: tcp-health-port
            timeoutSeconds: 1
          name: app
          ports:
          - containerPort: 18080
            protocol: TCP
          - containerPort: 17070
            protocol: TCP
          - containerPort: 18085
            protocol: TCP
          - containerPort: 19090
            protocol: TCP
          - containerPort: 18443
            protocol: TCP
          - containerPort: 16060
            protocol: TCP
          - containerPort: 19091
            protocol: TCP
          - containerPort: 16061
            protocol: TCP
          - containerPort: 18081
            protocol: TCP
          - containerPort: 17071
            protocol: TCP
          - containerPort: 19443
            protocol: TCP
          - containerPort: 18082
            protocol: TCP
          - containerPort: 18084
            protocol: TCP
          - containerPort: 19092
            protocol: TCP
          - containerPort: 18083
            protocol: TCP
          - containerPort: 18086
            protocol: TCP
          - containerPort: 18087
            protocol: TCP
          - containerPort: 8080
            protocol: TCP
          - containerPort: 3333
            name: tcp-health-port
            protocol: TCP
          readinessProbe:
            failureThreshold: 10
            httpGet:
              path: /
              port: 8080
              scheme: HTTP
            initialDelaySeconds: 1
            periodSeconds: 2
            successThreshold: 1
            timeoutSeconds: 1
          resources: {}
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
              - ALL
            runAsNonRoot: true
            runAsUser: 1000900000
          startupProbe:
            failureThreshold: 10
            periodSeconds: 1
            successThreshold: 1
            tcpSocket:
              port: tcp-health-port
            timeoutSeconds: 1
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
            name: kube-api-access-xqfzr
            readOnly: true
        dnsPolicy: ClusterFirst
        enableServiceLinks: true
        imagePullSecrets:
        - name: captured-dockercfg-tdvxj
        nodeName: ip-10-0-63-201.ec2.internal
        preemptionPolicy: PreemptLowerPriority
        priority: 0
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext:
          fsGroup: 1000900000
          seLinuxOptions:
            level: s0:c30,c15
          seccompProfile:
            type: RuntimeDefault
        serviceAccount: captured
        serviceAccountName: captured
        terminationGracePeriodSeconds: 30
        tolerations:
        - effect: NoExecute
          key: node.kubernetes.io/not-ready
          operator: Exists
          tolerationSeconds: 300
        - effect: NoExecute
          key: node.kubernetes.io/unreachable
          operator: Exists
          tolerationSeconds: 300
        topologySpreadConstraints:
        - labelSelector:
            matchLabels:
              app: captured
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
        volumes:
        - name: kube-api-access-xqfzr
          projected:
            defaultMode: 420
            sources:
            - serviceAccountToken:
                expirationSeconds: 3607
                path: token
            - configMap:
                items:
                - key: ca.crt
                  path: ca.crt
                name: kube-root-ca.crt
            - downwardAPI:
                items:
                - fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.namespace
                  path: namespace
            - configMap:
                items:
                - key: service-ca.crt
                  path: service-ca.crt
                name: openshift-service-ca.crt
      status:
        conditions:
        - lastProbeTime: null
          lastTransitionTime: "2025-04-09T11:59:21Z"
          status: "True"
          type: PodReadyToStartContainers
        - lastProbeTime: null
          lastTransitionTime: "2025-04-09T11:59:20Z"
          status: "True"
          type: Initialized
        - lastProbeTime: null
          lastTransitionTime: "2025-04-09T11:59:20Z"
          message: 'containers with unready status: [app]'
          reason: ContainersNotReady
          status: "False"
          type: Ready
        - lastProbeTime: null
          lastTransitionTime: "2025-04-09T11:59:20Z"
          message: 'containers with unready status: [app]'
          reason: ContainersNotReady
          status: "False"
          type: ContainersReady
        - lastProbeTime: null
          lastTransitionTime: "2025-04-09T11:59:20Z"
          status: "True"
          type: PodScheduled
        containerStatuses:
        - containerID: cri-o://e2f4e34985755f2465e8b27463e08c4ad0d7cf7b20662d43842c305a263c4ea5
          image: image-registry.openshift-image-registry.svc:5000/istio-system/app:istio-testing
          imageID: image-registry.openshift-image-registry.svc:5000/istio-system/app@sha256:51796092733faeba30645417ef0d45ab1d4ec5457beafa598b03bcbaa4d567e0
          lastState:
            terminated:
              containerID: cri-o://e2f4e34985755f2465e8b27463e08c4ad0d7cf7b20662d43842c305a263c4ea5
              exitCode: 0
              finishedAt: "2025-04-09T13:06:43Z"
              reason: Completed
              startedAt: "2025-04-09T13:06:31Z"
          name: app
          ready: false
          restartCount: 29
          started: false
          state:
            waiting:
              message: back-off 5m0s restarting failed container=app pod=captured-v2-674d878bb-rlwz7_echo-2-4894(c52eff71-9aba-4c33-a1ab-e8d3aa9c519a)
              reason: CrashLoopBackOff
        hostIP: 10.0.63.201
        hostIPs:
        - ip: 10.0.63.201
        phase: Running
        podIP: 10.128.2.84
        podIPs:
        - ip: 10.128.2.84
        qosClass: BestEffort
        startTime: "2025-04-09T11:59:20Z"

              sgaddam@redhat.com Gaddam Sridhar
              mbabushk@redhat.com Maxim Babushkin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: