Uploaded image for project: 'OpenShift Service Mesh'
  1. OpenShift Service Mesh
  2. OSSM-2376

ServiceMesh federation stops working after the restart of istiod pod

XMLWordPrintable

    • Sprint 61

      Federation controller uses multiple informers to fetch Kubernetes objects, but it does not check HasSynced() on all of them. This causes race condition - informers may be invoked while they are not ready and then object processing fails.

      Steps to reproduce for QE:
      1. Deploy 2 service meshes.
      2. Federate CA certificates and apply ServiceMeshPeers.
      3. Restart istiod containers.

      Istiods should start federation controllers successfully without errors like in the log below.

      Original description:

      The following error appears in istiod logs after istiod has been restarted:

      # oc logs -f istiod-east-mesh-5bc7974588-xjxcg | grep -i "error processing"
      2022-12-21T10:08:29.180591Z     error   federation      Error processing remote-east-mesh-system/west-mesh (will retry): could not get root cert for mesh west-mesh: error getting configmap west-ca-root-cert in namespace remote-east-mesh-system: configmap "west-ca-root-cert" not found  component=federation-discovery-controller
      2022-12-21T10:08:29.186240Z     error   federation      Error processing remote-east-mesh-system/west-mesh (will retry): could not get root cert for mesh west-mesh: error getting configmap west-ca-root-cert in namespace remote-east-mesh-system: configmap "west-ca-root-cert" not found  component=federation-discovery-controller
      2022-12-21T10:08:29.197127Z     error   federation      Error processing remote-east-mesh-system/west-mesh (will retry): could not get root cert for mesh west-mesh: error getting configmap west-ca-root-cert in namespace remote-east-mesh-system: configmap "west-ca-root-cert" not found  component=federation-discovery-controller
      2022-12-21T10:08:29.217542Z     error   federation      Error processing remote-east-mesh-system/west-mesh (will retry): could not get root cert for mesh west-mesh: error getting configmap west-ca-root-cert in namespace remote-east-mesh-system: configmap "west-ca-root-cert" not found  component=federation-discovery-controller
      2022-12-21T10:08:29.257752Z     error   federation      Error processing remote-east-mesh-system/west-mesh (will retry): could not get root cert for mesh west-mesh: error getting configmap west-ca-root-cert in namespace remote-east-mesh-system: configmap "west-ca-root-cert" not found  component=federation-discovery-controller
      2022-12-21T10:08:29.337875Z     error   federation      Error processing remote-east-mesh-system/west-mesh (giving up): could not get root cert for mesh west-mesh: error getting configmap west-ca-root-cert in namespace remote-east-mesh-system: configmap "west-ca-root-cert" not found   component=federation-discovery-controller

       
      And shortly after that the communication with the other servicemeshpeer is interrupted.  

      The configmap exists, and the communication was working correctly to the other peer. 

      # oc version
      Client Version: 4.11.18
      Kustomize Version: v4.5.4
      Server Version: 4.11.20
      Kubernetes Version: v1.24.6+5658434
      
      # oc get smcp -A -o wide
      NAMESPACE                 NAME        READY   STATUS            PROFILES      VERSION   AGE   IMAGE REGISTRY
      remote-east-mesh-system   east-mesh   10/10   ComponentsReady   ["default"]   2.2.4     45h
      # oc get csv
      NAME                           DISPLAY                                          VERSION    REPLACES                     PHASE
      elasticsearch-operator.5.5.5   OpenShift Elasticsearch Operator                 5.5.5                                   Succeeded
      jaeger-operator.v1.39.0-3      Red Hat OpenShift distributed tracing platform   1.39.0-3   jaeger-operator.v1.34.1-5    Succeeded
      kiali-operator.v1.57.3         Kiali Operator                                   1.57.3     kiali-operator.v1.48.3       Succeeded
      servicemeshoperator.v2.3.0     Red Hat OpenShift Service Mesh                   2.3.0-0    servicemeshoperator.v2.2.3   Succeeded

       
      Attaching the must-gather from the mesh where the issue happened. 

              jewertow@redhat.com Jacek Ewertowski
              rhn-support-asolanas Alexis Solanas
              Praneeth Bajjuri
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: