Uploaded image for project: 'Maistra'
  1. Maistra
  2. MAISTRA-2378

Problem reconciling SMMR with large number of members when ovs-multitenant is used


    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Blocker Blocker
    • maistra-
    • maistra-2.0.3, maistra-2.0.4, maistra-2.0.5
    • operator
    • None
    • Sprint 4

      Clusters configured with ovs-multitenant are experiencing ServiceMeshMemberRoll reconciliation issues when the number of members is above a certain threshold.

      This issue appears only with the combination of ovs-multitenant and the new concurrent reconciliation of member namespaces introduced in 2.0.3.

      When using ovs-multitenant, the istio operator joins a member namespace to the mesh by adding the `pod.network.openshift.io/multitenant.change-network` annotation to the `netnamespace` object for that member namespace (this is exactly what the `oc adm pod-network join-projects` command does). This annotation is then picked up by OpenShift, which joins the namespace to the correct network, and removes the annotation. Istio operator waits up to 16s for this to happen. Previously, as the namespaces were reconciled sequentially, the 16s timeout was adequate. In 2.0.3+ (with a high number of namespaces) that's no longer the case. OpenShift now has to process all those namespaces (i.e. remove the annotation) in 16s. If it fails to do so in just one of the namespaces, istio-operator considers the reconcile of that member to have failed and removes the member from SMMR.status.configuredNamespaces.

      The reconciler then runs again (with backoff, but still almost immediately). Instead of adding the annotation to just the namespaces that are not yet joined to the mesh, the ovs-multitenant implementation in istio-operator adds it to each and every member specified in the SMMR (even those that are already joined to the mesh). This typically causes new failures in other namespaces as opposed to the previous attempt. This time it's these namespaces that are removed from the configuredMembers list. The entire process then repeats and this is why we're seeing the configuredMembers list change randomly.

        1. failure
          27 kB
        2. reproducer.tar
          9 kB
        3. success
          26 kB

            mluksa@redhat.com Marko Luksa
            mluksa@redhat.com Marko Luksa
            1 Vote for this issue
            6 Start watching this issue