Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-25449

[ICNI2] BFD sessions per worker do not scale

XMLWordPrintable

    • No
    • SDN Sprint 248, SDN Sprint 249, SDN Sprint 250
    • 3
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      This issue is related to ICNI2/multi-egress OVN feature.

      During the PERFSCALE-2388 testing some intermittent BFD flapping between the emulated SPK pods and the application workers has been noticed.

      The issue can be reproduced with only 2 worker nodes and appears when each worker handles more than 20/30 BFD sessions.

      Talking to dceara@redhat.com we suspect the issue to be OVN CoPP (control plane protection).
      The idea is to rate limit control packets hitting ovn-controller.
      It's disabled by default and ovnkube should configure that with legitimate value for different protocols.  BFD is one of them.
      This is for ALL protocols:
      https://github.com/ovn-org/ovn-kubernetes/blob/1bd71cb0dd57740a23c504ede36fee0f85ae3bc5/go-controller/pkg/ovn/copp.go#L57-L60

          band := &nbdb.MeterBand{
              Action: types.MeterAction,
              Rate:   int(25), // hard-coding for now. TODO(tssurya): make this configurable if needed
          }
          for _, protocol := range defaultProtocolNames {
              // format: <OVNSupportedProtocolName>-rate-limiter
              meterName := getMeterNameForProtocol(protocol)
              meterNames[protocol] = meterName
              meter := &nbdb.Meter{
                  Name: meterName,
                  Fair: &meterFairness,
                  Unit: types.PacketsPerSecond,
              }
              ops, err = libovsdbops.CreateOrUpdateMeterOps(nbClient, ops, meter, []*nbdb.MeterBand{band},
                  &meter.Bands, &meter.Fair, &meter.Unit)
              if err != nil {
                  return "", fmt.Errorf("can't create meter %v: %v", meter, err)
              }
          }

      So that means that ovn-controller will get at most 25 BFD pkt-ins per second.
      Full tshooting document can be found here
       
      How reproducible:

      Through ICNI2/multi egress annotations (https://docs.openshift.com/container-platform/4.14/networking/ovn_kubernetes_network_provider/configuring-secondary-external-gateway.html) try to establish more than 30 BFD sessions per worker.

            sseethar Surya Seetharaman
            jcastillolema Jose Castillo Lema
            Anurag Saxena Anurag Saxena
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: