Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-42983

Azure 120 scale: OutboundPortsForOutboundRule

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Please change component if incorrect area, TIA

      Description of problem:

      Unable to scale Azure cluster to 120 nodes without failed machines and nodes never become ready
      
      NOTE: this functionality passed on 4.16 
      And passed on 4.17 with a workaround documented here: https://access.redhat.com/solutions/6982343
      
      Is this workaround a new every day change for scaling above 55 nodes?
      
      Machine events show: 
      
      InvalidConfiguration: failed to reconcile machine "scaleci18-8929-xd67z-worker-centralus2-2k44r": network.InterfacesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="SpecifiedAllocatedOutboundPortsForOutboundRuleExceedsTotalNumberOfAvailablePorts" Message="Specified Allocated Outbound Ports 1024 for Outbound Rule /subscriptions/***/resourceGroups/scaleci18-8929-xd67z-rg/providers/Microsoft.Network/loadBalancers/scaleci18-8929-xd67z/outboundRules/OutboundNATAllProtocols exceeds total number of available ports per backend instance of 1008 based upon desired pool size. Reduce allocated ports or increase number of IP addresses for outbound rule." Details=[]
          

      Version-Release number of selected component (if applicable):

       4.18.0-0.nightly-2024-10-08-033830
          

      How reproducible:

      100%
          

      Steps to Reproduce:

          1. Create azure ovn cluster on  4.18.0-0.nightly-2024-10-08-033830
          2. Scale all 3 zone machinesets (40 nodes on each of the 3 machinesets)
      
      Can be done in the console or below command executed on each of the 3 machinesets
      
       oc scale machinesets -n openshift-machine-api ${machineset} --replicas 40
          3. Many failed machines and nodes don't become available 
          

      Actual results:

      Many machines fail and nodes don't become available 
          

      Expected results:

      All machines and nodes become running 
          

      Additional info:

      Did not hit this issue on 4.17.0-nightly**, this is the first time hitting this issue on 4.18. Running on different regions and newer versions of 4.17 for comparison of results 
      
      
      Must gather can be found here: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/must-gather/594/artifact/
          

              Unassigned Unassigned
              prubenda Paige Patton
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated: