Uploaded image for project: 'OpenShift GitOps'
  1. OpenShift GitOps
  2. GITOPS-6149

1172 of 3672 SNOs are not deployed as openshift-gitops-application-controller-0 pod crashes with OOMKILLED

XMLWordPrintable

    • False
    • None
    • False

      Description of Problem

      • I'm doing 3500 SNO ZTP Scale test with ACM 2.13 Downstream build and OCP 4.18.0-rc4.  The SNOs are deployed with Assisted Installer with Siteconfig v1.  Each gitops application deploys 300 clusters. 500 clusters are deployed every hous. First 8 applications work well but applications are stuck at "out of sync" staring with application number 9 as the openshift-gitops-application-controller-0 pod crashes with OOMKILLED. so only the first 2500 clusters are deployed, the rest 1172 ones are not.
        we've down similar test  for ACM 2.12 and OCP 4.17 where sometimes we have issue with last 100 cluster in the last app, GITOPS-5664. but now it looks get much worse with ACM 2.13/OCP 4.18.0
        # oc get pod -n openshift-gitops openshift-gitops-application-controller-0 -w 
        
        NAME READY STATUS RESTARTS AGE 
        openshift-gitops-application-controller-0 1/1 Running 90 (8m8s ago) 18h
        openshift-gitops-application-controller-0 0/1 OOMKilled 90 (8m9s ago) 18h 
        openshift-gitops-application-controller-0 0/1 CrashLoopBackOff 90 (10s ago) 18h
        

      The gitops and audit must-gather is here

      Additional Info

      • <Any additional info such as logs, must-gather outputs, etc.>

      Problem Reproduction

      • <How do we reproduce the problem?>

      Reproducibility

      • <Always/Intermittent/Only Once>

      Prerequisites/Environment

      • <OpenShift, managed service (e.g., ROSA, ARO), operators, layered product, and other software versions, build details>

      Steps to Reproduce

      • ...

      Expected Results

      • ...

      Actual Results

      • ...

      Problem Analysis

      • <Completed by engineering team as part of the triage/refinement process>

      Root Cause

      • <What is the root cause of the problem? Or, why is it not a bug?>

      Workaround (If Possible)

      • <Are there any workarounds we can provide to the customers?>

      Fix Approaches

      • <If we decide to fix this bug, how will we do it?>

      Acceptance Criteria

      • ...

      Definition of Done

      • Code Complete:
        • All code has been written, reviewed, and approved.
      • Tested:
        • Unit tests have been written and passed.
        • Ensure code coverage is not reduced with the changes.
        • Integration tests have been automated.
        • System tests have been conducted, and all critical bugs have been fixed.
        • Tested and merged on OpenShift either upstream or downstream on a local build.
      • Documentation:
        • User documentation or release notes have been written (if applicable).
      • Build:
        • Code has been successfully built and integrated into the main repository / project.
        • Midstream changes (if applicable) are done, reviewed, approved and merged.
      • Review:
        • Code has been peer-reviewed and meets coding standards.
        • All acceptance criteria defined in the user story have been met.
        • Tested by reviewer on OpenShift.
      • Deployment:
        • The feature has been deployed on OpenShift cluster for testing.

              Unassigned Unassigned
              rhn-support-txue Ting Xue
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: