Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-58070

High latency etcd disk writes due to openshift-marketplace pods/OLM

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Critical Critical
    • None
    • 4.14.z
    • OLM
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • None
    • None
    • None
    • Rejected
    • Mewtwo Sprint 273
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

       

       

      platform: ARO
      OCP Version: 4.16.37
      ARO SRE have found out the presence of the openshift-marketplace pods in a master node that somehow causes disk IO contention. The following symptoms are observed in one customer cluster,
      
      In master-2 a few openshift-marketplace pods were spiking CPU usage, relatively greater than kube-apiserver or other typical top CPU user. In master-2, etcd latencies are as high as > 1 to ~9s. In master-2 VM disk queue length and IO bandwidth are relatively high or higher than average. 
      We are anticipating the possibility that this may be a regression for the fix for the bug -
      [OCPBUGS-48697] OLMv0: excessive catalog source snapshots cause severe performance regression [openshift-4.15.z] - Red Hat Issue Tracker
      
      Please investigate or help us rule this out. SRE team needs OLM expertise in order to confirm this bug exists in the customer's cluster or not.
      

       

      Version-Release number of selected component (if applicable):

          

      How reproducible:

      100%

      Steps to Reproduce:

          1. Install an ARO cluster with version 4.16.37.
          2. Wait for some time, perhaps instal operators and put average load on etcd. Or anything that simulates realistic cluster and OLM usage.
          3. In the openshit-console go to the alerts page and observe the etcdGRPCRequestsSlow alert. 
          4. The etcdGRPCRequestsSlow alert should be flipping between pending and inactive, or firing. 
          4. Grab the alerts query and run it in and observe the etcd latencies.    

      Actual results:

          etcd latencies are > 1s, and can even reach to 5s or 9s

      Expected results:

          etcd latencies should be < 1s or the alert should not be in pending nor fire

      Additional info:

      MG link: https://attachments.access.redhat.com/hydra/rest/cases/04179961/attachments/08b1bb49-3b20-4cbc-b212-94fd3facb1f5?usePresignedUrl=true

              rh-ee-jkeister Jordan Keister
              jcueto@redhat.com Jose Gavine Cueto
              None
              None
              Kui Wang Kui Wang
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: