Details

    • No
    • 100
    • 100% 100%

    Description

      We need to continue to maintain specific areas within storage, this is to capture that effort and track it across releases.

      Goals

      • To allow OCP users and cluster admins to detect problems early and with as little interaction with Red Hat as possible.
      • When Red Hat is involved, make sure we have all the information we need from the customer, i.e. in metrics / telemetry / must-gather.
      • Reduce storage test flakiness so we can spot real bugs in our CI.

      Requirements

      Requirement Notes isMvp?
      Telemetry   No
      Certification   No
      API metrics   No
           

      Out of Scope

      n/a

      Background, and strategic fit
      With the expected scale of our customer base, we want to keep load of customer tickets / BZs low

      Assumptions

      Customer Considerations

      Documentation Considerations

      • Target audience: internal
      • Updated content: none at this time.

      Notes

      In progress:

      • CI flakes:
        • Configurable timeouts for e2e tests
          • Azure is slow and times out often
          • Cinder times out formatting volumes
          • AWS resize test times out

       

      High prio:

      • Env. check tool for VMware - users often mis-configure permissions there and blame OpenShift. If we had a tool they could run, it might report better errors.
        • Should it be part of the installer?
        • Spike exists
      • Add / use cloud API call metrics
        • Helps customers to understand why things are slow
        • Helps build cop to understand a flake
          • With a post-install step that filters data from Prometheus that’s still running in the CI job.
        • Ideas:
          • Cloud is throttling X% of API calls longer than Y seconds
          • Attach / detach / provisioning / deletion / mount / unmount / resize takes longer than X seconds?
        • Capture metrics of operations that are stuck and won’t finish.
          • Sweep operation map from executioner???
          • Report operation metric into the highest bucket after the bucket threshold (i.e. if 10minutes is the last bucket, report an operation into this bucket after 10 minutes and don’t wait for its completion)?
          • Ask the monitoring team?
        • Include in CSI drivers too.
          • With alerts too

      Unsorted

      • As the number of storage operators grows, it would be grafana board for storage operators
        • CSI driver metrics (from CSI sidecars + the driver itself  + its operator?)
        • CSI migration?
      • Get aggregated logs in cluster
        • They're rotated too soon
        • No logs from dead / restarted pods
        • No tools to combine logs from multiple pods (e.g. 3 controller managers)
      • What storage issues customers have? it was 22% of all issues.
        • Insufficient docs?
        • Probably garbage
      • Document basic storage troubleshooting for our supports
        • What logs are useful when, what log level to use
        • This has been discussed during the GSS weekly team meeting; however, it would be beneficial to have this documented.
      • Common vSphere errors, their debugging and fixing. 
      • Document sig-storage flake handling - not all failed [sig-storage] tests are ours

      Attachments

        Activity

          People

            rh-gs-gcharot Gregory Charot
            rhn-support-dhardie Duncan Hardie
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: