[OCPPLAN-7784] Storage Maintainability - Red Hat Issue Tracker

Type: Feature
Resolution: Done
Priority: Major
Fix Version/s: openshift-4.7, openshift-4.8, openshift-4.9, openshift-4.10, openshift-4.11, openshift-4.12, openshift-4.13
Affects Version/s: None
Component/s: None
Labels:
None

We need to continue to maintain specific areas within storage, this is to capture that effort and track it across releases.

Goals

To allow OCP users and cluster admins to detect problems early and with as little interaction with Red Hat as possible.
When Red Hat is involved, make sure we have all the information we need from the customer, i.e. in metrics / telemetry / must-gather.
Reduce storage test flakiness so we can spot real bugs in our CI.

Requirements

Out of Scope

n/a

Background, and strategic fit
With the expected scale of our customer base, we want to keep load of customer tickets / BZs low

Assumptions

Customer Considerations

Documentation Considerations

Notes

In progress:

CSI certification flakes a lot. We should fix it before we start testing migration.
- In progress (API server restarts...) https://bugzilla.redhat.com/show_bug.cgi?id=1865857

Get local-storage-operator and AWS EBS CSI driver operator logs in must-gather (OLM-managed operators are not included there)
- In progress for LSO (must-gather script being included in image) https://bugzilla.redhat.com/show_bug.cgi?id=1756096

CI flakes:
- Configurable timeouts for e2e tests
  - Azure is slow and times out often
  - Cinder times out formatting volumes
  - AWS resize test times out

High prio:

Env. check tool for VMware - users often mis-configure permissions there and blame OpenShift. If we had a tool they could run, it might report better errors.
- Should it be part of the installer?
- Spike exists

Report events for cloud issues
- E.g. cloud API reports weird attach/provision error (e.g. due to outage)
What volume plugins actually users use the most? https://issues.redhat.com/browse/STOR-324

Unsorted

As the number of storage operators grows, it would be grafana board for storage operators
- CSI driver metrics (from CSI sidecars + the driver itself + its operator?)
- CSI migration?

Get aggregated logs in cluster
- They're rotated too soon
- No logs from dead / restarted pods
- No tools to combine logs from multiple pods (e.g. 3 controller managers)

What storage issues customers have? it was 22% of all issues.
- Insufficient docs?
- Probably garbage

Document basic storage troubleshooting for our supports
- What logs are useful when, what log level to use
- This has been discussed during the GSS weekly team meeting; however, it would be beneficial to have this documented.

Document sig-storage flake handling - not all failed [sig-storage] tests are ours