-
Feature
-
Resolution: Done
-
Major
-
None
-
None
-
None
-
No
-
0% To Do, 0% In Progress, 100% Done
We need to continue to maintain specific areas within storage, this is to capture that effort and track it across releases.
Goals
- To allow OCP users and cluster admins to detect problems early and with as little interaction with Red Hat as possible.
- When Red Hat is involved, make sure we have all the information we need from the customer, i.e. in metrics / telemetry / must-gather.
- Reduce storage test flakiness so we can spot real bugs in our CI.
Requirements
Requirement | Notes | isMvp? |
---|---|---|
Telemetry | No | |
Certification | No | |
API metrics | No | |
Out of Scope
n/a
Background, and strategic fit
With the expected scale of our customer base, we want to keep load of customer tickets / BZs low
Assumptions
Customer Considerations
Documentation Considerations
- Target audience: internal
- Updated content: none at this time.
Notes
In progress:
- CSI certification flakes a lot. We should fix it before we start testing migration.
- In progress (API server restarts...) https://bugzilla.redhat.com/show_bug.cgi?id=1865857
- Get local-storage-operator and AWS EBS CSI driver operator logs in must-gather (OLM-managed operators are not included there)
- In progress for LSO (must-gather script being included in image) https://bugzilla.redhat.com/show_bug.cgi?id=1756096
- CI flakes:
- Configurable timeouts for e2e tests
- Azure is slow and times out often
- Cinder times out formatting volumes
- AWS resize test times out
- Configurable timeouts for e2e tests
High prio:
- Env. check tool for VMware - users often mis-configure permissions there and blame OpenShift. If we had a tool they could run, it might report better errors.
- Should it be part of the installer?
- Spike exists
- Add / use cloud API call metrics
-
- Helps customers to understand why things are slow
- Helps build cop to understand a flake
- With a post-install step that filters data from Prometheus that’s still running in the CI job.
- Ideas:
- Cloud is throttling X% of API calls longer than Y seconds
- Attach / detach / provisioning / deletion / mount / unmount / resize takes longer than X seconds?
- Capture metrics of operations that are stuck and won’t finish.
- Sweep operation map from executioner???
- Report operation metric into the highest bucket after the bucket threshold (i.e. if 10minutes is the last bucket, report an operation into this bucket after 10 minutes and don’t wait for its completion)?
- Ask the monitoring team?
- Include in CSI drivers too.
- With alerts too
- Report events for cloud issues
- E.g. cloud API reports weird attach/provision error (e.g. due to outage)
- What volume plugins actually users use the most? https://issues.redhat.com/browse/STOR-324
Unsorted
- As the number of storage operators grows, it would be grafana board for storage operators
- CSI driver metrics (from CSI sidecars + the driver itself + its operator?)
- CSI migration?
- Get aggregated logs in cluster
- They're rotated too soon
- No logs from dead / restarted pods
- No tools to combine logs from multiple pods (e.g. 3 controller managers)
- What storage issues customers have? it was 22% of all issues.
- Insufficient docs?
- Probably garbage
- Document basic storage troubleshooting for our supports
- What logs are useful when, what log level to use
- This has been discussed during the GSS weekly team meeting; however, it would be beneficial to have this documented.
- Common vSphere errors, their debugging and fixing.
- Document sig-storage flake handling - not all failed [sig-storage] tests are ours