-
Epic
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
Ensure all HA components are not degraded by design during upgrades
-
BU Product Work
-
False
-
False
-
In Progress
-
OCPSTRAT-1064 - Improve upgrades - phase 3 - Control plane & worker node independence
-
Impediment
-
OCPSTRAT-1064Improve upgrades - phase 3 - Control plane & worker node independence
-
33% To Do, 33% In Progress, 33% Done
-
Undefined
Epic Goal
- Eliminate the gap between measured availability and Available=true
Why is this important?
- Today it's not uncommon, even for CI jobs, to have multiple operators which blip through either Degraded=True or Available=False conditions
- We should assume that if our CI jobs do this then when operating in customer environments with higher levels of chaos things will be even worse
- We have had multiple customers express that they've pursued rolling back upgrades because the cluster is telling them that portions of the cluster are Degraded or Unavailable when they're actually not
- Since our product is self-hosted, we can reasonably expect that the instability that we experience on our platform workloads (kube-apiserver, console, authentication, service availability), will also impact customer workloads that run exactly the same way: we're just better at detecting it.
Scenarios
- In all of the following, assume standard 3 master 0 worker or 3 master 2+ worker topologies
- Add/update CI jobs which ensure 100% Degraded=False and Available=True for the duration of upgrade
- Add/update CI jobs which measure availability of all components which are not explicitly defined as non-HA (ex: metal's DHCP server is singleton)
- Address all identified issues
Acceptance Criteria
- openshift/enhancements CONVENTIONS outlines these requirements
- CI - Release blocking jobs include these new/updated tests
- Release Technical Enablement - N/A if we do this we should need no docs
- No outstanding identified issues
Dependencies (internal and external)
- ...
Previous Work (Optional):
- Clayton, David, and Trevor identified many issues early in 4.8 development but were unable to ensure all teams addressed them that list is in this query, teams will be asked to address everything on this list as a 4.9 blocker+ bug and we will re-evaluate status closer to 4.9 code freeze to see which may be deferred to 4.10
https://bugzilla.redhat.com/buglist.cgi?columnlist=product%2Ccomponent%2Cassigned_to%2Cbug_severity%2Ctarget_release%2Cbug_status%2Cresolution%2Cshort_desc%2Cchangeddate&f1=longdesc&f2=cf_environment&j_top=OR&list_id=12012976&o1=casesubstring&o2=casesubstring&query_based_on=ClusterOperator%20conditions&query_format=advanced&v1=should%20not%20change%20condition%2F&v2=should%20not%20change%20condition%2F
Open questions::
- …
Done Checklist
- CI - CI is running, tests are automated and merged.
- DEV - Tests in place
- DEV - No outstanding failing tests
- QE - Test plans in Polarion: <link or reference to Polarion>
- QE - Automated tests merged: <link or reference to automated tests>
- DOC - Downstream documentation merged: <link to meaningful PR>
- account is impacted by
-
OCPBUGS-38666 clusteroperator/dns blips Degraded=True during upgrade test
- New
-
OCPBUGS-38668 clusteroperator/network blips Degraded=True during upgrade test
- New
-
OCPBUGS-38675 clusteroperator/authentication blips Degraded=True in CI jobs
- New
-
OCPBUGS-38676 clusteroperator/console blips Degraded=True during CI job run
- New
-
OCPBUGS-38678 clusteroperator/kube-apiserver blips Degraded=True outside of single node upgrade window
- New
-
OCPBUGS-38679 clusteroperator/openshift-samples blips Degraded=True outside of single node upgrade window
- New
-
OCPBUGS-38684 clusteroperator/network blips Degraded=True during serial test
- New
-
OCPBUGS-38750 clusteroperator/dns blips Degraded=True during non-upgrade test
- New
-
OCPBUGS-39026 clusteroperator/monitoring blips Degraded=True during upgrade test
- ASSIGNED
-
OCPBUGS-42837 clusteroperator/cloud-controller-manager blips Degraded=True during upgrade test
- ASSIGNED
-
OCPBUGS-38749 clusteroperator/machine-config blips Degraded=True during non-upgrade job run
- ON_QA
-
OCPBUGS-38667 clusteroperator/image-registry blips Degraded=True during upgrade test
- Verified
-
OCPBUGS-39199 clusteroperator/machine-config blips Degraded=True during upgrade test
- Verified
-
OCPBUGS-38664 clusteroperator/dns blips Degraded=True during upgrade test
- Closed
-
OCPBUGS-41527 ingress operator changed condition/Available to false during non-upgrade job
- Closed
- clones
-
OTA-700 Ensure availability of all HA components during upgrades
- Closed
- is depended on by
-
OCPSTRAT-836 Improve OpenShift upgrade progress feedback for cluster administrators
- New
- is related to
-
OCPBUGS-38659 clusteroperator/etcd blips Degraded=True during upgrade test
- New
-
OCPBUGS-38661 clusteroperator/kube-apiserver blips Degraded=True during upgrade test
- New
-
OCPBUGS-38662 clusteroperator/kube-controller-manager blips Degraded=True during upgrade test
- New
-
OCPBUGS-38663 clusteroperator/kube-scheduler blips Degraded=True during upgrade test
- New
-
OCPBUGS-42870 clusteroperator/openshift-controller-manager blips Degraded=True during upgrade test
- New
-
OCPBUGS-42872 clusteroperator/cloud-credential blips Degraded=True during upgrade test
- New
-
OCPBUGS-45921 clusteroperator/ingress blips Degraded=True during hypershift conformance test
- New
-
OCPBUGS-44332 clusteroperator/machine-api blips Degraded=True during CI Job run
- ASSIGNED
-
OCPBUGS-42875 clusteroperator/cluster-autoscaler blips Degraded=True during upgrade test
- ASSIGNED
-
OTA-980 Is the Failing=True status condition is a good indicator for admins?
- To Do
- links to
1.
|
QE Tracker | New | Unassigned | ||
2.
|
TE Tracker | New | Unassigned |