XML

Word

Printable

Type: Epic
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Labels:

Epic Name:
Ensure all HA components are not degraded by design during upgrades
Epic Status:
In Progress
Activity Type:
Product / Portfolio Work
Parent Link:
OCPSTRAT-1064Improve upgrades - phase 3 - Control plane & worker node independence
Hierarchy Progress Bar:

33% To Do, 0% In Progress, 67% Done
Blocked:
False
Blocked Reason:
None
Ready:
False
Size:
None

Target Version:
None
Release Blocker:
None

Epic Goal

Eliminate the gap between measured availability and Available=true

Why is this important?

Today it's not uncommon, even for CI jobs, to have multiple operators which blip through either Degraded=True or Available=False conditions
We should assume that if our CI jobs do this then when operating in customer environments with higher levels of chaos things will be even worse
We have had multiple customers express that they've pursued rolling back upgrades because the cluster is telling them that portions of the cluster are Degraded or Unavailable when they're actually not
Since our product is self-hosted, we can reasonably expect that the instability that we experience on our platform workloads (kube-apiserver, console, authentication, service availability), will also impact customer workloads that run exactly the same way: we're just better at detecting it.

Scenarios

In all of the following, assume standard 3 master 0 worker or 3 master 2+ worker topologies
Add/update CI jobs which ensure 100% Degraded=False and Available=True for the duration of upgrade
Add/update CI jobs which measure availability of all components which are not explicitly defined as non-HA (ex: metal's DHCP server is singleton)
Address all identified issues

Acceptance Criteria

openshift/enhancements CONVENTIONS outlines these requirements
CI - Release blocking jobs include these new/updated tests
Release Technical Enablement - N/A if we do this we should need no docs
No outstanding identified issues

Dependencies (internal and external)

Previous Work (Optional):

Clayton, David, and Trevor identified many issues early in 4.8 development but were unable to ensure all teams addressed them that list is in this query, teams will be asked to address everything on this list as a 4.9 blocker+ bug and we will re-evaluate status closer to 4.9 code freeze to see which may be deferred to 4.10
https://bugzilla.redhat.com/buglist.cgi?columnlist=product%2Ccomponent%2Cassigned_to%2Cbug_severity%2Ctarget_release%2Cbug_status%2Cresolution%2Cshort_desc%2Cchangeddate&f1=longdesc&f2=cf_environment&j_top=OR&list_id=12012976&o1=casesubstring&o2=casesubstring&query_based_on=ClusterOperator%20conditions&query_format=advanced&v1=should%20not%20change%20condition%2F&v2=should%20not%20change%20condition%2F

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
DEV - Tests in place
DEV - No outstanding failing tests
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

account is impacted by

OCPBUGS-38666 clusteroperator/dns blips Degraded=True during upgrade test

OCPBUGS-38675 clusteroperator/authentication blips Degraded=True in CI jobs

OCPBUGS-38676 clusteroperator/console blips Degraded=True during CI job run

OCPBUGS-38678 clusteroperator/kube-apiserver blips Degraded=True outside of single node upgrade window

OCPBUGS-38679 clusteroperator/openshift-samples blips Degraded=True outside of single node upgrade window

OCPBUGS-38750 clusteroperator/dns blips Degraded=True during non-upgrade test

CORENET-6605 clusteroperator/network blips Degraded=True during serial test

Review

OCPBUGS-42837 clusteroperator/cloud-controller-manager blips Degraded=True during upgrade test

POST

OCPBUGS-39026 clusteroperator/monitoring blips Degraded=True during upgrade test

Verified

OCPBUGS-38664 clusteroperator/dns blips Degraded=True during upgrade test

Closed

OCPBUGS-38667 clusteroperator/image-registry blips Degraded=True during upgrade test

Closed

OCPBUGS-38749 clusteroperator/machine-config blips Degraded=True during non-upgrade job run

Closed

OCPBUGS-39199 clusteroperator/machine-config blips Degraded=True during upgrade test

Closed

OCPBUGS-41527 ingress operator changed condition/Available to false during non-upgrade job

Closed

blocks

OCPSTRAT-2484 [phase-1]Improve upgrade experience - fix false alarms in ClusterOperator status

In Progress

clones

OTA-700 Ensure availability of all HA components during upgrades

Closed

is depended on by

OCPSTRAT-836 Improve OpenShift upgrade progress feedback for cluster administrators

is related to

OCPBUGS-38661 clusteroperator/kube-apiserver blips Degraded=True during upgrade test

OCPBUGS-38663 clusteroperator/kube-scheduler blips Degraded=True during upgrade test

OCPBUGS-42870 clusteroperator/openshift-controller-manager blips Degraded=True during upgrade test

OCPBUGS-42872 clusteroperator/cloud-credential blips Degraded=True during upgrade test

OCPBUGS-42875 clusteroperator/cluster-autoscaler blips Degraded=True during upgrade test

ASSIGNED

OCPBUGS-45921 clusteroperator/ingress blips Degraded=True during hypershift conformance test

POST

OCPBUGS-38662 clusteroperator/kube-controller-manager blips Degraded=True during upgrade test

POST

OCPBUGS-44332 clusteroperator/machine-api blips Degraded=True during CI Job run

POST

OCPBUGS-38659 clusteroperator/etcd blips Degraded=True during upgrade test

Verified

OCPBUGS-66209 clusteroperator/machine-config blips Degraded=True in CI jobs

Verified

OCPBUGS-66225 clusteroperator/image-registry blips Degraded=True during upgrade test

Verified

OTA-980 Is the Failing=True status condition is a good indicator for admins?

To Do

links to

openshift/api#2469: NO-JIRA: New rules about CO's conditions

openshift/cluster-image-registry-operator#1055: TRT-1576: Ensure affinity rules are applied for >=2 replicas

openshift/origin#28735: TRT-1576: Fail if operator has Available=False unless in upgrade window

openshift/origin#28847: TRT-1691: Revert #28735 "TRT-1576: Fail if operator has Available=False unless in upgrade window"

openshift/origin#28851: Revert "TRT-1691: Revert #28735 "TRT-1576: Fail if operator has Available=False unless in upgrade window""

openshift/origin#29002: TRT-1578: Enable operator degraded check

openshift/origin#29175: TRT-1575: Add more exceptions for operator degraded cases

openshift/origin#29273: TRT-1578: Add more exception for operator degraded cases

openshift/origin#29354: TRT-1575: Use catchall card for operator degraded exceptions for major operators

openshift/origin#29485: TRT-1575: Add a few more exception outside of upgraded window

openshift/origin#29566: TRT-1575: Fail the test when an expected operator goes to degraded

(9 account is impacted by, 1 blocks, 1 clones, 1 is depended on by, 12 is related to, 11 links to)

1.	QE Tracker		New		Unassigned
2.	TE Tracker		New		Unassigned

Assignee:: Unassigned

Reporter:: Scott Dodson

Need Info From:: None

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Created:: 2023/12/08 6:27 PM

Updated:: 2025/12/02 12:09 AM

Details

Description

Epic Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

Attachments

Issue Links

Easy Agile Planning Poker

Sub-Tasks

Activity

People

Dates