Loading...

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.16.0
Affects Version/s: 4.13, 4.12, 4.14, 4.15, 4.16
Component/s: Cluster Version Operator
Labels:
- pre-merge

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
3
Severity:
Important
Regression:
Yes

Target Backport Versions:

4.13, 4.12, 4.14, 4.15
Target Version:

4.16.0
Release Blocker:
Rejected
Sprint:
OTA 246, OTA 247
sprint_count:
2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
In Progress
Release Note Type:
Bug Fix
Release Note Text:

Hide
* Previously, Cluster Version Operator (CVO) changes that were introduced in {product-title} 4.15.0, 4.14.0, 4.13.17, and 4.12.43 caused failing risk evaluations to block the CVO from fetching new update recommendations. When the risk evaluations failed, the bug caused the CVO to overlook the update recommendation service. With this release, the CVO continues to poll the update recommendation service, regardless of whether update risks are being successfully evaluated and the issue has been resolved. (link:https://issues.redhat.com/browse/OCPBUGS-25708[*~~OCPBUGS-25708~~*])

Show
* Previously, Cluster Version Operator (CVO) changes that were introduced in {product-title} 4.15.0, 4.14.0, 4.13.17, and 4.12.43 caused failing risk evaluations to block the CVO from fetching new update recommendations. When the risk evaluations failed, the bug caused the CVO to overlook the update recommendation service. With this release, the CVO continues to poll the update recommendation service, regardless of whether update risks are being successfully evaluated and the issue has been resolved. (link: https://issues.redhat.com/browse/OCPBUGS-25708 [* OCPBUGS-25708 *])

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Changes made for faster risk cache-warming (the ~~OCPBUGS-19512~~ series) introduced an unfortunate cycle:

1. Cincinnati serves vulnerable PromQL, like graph-data#4524.
2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues.
3. Cincinnati PromQL fixed, like graph-data#4528.
4. Cases:

- (a) Before the cache-warming changes, and also after this bug's fix, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray!
- (b) Clusters with the cache-warming changes but without this bug's fix say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues.

Version-Release number of selected component (if applicable):

The regression went back via:

Updates from those releases (and later in their 4.y, until this bug lands a fix) to later releases are exposed.

How reproducible:

Likely very reproducible for exposed releases, but only when clusters are served PromQL risks that will consistently fail evaluation.

Steps to Reproduce:

1. Launch a cluster.
2. Point it at dummy Cincinnati data, as described in ~~OTA-520~~. Initially declare a risk with broken PromQL in that data, like cluster_operator_conditions.
3. Wait until the cluster is reporting Recommended=Unknown for those risks (oc adm upgrade --include-not-recommended).
4. Update the risk to working PromQL, like group(cluster_operator_conditions). Alternatively, update anything about the update-service data (e.g. adding a new update target with a path from the cluster's version).
5. Wait 10 minutes for the CVO to have plenty of time to pull that new Cincinnati data.
6. oc get -o json clusterversion version | jq '.status.conditionalUpdates[].risks[].matchingRules[].promql.promql' | sort | uniq | jq -r .

Actual results:

Exposed releases will still have the broken PromQL in their output (or will lack the new update target you added, or whatever the Cincinnati data change was).

Expected results:

Fixed releases will have picked up the fixed PromQL in their output (or will have the new update target you added, or whatever the Cincinnati data change was).

Additional info:

Identification

To detect exposure in collected Insights, look for EvaluationFailed conditionalUpdates like:

$ oc get -o json clusterversion version | jq -r '.status.conditionalUpdates[].conditions[] | select(.type == "Recommended" and .status == "Unknown" and .reason == "EvaluationFailed" and (.message | contains("invalid PromQL")))'
{
  "lastTransitionTime": "2023-12-15T22:00:45Z",
  "message": "Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34\nAdding a new worker node will fail for clusters running on ARO. https://issues.redhat.com/browse/MCO-958",
  "reason": "EvaluationFailed",
  "status": "Unknown",
  "type": "Recommended"
}

To confirm in-cluster vs. other EvaluationFailed invalid PromQL issues, you can look for Cincinnati retrieval attempts in CVO logs. Example from a healthy cluster:

$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 --since 30m | grep 'request updates from\|PromQL' | tail
I1221 20:36:39.783530       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?...
I1221 20:36:39.831358       1 promql.go:118] evaluate PromQL cluster condition: "(\n  group(cluster_operator_conditions{name=\"aro\"})\n  or\n  0 * group(cluster_operator_conditions)\n)\n"
I1221 20:40:19.674925       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?...
I1221 20:40:19.727998       1 promql.go:118] evaluate PromQL cluster condition: "(\n  group(cluster_operator_conditions{name=\"aro\"})\n  or\n  0 * group(cluster_operator_conditions)\n)\n"
I1221 20:43:59.567369       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?...
I1221 20:43:59.620315       1 promql.go:118] evaluate PromQL cluster condition: "(\n  group(cluster_operator_conditions{name=\"aro\"})\n  or\n  0 * group(cluster_operator_conditions)\n)\n"
I1221 20:47:39.457582       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?...
I1221 20:47:39.509505       1 promql.go:118] evaluate PromQL cluster condition: "(\n  group(cluster_operator_conditions{name=\"aro\"})\n  or\n  0 * group(cluster_operator_conditions)\n)\n"
I1221 20:51:19.348286       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?...
I1221 20:51:19.401496       1 promql.go:118] evaluate PromQL cluster condition: "(\n  group(cluster_operator_conditions{name=\"aro\"})\n  or\n  0 * group(cluster_operator_conditions)\n)\n"

showing fetch lines every few minutes. And from an exposed cluster, only showing PromQL eval lines:

$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 --since 30m | grep 'request updates from\|PromQL' | tail
I1221 20:50:10.165101       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:11.166170       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:12.166314       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:13.166517       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:14.166847       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:15.167737       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:16.168486       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:17.169417       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:18.169576       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
I1221 20:50:19.170544       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 --since 30m | grep 'request updates from' | tail
...no hits...

Recovery

If bitten, the remediation is to address the invalid PromQ. For example, we fixed that AROBrokenDNSMasq expression in graph-data#4528. And after that the local cluster administrator should restart their CVO, such as with:

$ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pods

blocks

OCPBUGS-25949 CVO should continue to periodically fetch upstream Cincinnati despite Recommended=Unknown risks

Closed

is cloned by

OCPBUGS-25949 CVO should continue to periodically fetch upstream Cincinnati despite Recommended=Unknown risks

Closed

links to

KCS 7053841: Upgrade path unavailable due to invalid PromQL result in RHOCP4

openshift/cluster-version-operator#1009: OCPBUGS-25708: pkg/cvo/availableupdates: Only bump LastAttempt on Cincinnati pulls

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Identification

Recovery

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates