Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-25708

CVO should continue to periodically fetch upstream Cincinnati despite Recommended=Unknown risks

    XMLWordPrintable

Details

    • Important
    • Yes
    • 3
    • OTA 246, OTA 247
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      The Cluster Version Operator (CVO) continually retrieves update recommendations and evaluates known conditional update risks against the current cluster state. CVO changes which shipped 4.15.0, 4.14.0, 4.13.17, and 4.12.43 caused failing risk evaluations to block the CVO from fetching new update recommendations. When the risk evaluations were failing because the update recommendation service served a poorly defined update risk, the bug could cause the CVO to fail to notice the update recommendation service serving an improved risk declaration. This release ships a fix, and now the CVO will continue to poll the update recommendation service, regardless of whether update risks are being successfully evaluated or not.
      Show
      The Cluster Version Operator (CVO) continually retrieves update recommendations and evaluates known conditional update risks against the current cluster state. CVO changes which shipped 4.15.0, 4.14.0, 4.13.17, and 4.12.43 caused failing risk evaluations to block the CVO from fetching new update recommendations. When the risk evaluations were failing because the update recommendation service served a poorly defined update risk, the bug could cause the CVO to fail to notice the update recommendation service serving an improved risk declaration. This release ships a fix, and now the CVO will continue to poll the update recommendation service, regardless of whether update risks are being successfully evaluated or not.
    • Bug Fix
    • In Progress

    Description

      Description of problem:

      Changes made for faster risk cache-warming (the OCPBUGS-19512 series) introduced an unfortunate cycle:

      1. Cincinnati serves vulnerable PromQL, like graph-data#4524.
      2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues.
      3. Cincinnati PromQL fixed, like graph-data#4528.
      4. Cases:

        • (a) Before the cache-warming changes, and also after this bug's fix, Clusters pick up the fixed PromQL, try to evaluate, and start succeeding. Hooray!
        • (b) Clusters with the cache-warming changes but without this bug's fix say "it's been a long time since we pulled fresh Cincinanti information, but it has not been long since my last attempt to eval this broken PromQL, so let me skip the Cincinnati pull and re-eval that old PromQL", which fails. Re-eval-and-fail loop continues.

      Version-Release number of selected component (if applicable):

      The regression went back via:

      Updates from those releases (and later in their 4.y, until this bug lands a fix) to later releases are exposed.

      How reproducible:

      Likely very reproducible for exposed releases, but only when clusters are served PromQL risks that will consistently fail evaluation.

      Steps to Reproduce:

      1. Launch a cluster.
      2. Point it at dummy Cincinnati data, as described in OTA-520. Initially declare a risk with broken PromQL in that data, like cluster_operator_conditions.
      3. Wait until the cluster is reporting Recommended=Unknown for those risks (oc adm upgrade --include-not-recommended).
      4. Update the risk to working PromQL, like group(cluster_operator_conditions). Alternatively, update anything about the update-service data (e.g. adding a new update target with a path from the cluster's version).
      5. Wait 10 minutes for the CVO to have plenty of time to pull that new Cincinnati data.
      6. oc get -o json clusterversion version | jq '.status.conditionalUpdates[].risks[].matchingRules[].promql.promql' | sort | uniq | jq -r .

      Actual results:

      Exposed releases will still have the broken PromQL in their output (or will lack the new update target you added, or whatever the Cincinnati data change was).

      Expected results:

      Fixed releases will have picked up the fixed PromQL in their output (or will have the new update target you added, or whatever the Cincinnati data change was).

      Additional info:

      Identification

      To detect exposure in collected Insights, look for EvaluationFailed conditionalUpdates like:

      $ oc get -o json clusterversion version | jq -r '.status.conditionalUpdates[].conditions[] | select(.type == "Recommended" and .status == "Unknown" and .reason == "EvaluationFailed" and (.message | contains("invalid PromQL")))'
      {
        "lastTransitionTime": "2023-12-15T22:00:45Z",
        "message": "Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34\nAdding a new worker node will fail for clusters running on ARO. https://issues.redhat.com/browse/MCO-958",
        "reason": "EvaluationFailed",
        "status": "Unknown",
        "type": "Recommended"
      } 

      To confirm in-cluster vs. other EvaluationFailed invalid PromQL issues, you can look for Cincinnati retrieval attempts in CVO logs. Example from a healthy cluster:

      $ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 --since 30m | grep 'request updates from\|PromQL' | tail
      I1221 20:36:39.783530       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?...
      I1221 20:36:39.831358       1 promql.go:118] evaluate PromQL cluster condition: "(\n  group(cluster_operator_conditions{name=\"aro\"})\n  or\n  0 * group(cluster_operator_conditions)\n)\n"
      I1221 20:40:19.674925       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?...
      I1221 20:40:19.727998       1 promql.go:118] evaluate PromQL cluster condition: "(\n  group(cluster_operator_conditions{name=\"aro\"})\n  or\n  0 * group(cluster_operator_conditions)\n)\n"
      I1221 20:43:59.567369       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?...
      I1221 20:43:59.620315       1 promql.go:118] evaluate PromQL cluster condition: "(\n  group(cluster_operator_conditions{name=\"aro\"})\n  or\n  0 * group(cluster_operator_conditions)\n)\n"
      I1221 20:47:39.457582       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?...
      I1221 20:47:39.509505       1 promql.go:118] evaluate PromQL cluster condition: "(\n  group(cluster_operator_conditions{name=\"aro\"})\n  or\n  0 * group(cluster_operator_conditions)\n)\n"
      I1221 20:51:19.348286       1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?...
      I1221 20:51:19.401496       1 promql.go:118] evaluate PromQL cluster condition: "(\n  group(cluster_operator_conditions{name=\"aro\"})\n  or\n  0 * group(cluster_operator_conditions)\n)\n"
      

      showing fetch lines every few minutes. And from an exposed cluster, only showing PromQL eval lines:

      $ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 --since 30m | grep 'request updates from\|PromQL' | tail
      I1221 20:50:10.165101       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
      I1221 20:50:11.166170       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
      I1221 20:50:12.166314       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
      I1221 20:50:13.166517       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
      I1221 20:50:14.166847       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
      I1221 20:50:15.167737       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
      I1221 20:50:16.168486       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
      I1221 20:50:17.169417       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
      I1221 20:50:18.169576       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
      I1221 20:50:19.170544       1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34
      $ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 --since 30m | grep 'request updates from' | tail
      ...no hits...
      

      Recovery

      If bitten, the remediation is to address the invalid PromQ. For example, we fixed that AROBrokenDNSMasq expression in graph-data#4528. And after that the local cluster administrator should restart their CVO, such as with:

      $ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pods
      

      Attachments

        Issue Links

          Activity

            People

              trking W. Trevor King
              trking W. Trevor King
              Yang Yang Yang Yang
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: