Uploaded image for project: 'OpenShift Over the Air'
  1. OpenShift Over the Air
  2. OTA-821

How comfortable are we with channel-clearing?

XMLWordPrintable

    • Icon: Spike Spike
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • None
    • False
    • None
    • False

      Our docs currently have:

      If you do not want the Cluster Version Operator to fetch available updates from the update recommendation service, you can use the oc adm upgrade channel command in the OpenShift CLI to configure an empty channel. This configuration can be helpful if, for example, a cluster has restricted network access and there is no local, reachable update recommendation service.

      And oc adm upgrade channel and the cluster-version operator's RetrievedUpdates condition are similarly relaxed about this. But there are risks:

      • The admin is taking responsibility for monitoring for available updates on their own. If they fail to check for updates, they risk running buggy code, and may have CVE exposure, etc.
      • When evaluating updates, the admin is taking responsibility for checking whether the update is supported, e.g. 4.11.0 to 4.11.1 is supported. 4.2.0 to 4.11.1 is not (ish; but it is clearly less wise, regardless of strict support-ness).
      • When evaluating supported updates, the admin is taking responsibility for checking and evaluating any declared update risks. E.g.
      $ curl -s 'https://api.openshift.com/api/upgrades_info/graph?channel=stable-4.11' | jq '.conditionalEdges[] | .risks as $r | .edges[] | select(.from == "4.10.28" and .to == "4.11.2") | $r'
      [
        {
          "url": "https://issues.redhat.com/browse/OCPBUGS-631",
          "name": "PodmanTermStorageCorruption",
          "message": "BareMetal, Nutanix, OpenStack, oVirt, and VSphere platforms may fail to update nodes in environments where it takes over 20 seconds to retrieve the Machine Config Daemon image.",
          "matchingRules": [
            {
              "type": "PromQL",
              "promql": {
                "promql": "cluster_infrastructure_provider{type=~\"BareMetal|Nutanix|OpenStack|oVirt|VSphere\"}\nor\n0 * cluster_infrastructure_provider\n"
              }
            }
          ]
        },
        {
          "url": "https://issues.redhat.com/browse/OCPBUGS-959",
          "name": "StaleInsightsRunLevelLabel",
          "message": "An 'openshift.io/run-level: 1' annotation on the openshift-insights namespace may cause \"container has runAsNonRoot\" for the insights operator when updating to 4.11.2.",
          "matchingRules": [
            {
              "type": "PromQL",
              "promql": {
                "promql": "kube_namespace_labels{namespace=\"openshift-insights\",label_openshift_io_run_level=\"1\"}\nor\n0 * kube_namespace_labels{namespace=\"openshift-insights\",label_openshift_io_run_level=\"\"}\n"
              }
            }
          ]
        },
        {
          "url": "https://issues.redhat.com/browse/OCPBUGS-595",
          "name": "StaleSELinuxPolicies",
          "message": "Custom SELinux policies, such as those installed by OpenShift Virtualization, may result in kubelet issues when updating to 4.11.2.",
          "matchingRules": [
            {
              "type": "PromQL",
              "promql": {
                "promql": "group(csv_succeeded{name=~\"kubevirt-hyperconverged-operator[.].*\"})\nor\n0 * group(csv_count)\n"
              }
            }
          ]
        }
      ]
      

      We should figure out what we want the UX to be, and whether we want to update the CVO's RetrievedUpdates NoChannel message, the oc adm upgrade channel help text and/or logged warnings, and/or the openshift docs around this.

      Also in this space, the restricted-network/disconnected update docs currently lead with the oc adm upgrade --image ... flow. Which was fine for 4.5 and earlier. But since 4.6, and definitely since 4.10's conditional updates, we really want to encourage folks to run a local update service, so they can get all the second two jobs (evaluating support and declared update risks) handled automatically, and only be on the hook for the first job (checking to make sure they have current data, this time to feed into their local update service).

              Unassigned Unassigned
              trking W. Trevor King
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: