Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14497

APIServer and Infrastructure CRDs should be actively managed despite LatencySensitive

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Undefined Undefined
    • 4.14.0
    • 4.13
    • config-operator
    • Moderate
    • No
    • False
    • Hide

      None

      Show
      None
    • Removed LatencySensitive featureset.
    • Removed Functionality

      Description of problem:

      LatencySensitive was documented through 4.5 as a way to enable Topology Manager.  But Topology Manager also went GA in 4.5.  And in 4.6, LatencySensitive stopped being a trigger for Upgradeable=False 4.y-to-4.(y+1) update  blocking.  This combination of events leads to a few Telemetry-submitting clusters that are running modern 4.y but using the kind-of-forgotten-about LatencySensitive feature set.  On top of this, recent openshift/api +openshift:enable:FeatureSets work has lead to some manifests being sharded between the Default and TechPreviewNoUpgrade feature sets, leaving those resources unmanaged for folks in the LatencySensitive set.

      Version-Release number of selected component (if applicable):

      The pivot happened from 4.12 to 4.13:

      $ oc adm release extract --to 4.13 quay.io/openshift-release-dev/ocp-release:4.13.0-x86_64
      $ grep -r 'release.openshift.io/feature-set:.*Default' 4.13
      4.13/0000_10_config-operator_01_apiserver-Default.crd.yaml:    release.openshift.io/feature-set: Default
      4.13/0000_10_config-operator_01_infrastructure-Default.crd.yaml:    release.openshift.io/feature-set: Default
      $ oc adm release extract --to 4.12 quay.io/openshift-release-dev/ocp-release:4.12.0-x86_64
      $ grep -r 'release.openshift.io/feature-set:.*Default' 4.12
      ...no hits...
      

      How reproducible:

      All the time, for LatencySensitive clusters.

      Steps to Reproduce:

      1. Launch a 4.12 cluster, e.g. with Cluster Bot launch 4.12.19 gcp.
      2. Set the LatencySensitive feature set:

      $ oc patch featuregate cluster --type=json --patch='[{"op":"add","path":"/spec/featureSet","value":"LatencySensitive"}]'
      

      3. Set a channel that allows updates to 4.13: oc adm upgrade channel fast-4.13
      4. Ack the Kube API removals:

      $ oc -n openshift-config patch configmap admin-acks --type json --patch '[{"op":"add","path":"/data","value":{"ack-4.12-kube-1.26-api-removals-in-4.13":"true"}}]'
      

      5. Update to 4.13: oc adm upgrade --to-latest
      6. Wait for the update to complete, e.g. polling oc adm upgrade until it levels.
      7. Check the changed-in-4.13 APIServer property:

      $ oc get -o json customresourcedefinition apiservers.config.openshift.io | jq -c '.spec.versions[].schema.openAPIV3Schema.properties.spec.properties.encryption.properties.type.enum'
      

      8. Check a changed-in-4.13 Infrastructure property:

      $ oc get -o json customresourcedefinition infrastructures.config.openshift.io | jq -c '.spec.versions[].schema.openAPIV3Schema.properties.spec.properties.platformSpec.properties.type.enum[-2:]'
      

      Actual results:

      $ oc get -o json clusterversion version | jq -r '.status.history[] | .startedTime + " " + .completionTime + " " + .state + " " + .version'
      2023-06-03T02:52:11Z 2023-06-03T04:20:57Z Completed 4.13.1
      2023-06-03T02:18:46Z 2023-06-03T02:45:09Z Completed 4.12.19
      $ oc get -o json customresourcedefinition apiservers.config.openshift.io | jq -c '.spec.versions[].schema.openAPIV3Schema.properties.spec.properties.encryption.properties.type.enum'
      ["","identity","aescbc"]
      $  oc get -o json customresourcedefinition infrastructures.config.openshift.io | jq -c '.spec.versions[].schema.openAPIV3Schema.properties.spec.properties.platformSpec.properties.type.enum[-2:]'
      ["AlibabaCloud","Nutanix"]
      

      With 4.12's lack of aesgcm and External, despite the cluster claiming to have completed the update to 4.13.

      The exclusion is also visible in the cluster-version operator logs:

      $ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 | grep -o 'excluding.*config-operator.*\(apiserver\|infrastructure\).*crd.*' | sort | uniq
      excluding 0000_10_config-operator_01_apiserver-Default.crd.yaml group=apiextensions.k8s.io kind=CustomResourceDefinition namespace= name=apiservers.config.openshift.io: "LatencySensitive" is required, and release.openshift.io/feature-set=Default
      excluding 0000_10_config-operator_01_apiserver-TechPreviewNoUpgrade.crd.yaml group=apiextensions.k8s.io kind=CustomResourceDefinition namespace= name=apiservers.config.openshift.io: "LatencySensitive" is required, and release.openshift.io/feature-set=TechPreviewNoUpgrade
      excluding 0000_10_config-operator_01_infrastructure-Default.crd.yaml group=apiextensions.k8s.io kind=CustomResourceDefinition namespace= name=infrastructures.config.openshift.io: "LatencySensitive" is required, and release.openshift.io/feature-set=Default
      excluding 0000_10_config-operator_01_infrastructure-TechPreviewNoUpgrade.crd.yaml group=apiextensions.k8s.io kind=CustomResourceDefinition namespace= name=infrastructures.config.openshift.io: "LatencySensitive" is required, and release.openshift.io/feature-set=TechPreviewNoUpgrade
      $ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 | grep -o 'customresourcedefinition .*' | sort | uniq
      customresourcedefinition "alertmanagerconfigs.monitoring.coreos.com" (396 of 840)
      customresourcedefinition "alertmanagers.monitoring.coreos.com" (397 of 840)
      customresourcedefinition "authentications.config.openshift.io" (36 of 840)
      customresourcedefinition "authentications.operator.openshift.io" (275 of 840)
      customresourcedefinition "baremetalhosts.metal3.io" (224 of 840)
      ...
      customresourcedefinition "imagetagmirrorsets.config.openshift.io" (46 of 840)
      customresourcedefinition "imagetagmirrorsets.config.openshift.io" (46 of 840): CustomResourceDefinition imagetagmirrorsets.config.openshift.io does not declare an Established status condition: []
      customresourcedefinition "ingresscontrollers.operator.openshift.io" (359 of 840)
      ...
      

      Expected results:

      $ oc get -o json customresourcedefinition apiservers.config.openshift.io | jq -c '.spec.versions[].schema.openAPIV3Schema.properties.spec.properties.encryption.properties.type.enum'
      ["","identity","aescbc","aesgcm"]
      $ oc get -o json customresourcedefinition infrastructures.config.openshift.io | jq -c '.spec.versions[].schema.openAPIV3Schema.properties.spec.properties.platformSpec.properties.type.enum[-2:]'
      ["Nutanix","External"]
      

      With 4.13's expected aesgcm and External.

      Additional info:

      Assuming you do not have a controller that is attempting to manage FeatureGate's spec, you can recover your cluster by moving to the default feature set:

      $ oc patch featuregate cluster --type=json --patch='[{"op":"remove","path":"/spec/featureSet"}]' 

      Shortly after which:

      $ oc get -o json customresourcedefinition apiservers.config.openshift.io | jq -c '.spec.versions[].schema.openAPIV3Schema.properties.spec.properties.encryption.properties.type.enum'
      ["","identity","aescbc","aesgcm"]
      $ oc get -o json customresourcedefinition infrastructures.config.openshift.io | jq -c '.spec.versions[].schema.openAPIV3Schema.properties.spec.properties.platformSpec.properties.type.enum[-2:]'
      ["Nutanix","External"]
      

              dgrisonn@redhat.com Damien Grisonnet
              trking W. Trevor King
              Rahul Gangwar Rahul Gangwar
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: