Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-15728

Machine config drifts when deploying with platform external

XMLWordPrintable

    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      When deploying with external platform, the reported state of the machine config pool is degraded, and we can observe a drift in the configuration:
      
      $ diff /etc/mcs-machine-config-content.json ~/rendered-master-1b6aab788192600896f36c5388d48374
      <                         "contents": "[Unit]\nDescription=Kubernetes Kubelet\nWants=rpc-statd.service network-online.target\nRequires=crio.service kubelet-auto-node-size.service\nAfter=network-online.target crio.service kubelet-auto-node-size.service\nAfter=ostree-finalize-staged.service\n\n[Service]\nType=notify\nExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests\nExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state\nExecStartPre=/bin/rm -f /var/lib/kubelet/memory_manager_state\nEnvironmentFile=/etc/os-release\nEnvironmentFile=-/etc/kubernetes/kubelet-workaround\nEnvironmentFile=-/etc/kubernetes/kubelet-env\nEnvironmentFile=/etc/node-sizing.env\n\nExecStart=/usr/local/bin/kubenswrapper \\\n    /usr/bin/kubelet \\\n      --config=/etc/kubernetes/kubelet.conf \\\n      --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \\\n      --kubeconfig=/var/lib/kubelet/kubeconfig \\\n      --container-runtime-endpoint=/var/run/crio/crio.sock \\\n      --runtime-cgroups=/system.slice/crio.service \\\n      --node-labels=node-role.kubernetes.io/control-plane,node-role.kubernetes.io/master,node.openshift.io/os_id=${ID} \\\n      --node-ip=${KUBELET_NODE_IP} \\\n      --minimum-container-ttl-duration=6m0s \\\n      --cloud-provider=external \\\n      --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec \\\n       \\\n      --hostname-override=${KUBELET_NODE_NAME} \\\n      --provider-id=${KUBELET_PROVIDERID} \\\n      --register-with-taints=node-role.kubernetes.io/master=:NoSchedule \\\n      --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bde9fb486f1e8369b465a8c0aff7152c2a1f5a326385ee492140592b506638d6 \\\n      --system-reserved=cpu=${SYSTEM_RESERVED_CPU},memory=${SYSTEM_RESERVED_MEMORY},ephemeral-storage=${SYSTEM_RESERVED_ES} \\\n      --v=${KUBELET_LOG_LEVEL}\n\nRestart=always\nRestartSec=10\n\n[Install]\nWantedBy=multi-user.target\n",
      ---
      >                         "contents": "[Unit]\nDescription=Kubernetes Kubelet\nWants=rpc-statd.service network-online.target\nRequires=crio.service kubelet-auto-node-size.service\nAfter=network-online.target crio.service kubelet-auto-node-size.service\nAfter=ostree-finalize-staged.service\n\n[Service]\nType=notify\nExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests\nExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state\nExecStartPre=/bin/rm -f /var/lib/kubelet/memory_manager_state\nEnvironmentFile=/etc/os-release\nEnvironmentFile=-/etc/kubernetes/kubelet-workaround\nEnvironmentFile=-/etc/kubernetes/kubelet-env\nEnvironmentFile=/etc/node-sizing.env\n\nExecStart=/usr/local/bin/kubenswrapper \\\n    /usr/bin/kubelet \\\n      --config=/etc/kubernetes/kubelet.conf \\\n      --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \\\n      --kubeconfig=/var/lib/kubelet/kubeconfig \\\n      --container-runtime-endpoint=/var/run/crio/crio.sock \\\n      --runtime-cgroups=/system.slice/crio.service \\\n      --node-labels=node-role.kubernetes.io/control-plane,node-role.kubernetes.io/master,node.openshift.io/os_id=${ID} \\\n      --node-ip=${KUBELET_NODE_IP} \\\n      --minimum-container-ttl-duration=6m0s \\\n      --cloud-provider= \\\n      --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec \\\n       \\\n      --hostname-override=${KUBELET_NODE_NAME} \\\n      --provider-id=${KUBELET_PROVIDERID} \\\n      --register-with-taints=node-role.kubernetes.io/master=:NoSchedule \\\n      --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bde9fb486f1e8369b465a8c0aff7152c2a1f5a326385ee492140592b506638d6 \\\n      --system-reserved=cpu=${SYSTEM_RESERVED_CPU},memory=${SYSTEM_RESERVED_MEMORY},ephemeral-storage=${SYSTEM_RESERVED_ES} \\\n      --v=${KUBELET_LOG_LEVEL}\n\nRestart=always\nRestartSec=10\n\n[Install]\nWantedBy=multi-user.target\n",
      
      
      the difference is --cloud-provider=external /--cloud-provider= is the flags passed to the kubelet.
      
      
      We also observe the following log in the MCC:
      W0629 09:57:44.583046       1 warnings.go:70] unknown field "spec.infra.status.platformStatus.external.cloudControllerManager"
      
      
      "spec.infra.status.platformStatus.external.cloudControllerManager" is basically the flag in the Infrastructure object that enables the external platform.

      Version-Release number of selected component (if applicable):

      4.14 nightly

      How reproducible:

      Always when platform is external

      Steps to Reproduce:

      1. Deploy a cluster with the external platform enabled, the featureSet TechPreviewNoUpgrade should be set and the Infrastructure object should look like:
      
      apiVersion: config.openshift.io/v1
      kind: Infrastructure
      metadata:
        creationTimestamp: "2023-06-28T10:37:12Z"
        generation: 1
        name: cluster
        resourceVersion: "538"
        uid: 57e09773-0eca-4767-95ce-8ec7d0f2cdae
      spec:
        cloudConfig:
          name: ""
        platformSpec:
          external:
            platformName: oci
          type: External
      status:
        apiServerInternalURI: https://api-int.test-infra-cluster-3cd17632.assisted-ci.oci-rhelcert.edge-sro.rhecoeng.com:6443
        apiServerURL: https://api.test-infra-cluster-3cd17632.assisted-ci.oci-rhelcert.edge-sro.rhecoeng.com:6443
        controlPlaneTopology: HighlyAvailable
        cpuPartitioning: None
        etcdDiscoveryDomain: ""
        infrastructureName: test-infra-cluster-3c-pqqqm
        infrastructureTopology: HighlyAvailable
        platform: External
        platformStatus:
          external:
            cloudControllerManager:
              state: External
          type: External
      2. Observe the drift with: oc get mcp
      

      Actual results:

      $ oc get mcp
      NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
      master                                                      False     True       True       3              0                   0                     3                      138m
      worker   rendered-worker-d48036fe2b657e6c71d5d1275675fefc   True      False      False      3              3                   3                     0                      138m
      

      Expected results:

      $ oc get mcp
      NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
      master   rendered-master-2ff4e25f807ef3b20b7c6e0c6526f05d   True      False      False      3              3                   3                     0                      33m
      worker   rendered-worker-48b7f39d78e3b1d94a0aba1ef4358d01   True      False      False      3              3                   3                     0                      33m

      Additional info:

      https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1688035248716119

              rhn-support-mrbraga Marco Braga
              agentil@redhat.com Adrien Gentil
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: