-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
4.14
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
No
-
None
-
None
-
None
-
MCO Sprint 242
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
New description of problem:
When the MCO installs, there is a small (<1% chance) of the first MCO pod losing leader election sometimes while doing its init loop, and the subsequent pod to never perform an init. This can cause the MCO pod to degrade via rendered-master not found error, since the initially generated MachineConfig never gets regenerated. The diff that can cause this can be one of the following:
- /etc/containers/registries.conf reverting to empty before being regenerated
- /etc/kubernetes/kubelet.conf and /var/lib/kubelet/config.json reverting to empty before being regenerated
- /etc/mco/internal-registry-pull-secret.json being generated on the first run
Technically 1 and 2 are "maskable" error if 3 isn't an issue, so the fix options should be:
- have the MCO pod re-init if the previous one was unsuccessful (may have to change the init mode detection mechanism)
- move registry pull secret to certificate path and remove from machineconfigs
Original description of problem:
While installing many SNOs via ZTP using ACM, a SNO failed to complete install because the MCO was degraded during the install process. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False False 20h Error while reconciling 4.14.0-rc.0: the cluster operator machine-config is degraded # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.14.0-rc.0 True False False 173m baremetal 4.14.0-rc.0 True False False 20h cloud-controller-manager 4.14.0-rc.0 True False False 20h cloud-credential 4.14.0-rc.0 True False False 20h cluster-autoscaler 4.14.0-rc.0 True False False 20h config-operator 4.14.0-rc.0 True False False 20h console 4.14.0-rc.0 True False False 20h control-plane-machine-set 4.14.0-rc.0 True False False 20h csi-snapshot-controller 4.14.0-rc.0 True False False 20h dns 4.14.0-rc.0 True False False 20h etcd 4.14.0-rc.0 True False False 20h image-registry 4.14.0-rc.0 True False False 20h ingress 4.14.0-rc.0 True False False 20h insights 4.14.0-rc.0 True False False 20h kube-apiserver 4.14.0-rc.0 True False False 20h kube-controller-manager 4.14.0-rc.0 True False False 20h kube-scheduler 4.14.0-rc.0 True False False 20h kube-storage-version-migrator 4.14.0-rc.0 True False False 20h machine-api 4.14.0-rc.0 True False False 20h machine-approver 4.14.0-rc.0 True False False 20h machine-config True True True 20h Unable to apply 4.14.0-rc.0: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 1, ready 0, updated: 0, unavailable: 1)]] marketplace 4.14.0-rc.0 True False False 20h monitoring 4.14.0-rc.0 True False False 20h network 4.14.0-rc.0 True False False 20h node-tuning 4.14.0-rc.0 True False False 20h openshift-apiserver 4.14.0-rc.0 True False False 114m openshift-controller-manager 4.14.0-rc.0 True False False 114m openshift-samples 4.14.0-rc.0 True False False 20h operator-lifecycle-manager 4.14.0-rc.0 True False False 20h operator-lifecycle-manager-catalog 4.14.0-rc.0 True False False 20h operator-lifecycle-manager-packageserver 4.14.0-rc.0 True False False 20h service-ca 4.14.0-rc.0 True False False 20h storage 4.14.0-rc.0 True False False 20h # oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master False True True 1 0 0 1 20h worker rendered-worker-b1ae085b0af76cd90252035359d08f1e True False False 0 0 0 0 20h # oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 00-worker 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 01-master-container-runtime 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 01-master-kubelet 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 01-worker-container-runtime 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 01-worker-kubelet 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 04-accelerated-container-startup-master 3.2.0 21h 04-accelerated-container-startup-worker 3.2.0 21h 05-kdump-config-master 3.2.0 21h 05-kdump-config-worker 3.2.0 21h 06-kdump-enable-master 3.2.0 21h 06-kdump-enable-worker 3.2.0 21h 10-masters-node-ip-hint 3.1.0 21h 10-workers-node-ip-hint 3.1.0 21h 50-master-dnsmasq-configuration 3.1.0 21h 50-masters-chrony-configuration 3.1.0 21h 50-workers-chrony-configuration 3.1.0 21h 97-master-generated-kubelet 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 97-worker-generated-kubelet 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 98-master-generated-kubelet 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 98-worker-generated-kubelet 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 99-crio-disable-wipe-master 3.2.0 21h 99-crio-disable-wipe-worker 3.2.0 21h 99-master-generated-registries 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 99-master-ssh 3.2.0 21h 99-worker-generated-registries 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 99-worker-ssh 3.2.0 21h container-mount-namespace-and-kubelet-conf-master 3.2.0 21h container-mount-namespace-and-kubelet-conf-worker 3.2.0 21h load-sctp-module-master 2.2.0 21h load-sctp-module-worker 2.2.0 21h rendered-master-048c4c2232a009a3adef2b6b23dff69e 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h rendered-master-9db98f74fc6921ae0f62e5a586e34842 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h rendered-worker-4e06f96d0ae77bdf1dfc8a85022a0a0d 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h rendered-worker-b1ae085b0af76cd90252035359d08f1e 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h
Version-Release number of selected component (if applicable):
Deployed SNO OCP - 4.14.0-rc.0 Hub 4.13.11 ACM - 2.9.0-DOWNSTREAM-2023-09-07-04-47-52
How reproducible:
Rare 1 out of 3618 in this test, in a prior test it was 5 out 3618, thus it could be more frequent than observed in this test
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Name: machine-config Namespace: Labels: <none> Annotations: exclude.release.openshift.io/internal-openshift-hosted: true include.release.openshift.io/self-managed-high-availability: true include.release.openshift.io/single-node-developer: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2023-09-12T20:42:51Z Generation: 1 Managed Fields: API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:exclude.release.openshift.io/internal-openshift-hosted: f:include.release.openshift.io/self-managed-high-availability: f:include.release.openshift.io/single-node-developer: f:ownerReferences: .: k:{"uid":"548d677e-f003-4aab-9492-9a3cb807c476"}: f:spec: Manager: cluster-version-operator Operation: Update Time: 2023-09-12T20:42:51Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: Manager: cluster-version-operator Operation: Update Subresource: status Time: 2023-09-12T20:42:51Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:extension: .: f:master: f:worker: f:relatedObjects: Manager: machine-config-operator Operation: Update Subresource: status Time: 2023-09-13T17:53:10Z Owner References: API Version: config.openshift.io/v1 Controller: true Kind: ClusterVersion Name: version UID: 548d677e-f003-4aab-9492-9a3cb807c476 Resource Version: 369343 UID: c037a04b-6bca-4422-9658-e6d939ce648a Spec: Status: Conditions: Last Transition Time: 2023-09-12T20:57:31Z Message: Working towards 4.14.0-rc.0 Status: True Type: Progressing Last Transition Time: 2023-09-12T21:03:31Z Message: Unable to apply 4.14.0-rc.0: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 1, ready 0, updated: 0, unavailable: 1)]] Reason: RequiredPoolsFailed Status: True Type: Degraded Last Transition Time: 2023-09-12T21:03:31Z Message: Cluster has deployed [] Reason: AsExpected Status: True Type: Available Last Transition Time: 2023-09-12T21:03:54Z Message: One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading Reason: DegradedPool Status: False Type: Upgradeable Extension: Master: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node vm01213 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-f8b0d299858740f4ece5a5dabf600cfd\\\" not found\"" Worker: all 0 nodes are at latest configuration rendered-worker-b1ae085b0af76cd90252035359d08f1e Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: Resource: controllerconfigs Group: machineconfiguration.openshift.io Name: Resource: kubeletconfigs Group: machineconfiguration.openshift.io Name: Resource: containerruntimeconfigs Group: machineconfiguration.openshift.io Name: Resource: machineconfigs Group: Name: Resource: nodes Group: Name: openshift-kni-infra Resource: namespaces Group: Name: openshift-openstack-infra Resource: namespaces Group: Name: openshift-ovirt-infra Resource: namespaces Group: Name: openshift-vsphere-infra Resource: namespaces Group: Name: openshift-nutanix-infra Resource: namespaces Events: <none>
- relates to
-
OCPBUGS-29108 SNO cluster installation failed on CVO operator timeout
-
- Closed
-