-
Bug
-
Resolution: Won't Do
-
Minor
-
None
-
4.14
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
No
-
None
-
None
-
None
-
MCO Sprint 242
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
New description of problem:
When the MCO installs, there is a small (<1% chance) of the first MCO pod losing leader election sometimes while doing its init loop, and the subsequent pod to never perform an init. This can cause the MCO pod to degrade via rendered-master not found error, since the initially generated MachineConfig never gets regenerated. The diff that can cause this can be one of the following:
- /etc/containers/registries.conf reverting to empty before being regenerated
- /etc/kubernetes/kubelet.conf and /var/lib/kubelet/config.json reverting to empty before being regenerated
- /etc/mco/internal-registry-pull-secret.json being generated on the first run
Technically 1 and 2 are "maskable" error if 3 isn't an issue, so the fix options should be:
- have the MCO pod re-init if the previous one was unsuccessful (may have to change the init mode detection mechanism)
- move registry pull secret to certificate path and remove from machineconfigs
Original description of problem:
While installing many SNOs via ZTP using ACM, a SNO failed to complete install because the MCO was degraded during the install process. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False False 20h Error while reconciling 4.14.0-rc.0: the cluster operator machine-config is degraded # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.14.0-rc.0 True False False 173m baremetal 4.14.0-rc.0 True False False 20h cloud-controller-manager 4.14.0-rc.0 True False False 20h cloud-credential 4.14.0-rc.0 True False False 20h cluster-autoscaler 4.14.0-rc.0 True False False 20h config-operator 4.14.0-rc.0 True False False 20h console 4.14.0-rc.0 True False False 20h control-plane-machine-set 4.14.0-rc.0 True False False 20h csi-snapshot-controller 4.14.0-rc.0 True False False 20h dns 4.14.0-rc.0 True False False 20h etcd 4.14.0-rc.0 True False False 20h image-registry 4.14.0-rc.0 True False False 20h ingress 4.14.0-rc.0 True False False 20h insights 4.14.0-rc.0 True False False 20h kube-apiserver 4.14.0-rc.0 True False False 20h kube-controller-manager 4.14.0-rc.0 True False False 20h kube-scheduler 4.14.0-rc.0 True False False 20h kube-storage-version-migrator 4.14.0-rc.0 True False False 20h machine-api 4.14.0-rc.0 True False False 20h machine-approver 4.14.0-rc.0 True False False 20h machine-config True True True 20h Unable to apply 4.14.0-rc.0: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 1, ready 0, updated: 0, unavailable: 1)]] marketplace 4.14.0-rc.0 True False False 20h monitoring 4.14.0-rc.0 True False False 20h network 4.14.0-rc.0 True False False 20h node-tuning 4.14.0-rc.0 True False False 20h openshift-apiserver 4.14.0-rc.0 True False False 114m openshift-controller-manager 4.14.0-rc.0 True False False 114m openshift-samples 4.14.0-rc.0 True False False 20h operator-lifecycle-manager 4.14.0-rc.0 True False False 20h operator-lifecycle-manager-catalog 4.14.0-rc.0 True False False 20h operator-lifecycle-manager-packageserver 4.14.0-rc.0 True False False 20h service-ca 4.14.0-rc.0 True False False 20h storage 4.14.0-rc.0 True False False 20h # oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master False True True 1 0 0 1 20h worker rendered-worker-b1ae085b0af76cd90252035359d08f1e True False False 0 0 0 0 20h # oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 00-worker 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 01-master-container-runtime 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 01-master-kubelet 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 01-worker-container-runtime 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 01-worker-kubelet 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 04-accelerated-container-startup-master 3.2.0 21h 04-accelerated-container-startup-worker 3.2.0 21h 05-kdump-config-master 3.2.0 21h 05-kdump-config-worker 3.2.0 21h 06-kdump-enable-master 3.2.0 21h 06-kdump-enable-worker 3.2.0 21h 10-masters-node-ip-hint 3.1.0 21h 10-workers-node-ip-hint 3.1.0 21h 50-master-dnsmasq-configuration 3.1.0 21h 50-masters-chrony-configuration 3.1.0 21h 50-workers-chrony-configuration 3.1.0 21h 97-master-generated-kubelet 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 97-worker-generated-kubelet 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 98-master-generated-kubelet 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 98-worker-generated-kubelet 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 99-crio-disable-wipe-master 3.2.0 21h 99-crio-disable-wipe-worker 3.2.0 21h 99-master-generated-registries 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 99-master-ssh 3.2.0 21h 99-worker-generated-registries 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h 99-worker-ssh 3.2.0 21h container-mount-namespace-and-kubelet-conf-master 3.2.0 21h container-mount-namespace-and-kubelet-conf-worker 3.2.0 21h load-sctp-module-master 2.2.0 21h load-sctp-module-worker 2.2.0 21h rendered-master-048c4c2232a009a3adef2b6b23dff69e 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h rendered-master-9db98f74fc6921ae0f62e5a586e34842 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h rendered-worker-4e06f96d0ae77bdf1dfc8a85022a0a0d 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h rendered-worker-b1ae085b0af76cd90252035359d08f1e 2ea9a64d51497060e4bef9e87fc9a55baf85b1f0 3.4.0 20h
Version-Release number of selected component (if applicable):
Deployed SNO OCP - 4.14.0-rc.0 Hub 4.13.11 ACM - 2.9.0-DOWNSTREAM-2023-09-07-04-47-52
How reproducible:
Rare 1 out of 3618 in this test, in a prior test it was 5 out 3618, thus it could be more frequent than observed in this test
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Name: machine-config
Namespace:
Labels: <none>
Annotations: exclude.release.openshift.io/internal-openshift-hosted: true
include.release.openshift.io/self-managed-high-availability: true
include.release.openshift.io/single-node-developer: true
API Version: config.openshift.io/v1
Kind: ClusterOperator
Metadata:
Creation Timestamp: 2023-09-12T20:42:51Z
Generation: 1
Managed Fields:
API Version: config.openshift.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:exclude.release.openshift.io/internal-openshift-hosted:
f:include.release.openshift.io/self-managed-high-availability:
f:include.release.openshift.io/single-node-developer:
f:ownerReferences:
.:
k:{"uid":"548d677e-f003-4aab-9492-9a3cb807c476"}:
f:spec:
Manager: cluster-version-operator
Operation: Update
Time: 2023-09-12T20:42:51Z
API Version: config.openshift.io/v1
Fields Type: FieldsV1
fieldsV1:
f:status:
Manager: cluster-version-operator
Operation: Update
Subresource: status
Time: 2023-09-12T20:42:51Z
API Version: config.openshift.io/v1
Fields Type: FieldsV1
fieldsV1:
f:status:
f:conditions:
f:extension:
.:
f:master:
f:worker:
f:relatedObjects:
Manager: machine-config-operator
Operation: Update
Subresource: status
Time: 2023-09-13T17:53:10Z
Owner References:
API Version: config.openshift.io/v1
Controller: true
Kind: ClusterVersion
Name: version
UID: 548d677e-f003-4aab-9492-9a3cb807c476
Resource Version: 369343
UID: c037a04b-6bca-4422-9658-e6d939ce648a
Spec:
Status:
Conditions:
Last Transition Time: 2023-09-12T20:57:31Z
Message: Working towards 4.14.0-rc.0
Status: True
Type: Progressing
Last Transition Time: 2023-09-12T21:03:31Z
Message: Unable to apply 4.14.0-rc.0: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 1, ready 0, updated: 0, unavailable: 1)]]
Reason: RequiredPoolsFailed
Status: True
Type: Degraded
Last Transition Time: 2023-09-12T21:03:31Z
Message: Cluster has deployed []
Reason: AsExpected
Status: True
Type: Available
Last Transition Time: 2023-09-12T21:03:54Z
Message: One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading
Reason: DegradedPool
Status: False
Type: Upgradeable
Extension:
Master: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node vm01213 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-f8b0d299858740f4ece5a5dabf600cfd\\\" not found\""
Worker: all 0 nodes are at latest configuration rendered-worker-b1ae085b0af76cd90252035359d08f1e
Related Objects:
Group:
Name: openshift-machine-config-operator
Resource: namespaces
Group: machineconfiguration.openshift.io
Name:
Resource: machineconfigpools
Group: machineconfiguration.openshift.io
Name:
Resource: controllerconfigs
Group: machineconfiguration.openshift.io
Name:
Resource: kubeletconfigs
Group: machineconfiguration.openshift.io
Name:
Resource: containerruntimeconfigs
Group: machineconfiguration.openshift.io
Name:
Resource: machineconfigs
Group:
Name:
Resource: nodes
Group:
Name: openshift-kni-infra
Resource: namespaces
Group:
Name: openshift-openstack-infra
Resource: namespaces
Group:
Name: openshift-ovirt-infra
Resource: namespaces
Group:
Name: openshift-vsphere-infra
Resource: namespaces
Group:
Name: openshift-nutanix-infra
Resource: namespaces
Events: <none>
- relates to
-
OCPBUGS-29108 SNO cluster installation failed on CVO operator timeout
-
- Closed
-