Loading...

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: 4.14
Component/s: Machine Config Operator
Labels:
- perfscale-telco-5g
- telco-5g

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
MCO Sprint 242
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

New description of problem:

When the MCO installs, there is a small (<1% chance) of the first MCO pod losing leader election sometimes while doing its init loop, and the subsequent pod to never perform an init. This can cause the MCO pod to degrade via rendered-master not found error, since the initially generated MachineConfig never gets regenerated. The diff that can cause this can be one of the following:

/etc/containers/registries.conf reverting to empty before being regenerated
/etc/kubernetes/kubelet.conf and /var/lib/kubelet/config.json reverting to empty before being regenerated
/etc/mco/internal-registry-pull-secret.json being generated on the first run

Technically 1 and 2 are "maskable" error if 3 isn't an issue, so the fix options should be:

have the MCO pod re-init if the previous one was unsuccessful (may have to change the init mode detection mechanism)
move registry pull secret to certificate path and remove from machineconfigs

Original description of problem:

While installing many SNOs via ZTP using ACM, a SNO failed to complete install because the MCO was degraded during the install process.

# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       False         20h     Error while reconciling 4.14.0-rc.0: the cluster operator machine-config is degraded

# oc get co
NAME                                       VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.14.0-rc.0   True        False         False      173m    
baremetal                                  4.14.0-rc.0   True        False         False      20h     
cloud-controller-manager                   4.14.0-rc.0   True        False         False      20h     
cloud-credential                           4.14.0-rc.0   True        False         False      20h     
cluster-autoscaler                         4.14.0-rc.0   True        False         False      20h     
config-operator                            4.14.0-rc.0   True        False         False      20h     
console                                    4.14.0-rc.0   True        False         False      20h     
control-plane-machine-set                  4.14.0-rc.0   True        False         False      20h     
csi-snapshot-controller                    4.14.0-rc.0   True        False         False      20h     
dns                                        4.14.0-rc.0   True        False         False      20h     
etcd                                       4.14.0-rc.0   True        False         False      20h     
image-registry                             4.14.0-rc.0   True        False         False      20h     
ingress                                    4.14.0-rc.0   True        False         False      20h     
insights                                   4.14.0-rc.0   True        False         False      20h     
kube-apiserver                             4.14.0-rc.0   True        False         False      20h     
kube-controller-manager                    4.14.0-rc.0   True        False         False      20h     
kube-scheduler                             4.14.0-rc.0   True        False         False      20h     
kube-storage-version-migrator              4.14.0-rc.0   True        False         False      20h     
machine-api                                4.14.0-rc.0   True        False         False      20h     
machine-approver                           4.14.0-rc.0   True        False         False      20h     
machine-config                                           True        True          True       20h     Unable to apply 4.14.0-rc.0: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 1, ready 0, updated: 0, unavailable: 1)]]
marketplace                                4.14.0-rc.0   True        False         False      20h     
monitoring                                 4.14.0-rc.0   True        False         False      20h     
network                                    4.14.0-rc.0   True        False         False      20h     
node-tuning                                4.14.0-rc.0   True        False         False      20h     
openshift-apiserver                        4.14.0-rc.0   True        False         False      114m    
openshift-controller-manager               4.14.0-rc.0   True        False         False      114m    
openshift-samples                          4.14.0-rc.0   True        False         False      20h     
operator-lifecycle-manager                 4.14.0-rc.0   True        False         False      20h     
operator-lifecycle-manager-catalog         4.14.0-rc.0   True        False         False      20h     
operator-lifecycle-manager-packageserver   4.14.0-rc.0   True        False         False      20h     
service-ca                                 4.14.0-rc.0   True        False         False      20h     
storage                                    4.14.0-rc.0   True        False         False      20h     

# oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master                                                      False     True       True       1              0                   0                     1                      20h
worker   rendered-worker-b1ae085b0af76cd90252035359d08f1e   True      False      False      0              0                   0                     0                      20h

# oc get mc
NAME                                                GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                           2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
00-worker                                           2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
01-master-container-runtime                         2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
01-master-kubelet                                   2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
01-worker-container-runtime                         2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
01-worker-kubelet                                   2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
04-accelerated-container-startup-master                                                        3.2.0             21h
04-accelerated-container-startup-worker                                                        3.2.0             21h
05-kdump-config-master                                                                         3.2.0             21h
05-kdump-config-worker                                                                         3.2.0             21h
06-kdump-enable-master                                                                         3.2.0             21h
06-kdump-enable-worker                                                                         3.2.0             21h
10-masters-node-ip-hint                                                                        3.1.0             21h
10-workers-node-ip-hint                                                                        3.1.0             21h
50-master-dnsmasq-configuration                                                                3.1.0             21h
50-masters-chrony-configuration                                                                3.1.0             21h
50-workers-chrony-configuration                                                                3.1.0             21h
97-master-generated-kubelet                         2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
97-worker-generated-kubelet                         2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
98-master-generated-kubelet                         2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
98-worker-generated-kubelet                         2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
99-crio-disable-wipe-master                                                                    3.2.0             21h
99-crio-disable-wipe-worker                                                                    3.2.0             21h
99-master-generated-registries                      2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
99-master-ssh                                                                                  3.2.0             21h
99-worker-generated-registries                      2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
99-worker-ssh                                                                                  3.2.0             21h
container-mount-namespace-and-kubelet-conf-master                                              3.2.0             21h
container-mount-namespace-and-kubelet-conf-worker                                              3.2.0             21h
load-sctp-module-master                                                                        2.2.0             21h
load-sctp-module-worker                                                                        2.2.0             21h
rendered-master-048c4c2232a009a3adef2b6b23dff69e    2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
rendered-master-9db98f74fc6921ae0f62e5a586e34842    2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
rendered-worker-4e06f96d0ae77bdf1dfc8a85022a0a0d    2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h
rendered-worker-b1ae085b0af76cd90252035359d08f1e    2ea9a64d51497060e4bef9e87fc9a55baf85b1f0   3.4.0             20h

Version-Release number of selected component (if applicable):

Deployed SNO OCP - 4.14.0-rc.0
Hub 4.13.11
ACM - 2.9.0-DOWNSTREAM-2023-09-07-04-47-52

How reproducible:

Rare 1 out of 3618 in this test, in a prior test it was 5 out 3618, thus it could be more frequent than observed in this test

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

Name:         machine-config
Namespace:    
Labels:       <none>
Annotations:  exclude.release.openshift.io/internal-openshift-hosted: true
              include.release.openshift.io/self-managed-high-availability: true
              include.release.openshift.io/single-node-developer: true
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2023-09-12T20:42:51Z
  Generation:          1
  Managed Fields:
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:exclude.release.openshift.io/internal-openshift-hosted:
          f:include.release.openshift.io/self-managed-high-availability:
          f:include.release.openshift.io/single-node-developer:
        f:ownerReferences:
          .:
          k:{"uid":"548d677e-f003-4aab-9492-9a3cb807c476"}:
      f:spec:
    Manager:      cluster-version-operator
    Operation:    Update
    Time:         2023-09-12T20:42:51Z
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
    Manager:      cluster-version-operator
    Operation:    Update
    Subresource:  status
    Time:         2023-09-12T20:42:51Z
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
        f:extension:
          .:
          f:master:
          f:worker:
        f:relatedObjects:
    Manager:      machine-config-operator
    Operation:    Update
    Subresource:  status
    Time:         2023-09-13T17:53:10Z
  Owner References:
    API Version:     config.openshift.io/v1
    Controller:      true
    Kind:            ClusterVersion
    Name:            version
    UID:             548d677e-f003-4aab-9492-9a3cb807c476
  Resource Version:  369343
  UID:               c037a04b-6bca-4422-9658-e6d939ce648a
Spec:
Status:
  Conditions:
    Last Transition Time:  2023-09-12T20:57:31Z
    Message:               Working towards 4.14.0-rc.0
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2023-09-12T21:03:31Z
    Message:               Unable to apply 4.14.0-rc.0: error during syncRequiredMachineConfigPools: [context deadline exceeded, failed to update clusteroperator: [client rate limiter Wait returned an error: context deadline exceeded, error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 1, ready 0, updated: 0, unavailable: 1)]]
    Reason:                RequiredPoolsFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2023-09-12T21:03:31Z
    Message:               Cluster has deployed []
    Reason:                AsExpected
    Status:                True
    Type:                  Available
    Last Transition Time:  2023-09-12T21:03:54Z
    Message:               One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading
    Reason:                DegradedPool
    Status:                False
    Type:                  Upgradeable
  Extension:
    Master:  pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node vm01213 is reporting: \"machineconfig.machineconfiguration.openshift.io \\\"rendered-master-f8b0d299858740f4ece5a5dabf600cfd\\\" not found\""
    Worker:  all 0 nodes are at latest configuration rendered-worker-b1ae085b0af76cd90252035359d08f1e
  Related Objects:
    Group:     
    Name:      openshift-machine-config-operator
    Resource:  namespaces
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  controllerconfigs
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  kubeletconfigs
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  containerruntimeconfigs
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  machineconfigs
    Group:     
    Name:      
    Resource:  nodes
    Group:     
    Name:      openshift-kni-infra
    Resource:  namespaces
    Group:     
    Name:      openshift-openstack-infra
    Resource:  namespaces
    Group:     
    Name:      openshift-ovirt-infra
    Resource:  namespaces
    Group:     
    Name:      openshift-vsphere-infra
    Resource:  namespaces
    Group:     
    Name:      openshift-nutanix-infra
    Resource:  namespaces
Events:        <none>

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

must-gather-vm01213-mcodegraded.tar.gz
45.27 MB
2023/09/13 5:58 PM
vm01213-mco.log
91 kB
2023/09/13 6:07 PM

relates to

OCPBUGS-29108 SNO cluster installation failed on CVO operator timeout

Closed

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide