[OCPBUGS-33796] SNO DU deployment ends up in degraded mcp status after install - Red Hat Issue Tracker

Type: Bug
Resolution: Won't Do
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.16
Component/s: Machine Config Operator
Labels:
- mco-triaged
- perfscale-telco-5g

Severity:
Important
Regression:
No
Sprint:
MCO Sprint 254
sprint_count:
1
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

Observing an intermittent issue with SNO install (with Assisted Installer) with DU profile where the cluster ends up in degraded mcp state post install. Reproduced with both 4.16.0-rc.0 and 4.16.0-rc.1  

$ oc get mcp master                                                                                                            
NAME     CONFIG   UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE                                                                                                        
master            False     True       True       1              0                   0                     1                      15h                                                                                        


mcp master status:

  Conditions:
    Last Transition Time:  2024-05-14T21:21:30Z
    Message:
    Reason:
    Status:                False
    Type:                  Updated
    Last Transition Time:  2024-05-14T21:21:30Z
    Message:               All nodes are updating to MachineConfig rendered-master-3ebeee8538946014a3f107ea0603d260
    Reason:
    Status:                True
    Type:                  Updating
    Last Transition Time:  2024-05-14T21:21:30Z
    Message:               Node e32-h22-r750 is reporting: "missing MachineConfig rendered-master-49651500230308839606552505f7f484\nmachineconfig.machineconfiguration.openshift.io \"rendered-master-49651500230308839606552505f7f484\" not found"
    Reason:                1 nodes are reporting degraded status on sync
    Status:                True
    Type:                  NodeDegraded
    Last Transition Time:  2024-05-14T21:21:30Z
    Message:
    Reason:
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2024-05-14T21:21:35Z
    Message:
    Reason:
    Status:                False
    Type:                  RenderDegraded
  Configuration:
  Degraded Machine Count:     1
  Machine Count:              1
  Observed Generation:        2
  Ready Machine Count:        0
  Unavailable Machine Count:  1
  Updated Machine Count:      0
Events:                       <none>



machine-config-daemon pod logs:

[2024-05-14T21:38:13Z INFO  nmstatectl] Nmstate version: 2.2.27
[2024-05-14T21:38:13Z INFO  nmstatectl::persist_nic] /etc/systemd/network does not exist, no need to clean up
I0514 21:38:13.197788   50688 daemon.go:1624] In bootstrap mode
E0514 21:38:13.197828   50688 writer.go:226] Marking Degraded due to: missing MachineConfig rendered-master-49651500230308839606552505f7f484
machineconfig.machineconfiguration.openshift.io "rendered-master-49651500230308839606552505f7f484" not found
I0514 21:38:42.173501   50688 certificate_writer.go:340] Certificate was synced from controllerconfig resourceVersion 12044
I0514 21:38:45.205661   50688 daemon.go:1898] Running: /run/machine-config-daemon-bin/nmstatectl persist-nic-names --root / --kargs-out /tmp/nmstate-kargs1344634730 --cleanup


machine-config-daemon pod events:
Events:
  Type     Reason      Age                  From               Message
  ----     ------      ----                 ----               -------
  Normal   Scheduled   28m                  default-scheduler  Successfully assigned openshift-machine-config-operator/machine-config-daemon-29r45 to e32-h22-r750
  Normal   Pulled      28m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:714a42e9eb52ef1bae8a2575ca1a2bfdf733d5a6786f08ceb3b6ff61d59931cf" already present on machine
  Normal   Created     28m                  kubelet            Created container machine-config-daemon
  Normal   Started     28m                  kubelet            Started container machine-config-daemon
  Normal   Pulled      28m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:91bb4f8991ea4b597c9404cec89a984cc3ad3f76a6099d868bc3388dbbd36346" already present on machine
  Normal   Created     28m                  kubelet            Created container kube-rbac-proxy
  Normal   Started     28m                  kubelet            Started container kube-rbac-proxy
  Normal   Created     26m                  kubelet            Created container machine-config-daemon
  Normal   Pulled      26m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:714a42e9eb52ef1bae8a2575ca1a2bfdf733d5a6786f08ceb3b6ff61d59931cf" already present on machine
  Normal   Started     26m                  kubelet            Started container machine-config-daemon
  Normal   Killing     19m (x2 over 22m)    kubelet            Container machine-config-daemon failed liveness probe, will be restarted
  Normal   Created     19m (x2 over 22m)    kubelet            Created container machine-config-daemon
  Normal   Started     19m (x2 over 22m)    kubelet            Started container machine-config-daemon
  Warning  Unhealthy   16m (x9 over 23m)    kubelet            Liveness probe failed: Get "http://127.0.0.1:8798/health": dial tcp 127.0.0.1:8798: connect: connection refused
  Normal   Pulled      13m (x4 over 22m)    kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:714a42e9eb52ef1bae8a2575ca1a2bfdf733d5a6786f08ceb3b6ff61d59931cf" already present on machine
  Warning  ProbeError  8m5s (x17 over 23m)  kubelet            Liveness probe error: Get "http://127.0.0.1:8798/health": dial tcp 127.0.0.1:8798: connect: connection refused
body:
  Warning  BackOff  3m28s (x7 over 4m35s)  kubelet  Back-off restarting failed container machine-config-daemon in pod machine-config-daemon-29r45_openshift-machine-config-operator(9953f60a-c482-4ec5-9f3c-d6ac5a874791)


oc describe node has the following annotation:
                    machineconfiguration.openshift.io/reason:
                      missing MachineConfig rendered-master-49651500230308839606552505f7f484
                      machineconfig.machineconfiguration.openshift.io "rendered-master-49651500230308839606552505f7f484" not found

Version-Release number of selected component (if applicable):

    OCP 4.16.0-rc.0, 4.16.0-rc.1

How reproducible:

  1.  Install SNO with DU profile
2. Check mcp status after install

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

    Master mcp is degraded post install

Expected results:

     Master mcp should not be degraded post install

Additional info:

relates to

OCPBUGS-33229 OCP 4.16 install fails with MCO error "error during syncRequiredMachineConfigPools"

Closed

Assignee:: Team MCO

Reporter:: Noreen Chhabra

QA Contact:: Sergio Regidor de la Rosa

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/05/16 4:32 PM

Updated:: 2024/12/02 11:54 PM

Resolved:: 2024/12/02 11:54 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide