-
Bug
-
Resolution: Won't Do
-
Normal
-
None
-
4.13.z
-
No
-
False
-
Description of problem:
A cluster mcp upgrage got stuck while upgrading from 4.13.15 to 4.13.37 The mcp update was stuck between the current and desired versions I0403 03:33:28.453415 1970990 daemon.go:1501] Current config: rendered-master-0e136129d2ccc49e34cb432c35b91b12 I0403 03:33:28.453422 1970990 daemon.go:1502] Desired config: rendered-master-044624087a107de8a42f89256c081c61 Upon checking the machine config daemon logs it appeared that the upgrade was stuck fetching a file etched ostree chunk sha256:a34fc3efd200 Fetching ostree chunk sha256:2c9372bf6f68 (91.3?MB) Fetched ostree chunk sha256:2c9372bf6f68 Fetching ostree chunk sha256:9745cdfb7160 (18.8?MB) Fetched ostree chunk sha256:9745cdfb7160 Fetching ostree chunk sha256:0af06817481a (12.0?MB) Fetched ostree chunk sha256:0af06817481a Fetching ostree chunk sha256:153dcaa5c6b0 (35.1?MB) Fetched ostree chunk sha256:153dcaa5c6b0 Fetching ostree chunk sha256:d202db8e3938 (12.8?MB) It was not progressing beyond this point When machine config daemon for the affected node got deleted, the new machine config daemon pod got stuck in a deadlock with following logs E0403 04:25:19.518444 2643 on_disk_validation.go:245] content mismatch for file "/etc/crio/crio.conf.d/00-default" (-want +got): []uint8( """ [crio] internal_wipe = true - version_file_persist = "/var/lib/crio/version" [crio.api] ... // 34 identical lines [crio.image] global_auth_file = "/var/lib/kubelet/config.json" - pause_image = "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:136862793d2fb6328cbd8a0cd603ef1d0faf2d78a48fe3035a5c82e22f7753bc" + pause_image = "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fb1a253166b0392323b19592c4b2820a02c3917546849891f5619e0990cb3909" pause_image_auth_file = "/var/lib/kubelet/config.json" pause_command = "/usr/bin/pod" ... // 39 identical lines """ )
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Cluster was stuck upgrading
Expected results:
Cluster upgrade should not be blocked
Additional info:
Following KCS unblocked the upgrade: 5315421
Also, there seem to be multiple KCS that may apply to this. It is a bit confusing which should be followed with which symptoms.
- https://access.redhat.com/solutions/5244121
- https://access.redhat.com/solutions/5414371
- https://access.redhat.com/solutions/6028851
SOS Report from one of the affected nodes:
https://drive.google.com/file/d/1DahN5oBbNqiaKQ4Jj9S8CQZLnl5oPTo3/view?usp=drive_link
Must gather:
https://drive.google.com/file/d/1fRLWsXJcaiBKzE9s_NUay1R8UG1XIo7O/view?usp=drive_link
- relates to
-
OCPBUGS-43267 pause_image content mismatch for file "/etc/crio/crio.conf.d/00-default"
- New