Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-31691

content mismatch for file "/etc/crio/crio.conf.d/00-default"

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.13.z
    • RHCOS
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      A cluster mcp upgrage got stuck while upgrading from 4.13.15 to 4.13.37
      
      The mcp update was stuck between the current and desired versions
      I0403 03:33:28.453415 1970990 daemon.go:1501] Current config: rendered-master-0e136129d2ccc49e34cb432c35b91b12
      I0403 03:33:28.453422 1970990 daemon.go:1502] Desired config: rendered-master-044624087a107de8a42f89256c081c61
      
      Upon checking the machine config daemon logs it appeared that the upgrade was stuck fetching a file
      etched ostree chunk sha256:a34fc3efd200
      Fetching ostree chunk sha256:2c9372bf6f68 (91.3?MB)
      Fetched ostree chunk sha256:2c9372bf6f68
      Fetching ostree chunk sha256:9745cdfb7160 (18.8?MB)
      Fetched ostree chunk sha256:9745cdfb7160
      Fetching ostree chunk sha256:0af06817481a (12.0?MB)
      Fetched ostree chunk sha256:0af06817481a
      Fetching ostree chunk sha256:153dcaa5c6b0 (35.1?MB)
      Fetched ostree chunk sha256:153dcaa5c6b0
      Fetching ostree chunk sha256:d202db8e3938 (12.8?MB)    
      
      It was not progressing beyond this point
      
      When machine config daemon for the affected node got deleted, the new machine config daemon pod got stuck in a deadlock with following logs
      
      E0403 04:25:19.518444    2643 on_disk_validation.go:245] content mismatch for file "/etc/crio/crio.conf.d/00-default" (-want +got):
        []uint8(
              """
              [crio]
              internal_wipe = true
      -       version_file_persist = "/var/lib/crio/version"
        
              [crio.api]
              ... // 34 identical lines
              [crio.image]
              global_auth_file = "/var/lib/kubelet/config.json"
      -       pause_image = "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:136862793d2fb6328cbd8a0cd603ef1d0faf2d78a48fe3035a5c82e22f7753bc"
      +       pause_image = "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fb1a253166b0392323b19592c4b2820a02c3917546849891f5619e0990cb3909"
              pause_image_auth_file = "/var/lib/kubelet/config.json"
              pause_command = "/usr/bin/pod"
              ... // 39 identical lines
              """
        )
      
      
      

      Version-Release number of selected component (if applicable):

          

      How reproducible:

          

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

       Cluster was stuck upgrading

      Expected results:

       Cluster upgrade should not be blocked

      Additional info:

       

      Following KCS unblocked the upgrade: 5315421

      Also, there seem to be multiple KCS that may apply to this. It is a bit confusing which should be followed with which symptoms.

      1. https://access.redhat.com/solutions/5244121
      2. https://access.redhat.com/solutions/5414371
      3. https://access.redhat.com/solutions/6028851

      SOS Report from one of the affected nodes:

      https://drive.google.com/file/d/1DahN5oBbNqiaKQ4Jj9S8CQZLnl5oPTo3/view?usp=drive_link

      Must gather:

      https://drive.google.com/file/d/1fRLWsXJcaiBKzE9s_NUay1R8UG1XIo7O/view?usp=drive_link

            Unassigned Unassigned
            taislam.osd Tafhim Ul Islam
            Sergio Regidor de la Rosa Sergio Regidor de la Rosa
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated: