Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-16651

ostree-finalize-staged.service timeout after 20 mins causing unsuccessful upgrade on some nodes.

XMLWordPrintable

    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      The infra MCP is degraded due to one of the infra node unable to upgrade due to below issue:

       

      2023-07-20T05:06:55.045058094Z I0720 05:06:55.045011    2786 update.go:2118] Disk currentConfig rendered-infra-c6d6928bfcd10ab1b440f6a2505bd5d1 overrides node's currentConfig annotation rendered-infra-76583762333a6685c3d4d1b75e14c28b
      2023-07-20T05:06:55.048306566Z I0720 05:06:55.048269    2786 daemon.go:1564] Validating against pending config rendered-infra-c6d6928bfcd10ab1b440f6a2505bd5d1
      2023-07-20T05:06:57.733681234Z E0720 05:06:57.733641    2786 writer.go:200] Marking Degraded due to: unexpected on-disk state validating against rendered-infra-c6d6928bfcd10ab1b440f6a2505bd5d1: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5ef4276442c5174d31f6b62a83aa40e64c719275dd731e5ccb0dc98911f7e57e", have "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fb065c8d91453ce4a3f5518189b34bce94406c01f43957abde01f08165b3a085" ("1ad911e70b7befaad4f3eac5ee14510bbaaecbedb9fb464ffbe3cb38e133576f") 

       Below are ostree-finalize-staged.service logs, we can see that there is a timeout after 20 minutes of copying:

       

      journalctl_--no-pager_--unit_ostree-finalize-staged
      Jul 19 15:22:23 SOINR01CAL0101.raiffeisen.org ostree[372060]: Copying /etc changes: 19 modified, 0 removed, 212 added
      Jul 19 15:42:21 SOINR01CAL0101.raiffeisen.org systemd[1]: ostree-finalize-staged.service: Stopping timed out. Terminating.
      

      The ostree-finalize-staged.service timeout is already set to 20 min in the RHCOS node.`

      $ cat etc/systemd/system/ostree-finalize-staged.service.d/override.conf
      [Service]
      TimeoutStopSec=20m

       

      $ cat rpm-ostree_status_-v 
      State: idle 
      Warning: failed to finalize previous deployment   
               check `journalctl -b -1 -u ostree-finalize-staged.service` AutomaticUpdates: disabled 
      Deployments: ● ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fb065c8d91453ce4a3f5518189b34bce94406c01f43957abde01f08165b3a085 (index: 0)             
                    Digest: sha256:fb065c8d91453ce4a3f5518189b34bce94406c01f43957abde01f08165b3a085                   Version: 412.86.202306271602-0 (2023-07-14T15:33:47Z)                       Commit: 1ad911e70b7befaad4f3eac5ee14510bbaaecbedb9fb464ffbe3cb38e133576f                            Staged: no 
                     StateRoot: rhcos  

      Additional info:

      Everytime when a minor upgrade is triggered for example from 4.12.20 to 4.12.21, 4.12.21 to 4.12.22 and 4.12.23 to 4.12.24. Only the infra nodes getting into the degraded state.
      
      A simple MCP upgrade, like an update on a machine config for NTP, does not bring the node to a degraded state.

            walters@redhat.com Colin Walters
            rhn-support-dpateriy Divyam Pateriya
            Sergio Regidor de la Rosa Sergio Regidor de la Rosa
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: