Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-7887

[4.12] crio.service should use a more safe restart policy to provide recoverability against concurrency issues

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Minor Minor
    • 4.12.z
    • 4.13
    • Node / CRI-O
    • Moderate
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      the crio.service default unit is configured with
      
      Restart=on-abnormal
      TimeoutStartSec=0
      
      This configuration can bring the crio.service (and dependants) to permanently fail at boot after the first attempt to start.
      
      For example, a possible concurrency issue between the network configuration scripts in OVN Baremetal IPI clusters can lead to the network-online.target reached while the network is still being configured. As the network-online.target is reached, crio and kubelet try to start, and crio can fail with
      
      "Failed to start streaming server: listen tcp 192.168.90.42:10010: bind: cannot assign requested address"
      
      Eventually, the network configuration converges, and the services can be manually started. This example happened in some upgrade CI runs in QE and could lead to nodes of a cluster being 'paused' and not ready until a system administrator manually restarted them.
      
      As a followup to the discussion in slack, and although the network issue deserves other investigations, we'd propose changing the restart policy of the crio.service unit to guarantee more safety and recoverability against such "byzantine" faults like order violation in the systemd dependency graph.
       

      How reproducible:

      Not always

      Steps to Reproduce:

      The following steps refer to the example above:
      
      1. Install a 4.11 ipv4-only BM IPI cluster with OVN and provisioning network (4.11.0-0.nightly-arm64-2022-11-27-133933) 
      2. Upgrade to  4.12.0-0.nightly-arm64-2022-11-29-032225
      3. After the rebase of a worker and its reboot, it doesn't reconcile, leaving the MCP degraded.
      (Manual recovery)
      4. Login to the worker and verify the network is up and crio is a failed unit.
      5. Start crio manually (you might need to start kubelet as well)
      6. The worker node object reconciles and the node is ready
      

      Additional info:

      - The issue could be related to a temporary network failure.
      - kubelet is already set with the `always` restart policy
      
      Slack conversation: https://coreos.slack.com/archives/CK1AE4ZCK/p1669727807409689

       

            pehunt@redhat.com Peter Hunt
            rhn-support-adistefa Alessandro Di Stefano
            Sunil Choudhary Sunil Choudhary
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: