Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-29820

Time drift event affected the ready status from sdn container

    XMLWordPrintable

Details

    • Moderate
    • No
    • SDN Sprint 250
    • 1
    • False
    • Hide

      None

      Show
      None
    • No mention of the cause for the initial time drift; workaround is to delete sdn- pod in openshift-sdn

    Description

      Description of problem:

      The sdn container from sdn pod seems to have the ready status affected in case of time drift events in the node's time. 
      
      If the node boots with a configured hour xxh:xxmin and the chronyd changes the hour to x - 1h, the sdn container stuck in unready state for a quantity of minutes out of the normal. Usually the container becomes ready in some seconds and during the time drift events, the container stuck several minutes in unready state. The sdn pod is recovered after deleting it.
      
      Although, the container does not change the status to ready, in the container logs, its functions are performed properly. For example:
      
      - There is the message 'openshift-sdn network plugin ready'
      - The CNI_ADD and CNI_DEL events happen properly    
      
      See more detail in the linked MG.

      Version-Release number of selected component (if applicable):

      4.12.39    

      How reproducible:

      I was able to reproduce the bug with the below actions:
      
      
      - Cluster version
      
      ~~~
      $ oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.12.39   True        False         104m    Cluster version is 4.12.39
      
      ~~~
      
      - At the moment of the reboot, I have configured the below date to the OCP node: 
      
      ~~~
      $ ssh quickcluster@worker-0.bgomes41239.lab -- sudo date 022008492024
      ~~~
      
      - The sdn pod stuck in unready state too: 
      
      ~~~
      -- Reboot --
      Feb 20 08:52:10 worker-0.bgomes41239.lab. chronyd[974]: Selected source x.x.x.x (clock.redhat.com)
      Feb 20 08:52:10 worker-0.bgomes41239.lab. chronyd[974]: System clock wrong by 3722.629599 seconds <-------
      Feb 20 09:54:12 worker-0.bgomes41239.lab. chronyd[974]: System clock was stepped by 3722.629599 seconds
      
      $ date && oc get pod -n openshift-sdn -owide sdn-9dr2f 
      Tue Feb 20 09:53:05 AM WET 2024 <------ ~3 minutes in unready state where the container usually take seconds to become ready. There are events where this delays more than 20 minutes
      NAME        READY   STATUS    RESTARTS   AGE   IP           NODE                                               NOMINATED NODE   READINESS GATES
      sdn-9dr2f   1/2     Running   14         41m   10.0.89.78   worker-0.bgomes41239.lab.   <none>           <none>
      
        - lastProbeTime: null
          lastTransitionTime: "2024-02-20T09:50:56Z"
          message: 'containers with unready status: [sdn]'
          reason: ContainersNotReady
          status: "False"
      
      
      ~~~
      
          

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

          

      Expected results:

          

      Additional info:

       

      Attachments

        Activity

          People

            sdn-team-bot sdn-team bot
            rhn-support-bgomes Bruno Gomes
            Zhanqi Zhao Zhanqi Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: