Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14409

SNO cluster API down after reboot

XMLWordPrintable

    • Important
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      7/5: pending repro --> more/better logs
      6/7: telco review pending triage
      Show
      7/5: pending repro --> more/better logs 6/7: telco review pending triage

      Description of problem:

      
      After performing several reboots in a row, SNO cluster API does not respond anymore:
      
      The connection to the server api.cloudransno-site1.slcm1.bos2.lab:6443 was refused - did you specify the right host or port?
      
      

      Version-Release number of selected component (if applicable):

      4.12.16
      
      

      How reproducible:

      
      We run a test that performs several reboot in a row. We see this issue with a high rate every time we run that test. We say in 4.12.16 100%of times, and now also in 4.12.21 happened the first time we run the test.
      
      

      Steps to Reproduce:

      1. Reboot SNO cluster 5 times
      2. Check API
      
      

      Actual results:

      
      Node does not respond anymore. I left it several hours but it did not come back.
      
      

      Expected results:

      
      Node recovers properly
      
      

      Additional info:

      
      System Impact: Very severe. Node cannot be longer used
      
      ACM reports: 
      
      The kube-apiserver is not ok, status code: 0, Get "https://172.31.0.1:443/livez": dial tcp 172.31.0.1:443: connect: connection refused
      
      oc adm must gather cannot be performed. Only SOS report. Logs attached
      

            rlopezma@redhat.com Rodrigo Lopez Manrique
            rlopezma@redhat.com Rodrigo Lopez Manrique
            Ke Wang Ke Wang
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: