Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-30637

[SNO 4.12] kube-apiserver fail to start after power outage : certificate issue

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Normal
    • None
    • 4.12.z
    • kube-apiserver
    • None
    • Important
    • No
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      After a small power outage a SNO was not able to start properly.
      kube-apiserver was failling and the error message look to indicate certificate issue.
      Between last line of journalctl and the boot of the node there is only 12 minutes.
      
      What we would like to know is : 
      - does the certificate of kube-apiserver are only stored on etcd database ?
      - how often those certificate are regenerated if they are ?
      - does a non commited write was happening during a certificate regeneration might has lead to this issue ?
      - how is it possible restore the SNO outside of a redeployment if the issue happen again, and is it possible ?
      - does having a backup of etcd would have been a possiblity to have a restore plan
      - if the issue is not etcd, what would be helpful to backup from the filesystem to be able to restore kube-apiserver 
      
      

      Version-Release number of selected component (if applicable):

      4.12.40
      Baremetal
      x86_64
      Disconnected environment    

      How reproducible:

      Customer was not able to reproduce and/or has not try to cut power on a lab node to reproduce the issue.
      But a crash of a node does not lead to this issue.

      Steps to Reproduce:

      N/A

      Actual results:

      kube-apiserver fail to start and node had to be redeployed.

      Expected results:

      A power outage won't not had an impact on the filesystem or etcd content which lead to a non working SNO.

      Additional info:

      We only have sosreport.
      We lack mustgather because the kube-apiserver is not up => so oc command can not work.
      
      
      From sosreport we can see : 
      
      Node last message : 
          Mar 02 16:08:34 aaa.bbb.ccc bash[7384]: I0302 16:08:34.147230    7384 kubelet_pods.go:897] "Unable to retrieve pull secret, the image pull may not succeed." pod="mtcil/fmaas-67b8c7f7f6-jlgr8" secret="" err="secret \"mtcil-imagepull-secret\" not found"
      
      Node boot at this moment : 
          Mar 02 16:20:36 localhost kernel: Linux version .....
      
      
      
      We can also see that kube-apiserver is launched around every 50 minutes in loop.
      $ ll kube-apiserver-aaa.bbb.ccc_openshift-kube-apiserver_kube-apiserver-*|grep -vE '(cert|insecure|check)'
      
      lrwxrwxrwx. 1 yank yank 147 Mar  2 16:46 kube-apiserver-aaa.bbb.ccc_openshift-kube-apiserver_kube-apiserver-74a7352e207cea7f4ea4fc4278009e5672f8b4a20f8fdf5a40adc8cb95398b3f.log
      lrwxrwxrwx. 1 yank yank 148 Mar  2 17:22 kube-apiserver-aaa.bbb.ccc_openshift-kube-apiserver_kube-apiserver-d712f8d40798bc6631f441f4402a03d515396b5504ba72a4e5a5c207b42a7780.log
      lrwxrwxrwx. 1 yank yank 148 Mar  2 18:48 kube-apiserver-aaa.bbb.ccc_openshift-kube-apiserver_kube-apiserver-5b1296ed6e1f2ad741d9f673a5985096b73b1ec2f50ba1b2a39de220b3011b01.log
      lrwxrwxrwx. 1 yank yank 148 Mar  2 19:08 kube-apiserver-aaa.bbb.ccc_openshift-kube-apiserver_kube-apiserver-125abcb782940e238cf5b804e074b71d8054656689a1f138cfa8b4898c5cfccc.log
      lrwxrwxrwx. 1 yank yank 148 Mar  2 19:58 kube-apiserver-aaa.bbb.ccc_openshift-kube-apiserver_kube-apiserver-a350824e29849492f4979a9b196528315646975493c184d7de8e4860a1d65e74.log
      
      
      Each of the occurrence lead to the same issue : 
      
      ~~~
      kube-apiserver-aaa.bbb.ccc_openshift-kube-apiserver_kube-apiserver-74a7352e207cea7f4ea4fc4278009e5672f8b4a20f8fdf5a40adc8cb95398b3f.log
      
      2024-03-02T16:46:48.279083255+00:00 stderr F I0302 16:46:48.279046      16 server.go:622] external host was not specified, using 2401:4900:14:434:0:b6:0:b05
      2024-03-02T16:46:48.279620600+00:00 stderr F I0302 16:46:48.279511      16 server.go:202] Version: v1.25.11+1485cc9
      2024-03-02T16:46:48.279665190+00:00 stderr F I0302 16:46:48.279601      16 server.go:204] "Golang settings" GOGC="100" GOMAXPROCS="" GOTRACEBACK=""
      2024-03-02T16:46:48.280313076+00:00 stderr F I0302 16:46:48.280259      16 dynamic_serving_content.go:113] "Loaded a new cert/key pair" name="serving-cert::/etc/kubernetes/static-pod-certs/secrets/service-network-serving-certkey/tls.crt::/etc/kubernetes/static-pod-certs/secrets/service-network-serving-certkey/tls.key"
      2024-03-02T16:46:48.280575042+00:00 stderr F I0302 16:46:48.280538      16 dynamic_serving_content.go:113] "Loaded a new cert/key pair" name="sni-serving-cert::/etc/kubernetes/static-pod-certs/secrets/localhost-serving-cert-certkey/tls.crt::/etc/kubernetes/static-pod-certs/secrets/localhost-serving-cert-certkey/tls.key"
      2024-03-02T16:46:48.280948527+00:00 stderr F I0302 16:46:48.280906      16 dynamic_serving_content.go:113] "Loaded a new cert/key pair" name="sni-serving-cert::/etc/kubernetes/static-pod-certs/secrets/service-network-serving-certkey/tls.crt::/etc/kubernetes/static-pod-certs/secrets/service-network-serving-certkey/tls.key"
      2024-03-02T16:46:48.281291695+00:00 stderr F I0302 16:46:48.281246      16 dynamic_serving_content.go:113] "Loaded a new cert/key pair" name="sni-serving-cert::/etc/kubernetes/static-pod-certs/secrets/external-loadbalancer-serving-certkey/tls.crt::/etc/kubernetes/static-pod-certs/secrets/external-loadbalancer-serving-certkey/tls.key"
      2024-03-02T16:46:48.281647178+00:00 stderr F I0302 16:46:48.281603      16 dynamic_serving_content.go:113] "Loaded a new cert/key pair" name="sni-serving-cert::/etc/kubernetes/static-pod-certs/secrets/internal-loadbalancer-serving-certkey/tls.crt::/etc/kubernetes/static-pod-certs/secrets/internal-loadbalancer-serving-certkey/tls.key"
      2024-03-02T16:46:48.281953998+00:00 stderr F I0302 16:46:48.281906      16 dynamic_serving_content.go:113] "Loaded a new cert/key pair" name="sni-serving-cert::/etc/kubernetes/static-pod-resources/secrets/localhost-recovery-serving-certkey/tls.crt::/etc/kubernetes/static-pod-resources/secrets/localhost-recovery-serving-certkey/tls.key"
      2024-03-02T16:46:48.613084229+00:00 stderr F E0302 16:46:48.613012      16 run.go:74] "command failed" err="missing content for CA bundle \"client-ca-bundle::/etc/kubernetes/static-pod-certs/configmaps/client-ca/ca-bundle.crt\""
      2024-03-02T16:46:48.617706173+00:00 stderr F I0302 16:46:48.617664       1 main.go:235] Termination finished with exit code 1
      ~~~
      
      
      
      

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            rhn-support-jpeyrard Johann Peyrard
            Ke Wang Ke Wang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: