Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-64719

Windows hybrid-overlay-node service restarts due to cert rotation error

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 3
    • None
    • None
    • None
    • None
    • WINC - Sprint 280
    • 1
    • In Progress
    • Bug Fix
    • Hide
      Previously, the `hybridOverlay` service was not using the trusted CA bundle when connecting to Kubernetes, as the `--k8s-cacert` option was missing from the service command. Because of this, users could encounter trust issues or failures when the `hybridOverlay` service attempted to communicate securely with Kubernetes clusters using custom or internal CAs. With this fix the `hybridOverlay` service command now includes the `--k8s-cacert flag` pointing to the trusted CA bundle. As a result, the `hybridOverlay` service uses the trusted CA bundle for secure communication, preventing trust issues and ensuring compatibility with the cluster.
      Show
      Previously, the `hybridOverlay` service was not using the trusted CA bundle when connecting to Kubernetes, as the `--k8s-cacert` option was missing from the service command. Because of this, users could encounter trust issues or failures when the `hybridOverlay` service attempted to communicate securely with Kubernetes clusters using custom or internal CAs. With this fix the `hybridOverlay` service command now includes the `--k8s-cacert flag` pointing to the trusted CA bundle. As a result, the `hybridOverlay` service uses the trusted CA bundle for secure communication, preventing trust issues and ensuring compatibility with the cluster.
    • None
    • None
    • None
    • None

      Description of problem:

      hybrid-overlay-node service is restarting due to cert rotation errors
      
      This issue was discussed in OCPBUGS-59637. Problem was expected to be addressed in OCP 4.20/WMCO 10.20. Fix was to include --k8s-cacert 
      
      [root@vm-236-67 ~]# oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.20.2    True        False         24h     Cluster version is 4.20.2
      
      [root@vm-236-67 ~]# oc get configmap windows-services-10.20.0-e3c3dfe  -n openshift-windows-machine-config-operator
      NAME                               DATA   AGE
      windows-services-10.20.0-e3c3dfe   3      24h
      
      administrator@WIN-PKOLG940BKQ C:\k>sc qc hybrid-overlay-node
      [SC] QueryServiceConfig SUCCESS
      
      SERVICE_NAME: hybrid-overlay-node
              TYPE               : 10  WIN32_OWN_PROCESS
              START_TYPE         : 3   DEMAND_START
              ERROR_CONTROL      : 0   IGNORE
              BINARY_PATH_NAME   : C:\k\hybrid-overlay-node.exe --node win-pkolg940bkq --bootstrap-kubeconfig=C:\k\kubeconfig --cert-dir=C:\k\cni\config --cert-duration=24h --windows-service --logfile C:\var\log\hybrid-overlay\hybrid-overlay.log --k8s-cacert C:\k\ca-bundle.crt --hybrid-overlay-vxlan-port 4789 --loglevel 5
              LOAD_ORDER_GROUP   :
              TAG                : 0
              DISPLAY_NAME       : hybrid-overlay-node
              DEPENDENCIES       : kubelet
              SERVICE_START_NAME : LocalSystem
      
      
      hybrid-overlay-node logs
      
      I1105 01:41:32.000965    1660 certificate_manager.go:422] "Certificate rotation is enabled" logger="kubernetes.io/kube-apiserver-client"
      I1105 01:41:32.000965    1660 kube.go:426] Certificate found
      I1105 01:41:32.000965    1660 certificate_manager.go:715] "Certificate rotation deadline determined" logger="kubernetes.io/kube-apiserver-client" expiration="2025-11-06 04:50:54 +0000 UTC" deadline="2025-11-06 01:24:15.547806536 +0000 UTC"
      I1105 01:41:32.000965    1660 certificate_manager.go:431] "Waiting for next certificate rotation" logger="kubernetes.io/kube-apiserver-client" sleep="15h42m43.546840836s"
      --
      --
      I1105 17:24:16.330619    1660 certificate_manager.go:566] "Rotating certificates" logger="kubernetes.io/kube-apiserver-client"
      E1105 17:24:16.341965    1660 certificate_manager.go:596] "Failed while requesting a signed certificate from the control plane" err="cannot create certificate signing request: Post \"http://localhost:8443/apis/certificates.k8s.io/v1/certificatesigningrequests\": dial tcp [::1]:8443: connectex: No connection could be made because the target machine actively refused it." logger="kubernetes.io/kube-apiserver-client.UnhandledError"
      E1105 17:24:16.341965    1660 certificate_manager.go:596] "Failed while requesting a signed certificate from the control plane" err="cannot create certificate signing request: Post \"http://localhost:8443/apis/certificates.k8s.io/v1/certificatesigningrequests\": dial tcp [::1]:8443: connectex: No connection could be made because the target machine actively refused it." logger="kubernetes.io/kube-apiserver-client.UnhandledError"
      --
      --
      I1105 20:50:59.533449    1660 certificate_manager.go:387] "Current certificate is expired" logger="kubernetes.io/kube-apiserver-client"
      E1105 20:50:59.533449    1660 kube.go:437] The current certificate is invalid, exiting.
      E1105 20:50:59.533449    1660 kube.go:437] The current certificate is invalid, exiting.
      E1105 20:50:59.533449    1660 kube.go:437] The current certificate is invalid, exiting.
      
      Workaround
      
      It's a temporary workaround until we address this issue
      
      [Environment]::SetEnvironmentVariable('KUBECONFIG','C:\k\kubeconfig','Machine')
      
      After setting the above environment variable , no more restarts were seen
      
          

      Version-Release number of selected component (if applicable):

      OCP 4.20.2
      WMCO 10.20
          

      How reproducible:

      Always
          

      Steps to Reproduce:

          1. Add a windows node to OCP cluster
          2. Wait for 24 hours and check hybrid-overlay-node logs for errors
          3.
          

      Actual results:

      Certificate rotation fails and hybrid-overlay-node service restarts
          

      Expected results:

      Certificate rotation should be successful avoiding hybrid-overlay-node service restarts
          

      Additional info:

      Application outages are seen which coincides with the hybrid-overlay-node service restarts
          

              jvaldes@redhat.com Jose Valdes
              rhn-support-rrajaram Ranjith Rajaram
              None
              None
              Aharon Rasouli Aharon Rasouli
              None
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: