Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-43481

Machine-API healthz probe failure on SNO Upgrades

XMLWordPrintable

    • Important
    • None
    • Proposed
    • False
    • Hide

      None

      Show
      None

      [sig-arch] events should not repeat pathologically for ns/openshift-machine-api

      The machine-api resource seems to not be responding to the `/healthz` requests from kubelet causing an increase in probe error events. The pod does seem to be up, and preliminary look at Loki is showing that the `/healthz` endpoint does seem to be up, but looses leader between, before starting the health probe again.

      Prow Link
      Loki General Query

      Loki Start/Stop/Query

      (read from bottom up)

      I1016 19:51:31.418815       1 server.go:191] "Starting webhook server" logger="controller-runtime.webhook"
      I1016 19:51:31.418764       1 server.go:247] "Serving metrics server" logger="controller-runtime.metrics" bindAddress=":8082" secure=false
      I1016 19:51:31.418703       1 server.go:83] "starting server" name="health probe" addr="[::]:9441"
      I1016 19:51:31.418650       1 server.go:208] "Starting metrics server" logger="controller-runtime.metrics"		
      2024/10/16 19:51:31 Starting the Cmd.
      
      ...
      
      2024/10/16 19:50:44 leader election lost
      I1016 19:50:44.406280       1 leaderelection.go:297] failed to renew lease openshift-machine-api/cluster-api-provider-machineset-leader: timed out waiting for the condition
      error
      E1016 19:50:44.406230       1 leaderelection.go:436] error retrieving resource lock openshift-machine-api/cluster-api-provider-machineset-leader: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-api-provider-machineset-leader": context deadline exceeded
      error
      E1016 19:50:37.430054       1 leaderelection.go:429] Failed to update lock optimitically: rpc error: code = DeadlineExceeded desc = context deadline exceeded, falling back to slow path
      error
      E1016 19:50:04.423920       1 leaderelection.go:436] error retrieving resource lock openshift-machine-api/cluster-api-provider-machineset-leader: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io cluster-api-provider-machineset-leader)
      error
      E1016 19:49:04.422237       1 leaderelection.go:429] Failed to update lock optimitically: rpc error: code = DeadlineExceeded desc = context deadline exceeded, falling back to slow path
      ....
      
      I1016 19:46:21.358989       1 server.go:83] "starting server" name="health probe" addr="[::]:9441"
      I1016 19:46:21.358891       1 server.go:247] "Serving metrics server" logger="controller-runtime.metrics" bindAddress=":8082" secure=false
      I1016 19:46:21.358682       1 server.go:208] "Starting metrics server" logger="controller-runtime.metrics"		
      2024/10/16 19:46:21 Starting the Cmd.
      

      Event Filter

            ehila@redhat.com Egli Hila
            ehila@redhat.com Egli Hila
            Neil Hamza Neil Hamza
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: