Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: 4.18.0
Affects Version/s: 4.18.0
Component/s: Cloud Compute / Machine API Providers
Labels:
- edge-payload

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:

4.18.0
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

[sig-arch] events should not repeat pathologically for ns/openshift-machine-api

The machine-api resource seems to not be responding to the `/healthz` requests from kubelet causing an increase in probe error events. The pod does seem to be up, and preliminary look at Loki is showing that the `/healthz` endpoint does seem to be up, but looses leader between, before starting the health probe again.

Prow Link
Loki General Query

Loki Start/Stop/Query

(read from bottom up)

I1016 19:51:31.418815       1 server.go:191] "Starting webhook server" logger="controller-runtime.webhook"
I1016 19:51:31.418764       1 server.go:247] "Serving metrics server" logger="controller-runtime.metrics" bindAddress=":8082" secure=false
I1016 19:51:31.418703       1 server.go:83] "starting server" name="health probe" addr="[::]:9441"
I1016 19:51:31.418650       1 server.go:208] "Starting metrics server" logger="controller-runtime.metrics"		
2024/10/16 19:51:31 Starting the Cmd.

...

2024/10/16 19:50:44 leader election lost
I1016 19:50:44.406280       1 leaderelection.go:297] failed to renew lease openshift-machine-api/cluster-api-provider-machineset-leader: timed out waiting for the condition
error
E1016 19:50:44.406230       1 leaderelection.go:436] error retrieving resource lock openshift-machine-api/cluster-api-provider-machineset-leader: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-api-provider-machineset-leader": context deadline exceeded
error
E1016 19:50:37.430054       1 leaderelection.go:429] Failed to update lock optimitically: rpc error: code = DeadlineExceeded desc = context deadline exceeded, falling back to slow path
error
E1016 19:50:04.423920       1 leaderelection.go:436] error retrieving resource lock openshift-machine-api/cluster-api-provider-machineset-leader: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io cluster-api-provider-machineset-leader)
error
E1016 19:49:04.422237       1 leaderelection.go:429] Failed to update lock optimitically: rpc error: code = DeadlineExceeded desc = context deadline exceeded, falling back to slow path
....

I1016 19:46:21.358989       1 server.go:83] "starting server" name="health probe" addr="[::]:9441"
I1016 19:46:21.358891       1 server.go:247] "Serving metrics server" logger="controller-runtime.metrics" bindAddress=":8082" secure=false
I1016 19:46:21.358682       1 server.go:208] "Starting metrics server" logger="controller-runtime.metrics"		
2024/10/16 19:46:21 Starting the Cmd.

Event Filter

links to

openshift/machine-api-operator#1299: OCPBUGS-43481: fix: health probes paths and updated timing

RHEA-2024:6122 OpenShift Container Platform 4.18.z bug fix update

Assignee:: Egli Hila

Reporter:: Egli Hila

QA Contact:: Neil Hamza

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/10/17 6:34 AM

Updated:: 2025/07/20 1:12 PM

Resolved:: 2024/12/20 11:56 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates