-
Bug
-
Resolution: Done
-
Critical
-
None
-
4.10.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
?
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
The metrics don't respond with error like below in kube-apiserver and the oc commands take a very long time (>30s) to respond from the bastion. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2022-08-31T13:28:14.786189711+06:00 E0831 07:28:14.786058 18 available_controller.go:546] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.128.10.197:6443/apis/metrics.k8s.io/v1beta1: Get "https://10.128.10.197:6443/apis/metrics.k8s.io/v1beta1": dial tcp 10.128.10.197:6443: connect: connection refused 2022-08-31T13:28:14.792678415+06:00 I0831 07:28:14.792496 18 available_controller.go:496] "changing APIService availability" name="v1beta1.metrics.k8s.io" oldStatus=False newStatus=False message="failing or missing response from https://10.128.10.197:6443/apis/metrics.k8s.io/v1beta1: Get \"https://10.128.10.197:6443/apis/metrics.k8s.io/v1beta1\": dial tcp 10.128.10.197:6443: connect: connection refused" reason="FailedDiscoveryCheck" 2022-08-31T13:28:14.800577820+06:00 E0831 07:28:14.800509 18 available_controller.go:546] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.128.10.197:6443/apis/metrics.k8s.io/v1beta1: Get "https://10.128.10.197:6443/apis/metrics.k8s.io/v1beta1": dial tcp 10.128.10.197:6443: connect: connection refused 2022-08-31T13:28:14.801286534+06:00 E0831 07:28:14.801214 18 available_controller.go:546] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.128.10.197:6443/apis/metrics.k8s.io/v1beta1: Get "https://10.128.10.197:6443/apis/metrics.k8s.io/v1beta1": dial tcp 10.128.10.197:6443: connect: connection refused 2022-08-31T13:28:15.660473185+06:00 E0831 07:28:15.660371 18 controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable 2022-08-31T13:28:15.660473185+06:00 I0831 07:28:15.660385 18 controller.go:129] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Below is repeated in verbose output of oc commands(oc get node -v=10). ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I0831 13:14:40.047925 932175 round_trippers.go:466] curl -v -XGET -H "Authorization: Bearer <masked>" -H "Accept: application/json, */*" -H "User-Agent: oc/4.10.0 (linux/amd64) kubernetes/45460a5" 'https://api.prod.banglalinkgsm.com:6443/apis/metrics.k8s.io/v1beta1?timeout=32s' I0831 13:14:40.054588 932175 round_trippers.go:570] HTTP Statistics: GetConnection 0 ms ServerProcessing 5 ms Duration 6 ms I0831 13:14:40.054629 932175 round_trippers.go:577] Response Headers: I0831 13:14:40.054648 932175 round_trippers.go:580] Audit-Id: 67b54a4d-b314-4a99-9f51-040bd3728aee I0831 13:14:40.054674 932175 round_trippers.go:580] Audit-Id: 67b54a4d-b314-4a99-9f51-040bd3728aee I0831 13:14:40.054695 932175 round_trippers.go:580] Cache-Control: no-cache, private I0831 13:14:40.054713 932175 round_trippers.go:580] Cache-Control: no-cache, private I0831 13:14:40.054732 932175 round_trippers.go:580] Content-Type: text/plain; charset=utf-8 I0831 13:14:40.054772 932175 round_trippers.go:580] X-Kubernetes-Pf-Prioritylevel-Uid: d08ae54f-6aa2-4f0b-8188-05518c09d563 I0831 13:14:40.054795 932175 round_trippers.go:580] Content-Length: 43 I0831 13:14:40.054814 932175 round_trippers.go:580] Date: Wed, 31 Aug 2022 07:14:13 GMT I0831 13:14:40.054833 932175 round_trippers.go:580] Retry-After: 1 I0831 13:14:40.054852 932175 round_trippers.go:580] X-Content-Type-Options: nosniff I0831 13:14:40.054871 932175 round_trippers.go:580] X-Kubernetes-Pf-Flowschema-Uid: 373372fa-869f-4ae0-976f-33a6c0da1286 I0831 13:14:40.054958 932175 with_retry.go:171] Got a Retry-After 1s response for attempt 9 to https://api.prod.banglalinkgsm.com:6443/apis/metrics.k8s.io/v1beta1?timeout=32s ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Below errors are seen in kube-controller-manager logs. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 022-08-31T13:28:03.622075963+06:00 E0831 07:28:03.622017 1 horizontal.go:226] failed to compute desired number of replicas based on listed metrics for Deployment/sdp-prod-apps/renewal-datasync: invalid metrics (2 invalid out of 2), first error is: failed to get memory utilization: unable to get metrics for resource memory: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io) 2022-08-31T13:28:03.622100808+06:00 I0831 07:28:03.622078 1 event.go:294] "Event occurred" object="sdp-prod-apps/renewal-datasync" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedGetResourceMetric" message="failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)" 2022-08-31T13:28:03.622117106+06:00 I0831 07:28:03.622104 1 event.go:294] "Event occurred" object="sdp-prod-apps/renewal-datasync" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedComputeMetricsReplicas" message="invalid metrics (2 invalid out of 2), first error is: failed to get memory utilization: unable to get metrics for resource memory: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)" 2022-08-31T13:28:09.707854155+06:00 E0831 07:28:09.707746 1 resource_quota_controller.go:413] unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request 2022-08-31T13:43:14.941377920+06:00 I0831 07:43:14.941290 1 event.go:294] "Event occurred" object="sdp-prod-apps/cdruploader" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedGetResourceMetric" message="failed to get memory utilization: unable to get metrics for resource memory: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)" 2022-08-31T13:52:40.358398516+06:00 I0831 07:52:40.358321 1 event.go:294] "Event occurred" object="sdp-prod-apps/consent-service" kind="HorizontalPodAutoscaler" apiVersion="autoscaling/v2" type="Warning" reason="FailedGetResourceMetric" message="failed to get memory utilization: unable to get metrics for resource memory: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Started working find after restarting the prometheus-adapter and the thanos-querier pods. The oc commands also started responding fast.
Version-Release number of selected component (if applicable):
How reproducible:
Medium
Steps to Reproduce:
Not known as the issue happens randomly on it's own and a restart of prometheus-adapter and the thanos-querier pods fixes it temporarily
Actual results:
oc commands responding very slow and metrics throwing errors
Expected results:
It shouldn't throw such errors making oc commands respond very late
Additional info:
Must gather - https://attachments.access.redhat.com/hydra/rest/cases/03302475/attachments/8be4122c-5626-4b8d-a138-f3042e6f3765?usePresignedUrl=true