Details
-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.12.z
-
No
-
Sprint 246, Sprint 247, Sprint 248, Sprint 249, Sprint 250, Sprint 251, Sprint 252
-
7
-
Rejected
-
False
-
Description
Description of problem:
The cluster version operator is currently experiencing an issue with the following error message: $ oc logs pod/cluster-version-operator-fdd98d77c-899x7 -n openshift-cluster-version 2023-10-02T13:03:53.956713295Z W1002 13:03:53.956647 1 start.go:157] Failed to get FeatureGate from cluster: Get "https://api-int.os-cluster-prod-01.ats-inc.com:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp: lookup api-int.os-cluster-prod-01.ats-inc.com on 10.136.0.10:53: no such host 2023-10-02T13:03:55.956235251Z W1002 13:03:55.956166 1 start.go:157] Failed to get FeatureGate from cluster: Get "https://api-int.os-cluster-prod-01.ats-inc.com:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp: lookup api-int.os-cluster-prod-01.ats-inc.com on 10.136.0.10:53: no such host 2023-10-02T13:03:57.956336455Z W1002 13:03:57.956273 1 start.go:157] Failed to get FeatureGate from cluster: Get "https://api-int.os-cluster-prod-01.ats-inc.com:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp: lookup api-int.os-cluster-prod-01.ats-inc.com on 10.136.0.10:53: no such host 2023-10-02T13:03:59.956560629Z W1002 13:03:59.956518 1 start.go:157] Failed to get FeatureGate from cluster: Get "https://api-int.os-cluster-prod-01.ats-inc.com:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp: lookup api-int.os-cluster-prod-01.ats-inc.com on 10.136.0.10:53: no such host
Version-Release number of selected component (if applicable):
Cluster ID: 1c087e4c-ab97-442d-9eb2-dace2e252958 Cluster Version: 4.12.4 Desired Version: 4.12.4 Channel: stable-4.12 Previous Version(s): 4.12.3, 4.11.26, 4.10.51, 4.10.6(unverified) Infrastructure -------------- Platform: VSphere Install Type: IPI apiServerInternalIP: 172.17.98.107 apiServerInternalIPs: 172.17.98.107 ingressIP: 172.17.98.108 ingressIPs: 172.17.98.108 Network ------- Network Type: OpenShiftSDN httpProxy: None httpsProxy: None Cluster network: 10.132.0.0/14 Host prefix: 23 Max nodes: 512 Max pods per node: 510
How reproducible:
It started seemingly randomly when nothing much had been changed on the cluster for a while.
Troubleshoot steps taken so far:
1. On the master0 node where the cluster version operator is running, the 'dig' command was successful: sh-4.4# dig api-int.os-cluster-prod-01.ats-inc.com ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8_6.1 <<>> api-int.os-cluster-prod-01.ats-inc.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59604 ;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; WARNING: recursion requested but not available ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ; COOKIE: 1183add7ab786034 (echoed) ;; QUESTION SECTION: ;api-int.os-cluster-prod-01.ats-inc.com. IN A ;; ANSWER SECTION: api-int.os-cluster-prod-01.ats-inc.com. 16 IN A 172.17.98.107 ;; Query time: 0 msec ;; SERVER: 172.17.98.163#53(172.17.98.163) ;; WHEN: Mon Oct 02 17:05:31 UTC 2023 ;; MSG SIZE rcvd: 133 2. The nameserver 172.17.98.163 corresponds to the IP address of the master0 node itself: $ cat ip_addr 2: ens192 inet 172.17.98.163/24 brd 172.17.98.255 scope global dynamic noprefixroute ens192\ valid_lft 80512sec preferred_lft 80512sec 3. When examining the DNS operator, it is confirmed that the pod is configured to use the correct DNS server (CoreDNS configures the kubelet to instruct pods to use the CoreDNS service IP address for name resolution): $ oc get svc -n openshift-dns NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/dns-default ClusterIP 10.136.0.10 <none> 53/UDP,53/TCP,9154/TCP 227d 4. Despite restarting the node-resolver, the issue persists: $ oc delete pod/node-resolver-hmlvs -n openshift-dns 5. Deleting the CVO operator pod did not resolve the issue: $ oc delete pod/cluster-version-operator-fdd98d77c-899x7 -n openshift-cluster-version 6. I considered adding a forward nameserver to CoreDNS, but I was advised that this is not a recommended solution. Therefore, I am opening a Jira issue to seek assistance.