-
Bug
-
Resolution: Duplicate
-
Undefined
-
None
-
4.14.z
-
None
-
Important
-
No
-
SDN Sprint 249
-
1
-
False
-
Description of problem:
Upgrade from 4.12.28 --> 4.13.24 (successfully) Upgrade from 4.13.24 --> 4.14.10 (partial/failed) --> Observed after all CO's upgraded, the node MCP rollout started. 1 Master and 1 Worker restarted and it was observed that the hostname for the nodes changed **unintentionally**. From IP-<address>.ec2.local to IP-<address>.<customdomain>.local Observed in the journal logs the following: journalctl_--no-pager:Feb 03 19:38:46 ip-10-131-136-180 kubenswrapper[1381]: I0203 19:38:46.097670 1381 flags.go:64] FLAG: --hostname-override="ip-10-131-136-180.<customdomain>.local" It is unclear why or how this value is being injected. VPC has a custom domain that matches this above value, but has been set like that since 4.10 or earlier with no impact to node health/cluster handling.
Version-Release number of selected component (if applicable):
4.14.10 is when issue was first observed
How reproducible:
every time
Steps to Reproduce:
1. have AWS cluster Upgrade cluster to 4.14.10 from 4.13.24 2. Observe after the CO rollover succeeds that master/worker begin to update MCP builds; note that initial nodes start to come up with new hostnames. (Pause MCP to prevent further changes). 3. Observe that hostnames on nodes that were previously issued as <IP>.ec2.internal are now known as <IP>.<custom-domain>.internal
Actual results:
Cluster is partially degraded/stuck between versions
Expected results:
cluster upgrade should complete normally, no degraded status and the host names for AWS should not change unless explicitly set (note that changing hostnames for AWS as a day-2 operation is not supported anyway). This change is unexpected for customer and no DNS modifications have been introduced to otherwise trigger the change.
Additional info:
Note: this is a dev cluster, customer has a possible workaround for logins presently with kubeconfig placement/token based-auth that skips IDP (currently impacted/unavailable for login on the cluster while nodes are degraded). However, this is a production instance for their developer teams and as a result they are blocked from upgrading their prod instance as well. Cluster is partially stable but they are expecting to revert etcd backup if no workarounds/solutions are available before too long. Observe in logs that the hostname is being overridden in journal logs:
Journal logs: Inintial name applied: ~~~ Feb 04 00:41:22 localhost systemd-journald[319]: Journal stopped Feb 04 00:41:23 ip-10-131-136-180 ostree-prepare-root[633]: sysroot.readonly configuration value: 1 Feb 04 00:41:23 ip-10-131-136-180 systemd[1]: Finished OSTree Prepare OS/. ~~~ ... secondary name applied: ~~~ Feb 04 00:41:32 ip-10-131-136-180 crio[1351]: time="2024-02-04 00:41:32.470506414Z" level=info msg="Started container" PID=2382 containerID=e18fbdb96ca1eed48 0fc9b430c369c30de74d243c121ff76da2261af8d2b88ca description=openshift-etcd/etcd-ip-10-131-136-180.<CUSTOM>.local/etcd-readyz id=faff0ef6-9f56-4b79-86b4-6cc122 df2eb4 name=/runtime.v1.RuntimeService/StartContainer sandboxID=d827217abc36e5fcd6494e38c5d9f43bc867b46c5a67c0a8aa9e8cd816000e5f Feb 04 00:41:32 ip-10-131-136-180 kubenswrapper[1374]: I0204 00:41:32.566016 1374 csi_plugin.go:913] Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "ip-10-131-136-180.<CUSTOM>.local" is forbidden: User "system:node:ip-10-131-136-180.ec2.internal" cannot get resource "c sinodes" in API group "storage.k8s.io" at the cluster scope: can only access CSINode with the same name as the requesting node Feb 04 00:41:32 ip-10-131-136-180 kubenswrapper[1374]: I0204 00:41:32.734353 1374 kubelet.go:2457] "SyncLoop (PLEG): event for pod" pod="openshift-etcd/et cd-ip-10-131-136-180.<CUSTOM>.local" event=&{ID:1faab27ba692b933fc213b08c8e8cc92 Type:ContainerStarted Data:e18fbdb96ca1eed480fc9b430c369c30de74d243c121ff76da 2261af8d2b88ca} ~~~