Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Duplicate
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.14.z
Component/s: Cloud Compute / Cloud Controller Manager
Labels:
None

Severity:
Important
Regression:
No
Sprint:
SDN Sprint 249
sprint_count:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Links:

Description

Description of problem:

    Upgrade from 4.12.28 --> 4.13.24 (successfully)
    Upgrade from 4.13.24 --> 4.14.10 (partial/failed) --> Observed after all CO's upgraded, the node MCP rollout started. 1 Master and 1 Worker restarted and it was observed that the hostname for the nodes changed **unintentionally**. From IP-<address>.ec2.local to IP-<address>.<customdomain>.local

Observed in the journal logs the following:

journalctl_--no-pager:Feb 03 19:38:46 ip-10-131-136-180 kubenswrapper[1381]: I0203 19:38:46.097670    1381 flags.go:64] FLAG: --hostname-override="ip-10-131-136-180.<customdomain>.local"

It is unclear why or how this value is being injected.

VPC has a custom domain that matches this above value, but has been set like that since 4.10 or earlier with no impact to node health/cluster handling.

Version-Release number of selected component (if applicable):

4.14.10 is when issue was first observed

How reproducible:

    every time

Steps to Reproduce:

    1. have AWS cluster Upgrade cluster to 4.14.10 from 4.13.24
    2. Observe after the CO rollover succeeds that master/worker begin to update MCP builds; note that initial nodes start to come up with new hostnames. (Pause MCP to prevent further changes).
    3. Observe that hostnames on nodes that were previously issued as <IP>.ec2.internal are now known as <IP>.<custom-domain>.internal

Actual results:

Cluster is partially degraded/stuck between versions

Expected results:

    cluster upgrade should complete normally, no degraded status and the host names for AWS should not change unless explicitly set (note that changing hostnames for AWS as a day-2 operation is not supported anyway). This change is unexpected for customer and no DNS modifications have been introduced to otherwise trigger the change.

Additional info:

Note: this is a dev cluster, customer has a possible workaround for logins presently with kubeconfig placement/token based-auth that skips IDP (currently impacted/unavailable for login on the cluster while nodes are degraded). However, this is a production instance for their developer teams and as a result they are blocked from upgrading their prod instance as well. Cluster is partially stable but they are expecting to revert etcd backup if no workarounds/solutions are available before too long.


Observe in logs that the hostname is being overridden in journal logs:

Journal logs:
Inintial name applied:

~~~
Feb 04 00:41:22 localhost systemd-journald[319]: Journal stopped
Feb 04 00:41:23 ip-10-131-136-180 ostree-prepare-root[633]: sysroot.readonly configuration value: 1
Feb 04 00:41:23 ip-10-131-136-180 systemd[1]: Finished OSTree Prepare OS/.
~~~

...

secondary name applied:

~~~
Feb 04 00:41:32 ip-10-131-136-180 crio[1351]: time="2024-02-04 00:41:32.470506414Z" level=info msg="Started container" PID=2382 containerID=e18fbdb96ca1eed48
0fc9b430c369c30de74d243c121ff76da2261af8d2b88ca description=openshift-etcd/etcd-ip-10-131-136-180.<CUSTOM>.local/etcd-readyz id=faff0ef6-9f56-4b79-86b4-6cc122
df2eb4 name=/runtime.v1.RuntimeService/StartContainer sandboxID=d827217abc36e5fcd6494e38c5d9f43bc867b46c5a67c0a8aa9e8cd816000e5f
Feb 04 00:41:32 ip-10-131-136-180 kubenswrapper[1374]: I0204 00:41:32.566016    1374 csi_plugin.go:913] Failed to contact API server when waiting for CSINode
 publishing: csinodes.storage.k8s.io "ip-10-131-136-180.<CUSTOM>.local" is forbidden: User "system:node:ip-10-131-136-180.ec2.internal" cannot get resource "c
sinodes" in API group "storage.k8s.io" at the cluster scope: can only access CSINode with the same name as the requesting node
Feb 04 00:41:32 ip-10-131-136-180 kubenswrapper[1374]: I0204 00:41:32.734353    1374 kubelet.go:2457] "SyncLoop (PLEG): event for pod" pod="openshift-etcd/et
cd-ip-10-131-136-180.<CUSTOM>.local" event=&{ID:1faab27ba692b933fc213b08c8e8cc92 Type:ContainerStarted Data:e18fbdb96ca1eed480fc9b430c369c30de74d243c121ff76da
2261af8d2b88ca}
~~~

Attachments

Activity

People

Assignee:: Joel Speed

Reporter:: Will Russell

QA Contact:: Zhanqi Zhao

Contributors:: Patryk Diak

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 2024/02/13 5:59 PM

Updated:: 2024/02/15 10:49 AM

Resolved:: 2024/02/15 10:49 AM