Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-29432

OCP 4.14.10 - AWS node hostnames changed unexpectedly from <IP-address>.ec2.internal to <IP-address>.<domain>.internal

    XMLWordPrintable

Details

    • Important
    • No
    • SDN Sprint 249
    • 1
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

          Upgrade from 4.12.28 --> 4.13.24 (successfully)
          Upgrade from 4.13.24 --> 4.14.10 (partial/failed) --> Observed after all CO's upgraded, the node MCP rollout started. 1 Master and 1 Worker restarted and it was observed that the hostname for the nodes changed **unintentionally**. From IP-<address>.ec2.local to IP-<address>.<customdomain>.local
      
      Observed in the journal logs the following:
      
      journalctl_--no-pager:Feb 03 19:38:46 ip-10-131-136-180 kubenswrapper[1381]: I0203 19:38:46.097670    1381 flags.go:64] FLAG: --hostname-override="ip-10-131-136-180.<customdomain>.local"
      
      It is unclear why or how this value is being injected.
      
      VPC has a custom domain that matches this above value, but has been set like that since 4.10 or earlier with no impact to node health/cluster handling. 

      Version-Release number of selected component (if applicable):

      4.14.10 is when issue was first observed    

      How reproducible:

          every time

      Steps to Reproduce:

          1. have AWS cluster Upgrade cluster to 4.14.10 from 4.13.24
          2. Observe after the CO rollover succeeds that master/worker begin to update MCP builds; note that initial nodes start to come up with new hostnames. (Pause MCP to prevent further changes).
          3. Observe that hostnames on nodes that were previously issued as <IP>.ec2.internal are now known as <IP>.<custom-domain>.internal
          

      Actual results:

      Cluster is partially degraded/stuck between versions    

      Expected results:

          cluster upgrade should complete normally, no degraded status and the host names for AWS should not change unless explicitly set (note that changing hostnames for AWS as a day-2 operation is not supported anyway). This change is unexpected for customer and no DNS modifications have been introduced to otherwise trigger the change.

      Additional info:

      Note: this is a dev cluster, customer has a possible workaround for logins presently with kubeconfig placement/token based-auth that skips IDP (currently impacted/unavailable for login on the cluster while nodes are degraded). However, this is a production instance for their developer teams and as a result they are blocked from upgrading their prod instance as well. Cluster is partially stable but they are expecting to revert etcd backup if no workarounds/solutions are available before too long.
      
      
      Observe in logs that the hostname is being overridden in journal logs:
      Journal logs:
      Inintial name applied:
      
      ~~~
      Feb 04 00:41:22 localhost systemd-journald[319]: Journal stopped
      Feb 04 00:41:23 ip-10-131-136-180 ostree-prepare-root[633]: sysroot.readonly configuration value: 1
      Feb 04 00:41:23 ip-10-131-136-180 systemd[1]: Finished OSTree Prepare OS/.
      ~~~
      
      ...
      
      secondary name applied:
      
      ~~~
      Feb 04 00:41:32 ip-10-131-136-180 crio[1351]: time="2024-02-04 00:41:32.470506414Z" level=info msg="Started container" PID=2382 containerID=e18fbdb96ca1eed48
      0fc9b430c369c30de74d243c121ff76da2261af8d2b88ca description=openshift-etcd/etcd-ip-10-131-136-180.<CUSTOM>.local/etcd-readyz id=faff0ef6-9f56-4b79-86b4-6cc122
      df2eb4 name=/runtime.v1.RuntimeService/StartContainer sandboxID=d827217abc36e5fcd6494e38c5d9f43bc867b46c5a67c0a8aa9e8cd816000e5f
      Feb 04 00:41:32 ip-10-131-136-180 kubenswrapper[1374]: I0204 00:41:32.566016    1374 csi_plugin.go:913] Failed to contact API server when waiting for CSINode
       publishing: csinodes.storage.k8s.io "ip-10-131-136-180.<CUSTOM>.local" is forbidden: User "system:node:ip-10-131-136-180.ec2.internal" cannot get resource "c
      sinodes" in API group "storage.k8s.io" at the cluster scope: can only access CSINode with the same name as the requesting node
      Feb 04 00:41:32 ip-10-131-136-180 kubenswrapper[1374]: I0204 00:41:32.734353    1374 kubelet.go:2457] "SyncLoop (PLEG): event for pod" pod="openshift-etcd/et
      cd-ip-10-131-136-180.<CUSTOM>.local" event=&{ID:1faab27ba692b933fc213b08c8e8cc92 Type:ContainerStarted Data:e18fbdb96ca1eed480fc9b430c369c30de74d243c121ff76da
      2261af8d2b88ca}
      ~~~
      
      

      Attachments

        Activity

          People

            joelspeed Joel Speed
            rhn-support-wrussell Will Russell
            Zhanqi Zhao Zhanqi Zhao
            Patryk Diak
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: