Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-42110

Node sclaling failed due to misconfigurations in on-prem-resolv-prepender.service in RHOCP4

XMLWordPrintable

    • Important
    • No
    • False
    • Hide

      None

      Show
      None
    • Hide
      Previously, the Machine Config Operator (MCO) vSphere resolv-prepender script used systemd directives that were not compatible with old boot image versions of OpenShift Container Platform 4. With this release, these OpenShift Container Platform nodes are compatible with old boot images with one of the following solutions: scaling with a boot image 4.13 or later, by using manual intervention, or upgrading to a release with this fix.
      ====
      This fixes a bug where the MCO's vsphere resolv-prepender script uses systemd directives that is not compatible with old bootimage versions in OCP 4. Nodes should be able to scale with either an newer bootimage version (4.13+), manual intervention, or upgrading to a release with this fix.
      Show
      Previously, the Machine Config Operator (MCO) vSphere resolv-prepender script used systemd directives that were not compatible with old boot image versions of OpenShift Container Platform 4. With this release, these OpenShift Container Platform nodes are compatible with old boot images with one of the following solutions: scaling with a boot image 4.13 or later, by using manual intervention, or upgrading to a release with this fix. ==== This fixes a bug where the MCO's vsphere resolv-prepender script uses systemd directives that is not compatible with old bootimage versions in OCP 4. Nodes should be able to scale with either an newer bootimage version (4.13+), manual intervention, or upgrading to a release with this fix.
    • Bug Fix
    • Done

      This is a clone of issue OCPBUGS-42109. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-42108. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-38012. The following is the description of the original issue:

      Description of problem:

      Customers are unable to scale-up the OCP nodes when the initial setup is done with OCP 4.8/4.9 and then upgraded to 4.15.22/4.15.23
      
      At first customer observed that the node scale-up failed and the /etc/resolv.conf was empty on the nodes.
      As a workaround, customer copy/paste the resolv.conf content from a correct resolv.conf and then it continued with setting up the new node.
      
      However then they observed the rendered MachineConfig assembled with the 00-worker, and suspected that something can be wrong with the on-prem-resolv-prepender.service service definition.
      As a workaround, customer manually changed this service definition which helped them to scale up new nodes.

      Version-Release number of selected component (if applicable):

      4.15 , 4.16

      How reproducible:

      100%

      Steps to Reproduce:

      1. Install OCP vSphere IPI cluster version 4.8 or 4.9
      2. Check "on-prem-resolv-prepender.service" service definition
      3. Upgrade it to 4.15.22 or 4.15.23
      4. Check if the node scaling is working 
      5. Check "on-prem-resolv-prepender.service" service definition     

      Actual results:

      Unable to scaleup node with default service definition. After manually making changes in the service definition , scaling is working.

      Expected results:

      Node sclaing should work without making any manual changes in the service definition.

      Additional info:

      on-prem-resolv-prepender.service content on the clusters build with 4.8 / 4.9 version and then upgraded to 4.15.22 / 4.25.23 :
      ~~~
      [Unit]
      Description=Populates resolv.conf according to on-prem IPI needs
      # Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe
      After=crio-wipe.service
      [Service]
      Type=oneshot
      Restart=on-failure
      RestartSec=10
      StartLimitIntervalSec=0
      ExecStart=/usr/local/bin/resolv-prepender.sh
      EnvironmentFile=/run/resolv-prepender/env
      ~~~
      
      After manually correcting the service definition as below, scaling works on 4.15.22 / 4.15.23 :
      ~~~
      [Unit]
      Description=Populates resolv.conf according to on-prem IPI needs
      # Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe
      After=crio-wipe.service
      StartLimitIntervalSec=0                -----------> this
      [Service]
      Type=oneshot
      #Restart=on-failure                    -----------> this
      RestartSec=10
      ExecStart=/usr/local/bin/resolv-prepender.sh
      EnvironmentFile=/run/resolv-prepender/env
      ~~~
      
      Below is the on-prem-resolv-prepender.service on a freshly intsalled 4.15.23 where sclaing is working fine :
      ~~~
      [Unit]
      Description=Populates resolv.conf according to on-prem IPI needs
      # Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe
      After=crio-wipe.service
      StartLimitIntervalSec=0
      [Service]
      Type=oneshot
      Restart=on-failure
      RestartSec=10
      ExecStart=/usr/local/bin/resolv-prepender.sh
      EnvironmentFile=/run/resolv-prepender/env
      ~~~
      
      Observed this in the rendered MachineConfig which is assembled with the 00-worker

              mkowalsk@redhat.com Mat Kowalski
              openshift-crt-jira-prow OpenShift Prow Bot
              Zhanqi Zhao Zhanqi Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: