Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-42109

Node sclaling failed due to misconfigurations in on-prem-resolv-prepender.service in RHOCP4

XMLWordPrintable

    • Important
    • None
    • False
    • Hide

      None

      Show
      None
    • Hide
      This fixes a bug where the MCO's vsphere resolv-prepender script uses systemd directives that is not compatible with old bootimage versions in OCP 4. Nodes should be able to scale with either an newer bootimage version (4.13+), manual intervention, or upgrading to a release with this fix.
      Show
      This fixes a bug where the MCO's vsphere resolv-prepender script uses systemd directives that is not compatible with old bootimage versions in OCP 4. Nodes should be able to scale with either an newer bootimage version (4.13+), manual intervention, or upgrading to a release with this fix.
    • Bug Fix
    • In Progress

      This is a clone of issue OCPBUGS-42108. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-38012. The following is the description of the original issue:

      Description of problem:

      Customers are unable to scale-up the OCP nodes when the initial setup is done with OCP 4.8/4.9 and then upgraded to 4.15.22/4.15.23
      
      At first customer observed that the node scale-up failed and the /etc/resolv.conf was empty on the nodes.
      As a workaround, customer copy/paste the resolv.conf content from a correct resolv.conf and then it continued with setting up the new node.
      
      However then they observed the rendered MachineConfig assembled with the 00-worker, and suspected that something can be wrong with the on-prem-resolv-prepender.service service definition.
      As a workaround, customer manually changed this service definition which helped them to scale up new nodes.

      Version-Release number of selected component (if applicable):

      4.15 , 4.16

      How reproducible:

      100%

      Steps to Reproduce:

      1. Install OCP vSphere IPI cluster version 4.8 or 4.9
      2. Check "on-prem-resolv-prepender.service" service definition
      3. Upgrade it to 4.15.22 or 4.15.23
      4. Check if the node scaling is working 
      5. Check "on-prem-resolv-prepender.service" service definition     

      Actual results:

      Unable to scaleup node with default service definition. After manually making changes in the service definition , scaling is working.

      Expected results:

      Node sclaing should work without making any manual changes in the service definition.

      Additional info:

      on-prem-resolv-prepender.service content on the clusters build with 4.8 / 4.9 version and then upgraded to 4.15.22 / 4.25.23 :
      ~~~
      [Unit]
      Description=Populates resolv.conf according to on-prem IPI needs
      # Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe
      After=crio-wipe.service
      [Service]
      Type=oneshot
      Restart=on-failure
      RestartSec=10
      StartLimitIntervalSec=0
      ExecStart=/usr/local/bin/resolv-prepender.sh
      EnvironmentFile=/run/resolv-prepender/env
      ~~~
      
      After manually correcting the service definition as below, scaling works on 4.15.22 / 4.15.23 :
      ~~~
      [Unit]
      Description=Populates resolv.conf according to on-prem IPI needs
      # Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe
      After=crio-wipe.service
      StartLimitIntervalSec=0                -----------> this
      [Service]
      Type=oneshot
      #Restart=on-failure                    -----------> this
      RestartSec=10
      ExecStart=/usr/local/bin/resolv-prepender.sh
      EnvironmentFile=/run/resolv-prepender/env
      ~~~
      
      Below is the on-prem-resolv-prepender.service on a freshly intsalled 4.15.23 where sclaing is working fine :
      ~~~
      [Unit]
      Description=Populates resolv.conf according to on-prem IPI needs
      # Per https://issues.redhat.com/browse/OCPBUGS-27162 there is a problem if this is started before crio-wipe
      After=crio-wipe.service
      StartLimitIntervalSec=0
      [Service]
      Type=oneshot
      Restart=on-failure
      RestartSec=10
      ExecStart=/usr/local/bin/resolv-prepender.sh
      EnvironmentFile=/run/resolv-prepender/env
      ~~~
      
      Observed this in the rendered MachineConfig which is assembled with the 00-worker

            mkowalsk@redhat.com Mat Kowalski
            openshift-crt-jira-prow OpenShift Prow Bot
            Zhanqi Zhao Zhanqi Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: