Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-36536

Disconnected ARO clusters fail to add new nodes after upgrading to 4.14

XMLWordPrintable

    • No
    • MCO Sprint 256
    • 1
    • False
    • Hide

      None

      Show
      None
    • Hide
      Fixed a rare bug where in disconnected ARO installs do not scale up new nodes after upgrading to affected versions.

      This regressed in 4.14 due to a change of dependency targets, but was not caught until 4.17. We will apply this patch for 4.14->4.17 via backport.
      Show
      Fixed a rare bug where in disconnected ARO installs do not scale up new nodes after upgrading to affected versions. This regressed in 4.14 due to a change of dependency targets, but was not caught until 4.17. We will apply this patch for 4.14->4.17 via backport.
    • Bug Fix
    • In Progress

      This is a clone of issue OCPBUGS-35300. The following is the description of the original issue:

      Description of problem:

      ARO cluster fails to install with disconnected networking.
      We see master nodes bootup hang on the service machine-config-daemon-pull.service. Logs from the service indicate it cannot reach the public IP of the image registry. In ARO, image registries need to go via a proxy. Dnsmasq is used to inject proxy DNS answers, but machine-config-daemon-pull is starting before ARO's dnsmasq.service starts.
      

      Version-Release number of selected component (if applicable):

      4.14.16
      

      How reproducible:

      Always
      

      Steps to Reproduce:

      For Fresh Install:
      1. Create the required ARO vnet and subnets
      2. Attach a route table to the subnets with a blackhole route 0.0.0.0/0
      3. Create 4.14 ARO cluster with --apiserver-visibility=Private --ingress-visibility=Private --outbound-type=UserDefinedRouting
      
      [OR]
      
      Post Upgrade to 4.14:
      1. Create a ARO 4.13 UDR.
      2. ClusterUpgrade the cluster 4.13-> 4.14 , upgrade was successful
      3. Create a new node (scale up), we run into the same issue. 

      Actual results:

      For Fresh Install of 4.14:
      ERROR: (InternalServerError) Deployment failed.
      
      [OR]
      
      Post Upgrade to 4.14:
      Node doesn't come into a Ready State and Machine is stuck in Provisioned status.

      Expected results:

      Succeeded 

      Additional info:
      We see in the node logs that machine-config-daemon-pull.service is unable to reach the image registry. ARO's dnsmasq was not yet started.
      Previously, systemd ordering was set for ovs-configuration.service to start after (ARO's) dnsmasq.service. Perhaps that should have gone on machine-config-daemon-pull.service.
      See https://issues.redhat.com/browse/OCPBUGS-25406.

              team-mco Team MCO
              openshift-crt-jira-prow OpenShift Prow Bot
              Hilliary Lipsig Hilliary Lipsig
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: