Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-19081

internal-registry-pull-secret.json updates and causes nodes to be stuck

XMLWordPrintable

    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      On ci/prow/e2e-gcp-ovn-techpreview jobs, we noticed that the installation would always fail. This also happens when launching jobs from ClusterBot with https://github.com/openshift/cloud-provider-gcp/pull/35 and the TechPreview feature flag set.
      
      A sample job is https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cloud-provider-gcp/35/pull-ci-openshift-cloud-provider-gcp-master-e2e-gcp-ovn-techpreview/1700229284032942080
      
      In the above job, the machine configs change from "rendered-worker-d5ca314d6412a630a0df72eb28b88543" to "rendered-worker-3eb45ca2b54008d2d1d5a6701a84bd7e", but no nodes are ever given "rendered-worker-3eb45ca2b54008d2d1d5a6701a84bd7e" as a desired configuration. When diff'ing the two configs, the only notable change is that `/etc/mco/internal-registry-pull-secret.json` changes from being empty to being populated.
      
      
      Within the Machine Config Operator controller logs of this job, we see the new machine config being generated, but not assigned.
      
      ```
      ☸ ocp/api-ci-l2s4-p1-openshiftapps-com:6443/nbrubake (ocp) in Downloads/artifacts/pods
      ❯ rg rendered-worker-3
      openshift-machine-config-operator_machine-config-controller-7c754bffdd-84k5s_machine-config-controller.log
      184:I0908 19:59:48.796617       1 render_controller.go:510] Generated machineconfig rendered-worker-3eb45ca2b54008d2d1d5a6701a84bd7e from 7 configs: [{MachineConfig  00-worker  machineconfiguration.openshift.io/v1  } {MachineConfig  01-worker-container-runtime  machineconfiguration.openshift.io/v1  } {MachineConfig  01-worker-kubelet  machineconfiguration.openshift.io/v1  } {MachineConfig  97-worker-generated-kubelet  machineconfiguration.openshift.io/v1  } {MachineConfig  98-worker-generated-kubelet  machineconfiguration.openshift.io/v1  } {MachineConfig  99-worker-generated-registries  machineconfiguration.openshift.io/v1  } {MachineConfig  99-worker-ssh  machineconfiguration.openshift.io/v1  }]
      185:I0908 19:59:48.797076       1 event.go:298] Event(v1.ObjectReference{Kind:"MachineConfigPool", Namespace:"openshift-machine-config-operator", Name:"worker", UID:"3416d468-710b-4c19-a956-00dead3dec84", APIVersion:"machineconfiguration.openshift.io/v1", ResourceVersion:"25805", FieldPath:""}): type: 'Normal' reason: 'RenderedConfigGenerated' rendered-worker-3eb45ca2b54008d2d1d5a6701a84bd7e successfully generated (release version: 4.15.0-0.ci.test-2023-09-08-193239-ci-op-h57nt20x-latest, controller version: 5b821a279c88fee1cc1886a6cf1ec774891a2258)
      187:I0908 19:59:48.872593       1 render_controller.go:536] Pool worker: now targeting: rendered-worker-3eb45ca2b54008d2d1d5a6701a84bd7e
      ```
      
      We have _not_ seen this when deploying onto GCP manually, however.
      
      A similar ClusterBot failure is here: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp-modern/1702401781788577792

      Version-Release number of selected component (if applicable):

       

      How reproducible:

      So far, 100% of the time with prow-based deployments 

      Steps to Reproduce:

      1. Launch ci/prow/e2e-gcp-ovn-techpreview or create a GCP cluster with TechPreviewNoUpgrade from ClusterBot
      2. Wait for the launch to fail
      3.
      

      Actual results:

      Worker nodes get restarted and result in services such as olm, ingress, the image registry, and others to become unavailable

      Expected results:

      The image pull secret doesn't change during an install, or if it does, it doesn't result in stuck workers.

      Additional info:

      No Machines or MachineSets are populated in the gather-extras, but that's likely because this is meant to be a ClusterAPI-managed cluster, which uses a different API group than the gather scripts use.
      
      Also, there are worker nodes in the gather-extra, but they are marked as unschedulable due to missing network routes.

              djoshy David Joshy
              rh-ee-nbrubake Nolan Brubaker
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: