Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-31502

Console operator progressing forever on single-master cluster

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Undefined Undefined
    • None
    • 4.14.z
    • Management Console
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • No
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      I am using OCP deployed on AWS:

      $ oc version
      Client Version: 4.14.3
      Kustomize Version: v5.0.1
      Server Version: 4.14.11
      Kubernetes Version: v1.27.10+28ed2d7
      

      My cluster has a single master node and multiple worker nodes:

      $ oc get node
      NAME                                        STATUS   ROLES                  AGE   VERSION
      ip-10-0-17-251.us-west-2.compute.internal   Ready    worker                 10h   v1.27.10+28ed2d7
      ip-10-0-19-110.us-west-2.compute.internal   Ready    control-plane,master   10h   v1.27.10+28ed2d7
      ip-10-0-47-136.us-west-2.compute.internal   Ready    worker                 10h   v1.27.10+28ed2d7
      ip-10-0-66-68.us-west-2.compute.internal    Ready    worker                 10h   v1.27.10+28ed2d7
      ip-10-0-81-138.us-west-2.compute.internal   Ready    worker                 9h    v1.27.10+28ed2d7
      

      Accordingly, in the Infrastructure object, the fields describing the cluster topology are controlPlaneTopology: SingleReplica and infrastructureTopology: HighlyAvailable:

      $ oc get infrastructures.config.openshift.io cluster -o yaml
      apiVersion: config.openshift.io/v1
      kind: Infrastructure
      metadata:
        ...
        name: cluster
        ...
      spec:
        cloudConfig:
          key: config
          name: cloud-provider-config
        platformSpec:
          aws: {}
          type: AWS
      status:
        apiServerInternalURI: https://api-int.mycluster10a.sandbox1452.opentlc.com:6443
        apiServerURL: https://api.mycluster10a.sandbox1452.opentlc.com:6443
        controlPlaneTopology: SingleReplica
        cpuPartitioning: None
        etcdDiscoveryDomain: ""
        infrastructureName: mycluster10a-ddr8m
        infrastructureTopology: HighlyAvailable
        platform: AWS
        platformStatus:
          aws:
            region: us-west-2
          type: AWS
      

      The console operator schedules three console replicas:

      $ oc get po -n openshift-console
      NAME                         READY   STATUS    RESTARTS   AGE
      console-58cc755947-ld47m     0/1     Pending   0          9h
      console-58cc755947-s9skd     0/1     Pending   0          9h
      console-69cc447fd4-b5nmt     1/1     Running   2          10h
      downloads-5545fcd8f7-8mflq   1/1     Running   2          10h
      downloads-5545fcd8f7-gg2k4   1/1     Running   2          10h
      

      Two out of three scheduled replicas remain pending forever. The console pod node selector is configured to place the console pods on master nodes and there is an anti-affinity that prevents placing more than one console pod on the same node:

      $ oc get deploy -n openshift-console console -o yaml
      ...
         spec:
            affinity:
              podAntiAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchExpressions:
                    - key: component
                      operator: In
                      values:
                      - ui
                  topologyKey: kubernetes.io/hostname
      ...
            nodeSelector:
              node-role.kubernetes.io/master: ""
      ...
      

      As my cluster has a single master node, it's not possible to place all three console pods. As a result, the console operator cannot finish reconciling:

      $ oc get co
      NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.14.11   True        False         False      9h
      baremetal                                  4.14.11   True        False         False      10h
      cloud-controller-manager                   4.14.11   True        False         False      10h
      cloud-credential                           4.14.11   True        False         False      10h
      cluster-autoscaler                         4.14.11   True        False         False      10h
      config-operator                            4.14.11   True        False         False      10h
      console                                    4.14.11   True        True          False      9h      SyncLoopRefreshProgressing: Working toward version 4.14.11, 1 replicas available
      control-plane-machine-set                  4.14.11   True        False         False      10h
      csi-snapshot-controller                    4.14.11   True        False         False      10h
      dns                                        4.14.11   True        False         False      10h
      etcd                                       4.14.11   True        False         False      10h
      ...
      

      It looks like the console operator consults the Infrastructure.status.infrastructureTopology field that is set to HighlyAvailable. Based on this information, the console operator creates three console replicas.

      The console operator should perhaps also check the field Infrastructure.status.controlPlaneTopology. This field is set to SingleReplica indicating that it's not possible to schedule three console replicas on master nodes.

              jhadvig@redhat.com Jakub Hadvig
              anosek@redhat.com Ales Nosek
              None
              None
              YaDan Pei YaDan Pei
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: