Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-64582

SNO cluster upgrade is blocked with frr-k8s-webhook-server deployment pending schedule

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.19.z, 4.20.z
    • Networking / FRR-K8s
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Low
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      When upgrading a single node OpenShift (SNO) cluster from 4.19 to 4.20, the upgrade gets stuck in the network cluster operator.
      
      It looks like the new frr-k8s-webhook-server pod can't be scheduled because there are no free ports in the node.

      Version-Release number of selected component (if applicable):

      OCP 4.19.16 upgrade to 4.20.1

      How reproducible:

      Always

      Steps to Reproduce:

      1. Have a SNO cluster in version 4.19
      2. Upgrade to 4.20
          

      Actual results:

      Upgrade is blocked:
      
      $ oc get co network
      NAME      VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      network   4.19.16   True        True          False      111d    Deployment "/openshift-frr-k8s/frr-k8s-webhook-server" update is rolling out (1 out of 2 updated)
      
      
      New pod is pending:
      
      $ oc get pod -n openshift-frr-k8s
      NAME                                     READY   STATUS    RESTARTS   AGE
      frr-k8s-fnqzh                            7/7     Running   0          35m
      frr-k8s-webhook-server-bc5694f79-w49mc   0/1     Pending   0          10m
      frr-k8s-webhook-server-bc7675998-s77lz   1/1     Running   2          14d
      
      Deployment uses RollingUpdate strategy:
      
      $ oc get deployment -n openshift-frr-k8s frr-k8s-webhook-server -o yaml
      
      spec:
        replicas: 1
        strategy:
          rollingUpdate:
            maxSurge: 25%
            maxUnavailable: 25%
          type: RollingUpdate
      
      
      
      Pod is not scheduled due to lack of node ports:
      
      $ oc describe pod -n openshift-frr-k8s frr-k8s-webhook-server-bc5694f79-w49mc
      Name:                 frr-k8s-webhook-server-bc5694f79-w49mc
      Namespace:            openshift-frr-k8s
      Priority:             2000000000
      Priority Class Name:  system-cluster-critical
      Service Account:      frr-k8s-daemon
      Node:                 <none>
      Labels:               component=frr-k8s-webhook-server
                            pod-template-hash=bc5694f79
      Annotations:          openshift.io/required-scc: privileged
                            openshift.io/scc: privileged
                            security.openshift.io/validated-scc-subject-type: serviceaccount
      Status:               Pending
      IP:                   
      IPs:                  <none>
      Controlled By:        ReplicaSet/frr-k8s-webhook-server-bc5694f79
      Containers:
        frr-k8s-webhook-server:
          Image:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6a974f04d4aefdb39bf2d4649b24e7e0e87685afa3d07ca46234f1a0c5688e4b
          Port:       9123/TCP
          Host Port:  9123/TCP
          Command:
            /frr-k8s
          Args:
            --log-level=info
            --webhook-mode=onlywebhook
            --disable-cert-rotation=true
            --namespace=$(NAMESPACE)
            --webhook-port=9123
          Requests:
            cpu:      10m
            memory:   50Mi
          Liveness:   http-get https://:webhook/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
          Readiness:  http-get https://:webhook/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
          Environment:
            NAMESPACE:  openshift-frr-k8s (v1:metadata.namespace)
          Mounts:
            /tmp/k8s-webhook-server/serving-certs from cert (ro)
            /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q5tpd (ro)
      Conditions:
        Type           Status
        PodScheduled   False 
      Volumes:
        cert:
          Type:        Secret (a volume populated by a Secret)
          SecretName:  frr-k8s-webhook-server-cert
          Optional:    false
        kube-api-access-q5tpd:
          Type:                    Projected (a volume that contains injected data from multiple sources)
          TokenExpirationSeconds:  3607
          ConfigMapName:           kube-root-ca.crt
          Optional:                false
          DownwardAPI:             true
          ConfigMapName:           openshift-service-ca.crt
          Optional:                false
      QoS Class:                   Burstable
      Node-Selectors:              <none>
      Tolerations:                 node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                                   node-role.kubernetes.io/master:NoSchedule op=Exists
                                   node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                                   node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                                   node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
      Events:
        Type     Reason            Age                  From               Message
        ----     ------            ----                 ----               -------
        Warning  FailedScheduling  10m                  default-scheduler  0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports.
        Warning  FailedScheduling  17s (x2 over 5m17s)  default-scheduler  0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports.

      Expected results:

      No manual intervention required during the upgrade.

      Additional info:

      Deleting the running frr-k8s-webhook-server pod unblocks the upgrade.

              fpaoline@redhat.com Federico Paolinelli
              rhn-support-jortialc Juan Orti
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: