Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-63743

HostedCluster ovnkube-node and multus pods crashLooping after kubelet restart

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.21
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 3
    • None
    • None
    • None
    • Proposed
    • CORENET Sprint 279
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:
      In a HostedCluster configured with a custom networking config:

      apiVersion: hypershift.openshift.io/v1beta1
      kind: HostedCluster
      metadata:
        annotations:
          hypershift.openshift.io/control-plane-operator-image: quay.io/jparrill/hypershift:OCPBUGS-59649-v67
      spec:
      ....
      ....
      ....
        operatorConfiguration:
          clusterNetworkOperator:
            disableMultiNetwork: false
            ovnKubernetesConfig:
              ipv4:
                internalJoinSubnet: 100.99.0.0/16
                internalTransitSwitchSubnet: 100.100.0.0/16
      

      I'm using a concrete image build because this is a feature on going quay.io/jparrill/hypershift:OCPBUGS-59649-v67, make sure you use it during the reproduction of the issue.

      Version-Release number of selected component (if applicable):

      • 4.21.0-0.ci-2025-10-31-105038-test-ci-op-ts6w8gjy-latest

      How reproducible:

      Steps to Reproduce:

      1. Create a hostedCluster in AWS using the hypershift CLI and using --render + --render-sensitive flags, generating a STDOUT manifest, use > to put it in a file
      2. Edit the manifest and add the configuration set above
      3. Create the cluster
      4. Once finished, access the hostedCluster via kubeconfig (sample:

      oc get secret -n clusters jparrill-hosted-admin-kubeconfig -o jsonpath='{.data.kubeconfig}'| base64 -d > /Users/jparrill/RedHat/RedHat_Engineering/hypershift/hosted_clusters/clusters-jparrill-hosted/kubeconfig
      

      5. Create the additional pull secret to trigger the kubelet restart, you can use these files:

      ### create-additional-user-ps.sh
      if [[ -z $1 ]];then
          echo "give me a secret"
          exit 1
      fi
      
      kubectl create secret generic additional-pull-secret \
        --from-file=.dockerconfigjson=$1 \
        --type=kubernetes.io/dockerconfigjson \
        --namespace=kube-system
      
      
      ### dockerps-1
      {
              "auths": {
                      "docker.io": {
                              "auth": "cGFkYWp1YW46ZGNrcl9wYXRfdnFWbTVxWGtRb2ZMbnJCZHFFYVlxSm9kQk1Z"
                      }
              }
      }
      
      ## Then execute
      ./create-additional-user-ps.sh dockerps-1
      

      6. This should trigger the reconciliation of globalPullSecret controller, and after that the restart of the kubelets at node level.

      Actual results:

      λ static oc get pod -n openshift-ovn-kubernetes
      NAME                 READY   STATUS             RESTARTS        AGE
      ovnkube-node-2kgtm   7/8     CrashLoopBackOff   14 (160m ago)   3h27m
      ovnkube-node-dkjjf   7/8     CrashLoopBackOff   18 (161m ago)   3h27m
      
      λ static oc get pod -n openshift-multus
      NAME                                  READY   STATUS             RESTARTS        AGE
      multus-additional-cni-plugins-qwsxk   1/1     Running            0               3h37m
      multus-q6shp                          0/1     CrashLoopBackOff   12 (161m ago)   3h37m
      network-metrics-daemon-tppdv          2/2     Running            0               3h37m
      multus-additional-cni-plugins-blr75   1/1     Running            0               3h36m
      multus-gfqt8                          0/1     CrashLoopBackOff   16 (162m ago)   3h36m
      network-metrics-daemon-m7nx8          2/2     Running            0               3h36m
      

      Expected results:
      No crashloop pods

      Additional info:

      Slack thread: https://redhat-internal.slack.com/archives/CK1AE4ZCK/p1761899088565229

      Affected Platforms:

      Is it an internal CI failure, found during the development of a feature. This is the PR: https://github.com/openshift/hypershift/pull/6745.

      If it is a CI failure:

      • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
      • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
      • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
      NAME                          STATUS   ROLES    AGE     VERSION
      ip-10-0-3-104.ec2.internal    Ready    worker   3h37m   v1.34.1
      ip-10-0-14-138.ec2.internal   Ready    worker   3h36m   v1.34.1
      
      λ static oc get pod -n openshift-ovn-kubernetes
      NAME                 READY   STATUS             RESTARTS        AGE
      ovnkube-node-2kgtm   7/8     CrashLoopBackOff   14 (160m ago)   3h27m
      ovnkube-node-dkjjf   7/8     CrashLoopBackOff   18 (161m ago)   3h27m
      
      λ static oc get pod -n openshift-multus
      NAME                                  READY   STATUS             RESTARTS        AGE
      multus-additional-cni-plugins-qwsxk   1/1     Running            0               3h37m
      multus-q6shp                          0/1     CrashLoopBackOff   12 (161m ago)   3h37m
      network-metrics-daemon-tppdv          2/2     Running            0               3h37m
      multus-additional-cni-plugins-blr75   1/1     Running            0               3h36m
      multus-gfqt8                          0/1     CrashLoopBackOff   16 (162m ago)   3h36m
      network-metrics-daemon-m7nx8          2/2     Running            0               3h36m
      
      Error:
      2025-10-31T12:07:29.491809789Z + exec /usr/bin/ovnkube --init-ovnkube-controller ip-10-0-14-138.ec2.internal --init-node ip-10-0-14-138.ec2.internal --config-file=/run/ovnkube-config/ovnkube.conf --ovn-empty-lb-events --loglevel 4 --inactivity-probe=180000 --gateway-mode shared --gateway-interface br-ex --metrics-bind-address 127.0.0.1:29103 --ovn-metrics-bind-address 127.0.0.1:29105 --metrics-enable-pprof --metrics-enable-config-duration --export-ovs-metrics --disable-snat-multiple-gws --enable-multi-network --enable-network-segmentation --enable-preconfigured-udn-addresses --enable-admin-network-policy --enable-multicast --zone ip-10-0-14-138.ec2.internal --enable-interconnect --acl-logging-rate-limit 20 --disable-forwarding --bootstrap-kubeconfig=/var/lib/kubelet/kubeconfig --cert-dir=/etc/ovn/ovnkube-node-certs --cert-duration=24h --gateway-v4-join-subnet 100.99.0.0/16 --gateway-v4-masquerade-subnet 169.254.0.0/17 --gateway-v6-masquerade-subnet fd69::/112 --cluster-manager-v4-transit-switch-subnet 100.100.0.0/16 --enable-egress-ip=true --enable-egress-firewall=true --enable-egress-qos=true --enable-egress-service=true --enable-multi-external-gateway=true
      2025-10-31T12:07:29.523946921Z Incorrect Usage: flag provided but not defined: -cluster-manager-v4-transit-switch-subnet
      

              pdiak@redhat.com Patryk Diak
              jparrill@redhat.com Juan Manuel Parrilla Madrid
              None
              None
              Anurag Saxena Anurag Saxena
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: