Loading...

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.21
Component/s: Networking / ovn-kubernetes
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
3
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:

4.21.0
Release Blocker:
Proposed
Sprint:
CORENET Sprint 279
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:
In a HostedCluster configured with a custom networking config:

apiVersion: hypershift.openshift.io/v1beta1
kind: HostedCluster
metadata:
  annotations:
    hypershift.openshift.io/control-plane-operator-image: quay.io/jparrill/hypershift:OCPBUGS-59649-v67
spec:
....
....
....
  operatorConfiguration:
    clusterNetworkOperator:
      disableMultiNetwork: false
      ovnKubernetesConfig:
        ipv4:
          internalJoinSubnet: 100.99.0.0/16
          internalTransitSwitchSubnet: 100.100.0.0/16

I'm using a concrete image build because this is a feature on going quay.io/jparrill/hypershift:OCPBUGS-59649-v67, make sure you use it during the reproduction of the issue.

Version-Release number of selected component (if applicable):

4.21.0-0.ci-2025-10-31-105038-test-ci-op-ts6w8gjy-latest

How reproducible:

Steps to Reproduce:

1. Create a hostedCluster in AWS using the hypershift CLI and using --render + --render-sensitive flags, generating a STDOUT manifest, use > to put it in a file
2. Edit the manifest and add the configuration set above
3. Create the cluster
4. Once finished, access the hostedCluster via kubeconfig (sample:

oc get secret -n clusters jparrill-hosted-admin-kubeconfig -o jsonpath='{.data.kubeconfig}'| base64 -d > /Users/jparrill/RedHat/RedHat_Engineering/hypershift/hosted_clusters/clusters-jparrill-hosted/kubeconfig

5. Create the additional pull secret to trigger the kubelet restart, you can use these files:

### create-additional-user-ps.sh
if [[ -z $1 ]];then
    echo "give me a secret"
    exit 1
fi

kubectl create secret generic additional-pull-secret \
  --from-file=.dockerconfigjson=$1 \
  --type=kubernetes.io/dockerconfigjson \
  --namespace=kube-system


### dockerps-1
{
        "auths": {
                "docker.io": {
                        "auth": "cGFkYWp1YW46ZGNrcl9wYXRfdnFWbTVxWGtRb2ZMbnJCZHFFYVlxSm9kQk1Z"
                }
        }
}

## Then execute
./create-additional-user-ps.sh dockerps-1

6. This should trigger the reconciliation of globalPullSecret controller, and after that the restart of the kubelets at node level.

Actual results:

λ static oc get pod -n openshift-ovn-kubernetes
NAME                 READY   STATUS             RESTARTS        AGE
ovnkube-node-2kgtm   7/8     CrashLoopBackOff   14 (160m ago)   3h27m
ovnkube-node-dkjjf   7/8     CrashLoopBackOff   18 (161m ago)   3h27m

λ static oc get pod -n openshift-multus
NAME                                  READY   STATUS             RESTARTS        AGE
multus-additional-cni-plugins-qwsxk   1/1     Running            0               3h37m
multus-q6shp                          0/1     CrashLoopBackOff   12 (161m ago)   3h37m
network-metrics-daemon-tppdv          2/2     Running            0               3h37m
multus-additional-cni-plugins-blr75   1/1     Running            0               3h36m
multus-gfqt8                          0/1     CrashLoopBackOff   16 (162m ago)   3h36m
network-metrics-daemon-m7nx8          2/2     Running            0               3h36m

Expected results:
No crashloop pods

Additional info:

Slack thread: https://redhat-internal.slack.com/archives/CK1AE4ZCK/p1761899088565229

Affected Platforms:

Is it an internal CI failure, found during the development of a feature. This is the PR: https://github.com/openshift/hypershift/pull/6745.

In there you can check the test called "TestCreateClusterCustomConfig"
This is the artifacts folder, check the hostedcluster.tar filer: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_hypershift/6745/pull-ci-openshift-hypershift-main-e2e-aws/1984206058326855680/artifacts/e2e-aws/hypershift-aws-run-e2e-external/artifacts/TestCreateClusterCustomConfig/

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run

NAME                          STATUS   ROLES    AGE     VERSION
ip-10-0-3-104.ec2.internal    Ready    worker   3h37m   v1.34.1
ip-10-0-14-138.ec2.internal   Ready    worker   3h36m   v1.34.1

λ static oc get pod -n openshift-ovn-kubernetes
NAME                 READY   STATUS             RESTARTS        AGE
ovnkube-node-2kgtm   7/8     CrashLoopBackOff   14 (160m ago)   3h27m
ovnkube-node-dkjjf   7/8     CrashLoopBackOff   18 (161m ago)   3h27m

λ static oc get pod -n openshift-multus
NAME                                  READY   STATUS             RESTARTS        AGE
multus-additional-cni-plugins-qwsxk   1/1     Running            0               3h37m
multus-q6shp                          0/1     CrashLoopBackOff   12 (161m ago)   3h37m
network-metrics-daemon-tppdv          2/2     Running            0               3h37m
multus-additional-cni-plugins-blr75   1/1     Running            0               3h36m
multus-gfqt8                          0/1     CrashLoopBackOff   16 (162m ago)   3h36m
network-metrics-daemon-m7nx8          2/2     Running            0               3h36m

Error:
2025-10-31T12:07:29.491809789Z + exec /usr/bin/ovnkube --init-ovnkube-controller ip-10-0-14-138.ec2.internal --init-node ip-10-0-14-138.ec2.internal --config-file=/run/ovnkube-config/ovnkube.conf --ovn-empty-lb-events --loglevel 4 --inactivity-probe=180000 --gateway-mode shared --gateway-interface br-ex --metrics-bind-address 127.0.0.1:29103 --ovn-metrics-bind-address 127.0.0.1:29105 --metrics-enable-pprof --metrics-enable-config-duration --export-ovs-metrics --disable-snat-multiple-gws --enable-multi-network --enable-network-segmentation --enable-preconfigured-udn-addresses --enable-admin-network-policy --enable-multicast --zone ip-10-0-14-138.ec2.internal --enable-interconnect --acl-logging-rate-limit 20 --disable-forwarding --bootstrap-kubeconfig=/var/lib/kubelet/kubeconfig --cert-dir=/etc/ovn/ovnkube-node-certs --cert-duration=24h --gateway-v4-join-subnet 100.99.0.0/16 --gateway-v4-masquerade-subnet 169.254.0.0/17 --gateway-v6-masquerade-subnet fd69::/112 --cluster-manager-v4-transit-switch-subnet 100.100.0.0/16 --enable-egress-ip=true --enable-egress-firewall=true --enable-egress-qos=true --enable-egress-service=true --enable-multi-external-gateway=true
2025-10-31T12:07:29.523946921Z Incorrect Usage: flag provided but not defined: -cluster-manager-v4-transit-switch-subnet

links to

openshift/cluster-network-operator#2828: OCPBUGS-63743: Rename transit subnet flags

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates