Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.13
Component/s: Networking / ovn-kubernetes
Labels:
None

Severity:
Important
Regression:
No
Sprint:
SDN Sprint 247, SDN Sprint 248
sprint_count:
2
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

The cluster with 252 Worker nodes was loaded with cluster-density-v2 workload,
an attempt to upgrade the cluster from OCP 4.12.25 to 4.13.24 resulted into failure

Here's the OCP Config of the said cluster:

Master Nodes: Standard_D32s_v5 x 3
Infra Nodes:  Standard_E16s_v3 x 3
Worker Nodes: Standard_D8s_v5  x 3

Version-Release number of selected component (if applicable):

From OCP Version OCP 4.12.25
To OCP Version: OCP 4.13.24 [channel: fast-4.13]

Steps to Reproduce:

1. kube-burner ocp cluster-density-v2 --gc=false --iterations=2268 --churn=false
2. oc adm upgrade channel fast-4.13
3. oc adm upgrade --to=4.13.24

Actual results:

oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.25   True        True          91m     Unable to apply 4.13.24: wait has exceeded 40 minutes for these operators: network

Expected results:

OCP Cluster should have upgraded to 4.13.24

Additional info:

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.12.25   True        True          89m     Unable to apply 4.13.24: wait has exceeded 40 minutes for these operators: network
$
============================================================
$ NAME                                       VERSION        AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
aro                                        v20231214.00   True        False         False      46h     
authentication                             4.13.24        True        False         False      61m     
cloud-controller-manager                   4.13.24        True        False         False      46h     
cloud-credential                           4.13.24        True        False         False      46h     
cluster-autoscaler                         4.13.24        True        False         False      46h     
config-operator                            4.13.24        True        False         False      46h     
console                                    4.13.24        True        False         False      15h     
control-plane-machine-set                  4.13.24        True        False         False      46h     
csi-snapshot-controller                    4.13.24        True        False         False      46h     
dns                                        4.12.25        True        False         False      46h     
etcd                                       4.13.24        True        False         False      46h     
image-registry                             4.13.24        True        False         False      46h     
ingress                                    4.13.24        True        False         False      61m     
insights                                   4.13.24        True        False         False      46h     
kube-apiserver                             4.13.24        True        False         False      46h     
kube-controller-manager                    4.13.24        True        False         False      46h     
kube-scheduler                             4.13.24        True        False         False      46h     
kube-storage-version-migrator              4.13.24        True        False         False      45h     
machine-api                                4.13.24        True        False         False      46h     
machine-approver                           4.13.24        True        False         False      46h     
machine-config                             4.12.25        True        False         False      37h     
marketplace                                4.13.24        True        False         False      46h     
monitoring                                 4.13.24        True        False         False      46h     
network                                    4.12.25        True        True          True       46h     DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - pod ovnkube-master-27mvw is in CrashLoopBackOff State...
node-tuning                                4.13.24        True        False         False      61m     
openshift-apiserver                        4.13.24        True        False         False      46h     
openshift-controller-manager               4.13.24        True        False         False      46h     
openshift-samples                          4.13.24        True        False         False      63m     
operator-lifecycle-manager                 4.13.24        True        False         False      46h     
operator-lifecycle-manager-catalog         4.13.24        True        False         False      46h     
operator-lifecycle-manager-packageserver   4.13.24        True        False         False      45h     
service-ca                                 4.13.24        True        False         False      46h     
storage                                    4.13.24        True        False         False      46h  

============================================================
$ oc get po | grep -i master
ovnkube-master-27mvw   5/6     CrashLoopBackOff   16 (3m5s ago)   42m
ovnkube-master-7959l   4/6     CrashLoopBackOff   15 (85s ago)    39m
ovnkube-master-8k9rc   5/6     CrashLoopBackOff   22 (54s ago)    38m
============================================================ 

$   ovn-dbchecker:
    Container ID:  cri-o://ed05d1834860fe64162db7bb1cb802b61c7d373725d7b33c8391a89a98e89cec
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:09719f2468f20098bcfed60af5394956311830885df86edbe40add8704c8703f
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:09719f2468f20098bcfed60af5394956311830885df86edbe40add8704c8703f
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      set -xe
      if [[ -f "/env/_master" ]]; then
        set -o allexport
        source "/env/_master"
        set +o allexport
      fi
      
      echo "I$(date "+%m%d %H:%M:%S.%N") - ovn-dbchecker - start ovn-dbchecker"
      
      # RAFT clusters need an odd number of members to achieve consensus.
      # The CNO determines which members make up the cluster, so if this container
      # is not supposed to be part of the cluster, wait forever doing nothing
      # (instad of exiting and causing CrashLoopBackoffs for no reason).
      if [[ ! "ssl:10.0.0.8:9641,ssl:10.0.0.10:9641,ssl:10.0.0.9:9641" =~ .*":${K8S_NODE_IP}:".* ]] && [[ ! "ssl:10.0.0.8:9641,ssl:10.0.0.10:9641,ssl:10.0.0.9:9641" =~ .*":[${K8S_NODE_IP}]:".* ]]; then
        echo "$(date -Iseconds) - not selected as RAFT member; sleeping..."
        sleep 1500d
        exit 0
      fi
      
      exec /usr/bin/ovndbchecker \
        --config-file=/run/ovnkube-config/ovnkube.conf \
        --loglevel "${OVN_KUBE_LOG_LEVEL}" \
        --sb-address "ssl:10.0.0.8:9642,ssl:10.0.0.10:9642,ssl:10.0.0.9:9642" \
        --sb-client-privkey /ovn-cert/tls.key \
        --sb-client-cert /ovn-cert/tls.crt \
        --sb-client-cacert /ovn-ca/ca-bundle.crt \
        --sb-cert-common-name "ovn" \
        --sb-raft-election-timer "16" \
        --nb-address "ssl:10.0.0.8:9641,ssl:10.0.0.10:9641,ssl:10.0.0.9:9641" \
        --nb-client-privkey /ovn-cert/tls.key \
        --nb-client-cert /ovn-cert/tls.crt \
        --nb-client-cacert /ovn-ca/ca-bundle.crt \
        --nb-cert-common-name "ovn" \
        --nb-raft-election-timer "10"
      
    State:       Running
      Started:   Fri, 12 Jan 2024 14:51:02 +0530
    Last State:  Terminated
      Reason:    Error
      Message:   27       1 ovndbmanager.go:354] "OVN_Southbound scheme upgrade failed" err="failed to upgrade schema, stderr: \"2024-01-12T09:18:48Z|00001|ovsdb|WARN|/usr/share/ovn/ovn-sb.ovsschema: changed 2 columns in 'OVN_Southbound' database from ephemeral to persistent, including 'status' column in 'Connection' table, because clusters do not support ephemeral columns\\n2024-01-12T09:19:18Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\\n\", error: OVN command '/usr/bin/ovsdb-client -t 30 convert unix:/var/run/ovn/ovnsb_db.sock /usr/share/ovn/ovn-sb.ovsschema' failed: signal: alarm clock"
E0112 09:20:00.805994       1 ovndbmanager.go:354] "OVN_Southbound scheme upgrade failed" err="failed to upgrade schema, stderr: \"2024-01-12T09:19:30Z|00001|ovsdb|WARN|/usr/share/ovn/ovn-sb.ovsschema: changed 2 columns in 'OVN_Southbound' database from ephemeral to persistent, including 'status' column in 'Connection' table, because clusters do not support ephemeral columns\\n2024-01-12T09:20:00Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\\n\", error: OVN command '/usr/bin/ovsdb-client -t 30 convert unix:/var/run/ovn/ovnsb_db.sock /usr/share/ovn/ovn-sb.ovsschema' failed: signal: alarm clock"
E0112 09:20:10.818688       1 ovndbmanager.go:354] "OVN_Southbound scheme upgrade failed" err="failed to get schema version for NBDB, stderr: \"2024-01-12T09:20:10Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\\n\", error: OVN command '/usr/bin/ovsdb-client -t 10 get-schema-version unix:/var/run/ovn/ovnsb_db.sock OVN_Southbound' failed: signal: alarm clock"
F0112 09:20:10.818733       1 ovndbmanager.go:54] SBDB Upgrade failed: failed to upgrade db schema: timed out waiting for the condition. Error from last attempt: failed to get schema version for NBDB, stderr: "2024-01-12T09:20:10Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n", error: OVN command '/usr/bin/ovsdb-client -t 10 get-schema-version unix:/var/run/ovn/ovnsb_db.sock OVN_Southbound' failed: signal: alarm clock


      Exit Code:    255
      Started:      Fri, 12 Jan 2024 14:44:08 +0530
      Finished:     Fri, 12 Jan 2024 14:50:10 +0530
    Ready:          True
    Restart Count:  10
    Requests:
      cpu:     10m
      memory:  300Mi
    Environment:
      OVN_KUBE_LOG_LEVEL:  4
      K8S_NODE_IP:          (v1:status.hostIP)
    Mounts:
      /env from env-overrides (rw)
      /etc/openvswitch/ from etc-openvswitch (rw)
      /etc/ovn/ from etc-openvswitch (rw)
      /ovn-ca from ovn-ca (rw)
      /ovn-cert from ovn-cert (rw)
      /run/openvswitch/ from run-openvswitch (rw)
      /run/ovn/ from run-ovn (rw)
      /run/ovnkube-config/ from ovnkube-config (rw)
      /var/lib/openvswitch/ from var-lib-openvswitch (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-stb4h (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  systemd-units:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/systemd/system
    HostPathType: 
  etc-openvswitch:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/ovn/etc
    HostPathType: 
  var-lib-openvswitch:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/ovn/data
    HostPathType: 
  run-openvswitch:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/openvswitch
    HostPathType: 
  run-ovn:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/ovn
    HostPathType: 
  ovnkube-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      ovnkube-config
    Optional:  false
  env-overrides:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      env-overrides
    Optional:  true
  ovn-ca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      ovn-ca
    Optional:  false
  ovn-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ovn-cert
    Optional:    false
  ovn-master-metrics-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ovn-master-metrics-cert
    Optional:    true
  kube-api-access-stb4h:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              beta.kubernetes.io/os=linux
                             node-role.kubernetes.io/master=
Tolerations:                 node-role.kubernetes.io/master op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable op=Exists
                             node.kubernetes.io/not-ready op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  72m                  default-scheduler  Successfully assigned openshift-ovn-kubernetes/ovnkube-master-27mvw to krishvoor-v5-ocp-jrq4p-master-0
  Normal   Pulled     72m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:09719f2468f20098bcfed60af5394956311830885df86edbe40add8704c8703f" already present on machine
  Normal   Created    72m                  kubelet            Created container northd
  Normal   Started    72m                  kubelet            Started container northd
  Normal   Pulled     72m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:09719f2468f20098bcfed60af5394956311830885df86edbe40add8704c8703f" already present on machine
  Normal   Created    72m                  kubelet            Created container nbdb
  Normal   Started    72m                  kubelet            Started container nbdb
  Normal   Pulled     72m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ff58f8cff3d9c63906656c10e45f9b61fda02d86165d6de8a4e8c0fc4bbca250" already present on machine
  Normal   Pulled     72m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:09719f2468f20098bcfed60af5394956311830885df86edbe40add8704c8703f" already present on machine
  Normal   Started    72m                  kubelet            Started container kube-rbac-proxy
  Normal   Created    72m                  kubelet            Created container kube-rbac-proxy
  Normal   Created    72m                  kubelet            Created container sbdb
  Normal   Started    72m                  kubelet            Started container sbdb
  Normal   Pulled     71m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:09719f2468f20098bcfed60af5394956311830885df86edbe40add8704c8703f" already present on machine
  Normal   Started    71m                  kubelet            Started container ovnkube-master
  Normal   Created    71m                  kubelet            Created container ovnkube-master
  Normal   Pulled     71m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:09719f2468f20098bcfed60af5394956311830885df86edbe40add8704c8703f" already present on machine
  Normal   Created    71m                  kubelet            Created container ovn-dbchecker
  Normal   Started    71m                  kubelet            Started container ovn-dbchecker
  Warning  BackOff    32m (x134 over 68m)  kubelet            Back-off restarting failed container
  Warning  Unhealthy  22m (x130 over 68m)  kubelet            Readiness probe failed: SB DB Raft leader is unknown to the cluster node.
+ [[ ! ssl:10.0.0.8:9642,ssl:10.0.0.10:9642,ssl:10.0.0.9:9642 =~ .*:10\.0\.0\.8:.* ]]
++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=3 cluster/status OVN_Southbound
++ grep 'Leader: unknown'
+ leader_status='Leader: unknown'
+ [[ ! -z Leader: unknown ]]
+ echo 'SB DB Raft leader is unknown to the cluster node.'
+ exit 1
  Warning  Unhealthy  2m39s (x397 over 70m)  kubelet  Readiness probe failed: + [[ ! ssl:10.0.0.8:9642,ssl:10.0.0.10:9642,ssl:10.0.0.9:9642 =~ .*:10\.0\.0\.8:.* ]]
++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=3 cluster/status OVN_Southbound
++ grep 'Leader: unknown'
++ true
+ leader_status=
============================================================ 

$

is related to

OCPBUGS-27439 [ARO] 4.13.23 --> 4.14.10 Upgrade Failed at [network]

Closed

Assignee:: Nadia Pinaeva

Reporter:: Krishna Harsha Voora

QA Contact:: Anurag Saxena

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2024/01/12 9:29 AM

Updated:: 2024/02/09 2:46 PM

Resolved:: 2024/01/18 10:30 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates