-
Bug
-
Resolution: Done-Errata
-
Major
-
4.13, 4.14
-
Yes
-
Rejected
-
False
-
This is a clone of issue OCPBUGS-17391. The following is the description of the original issue:
—
the pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-local-to-shared-gateway-mode-migration job started failing recently when the
ovnkube-master daemonset would not finish rolling out after 360s.
taking the must gather to debug which happens a few minutes after the test
failure you can see that the daemonset is still not ready, so I believe that
increasing the timeout is not the answer.
some debug info:
➜ static-kas git:(master) oc --kubeconfig=/tmp/kk get daemonsets -A NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE openshift-cluster-csi-drivers aws-ebs-csi-driver-node 6 6 6 6 6 kubernetes.io/os=linux 8h openshift-cluster-node-tuning-operator tuned 6 6 6 6 6 kubernetes.io/os=linux 8h openshift-dns dns-default 6 6 6 6 6 kubernetes.io/os=linux 8h openshift-dns node-resolver 6 6 6 6 6 kubernetes.io/os=linux 8h openshift-image-registry node-ca 6 6 6 6 6 kubernetes.io/os=linux 8h openshift-ingress-canary ingress-canary 3 3 3 3 3 kubernetes.io/os=linux 8h openshift-machine-api machine-api-termination-handler 0 0 0 0 0 kubernetes.io/os=linux,machine.openshift.io/interruptible-instance= 8h openshift-machine-config-operator machine-config-daemon 6 6 6 6 6 kubernetes.io/os=linux 8h openshift-machine-config-operator machine-config-server 3 3 3 3 3 node-role.kubernetes.io/master= 8h openshift-monitoring node-exporter 6 6 6 6 6 kubernetes.io/os=linux 8h openshift-multus multus 6 6 6 6 6 kubernetes.io/os=linux 9h openshift-multus multus-additional-cni-plugins 6 6 6 6 6 kubernetes.io/os=linux 9h openshift-multus network-metrics-daemon 6 6 6 6 6 kubernetes.io/os=linux 9h openshift-network-diagnostics network-check-target 6 6 6 6 6 beta.kubernetes.io/os=linux 9h openshift-ovn-kubernetes ovnkube-master 3 3 2 2 2 beta.kubernetes.io/os=linux,node-role.kubernetes.io/master= 9h openshift-ovn-kubernetes ovnkube-node 6 6 6 6 6 beta.kubernetes.io/os=linux 9h Name: ovnkube-master Selector: app=ovnkube-master Node-Selector: beta.kubernetes.io/os=linux,node-role.kubernetes.io/master= Labels: networkoperator.openshift.io/generates-operator-status=stand-alone Annotations: deprecated.daemonset.template.generation: 3 kubernetes.io/description: This daemonset launches the ovn-kubernetes controller (master) networking components. networkoperator.openshift.io/cluster-network-cidr: 10.128.0.0/14 networkoperator.openshift.io/hybrid-overlay-status: disabled networkoperator.openshift.io/ip-family-mode: single-stack release.openshift.io/version: 4.14.0-0.ci.test-2023-08-04-123014-ci-op-c6fp05f4-latest Desired Number of Nodes Scheduled: 3 Current Number of Nodes Scheduled: 3 Number of Nodes Scheduled with Up-to-date Pods: 2 Number of Nodes Scheduled with Available Pods: 2 Number of Nodes Misscheduled: 0 Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed Pod Template: Labels: app=ovnkube-master component=network kubernetes.io/os=linux openshift.io/component=network ovn-db-pod=true type=infra Annotations: networkoperator.openshift.io/cluster-network-cidr: 10.128.0.0/14 networkoperator.openshift.io/hybrid-overlay-status: disabled networkoperator.openshift.io/ip-family-mode: single-stack target.workload.openshift.io/management: {"effect": "PreferredDuringScheduling"} Service Account: ovn-kubernetes-controller
it seems there is one pod that is not coming up all the way and that pod has
two containers not ready (sbdb and nbdb). logs from those containers below:
➜ static-kas git:(master) oc --kubeconfig=/tmp/kk describe pod ovnkube-master-7qlm5 -n openshift-ovn-kubernetes | rg '^ [a-z].*:|Ready' northd: Ready: True nbdb: Ready: False kube-rbac-proxy: Ready: True sbdb: Ready: False ovnkube-master: Ready: True ovn-dbchecker: Ready: True ➜ static-kas git:(master) oc --kubeconfig=/tmp/kk logs ovnkube-master-7qlm5 -n openshift-ovn-kubernetes -c sbdb 2023-08-04T13:08:49.127480354Z + [[ -f /env/_master ]] 2023-08-04T13:08:49.127562165Z + trap quit TERM INT 2023-08-04T13:08:49.127609496Z + ovn_kubernetes_namespace=openshift-ovn-kubernetes 2023-08-04T13:08:49.127637926Z + ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt' 2023-08-04T13:08:49.127637926Z + transport=ssl 2023-08-04T13:08:49.127645167Z + ovn_raft_conn_ip_url_suffix= 2023-08-04T13:08:49.127682687Z + [[ 10.0.42.108 == \: ]] 2023-08-04T13:08:49.127690638Z + db=sb 2023-08-04T13:08:49.127690638Z + db_port=9642 2023-08-04T13:08:49.127712038Z + ovn_db_file=/etc/ovn/ovnsb_db.db 2023-08-04T13:08:49.127854181Z + [[ ! ssl:10.0.102.2:9642,ssl:10.0.42.108:9642,ssl:10.0.74.128:9642 =~ .:10\.0\.42\.108:. ]] 2023-08-04T13:08:49.128199437Z ++ bracketify 10.0.42.108 2023-08-04T13:08:49.128237768Z ++ case "$1" in 2023-08-04T13:08:49.128265838Z ++ echo 10.0.42.108 2023-08-04T13:08:49.128493242Z + OVN_ARGS='--db-sb-cluster-local-port=9644 --db-sb-cluster-local-addr=10.0.42.108 --no-monitor --db-sb-cluster-local-proto=ssl --ovn-sb-db-ssl-key=/ovn-cert/tls.key --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt' 2023-08-04T13:08:49.128535253Z + CLUSTER_INITIATOR_IP=10.0.102.2 2023-08-04T13:08:49.128819438Z ++ date -Iseconds 2023-08-04T13:08:49.130157063Z 2023-08-04T13:08:49+00:00 - starting sbdb CLUSTER_INITIATOR_IP=10.0.102.2 2023-08-04T13:08:49.130170893Z + echo '2023-08-04T13:08:49+00:00 - starting sbdb CLUSTER_INITIATOR_IP=10.0.102.2' 2023-08-04T13:08:49.130170893Z + initialize=false 2023-08-04T13:08:49.130179713Z + [[ ! -e /etc/ovn/ovnsb_db.db ]] 2023-08-04T13:08:49.130318475Z + [[ false == \t\r\u\e ]] 2023-08-04T13:08:49.130406657Z + wait 9 2023-08-04T13:08:49.130493659Z + exec /usr/share/ovn/scripts/ovn-ctl -db-sb-cluster-local-port=9644 --db-sb-cluster-local-addr=10.0.42.108 --no-monitor --db-sb-cluster-local-proto=ssl --ovn-sb-db-ssl-key=/ovn-cert/tls.key --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '-ovn-sb-log=-vconsole:info -vfile:off -vPATTERN:console:%D {%Y-%m-%dT%H:%M:%S.###Z} |%05N|%c%T|%p|%m' run_sb_ovsdb 2023-08-04T13:08:49.208399304Z 2023-08-04T13:08:49.208Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-sb.log 2023-08-04T13:08:49.213507987Z ovn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed (No such file or directory) 2023-08-04T13:08:49.224890005Z 2023-08-04T13:08:49Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting... 2023-08-04T13:08:49.224912156Z 2023-08-04T13:08:49Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connection attempt failed (No such file or directory) 2023-08-04T13:08:49.255474964Z 2023-08-04T13:08:49.255Z|00002|raft|INFO|local server ID is 7f92 2023-08-04T13:08:49.333342909Z 2023-08-04T13:08:49.333Z|00003|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 3.1.2 2023-08-04T13:08:49.348948944Z 2023-08-04T13:08:49.348Z|00004|reconnect|INFO|ssl:10.0.102.2:9644: connecting... 2023-08-04T13:08:49.349002565Z 2023-08-04T13:08:49.348Z|00005|reconnect|INFO|ssl:10.0.74.128:9644: connecting... 2023-08-04T13:08:49.352510569Z 2023-08-04T13:08:49.352Z|00006|reconnect|INFO|ssl:10.0.102.2:9644: connected 2023-08-04T13:08:49.353870484Z 2023-08-04T13:08:49.353Z|00007|reconnect|INFO|ssl:10.0.74.128:9644: connected 2023-08-04T13:08:49.889326777Z 2023-08-04T13:08:49.889Z|00008|raft|INFO|server 2501 is leader for term 5 2023-08-04T13:08:49.890316765Z 2023-08-04T13:08:49.890Z|00009|raft|INFO|rejecting append_request because previous entry 5,1538 not in local log (mismatch past end of log) 2023-08-04T13:08:49.891199951Z 2023-08-04T13:08:49.891Z|00010|raft|INFO|rejecting append_request because previous entry 5,1539 not in local log (mismatch past end of log) 2023-08-04T13:08:50.225632838Z 2023-08-04T13:08:50Z|00003|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting... 2023-08-04T13:08:50.225677739Z 2023-08-04T13:08:50Z|00004|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connected 2023-08-04T13:08:50.227772827Z Waiting for OVN_Southbound to come up. 2023-08-04T13:08:55.716284614Z 2023-08-04T13:08:55.716Z|00011|raft|INFO|ssl:10.0.74.128:43498: learned server ID 3dff 2023-08-04T13:08:55.716323395Z 2023-08-04T13:08:55.716Z|00012|raft|INFO|ssl:10.0.74.128:43498: learned remote address ssl:10.0.74.128:9644 2023-08-04T13:08:55.724570375Z 2023-08-04T13:08:55.724Z|00013|raft|INFO|ssl:10.0.102.2:47804: learned server ID 2501 2023-08-04T13:08:55.724599466Z 2023-08-04T13:08:55.724Z|00014|raft|INFO|ssl:10.0.102.2:47804: learned remote address ssl:10.0.102.2:9644 2023-08-04T13:08:59.348572779Z 2023-08-04T13:08:59.348Z|00015|memory|INFO|32296 kB peak resident set size after 10.1 seconds 2023-08-04T13:08:59.348648190Z 2023-08-04T13:08:59.348Z|00016|memory|INFO|atoms:35959 cells:31476 monitors:0 n-weak-refs:749 raft-connections:4 raft-log:1543 txn-history:100 txn-history-atoms:7100 ➜ static-kas git:(master) oc --kubeconfig=/tmp/kk logs ovnkube-master-7qlm5 -n openshift-ovn-kubernetes -c nbdb 2023-08-04T13:08:48.779743434Z + [[ -f /env/_master ]] 2023-08-04T13:08:48.779743434Z + trap quit TERM INT 2023-08-04T13:08:48.779825516Z + ovn_kubernetes_namespace=openshift-ovn-kubernetes 2023-08-04T13:08:48.779825516Z + ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt' 2023-08-04T13:08:48.779825516Z + transport=ssl 2023-08-04T13:08:48.779825516Z + ovn_raft_conn_ip_url_suffix= 2023-08-04T13:08:48.779825516Z + [[ 10.0.42.108 == \: ]] 2023-08-04T13:08:48.779825516Z + db=nb 2023-08-04T13:08:48.779825516Z + db_port=9641 2023-08-04T13:08:48.779825516Z + ovn_db_file=/etc/ovn/ovnnb_db.db 2023-08-04T13:08:48.779887606Z + [[ ! ssl:10.0.102.2:9641,ssl:10.0.42.108:9641,ssl:10.0.74.128:9641 =~ .:10\.0\.42\.108:. ]] 2023-08-04T13:08:48.780159182Z ++ bracketify 10.0.42.108 2023-08-04T13:08:48.780167142Z ++ case "$1" in 2023-08-04T13:08:48.780172102Z ++ echo 10.0.42.108 2023-08-04T13:08:48.780314224Z + OVN_ARGS='--db-nb-cluster-local-port=9643 --db-nb-cluster-local-addr=10.0.42.108 --no-monitor --db-nb-cluster-local-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt' 2023-08-04T13:08:48.780314224Z + CLUSTER_INITIATOR_IP=10.0.102.2 2023-08-04T13:08:48.780518588Z ++ date -Iseconds 2023-08-04T13:08:48.781738820Z 2023-08-04T13:08:48+00:00 - starting nbdb CLUSTER_INITIATOR_IP=10.0.102.2, K8S_NODE_IP=10.0.42.108 2023-08-04T13:08:48.781753021Z + echo '2023-08-04T13:08:48+00:00 - starting nbdb CLUSTER_INITIATOR_IP=10.0.102.2, K8S_NODE_IP=10.0.42.108' 2023-08-04T13:08:48.781753021Z + initialize=false 2023-08-04T13:08:48.781753021Z + [[ ! -e /etc/ovn/ovnnb_db.db ]] 2023-08-04T13:08:48.781816342Z + [[ false == \t\r\u\e ]] 2023-08-04T13:08:48.781936684Z + wait 9 2023-08-04T13:08:48.781974715Z + exec /usr/share/ovn/scripts/ovn-ctl -db-nb-cluster-local-port=9643 --db-nb-cluster-local-addr=10.0.42.108 --no-monitor --db-nb-cluster-local-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '-ovn-nb-log=-vconsole:info -vfile:off -vPATTERN:console:%D {%Y-%m-%dT%H:%M:%S.###Z} |%05N|%c%T|%p|%m' run_nb_ovsdb 2023-08-04T13:08:48.851644059Z 2023-08-04T13:08:48.851Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log 2023-08-04T13:08:48.852091247Z ovn-nbctl: unix:/var/run/ovn/ovnnb_db.sock: database connection failed (No such file or directory) 2023-08-04T13:08:48.861365357Z 2023-08-04T13:08:48Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2023-08-04T13:08:48.861365357Z 2023-08-04T13:08:48Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory) 2023-08-04T13:08:48.875126148Z 2023-08-04T13:08:48.875Z|00002|raft|INFO|local server ID is c503 2023-08-04T13:08:48.911846610Z 2023-08-04T13:08:48.911Z|00003|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 3.1.2 2023-08-04T13:08:48.918864408Z 2023-08-04T13:08:48.918Z|00004|reconnect|INFO|ssl:10.0.102.2:9643: connecting... 2023-08-04T13:08:48.918934490Z 2023-08-04T13:08:48.918Z|00005|reconnect|INFO|ssl:10.0.74.128:9643: connecting... 2023-08-04T13:08:48.923439162Z 2023-08-04T13:08:48.923Z|00006|reconnect|INFO|ssl:10.0.102.2:9643: connected 2023-08-04T13:08:48.925166154Z 2023-08-04T13:08:48.925Z|00007|reconnect|INFO|ssl:10.0.74.128:9643: connected 2023-08-04T13:08:49.861650961Z 2023-08-04T13:08:49Z|00003|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2023-08-04T13:08:49.861747153Z 2023-08-04T13:08:49Z|00004|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected 2023-08-04T13:08:49.875272530Z 2023-08-04T13:08:49.875Z|00008|raft|INFO|server fccb is leader for term 6 2023-08-04T13:08:49.875302480Z 2023-08-04T13:08:49.875Z|00009|raft|INFO|rejecting append_request because previous entry 6,1732 not in local log (mismatch past end of log) 2023-08-04T13:08:49.876027164Z Waiting for OVN_Northbound to come up. 2023-08-04T13:08:55.694760761Z 2023-08-04T13:08:55.694Z|00010|raft|INFO|ssl:10.0.74.128:57122: learned server ID d382 2023-08-04T13:08:55.694800872Z 2023-08-04T13:08:55.694Z|00011|raft|INFO|ssl:10.0.74.128:57122: learned remote address ssl:10.0.74.128:9643 2023-08-04T13:08:55.706904913Z 2023-08-04T13:08:55.706Z|00012|raft|INFO|ssl:10.0.102.2:43230: learned server ID fccb 2023-08-04T13:08:55.706931733Z 2023-08-04T13:08:55.706Z|00013|raft|INFO|ssl:10.0.102.2:43230: learned remote address ssl:10.0.102.2:9643 2023-08-04T13:08:58.919567770Z 2023-08-04T13:08:58.919Z|00014|memory|INFO|21944 kB peak resident set size after 10.1 seconds 2023-08-04T13:08:58.919643762Z 2023-08-04T13:08:58.919Z|00015|memory|INFO|atoms:8471 cells:7481 monitors:0 n-weak-refs:200 raft-connections:4 raft-log:1737 txn-history:72 txn-history-atoms:8165 ➜ static-kas git:(master)
This seems to happen very frequently now, but was not happening before around July 21st.
- blocks
-
OCPBUGS-20197 [4.13] Rollout of ovnk pods is taking more time
- Closed
- clones
-
OCPBUGS-19808 Rollout of ovnk pods is taking more time
- Closed
- is blocked by
-
OCPBUGS-19808 Rollout of ovnk pods is taking more time
- Closed
- is cloned by
-
OCPBUGS-20197 [4.13] Rollout of ovnk pods is taking more time
- Closed
-
OCPBUGS-22292 [4.13] Rollout of ovnk pods is taking more time
- Closed
- is related to
-
OCPBUGS-22293 [4.13] CNO fails to apply ovnkube-master daemonset during upgrade
- Closed
- links to
-
RHBA-2023:5672 OpenShift Container Platform 4.13.z bug fix update