Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-19808

Rollout of ovnk pods is taking more time

    XMLWordPrintable

Details

    • Yes
    • Approved
    • False
    • Hide

      None

      Show
      None

    Description

      This is a clone of issue OCPBUGS-17391. The following is the description of the original issue:

      the pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-local-to-shared-gateway-mode-migration job started failing recently when the
      ovnkube-master daemonset would not finish rolling out after 360s.

      taking the must gather to debug which happens a few minutes after the test
      failure you can see that the daemonset is still not ready, so I believe that
      increasing the timeout is not the answer.

      some debug info:

       

      static-kas git:(master) oc --kubeconfig=/tmp/kk get daemonsets -A 
      NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
      openshift-cluster-csi-drivers aws-ebs-csi-driver-node 6 6 6 6 6 kubernetes.io/os=linux 8h
      openshift-cluster-node-tuning-operator tuned 6 6 6 6 6 kubernetes.io/os=linux 8h
      openshift-dns dns-default 6 6 6 6 6 kubernetes.io/os=linux 8h
      openshift-dns node-resolver 6 6 6 6 6 kubernetes.io/os=linux 8h
      openshift-image-registry node-ca 6 6 6 6 6 kubernetes.io/os=linux 8h
      openshift-ingress-canary ingress-canary 3 3 3 3 3 kubernetes.io/os=linux 8h
      openshift-machine-api machine-api-termination-handler 0 0 0 0 0 kubernetes.io/os=linux,machine.openshift.io/interruptible-instance= 8h
      openshift-machine-config-operator machine-config-daemon 6 6 6 6 6 kubernetes.io/os=linux 8h
      openshift-machine-config-operator machine-config-server 3 3 3 3 3 node-role.kubernetes.io/master= 8h
      openshift-monitoring node-exporter 6 6 6 6 6 kubernetes.io/os=linux 8h
      openshift-multus multus 6 6 6 6 6 kubernetes.io/os=linux 9h
      openshift-multus multus-additional-cni-plugins 6 6 6 6 6 kubernetes.io/os=linux 9h
      openshift-multus network-metrics-daemon 6 6 6 6 6 kubernetes.io/os=linux 9h
      openshift-network-diagnostics network-check-target 6 6 6 6 6 beta.kubernetes.io/os=linux 9h
      openshift-ovn-kubernetes ovnkube-master 3 3 2 2 2 beta.kubernetes.io/os=linux,node-role.kubernetes.io/master= 9h
      openshift-ovn-kubernetes ovnkube-node 6 6 6 6 6 beta.kubernetes.io/os=linux 9h
      Name: ovnkube-master
      Selector: app=ovnkube-master
      Node-Selector: beta.kubernetes.io/os=linux,node-role.kubernetes.io/master=
      Labels: networkoperator.openshift.io/generates-operator-status=stand-alone
      Annotations: deprecated.daemonset.template.generation: 3
      kubernetes.io/description: This daemonset launches the ovn-kubernetes controller (master) networking components.
      networkoperator.openshift.io/cluster-network-cidr: 10.128.0.0/14
      networkoperator.openshift.io/hybrid-overlay-status: disabled
      networkoperator.openshift.io/ip-family-mode: single-stack
      release.openshift.io/version: 4.14.0-0.ci.test-2023-08-04-123014-ci-op-c6fp05f4-latest
      Desired Number of Nodes Scheduled: 3
      Current Number of Nodes Scheduled: 3
      Number of Nodes Scheduled with Up-to-date Pods: 2
      Number of Nodes Scheduled with Available Pods: 2
      Number of Nodes Misscheduled: 0
      Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed
      Pod Template:
      Labels: app=ovnkube-master
      component=network
      kubernetes.io/os=linux
      openshift.io/component=network
      ovn-db-pod=true
      type=infra
      Annotations: networkoperator.openshift.io/cluster-network-cidr: 10.128.0.0/14
      networkoperator.openshift.io/hybrid-overlay-status: disabled
      networkoperator.openshift.io/ip-family-mode: single-stack
      target.workload.openshift.io/management:
      {"effect": "PreferredDuringScheduling"}
      Service Account: ovn-kubernetes-controller
      

       

      it seems there is one pod that is not coming up all the way and that pod has
      two containers not ready (sbdb and nbdb). logs from those containers below:

       

      static-kas git:(master) oc --kubeconfig=/tmp/kk describe pod ovnkube-master-7qlm5 -n openshift-ovn-kubernetes | rg '^ [a-z].*:|Ready'
      northd:
      Ready: True
      nbdb:
      Ready: False
      kube-rbac-proxy:
      Ready: True
      sbdb:
      Ready: False
      ovnkube-master:
      Ready: True
      ovn-dbchecker:
      Ready: True
      ➜ static-kas git:(master) oc --kubeconfig=/tmp/kk logs ovnkube-master-7qlm5 -n openshift-ovn-kubernetes -c sbdb
      2023-08-04T13:08:49.127480354Z + [[ -f /env/_master ]]
      2023-08-04T13:08:49.127562165Z + trap quit TERM INT
      2023-08-04T13:08:49.127609496Z + ovn_kubernetes_namespace=openshift-ovn-kubernetes
      2023-08-04T13:08:49.127637926Z + ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt'
      2023-08-04T13:08:49.127637926Z + transport=ssl
      2023-08-04T13:08:49.127645167Z + ovn_raft_conn_ip_url_suffix=
      2023-08-04T13:08:49.127682687Z + [[ 10.0.42.108 == \: ]]
      2023-08-04T13:08:49.127690638Z + db=sb
      2023-08-04T13:08:49.127690638Z + db_port=9642
      2023-08-04T13:08:49.127712038Z + ovn_db_file=/etc/ovn/ovnsb_db.db
      2023-08-04T13:08:49.127854181Z + [[ ! ssl:10.0.102.2:9642,ssl:10.0.42.108:9642,ssl:10.0.74.128:9642 =~ .:10\.0\.42\.108:. ]]
      2023-08-04T13:08:49.128199437Z ++ bracketify 10.0.42.108
      2023-08-04T13:08:49.128237768Z ++ case "$1" in
      2023-08-04T13:08:49.128265838Z ++ echo 10.0.42.108
      2023-08-04T13:08:49.128493242Z + OVN_ARGS='--db-sb-cluster-local-port=9644 --db-sb-cluster-local-addr=10.0.42.108 --no-monitor --db-sb-cluster-local-proto=ssl --ovn-sb-db-ssl-key=/ovn-cert/tls.key --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt'
      2023-08-04T13:08:49.128535253Z + CLUSTER_INITIATOR_IP=10.0.102.2
      2023-08-04T13:08:49.128819438Z ++ date -Iseconds
      2023-08-04T13:08:49.130157063Z 2023-08-04T13:08:49+00:00 - starting sbdb CLUSTER_INITIATOR_IP=10.0.102.2
      2023-08-04T13:08:49.130170893Z + echo '2023-08-04T13:08:49+00:00 - starting sbdb CLUSTER_INITIATOR_IP=10.0.102.2'
      2023-08-04T13:08:49.130170893Z + initialize=false
      2023-08-04T13:08:49.130179713Z + [[ ! -e /etc/ovn/ovnsb_db.db ]]
      2023-08-04T13:08:49.130318475Z + [[ false == \t\r\u\e ]]
      2023-08-04T13:08:49.130406657Z + wait 9
      2023-08-04T13:08:49.130493659Z + exec /usr/share/ovn/scripts/ovn-ctl -db-sb-cluster-local-port=9644 --db-sb-cluster-local-addr=10.0.42.108 --no-monitor --db-sb-cluster-local-proto=ssl --ovn-sb-db-ssl-key=/ovn-cert/tls.key --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '-ovn-sb-log=-vconsole:info -vfile:off -vPATTERN:console:%D
      {%Y-%m-%dT%H:%M:%S.###Z}
      |%05N|%c%T|%p|%m' run_sb_ovsdb
      2023-08-04T13:08:49.208399304Z 2023-08-04T13:08:49.208Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-sb.log
      2023-08-04T13:08:49.213507987Z ovn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed (No such file or directory)
      2023-08-04T13:08:49.224890005Z 2023-08-04T13:08:49Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting...
      2023-08-04T13:08:49.224912156Z 2023-08-04T13:08:49Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connection attempt failed (No such file or directory)
      2023-08-04T13:08:49.255474964Z 2023-08-04T13:08:49.255Z|00002|raft|INFO|local server ID is 7f92
      2023-08-04T13:08:49.333342909Z 2023-08-04T13:08:49.333Z|00003|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 3.1.2
      2023-08-04T13:08:49.348948944Z 2023-08-04T13:08:49.348Z|00004|reconnect|INFO|ssl:10.0.102.2:9644: connecting...
      2023-08-04T13:08:49.349002565Z 2023-08-04T13:08:49.348Z|00005|reconnect|INFO|ssl:10.0.74.128:9644: connecting...
      2023-08-04T13:08:49.352510569Z 2023-08-04T13:08:49.352Z|00006|reconnect|INFO|ssl:10.0.102.2:9644: connected
      2023-08-04T13:08:49.353870484Z 2023-08-04T13:08:49.353Z|00007|reconnect|INFO|ssl:10.0.74.128:9644: connected
      2023-08-04T13:08:49.889326777Z 2023-08-04T13:08:49.889Z|00008|raft|INFO|server 2501 is leader for term 5
      2023-08-04T13:08:49.890316765Z 2023-08-04T13:08:49.890Z|00009|raft|INFO|rejecting append_request because previous entry 5,1538 not in local log (mismatch past end of log)
      2023-08-04T13:08:49.891199951Z 2023-08-04T13:08:49.891Z|00010|raft|INFO|rejecting append_request because previous entry 5,1539 not in local log (mismatch past end of log)
      2023-08-04T13:08:50.225632838Z 2023-08-04T13:08:50Z|00003|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting...
      2023-08-04T13:08:50.225677739Z 2023-08-04T13:08:50Z|00004|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connected
      2023-08-04T13:08:50.227772827Z Waiting for OVN_Southbound to come up.
      2023-08-04T13:08:55.716284614Z 2023-08-04T13:08:55.716Z|00011|raft|INFO|ssl:10.0.74.128:43498: learned server ID 3dff
      2023-08-04T13:08:55.716323395Z 2023-08-04T13:08:55.716Z|00012|raft|INFO|ssl:10.0.74.128:43498: learned remote address ssl:10.0.74.128:9644
      2023-08-04T13:08:55.724570375Z 2023-08-04T13:08:55.724Z|00013|raft|INFO|ssl:10.0.102.2:47804: learned server ID 2501
      2023-08-04T13:08:55.724599466Z 2023-08-04T13:08:55.724Z|00014|raft|INFO|ssl:10.0.102.2:47804: learned remote address ssl:10.0.102.2:9644
      2023-08-04T13:08:59.348572779Z 2023-08-04T13:08:59.348Z|00015|memory|INFO|32296 kB peak resident set size after 10.1 seconds
      2023-08-04T13:08:59.348648190Z 2023-08-04T13:08:59.348Z|00016|memory|INFO|atoms:35959 cells:31476 monitors:0 n-weak-refs:749 raft-connections:4 raft-log:1543 txn-history:100 txn-history-atoms:7100
      ➜ static-kas git:(master) oc --kubeconfig=/tmp/kk logs ovnkube-master-7qlm5 -n openshift-ovn-kubernetes -c nbdb 
      2023-08-04T13:08:48.779743434Z + [[ -f /env/_master ]]
      2023-08-04T13:08:48.779743434Z + trap quit TERM INT
      2023-08-04T13:08:48.779825516Z + ovn_kubernetes_namespace=openshift-ovn-kubernetes
      2023-08-04T13:08:48.779825516Z + ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt'
      2023-08-04T13:08:48.779825516Z + transport=ssl
      2023-08-04T13:08:48.779825516Z + ovn_raft_conn_ip_url_suffix=
      2023-08-04T13:08:48.779825516Z + [[ 10.0.42.108 == \: ]]
      2023-08-04T13:08:48.779825516Z + db=nb
      2023-08-04T13:08:48.779825516Z + db_port=9641
      2023-08-04T13:08:48.779825516Z + ovn_db_file=/etc/ovn/ovnnb_db.db
      2023-08-04T13:08:48.779887606Z + [[ ! ssl:10.0.102.2:9641,ssl:10.0.42.108:9641,ssl:10.0.74.128:9641 =~ .:10\.0\.42\.108:. ]]
      2023-08-04T13:08:48.780159182Z ++ bracketify 10.0.42.108
      2023-08-04T13:08:48.780167142Z ++ case "$1" in
      2023-08-04T13:08:48.780172102Z ++ echo 10.0.42.108
      2023-08-04T13:08:48.780314224Z + OVN_ARGS='--db-nb-cluster-local-port=9643 --db-nb-cluster-local-addr=10.0.42.108 --no-monitor --db-nb-cluster-local-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt'
      2023-08-04T13:08:48.780314224Z + CLUSTER_INITIATOR_IP=10.0.102.2
      2023-08-04T13:08:48.780518588Z ++ date -Iseconds
      2023-08-04T13:08:48.781738820Z 2023-08-04T13:08:48+00:00 - starting nbdb CLUSTER_INITIATOR_IP=10.0.102.2, K8S_NODE_IP=10.0.42.108
      2023-08-04T13:08:48.781753021Z + echo '2023-08-04T13:08:48+00:00 - starting nbdb CLUSTER_INITIATOR_IP=10.0.102.2, K8S_NODE_IP=10.0.42.108'
      2023-08-04T13:08:48.781753021Z + initialize=false
      2023-08-04T13:08:48.781753021Z + [[ ! -e /etc/ovn/ovnnb_db.db ]]
      2023-08-04T13:08:48.781816342Z + [[ false == \t\r\u\e ]]
      2023-08-04T13:08:48.781936684Z + wait 9
      2023-08-04T13:08:48.781974715Z + exec /usr/share/ovn/scripts/ovn-ctl -db-nb-cluster-local-port=9643 --db-nb-cluster-local-addr=10.0.42.108 --no-monitor --db-nb-cluster-local-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '-ovn-nb-log=-vconsole:info -vfile:off -vPATTERN:console:%D
      {%Y-%m-%dT%H:%M:%S.###Z}
      |%05N|%c%T|%p|%m' run_nb_ovsdb
      2023-08-04T13:08:48.851644059Z 2023-08-04T13:08:48.851Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log
      2023-08-04T13:08:48.852091247Z ovn-nbctl: unix:/var/run/ovn/ovnnb_db.sock: database connection failed (No such file or directory)
      2023-08-04T13:08:48.861365357Z 2023-08-04T13:08:48Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
      2023-08-04T13:08:48.861365357Z 2023-08-04T13:08:48Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory)
      2023-08-04T13:08:48.875126148Z 2023-08-04T13:08:48.875Z|00002|raft|INFO|local server ID is c503
      2023-08-04T13:08:48.911846610Z 2023-08-04T13:08:48.911Z|00003|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 3.1.2
      2023-08-04T13:08:48.918864408Z 2023-08-04T13:08:48.918Z|00004|reconnect|INFO|ssl:10.0.102.2:9643: connecting...
      2023-08-04T13:08:48.918934490Z 2023-08-04T13:08:48.918Z|00005|reconnect|INFO|ssl:10.0.74.128:9643: connecting...
      2023-08-04T13:08:48.923439162Z 2023-08-04T13:08:48.923Z|00006|reconnect|INFO|ssl:10.0.102.2:9643: connected
      2023-08-04T13:08:48.925166154Z 2023-08-04T13:08:48.925Z|00007|reconnect|INFO|ssl:10.0.74.128:9643: connected
      2023-08-04T13:08:49.861650961Z 2023-08-04T13:08:49Z|00003|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting...
      2023-08-04T13:08:49.861747153Z 2023-08-04T13:08:49Z|00004|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected
      2023-08-04T13:08:49.875272530Z 2023-08-04T13:08:49.875Z|00008|raft|INFO|server fccb is leader for term 6
      2023-08-04T13:08:49.875302480Z 2023-08-04T13:08:49.875Z|00009|raft|INFO|rejecting append_request because previous entry 6,1732 not in local log (mismatch past end of log)
      2023-08-04T13:08:49.876027164Z Waiting for OVN_Northbound to come up.
      2023-08-04T13:08:55.694760761Z 2023-08-04T13:08:55.694Z|00010|raft|INFO|ssl:10.0.74.128:57122: learned server ID d382
      2023-08-04T13:08:55.694800872Z 2023-08-04T13:08:55.694Z|00011|raft|INFO|ssl:10.0.74.128:57122: learned remote address ssl:10.0.74.128:9643
      2023-08-04T13:08:55.706904913Z 2023-08-04T13:08:55.706Z|00012|raft|INFO|ssl:10.0.102.2:43230: learned server ID fccb
      2023-08-04T13:08:55.706931733Z 2023-08-04T13:08:55.706Z|00013|raft|INFO|ssl:10.0.102.2:43230: learned remote address ssl:10.0.102.2:9643
      2023-08-04T13:08:58.919567770Z 2023-08-04T13:08:58.919Z|00014|memory|INFO|21944 kB peak resident set size after 10.1 seconds
      2023-08-04T13:08:58.919643762Z 2023-08-04T13:08:58.919Z|00015|memory|INFO|atoms:8471 cells:7481 monitors:0 n-weak-refs:200 raft-connections:4 raft-log:1737 txn-history:72 txn-history-atoms:8165
      ➜ static-kas git:(master)
      

      This seems to happen very frequently now, but was not happening before around July 21st.

      https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-local-to-shared-gateway-mode-migration?buildId=1684628739427667968

       

      Attachments

        Issue Links

          Activity

            People

              jluhrsen Jamo Luhrsen
              openshift-crt-jira-prow OpenShift Prow Bot
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: