Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-5543

Inconsistent output reporting what database is the leader in OVN-Kubernetes

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Undefined Undefined
    • None
    • 4.12
    • None
    • Moderate
    • None
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      The output of oc get cm -n openshift-ovn-kubernetes ovn-kubernetes-master -o json | jq '.metadata.annotations' is at odds with the leader northbound db as reported by oc exec -n openshift-ovn-kubernetes ovnkube-master-hc4tc -- /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 cluster/status OVN_Northbound

      Version-Release number of selected component (if applicable):

      I tested this against 4.12 but I assume it affects other versions 

      How reproducible:

      Does not report incorrectly all the time but I have observed it twice on 4.12 cluster. 

      Steps to Reproduce:

      1.Run: 
      oc get cm -n openshift-ovn-kubernetes ovn-kubernetes-master -o json | jq '.metadata.annotations' 
      It reports 
      {
        "control-plane.alpha.kubernetes.io/leader": "{\"holderIdentity\":\"ip-10-0-223-230.us-west-1.compute.internal\",\"leaseDurationSeconds\":60,\"acquireTime\":\"2023-01-10T09:18:16Z\",\"renewTime\":\"2023-01-10T10:53:33Z\",\"leaderTransitions\":2}"
      }
      So this output indicates ip-10-0-223-230.us-west-1.compute.internal is the leader. The supposed leader is highlighted below: 
      
      $ oc get po -o wide -n openshift-ovn-kubernetes
      NAME                   READY   STATUS    RESTARTS       AGE    IP             NODE                                         NOMINATED NODE   READINESS GATES
      ovnkube-master-2msln   6/6     Running   1 (111m ago)   117m   10.0.159.67    ip-10-0-159-67.us-west-1.compute.internal    <none>           <none>
      ovnkube-master-hc4tc   6/6     Running   0              117m   10.0.223.230   ip-10-0-223-230.us-west-1.compute.internal   <none>           <none>
      ovnkube-master-w7p9l   6/6     Running   1 (103m ago)   117m   10.0.162.177   ip-10-0-162-177.us-west-1.compute.internal   <none>           <none>
      ovnkube-node-4ggb2     5/5     Running   0              117m   10.0.159.67    ip-10-0-159-67.us-west-1.compute.internal    <none>           <none>
      ovnkube-node-54wmz     5/5     Running   0              108m   10.0.146.216   ip-10-0-146-216.us-west-1.compute.internal   <none>           <none>
      ovnkube-node-7j7rl     5/5     Running   0              117m   10.0.162.177   ip-10-0-162-177.us-west-1.compute.internal   <none>           <none>
      ovnkube-node-j2tqd     5/5     Running   0              107m   10.0.171.199   ip-10-0-171-199.us-west-1.compute.internal   <none>           <none>
      ovnkube-node-k4fxw     5/5     Running   1 (108m ago)   108m   10.0.212.66    ip-10-0-212-66.us-west-1.compute.internal    <none>           <none>
      ovnkube-node-srks9     5/5     Running   0              117m   10.0.223.230   ip-10-0-223-230.us-west-1.compute.internal   <none>           <none>
      
      2. I ran status check on this pod and it says role is follower with the leader being Leader: 9155 which is 10.0.159.67 which suggests the leader is ovnkube-master-2msln. So that looks inconsistent, someone is reporting this incorrectly. 
      $ oc exec -n openshift-ovn-kubernetes ovnkube-master-hc4tc -- /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 cluster/status OVN_Northbound
      Defaulted container "northd" out of: northd, nbdb, kube-rbac-proxy, sbdb, ovnkube-master, ovn-dbchecker
      12c4
      Name: OVN_Northbound
      Cluster ID: cecc (cecc7ea8-9fbc-457d-889d-c01fd278aae5)
      Server ID: 12c4 (12c43ba4-90bc-48b2-9cd5-d599266b20f6)
      Address: ssl:10.0.223.230:9643
      Status: cluster member
      Role: follower
      Term: 2
      Leader: 9155
      Vote: unknown
      
      Election timer: 10000
      Log: [2, 2333]
      Entries not yet committed: 0
      Entries not yet applied: 0
      Connections: ->0000 <-9155 <-7c91 ->7c91
      Disconnections: 0
      Servers:
          7c91 (7c91 at ssl:10.0.162.177:9643) last msg 7114020 ms ago
          12c4 (12c4 at ssl:10.0.223.230:9643) (self)
          9155 (9155 at ssl:10.0.159.67:9643) last msg 2486 ms ago
      
      10.0.159.67 is the leader from this output which does not match output from oc get cm -n openshift-ovn-kubernetes ovn-kubernetes-master -o json | jq '.metadata.annotations'

      Actual results:

      mismatch in output reporting which pod is the leader between two commands 

      Expected results:

      I would expect both commands to report the same leader. 

      Additional info:

      NOTE: That command runs as a readiness probe in the ovnkube-master pods.  You can see it like this: oc get pods/ovnkube-master-jljfs -n openshift-ovn-kubernetes -o json | jq '.spec.containers[] | select(.name=="nbdb") | .readinessProbe'

              bbennett@redhat.com Ben Bennett
              rhn-support-kquinn Kevin Quinn
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: