Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-32947

[1812856] knmstate: Inconsistency between NNCE and NNCP status report

XMLWordPrintable

    • High
    • No

      Description of problem:
      On a destructive configuration policy, which involves all physical NICs of a node, and supposed to disable the connectivity of the node, the NNCE reports one state, while NNCP reports another.

      Version-Release number of selected component (if applicable):
      kubernetes-nmstate-handler-rhel8@sha256:4a1379bf1223cf064e54419721045ca1275ae57a04433db78d4a54e1269acee1
      CNAO: sha256_379cfaaba59bae6089af24bb25c104e399e867b6732e5c8a33caf235

      How reproducible:
      Most of the times (the bug doesn't always occur).

      Steps to Reproduce:
      1. Apply a valid NNCP that affects all physical NICs of a node.
      In the example given here I set all the NICs, which originally had dynamic IPs, to have static IPs. For each NIC I used the same dynamic IP that the DHCP server provide to it (to make sure I avoid IP conflicts).
      apiVersion: nmstate.io/v1alpha1
      kind: NodeNetworkConfigurationPolicy
      metadata:
      name: static-nics
      spec:
      desiredState:
      interfaces:

      • name: ens3
        type: ethernet
        state: up
        ipv4:
        address:
      • ip: 172.16.0.33
        prefix-length: 24
        dhcp: false
        enabled: true
      • name: ens6
        type: ethernet
        state: up
        ipv4:
        address:
      • ip: 172.16.0.19
        prefix-length: 24
        dhcp: false
        enabled: true
      • name: ens7
        type: ethernet
        state: up
        ipv4:
        address:
      • ip: 172.16.0.49
        prefix-length: 24
        dhcp: false
        enabled: true
      • name: ens8
        type: ethernet
        state: up
        ipv4:
        address:
      • ip: 172.16.0.14
        prefix-length: 24
        dhcp: false
        enabled: true
        nodeSelector:
        kubernetes.io/hostname: "host-172-16-0-33"

      2. After a long-enough timeout (~5 minutes) check the IP addresses of all the NIC that were set in this NNCP:
      [cnv-qe-jenkins@cnv-executor-ysegev-4-3 yossi]$ ssh core@172.16.0.33 ip addr show dev ens3
      2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
      link/ether fa:16:3e:dc:3f:f6 brd ff:ff:ff:ff:ff:ff
      inet 172.16.0.33/24 brd 172.16.0.255 scope global dynamic noprefixroute ens3
      valid_lft 86195sec preferred_lft 86195sec
      inet6 fe80::f816:3eff:fedc:3ff6/64 scope link noprefixroute
      valid_lft forever preferred_lft forever

      [cnv-qe-jenkins@cnv-executor-ysegev-4-3 yossi]$ ssh core@172.16.0.33 ip addr show dev ens6
      3: ens6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
      link/ether fa:16:3e:2e:76:aa brd ff:ff:ff:ff:ff:ff
      inet 172.16.0.19/24 brd 172.16.0.255 scope global dynamic noprefixroute ens6
      valid_lft 86192sec preferred_lft 86192sec
      inet6 fe80::f816:3eff:fe2e:76aa/64 scope link noprefixroute
      valid_lft forever preferred_lft forever

      [cnv-qe-jenkins@cnv-executor-ysegev-4-3 yossi]$ ssh core@172.16.0.33 ip addr show dev ens7
      4: ens7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
      link/ether fa:16:3e:9d:e8:a3 brd ff:ff:ff:ff:ff:ff
      inet 172.16.0.49/24 brd 172.16.0.255 scope global dynamic noprefixroute ens7
      valid_lft 86189sec preferred_lft 86189sec

      [cnv-qe-jenkins@cnv-executor-ysegev-4-3 yossi]$ ssh core@172.16.0.33 ip addr show dev ens8
      5: ens8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
      link/ether fa:16:3e:2d:ef:00 brd ff:ff:ff:ff:ff:ff
      inet 172.16.0.14/24 brd 172.16.0.255 scope global dynamic noprefixroute ens8
      valid_lft 86186sec preferred_lft 86186sec

      In all the cases, you can see that the address line contains the word "dynamic", which implies that the intended policy configuration was considered to be destructive, and therefore it was roll-backed.

      3. Check the status of both NNCP and NNCE:
      [cnv-qe-jenkins@cnv-executor-ysegev-4-3 yossi]$ oc get nncp static-nics
      NAME STATUS
      static-nics SuccessfullyConfigured
      [cnv-qe-jenkins@cnv-executor-ysegev-4-3 yossi]$
      [cnv-qe-jenkins@cnv-executor-ysegev-4-3 yossi]$ oc get nnce host-172-16-0-33.static-nics
      NAME STATUS
      host-172-16-0-33.static-nics ConfigurationProgressing

      Actual results:
      <BUG> Each shows a different status ("SuccessfullyConfigured" and "ConfigurationProgressing"), which is wrong in both cases.
      In addition - the NNCE description ("oc get nnce host-172-16-0-33.static-nics -o yaml") doesn't include a rollback message.

      Expected results:
      1. The status of both NNCP and NNCE should be "ConfigurationFailed".
      2. The current status condition in the NNCE should include a rollback message (search for the string "rollback" to verify).

      Additional info:
      This bug also happened on other scenarios, e.g. when the static IP's in the policy were different than those that were already dynamically given by the DHCP server.
      However, in this scenario the occurrence of the bug was not consistent, and in some of the cases the behavior was as-expected (i.e. both NNCE and NNCP showed status "ConfigurationFailed", and the NNCE description included a rollback message).

      The node's journalctl output is attached, with nmstate in TRACE log-level. It includes the timeline since just before applying the policy.

              phoracek@redhat.com Petr Horacek
              ysegev@redhat.com Yossi Segev
              Nir Rozen Nir Rozen
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: