Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-11567

The master node meet annotation not found issue when upgrade from 4.12 to 4.13

    XMLWordPrintable

Details

    • Moderate
    • No
    • SDN Sprint 235
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      The master node meet annotation not found issue when upgrade from 4.12 to 4.13. This block the upgrade process. 
      Events:
        Type     Reason                     Age                    From                 Message
        ----     ------                     ----                   ----                 -------
      Warning  ErrorReconcilingNode       3h12m (x2 over 3h12m)  controlplane         [k8s.ovn.org/node-chassis-id annotation not found for node ip-10-0-207-132.ec2.internal, macAddress annotation not found for node "ip-10-0-207-132.ec2.internal" , k8s.ovn.org/l3-gateway-config annotation not found for node "ip-10-0-207-132.ec2.internal"]
        Warning  ErrorReconcilingNode       3h12m                  controlplane         error creating gateway for node ip-10-0-207-132.ec2.internal: failed to init shared interface gateway: failed to create MAC Binding for dummy nexthop ip-10-0-207-132.ec2.internal: error getting datapath GR_ip-10-0-207-132.ec2.internal: object not found
      Warning  ErrorReconcilingNode       41m (x10 over 52m)     controlplane         error creating gateway for node ip-10-0-207-132.ec2.internal: failed to init shared interface gateway: failed to sync stale SNATs on node ip-10-0-207-132.ec2.internal: unable to fetch podIPs for pod openshift-cluster-machine-approver/machine-approver-7747c58fc8-rbwcf
        Warning  ErrorReconcilingNode       36m (x5 over 50m)      controlplane         error creating gateway for node ip-10-0-207-132.ec2.internal: failed to init shared interface gateway: failed to sync stale SNATs on node ip-10-0-207-132.ec2.internal: unable to fetch podIPs for pod openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator-64f55669d9-9p6x4

      Version-Release number of selected component (if applicable):

      4.12.0-0.nightly-arm64-2023-04-04-094521 -> 4.13.0-0.nightly-arm64-2023-04-04-091502 (profile: 03_aarch64_Disconnected IPI on AWS & HTTP_PROXY & OVN IPSec)
      
      Also hit two times for arm64 build when upgrade from 4.12 to 4.13, and the profile are:
      aws-ipi-ovn-ipsec
      azure-ipi-ovn-ipsec

      How reproducible:

      Now we hit 3 times

      Steps to Reproduce:

      1.upgrade a cluster from 4.12 to 4.13
      2.
      3.
      

      Actual results:

      1 the upgrade failed due to timeout, because the master node meet annotation not found issue, and the kubelet became down finally, which lead to the mcp master can't finish rolling out.
      
      Events:
        Type     Reason                     Age                    From                 Message
        ----     ------                     ----                   ----                 -------
        Normal   NodeHasNoDiskPressure      3h14m (x8 over 3h14m)  kubelet              Node ip-10-0-207-132.ec2.internal status is now: NodeHasNoDiskPressure
        Normal   NodeHasSufficientMemory    3h14m (x8 over 3h14m)  kubelet              Node ip-10-0-207-132.ec2.internal status is now: NodeHasSufficientMemory
        Normal   RegisteredNode             3h14m                  node-controller      Node ip-10-0-207-132.ec2.internal event: Registered Node ip-10-0-207-132.ec2.internal in Controller
        Warning  ErrorReconcilingNode       3h12m (x2 over 3h12m)  controlplane         [k8s.ovn.org/node-chassis-id annotation not found for node ip-10-0-207-132.ec2.internal, macAddress annotation not found for node "ip-10-0-207-132.ec2.internal" , k8s.ovn.org/l3-gateway-config annotation not found for node "ip-10-0-207-132.ec2.internal"]
        Warning  ErrorReconcilingNode       3h12m                  controlplane         error creating gateway for node ip-10-0-207-132.ec2.internal: failed to init shared interface gateway: failed to create MAC Binding for dummy nexthop ip-10-0-207-132.ec2.internal: error getting datapath GR_ip-10-0-207-132.ec2.internal: object not found
        Normal   Uncordon                   3h11m                  machineconfigdaemon  Update completed for config rendered-master-b7734793514e431d2be272b7cc277200 and node has been uncordoned
        Normal   NodeDone                   3h11m                  machineconfigdaemon  Setting node ip-10-0-207-132.ec2.internal, currentConfig rendered-master-b7734793514e431d2be272b7cc277200 to Done
        Normal   ConfigDriftMonitorStarted  3h11m                  machineconfigdaemon  Config Drift Monitor started, watching against rendered-master-b7734793514e431d2be272b7cc277200
        Normal   RegisteredNode             3h6m                   node-controller      Node ip-10-0-207-132.ec2.internal event: Registered Node ip-10-0-207-132.ec2.internal in Controller
        Normal   RegisteredNode             3h6m                   node-controller      Node ip-10-0-207-132.ec2.internal event: Registered Node ip-10-0-207-132.ec2.internal in Controller
        Normal   RegisteredNode             3h2m                   node-controller      Node ip-10-0-207-132.ec2.internal event: Registered Node ip-10-0-207-132.ec2.internal in Controller
        Normal   RegisteredNode             178m                   node-controller      Node ip-10-0-207-132.ec2.internal event: Registered Node ip-10-0-207-132.ec2.internal in Controller
        Normal   RegisteredNode             177m                   node-controller      Node ip-10-0-207-132.ec2.internal event: Registered Node ip-10-0-207-132.ec2.internal in Controller
        Normal   RegisteredNode             174m                   node-controller      Node ip-10-0-207-132.ec2.internal event: Registered Node ip-10-0-207-132.ec2.internal in Controller
        Normal   OSUpdateStaged             173m                   machineconfigdaemon  Changes to OS staged
        Normal   PendingConfig              173m                   machineconfigdaemon  Written pending config rendered-master-8bf64f293f301538822add4805050730
        Normal   SkipReboot                 173m                   machineconfigdaemon  Config changes do not require reboot.
        Normal   Uncordon                   173m                   machineconfigdaemon  Update completed for config rendered-master-8bf64f293f301538822add4805050730 and node has been uncordoned
        Normal   NodeDone                   173m                   machineconfigdaemon  Setting node ip-10-0-207-132.ec2.internal, currentConfig rendered-master-8bf64f293f301538822add4805050730 to Done
        Normal   ConfigDriftMonitorStarted  173m                   machineconfigdaemon  Config Drift Monitor started, watching against rendered-master-8bf64f293f301538822add4805050730
        Normal   OSUpdateStaged             171m                   machineconfigdaemon  Changes to OS staged
        Normal   PendingConfig              171m                   machineconfigdaemon  Written pending config rendered-master-786aa385eb74bdb14138faccccd9e82b
        Normal   Uncordon                   171m                   machineconfigdaemon  Update completed for config rendered-master-786aa385eb74bdb14138faccccd9e82b and node has been uncordoned
        Normal   NodeDone                   171m                   machineconfigdaemon  Setting node ip-10-0-207-132.ec2.internal, currentConfig rendered-master-786aa385eb74bdb14138faccccd9e82b to Done
        Normal   ConfigDriftMonitorStarted  171m                   machineconfigdaemon  Config Drift Monitor started, watching against rendered-master-786aa385eb74bdb14138faccccd9e82b
        Normal   SkipReboot                 159m (x2 over 171m)    machineconfigdaemon  Config changes do not require reboot. Service crio was reloaded.
        Normal   OSUpdateStaged             159m                   machineconfigdaemon  Changes to OS staged
        Normal   PendingConfig              159m                   machineconfigdaemon  Written pending config rendered-master-6b7cd9aa5bbf5900088adda3ec249647
        Normal   NodeSchedulable            159m                   kubelet              Node ip-10-0-207-132.ec2.internal status is now: NodeSchedulable
        Normal   ConfigDriftMonitorStarted  159m                   machineconfigdaemon  Config Drift Monitor started, watching against rendered-master-6b7cd9aa5bbf5900088adda3ec249647
        Normal   Uncordon                   159m                   machineconfigdaemon  Update completed for config rendered-master-6b7cd9aa5bbf5900088adda3ec249647 and node has been uncordoned
        Normal   NodeDone                   159m                   machineconfigdaemon  Setting node ip-10-0-207-132.ec2.internal, currentConfig rendered-master-6b7cd9aa5bbf5900088adda3ec249647 to Done
        Normal   Drain                      149m (x2 over 161m)    machineconfigdaemon  Draining node to update config.
        Normal   Cordon                     149m (x2 over 161m)    machineconfigdaemon  Cordoned node to apply update
        Normal   ConfigDriftMonitorStopped  149m (x4 over 173m)    machineconfigdaemon  Config Drift Monitor stopped
        Normal   NodeNotSchedulable         149m (x2 over 160m)    kubelet              Node ip-10-0-207-132.ec2.internal status is now: NodeNotSchedulable
        Normal   OSUpdateStaged             148m                   machineconfigdaemon  Changes to OS staged
        Normal   PendingConfig              148m                   machineconfigdaemon  Written pending config rendered-master-0b26a7484328389fe70296e0e44a9465
        Normal   Reboot                     148m                   machineconfigdaemon  Node will reboot into config rendered-master-0b26a7484328389fe70296e0e44a9465
        Normal   OSUpdateStarted            148m (x4 over 173m)    machineconfigdaemon  
        Normal   RegisteredNode             148m                   node-controller      Node ip-10-0-207-132.ec2.internal event: Registered Node ip-10-0-207-132.ec2.internal in Controller
        Normal   NodeNotReady               147m                   node-controller      Node ip-10-0-207-132.ec2.internal status is now: NodeNotReady
        Normal   NodeAllocatableEnforced    145m                   kubelet              Updated Node Allocatable limit across pods
        Normal   NodeReady                  145m                   kubelet              Node ip-10-0-207-132.ec2.internal status is now: NodeReady
        Normal   Starting                   145m                   kubelet              Starting kubelet.
        Normal   NodeHasSufficientMemory    145m (x2 over 145m)    kubelet              Node ip-10-0-207-132.ec2.internal status is now: NodeHasSufficientMemory
        Normal   NodeHasNoDiskPressure      145m (x2 over 145m)    kubelet              Node ip-10-0-207-132.ec2.internal status is now: NodeHasNoDiskPressure
        Normal   NodeHasSufficientPID       145m (x2 over 145m)    kubelet              Node ip-10-0-207-132.ec2.internal status is now: NodeHasSufficientPID
        Warning  Rebooted                   145m                   kubelet              Node ip-10-0-207-132.ec2.internal has been rebooted, boot id: f4598387-0bf8-4394-884f-a0866124f25a
        Normal   NodeSchedulable            145m                   kubelet              Node ip-10-0-207-132.ec2.internal status is now: NodeSchedulable
        Normal   Uncordon                   145m                   machineconfigdaemon  Update completed for config rendered-master-0b26a7484328389fe70296e0e44a9465 and node has been uncordoned
        Normal   NodeDone                   145m                   machineconfigdaemon  Setting node ip-10-0-207-132.ec2.internal, currentConfig rendered-master-0b26a7484328389fe70296e0e44a9465 to Done
        Normal   ConfigDriftMonitorStarted  145m                   machineconfigdaemon  Config Drift Monitor started, watching against rendered-master-0b26a7484328389fe70296e0e44a9465
        Normal   RegisteredNode             136m                   node-controller      Node ip-10-0-207-132.ec2.internal event: Registered Node ip-10-0-207-132.ec2.internal in Controller
        Normal   RegisteredNode             104m                   node-controller      Node ip-10-0-207-132.ec2.internal event: Registered Node ip-10-0-207-132.ec2.internal in Controller
        Normal   RegisteredNode             102m                   node-controller      Node ip-10-0-207-132.ec2.internal event: Registered Node ip-10-0-207-132.ec2.internal in Controller
        Normal   ConfigDriftMonitorStarted  69m                    machineconfigdaemon  Config Drift Monitor started, watching against rendered-master-0b26a7484328389fe70296e0e44a9465
        Normal   ConfigDriftMonitorStopped  67m                    machineconfigdaemon  Config Drift Monitor stopped
        Normal   Drain                      67m                    machineconfigdaemon  Draining node to update config.
        Normal   Cordon                     67m                    machineconfigdaemon  Cordoned node to apply update
        Normal   NodeNotSchedulable         67m (x2 over 145m)     kubelet              Node ip-10-0-207-132.ec2.internal status is now: NodeNotSchedulable
        Normal   OSUpdateStarted            66m                    machineconfigdaemon  Upgrading OS; Changing kernel arguments
        Normal   InClusterUpgrade           66m                    machineconfigdaemon  Updating from oscontainer quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:792af500fec0c1ddd4888cd9acb1e990561668744f9e8a4831e9998083fe01e5
        Normal   OSUpgradeApplied           65m                    machineconfigdaemon  OS upgrade applied; new MachineConfig (rendered-master-b80b40f8bbea2f6e5ba8e96f7c1ccfa1) has new OS image (quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:792af500fec0c1ddd4888cd9acb1e990561668744f9e8a4831e9998083fe01e5)
        Normal   Reboot                     65m                    machineconfigdaemon  Node will reboot into config rendered-master-b80b40f8bbea2f6e5ba8e96f7c1ccfa1
        Normal   OSUpdateStaged             65m                    machineconfigdaemon  Changes to OS staged
        Normal   PendingConfig              65m                    machineconfigdaemon  Written pending config rendered-master-b80b40f8bbea2f6e5ba8e96f7c1ccfa1
        Normal   RegisteredNode             64m                    node-controller      Node ip-10-0-207-132.ec2.internal event: Registered Node ip-10-0-207-132.ec2.internal in Controller
        Normal   Starting                   62m                    kubelet              Starting kubelet.
        Warning  Rebooted                   62m (x2 over 62m)      kubelet              Node ip-10-0-207-132.ec2.internal has been rebooted, boot id: 36b706c0-93f7-4219-a25c-59e4919ec5e9
        Normal   NodeHasNoDiskPressure      62m (x3 over 62m)      kubelet              Node ip-10-0-207-132.ec2.internal status is now: NodeHasNoDiskPressure
        Normal   NodeReady                  62m                    kubelet              Node ip-10-0-207-132.ec2.internal status is now: NodeReady
        Normal   NodeNotSchedulable         62m                    kubelet              Node ip-10-0-207-132.ec2.internal status is now: NodeNotSchedulable
        Normal   NodeNotReady               62m (x2 over 62m)      kubelet              Node ip-10-0-207-132.ec2.internal status is now: NodeNotReady
        Normal   NodeAllocatableEnforced    62m                    kubelet              Updated Node Allocatable limit across pods
        Normal   NodeHasSufficientPID       62m (x3 over 62m)      kubelet              Node ip-10-0-207-132.ec2.internal status is now: NodeHasSufficientPID
        Normal   NodeHasSufficientMemory    62m (x3 over 62m)      kubelet              Node ip-10-0-207-132.ec2.internal status is now: NodeHasSufficientMemory
        Normal   ConfigDriftMonitorStarted  61m                    machineconfigdaemon  Config Drift Monitor started, watching against rendered-master-b80b40f8bbea2f6e5ba8e96f7c1ccfa1
        Normal   NodeDone                   61m                    machineconfigdaemon  Setting node ip-10-0-207-132.ec2.internal, currentConfig rendered-master-b80b40f8bbea2f6e5ba8e96f7c1ccfa1 to Done
        Normal   Uncordon                   61m                    machineconfigdaemon  Update completed for config rendered-master-b80b40f8bbea2f6e5ba8e96f7c1ccfa1 and node has been uncordoned
        Normal   NodeSchedulable            61m                    kubelet              Node ip-10-0-207-132.ec2.internal status is now: NodeSchedulable
        Normal   NodeNotReady               60m (x2 over 64m)      node-controller      Node ip-10-0-207-132.ec2.internal status is now: NodeNotReady
        Normal   RegisteredNode             53m                    node-controller      Node ip-10-0-207-132.ec2.internal event: Registered Node ip-10-0-207-132.ec2.internal in Controller
        Warning  ErrorReconcilingNode       41m (x10 over 52m)     controlplane         error creating gateway for node ip-10-0-207-132.ec2.internal: failed to init shared interface gateway: failed to sync stale SNATs on node ip-10-0-207-132.ec2.internal: unable to fetch podIPs for pod openshift-cluster-machine-approver/machine-approver-7747c58fc8-rbwcf
        Warning  ErrorReconcilingNode       36m (x5 over 50m)      controlplane         error creating gateway for node ip-10-0-207-132.ec2.internal: failed to init shared interface gateway: failed to sync stale SNATs on node ip-10-0-207-132.ec2.internal: unable to fetch podIPs for pod openshift-cloud-controller-manager-operator/cluster-cloud-controller-manager-operator-64f55669d9-9p6x4

      Expected results:

      1 the upgrade succeed and the master don't prompt error. 

      Additional info:

      the job details(include prow and jenkins job): 
      1)https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.13-arm64-stable-4.13-upgrade-from-stable-4.12-aws-ipi-ovn-ipsec-p1-f14/1642004799069097984 
      
      2)https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.13-arm64-nightly-4.13-upgrade-from-stable-4.12-azure-ipi-ovn-ipsec-p1-f14/1642489498921078784
      
      3)https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-upgrade/job/upgrade-pipeline/36070/consoleFull 

      Attachments

        Issue Links

          Activity

            People

              bpickard@redhat.com Ben Pickard
              rhn-support-minmli Min Li
              Min Li Min Li
              Surya Seetharaman
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: