Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-44302

pod cannot be ready during live migration

XMLWordPrintable

    • Critical
    • Yes
    • False
    • Hide

      None

      Show
      None
    • Release Note Not Required
    • In Progress

      Description of problem:

      After applying networkpolicy on the namespace, and do live migration.  Pod cannot be ready after updating the route table mtu and reboot

       

      cat <<EOF | oc create -f -
      apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
        name: default-deny-ingress
        namespace: z3
      spec:
        podSelector: {}
        policyTypes:
        - Ingress

      kind: NetworkPolicy
      apiVersion: networking.k8s.io/v1
      metadata:
        name: allow-all-ingress
        namespace: z3
      spec:
        ingress:
          - from:
            - namespaceSelector:
                matchLabels:
                  team: qe
              podSelector:
                matchLabels:
                  name: test
        policyTypes:
          - Ingress

      apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
        name: allow-from-openshift-ingress
        namespace: z3
      spec:
        ingress:
        - from:
          - namespaceSelector:
              matchLabels:
                policy-group.network.openshift.io/ingress: ""
        podSelector: {}
        policyTypes:
        - Ingress
      EOF

       

      Events:
        Type     Reason                  Age                     From     Message
        ----     ------                  ----                    ----     -------
        Warning  FailedCreatePodSandBox  6m25s (x4579 over 19h)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_hello-hdx8h_z3_8e3a0595-fabd-4953-a460-5c014290122d_0(383f4845fa3cc790f58c5d1a755fa46cc69c220a3669c65422a0423293c9863a): error adding pod z3_hello-hdx8h to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: 'ContainerID:"383f4845fa3cc790f58c5d1a755fa46cc69c220a3669c65422a0423293c9863a" Netns:"/var/run/netns/cbba5e98-ae28-4199-a573-ef1c24013442" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=z3;K8S_POD_NAME=hello-hdx8h;K8S_POD_INFRA_CONTAINER_ID=383f4845fa3cc790f58c5d1a755fa46cc69c220a3669c65422a0423293c9863a;K8S_POD_UID=8e3a0595-fabd-4953-a460-5c014290122d" Path:"" ERRORED: error configuring pod [z3/hello-hdx8h] networking: [z3/hello-hdx8h/8e3a0595-fabd-4953-a460-5c014290122d:openshift-sdn]: error adding container to network "openshift-sdn": failed to add route to 10.128.0.2/14 via SDN: invalid argument
      ': StdinData: {"binDir":"/var/lib/cni/bin","clusterNetwork":"/host/run/multus/cni/net.d/80-openshift-network.conf","cniVersion":"0.3.1","daemonSocketDir":"/run/multus/socket","globalNamespaces":"default,openshift-multus,openshift-sriov-network-operator","logLevel":"verbose","logToStderr":true,"name":"multus-cni-network","namespaceIsolation":true,"type":"multus-shim"}
        Normal  AddedInterface  2m4s  multus  Add eth0 [10.128.0.209/23] from openshift-sdn

       

      Version-Release number of selected component (if applicable):

      How reproducible:

      Steps to Reproduce:

      1.  setup 4.16 cluster 

      2.  Create namespace and pods and then apply Networkpolicy

      3.  do live migration

      Actual results:

      After route table mtu is updated and reboot. the pods on that worker cannot be ready with error (see description)

       

      Expected results:

      Additional info:

      Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

      Affected Platforms:

      Is it an

      1. internal CI failure
      2. customer issue / SD
      3. internal RedHat testing failure

      If it is an internal RedHat testing failure:

      • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

      If it is a CI failure:

      • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
      • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
      • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
      • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
      • If it's a connectivity issue,
      • What is the srcNode, srcIP and srcNamespace and srcPodName?
      • What is the dstNode, dstIP and dstNamespace and dstPodName?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

      If it is a customer / SD issue:

      • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
      • Don’t presume that Engineering has access to Salesforce.
      • Do presume that Engineering will access attachments through supportshell.
      • Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
      • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
        • If the issue is in a customer namespace then provide a namespace inspect.
        • If it is a connectivity issue:
          • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
          • What is the dstNode, dstNamespace, dstPodName and dstPodIP?
          • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
          • Please provide the UTC timestamp networking outage window from must-gather
          • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
        • If it is not a connectivity issue:
          • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
      • When showing the results from commands, include the entire command in the output.  
      • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
      • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
      • Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
      • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
      • For guidance on using this template please see
        OCPBUGS Template Training for Networking  components

              pliurh Peng Liu
              zzhao1@redhat.com Zhanqi Zhao
              Zhanqi Zhao Zhanqi Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: