Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.14.z
Component/s: Node / Kubelet
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
No

Target Backport Versions:
None
Target Version:

4.17.0
Release Blocker:
None
Sprint:
CNF Network Sprint 257
sprint_count:
1

Customer Impact:

Customer Escalated, Customer Facing
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Priority Data:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Created Net-attach-def with 2 IPs in range. After that created deployment with 2 replicas using that net-attach-def. Whereabouts daemoneset is created also cronjob is enable reconsiling at every one min. 
When i poweroff the node one which one of pod is deployded gracefully(poweroff)/ungracefully(poweroff --force) new pod is getting created on healthy node and stuck in container creating state

Version-Release number of selected component (if applicable):

    4.14.11

How reproducible:

- Create whereabout daemon set with help of [documentation]([https://docs.openshift.com/container-platform/4.14/networking/multiple_networks/configuring-additional-network.html#nw-multus-creating-whereabouts-reconciler-daemon-set_configuring-additional-network)]
- Update the reconciler_cron_expression to: "*/1 * * * *"
- Create net-attach-def with 2 IPs in range
- Create deployment with 2 replicas
- Powreoff the node on which on of the POD is running
- New Pod spawned on new healthy node with Container Creating in status.

Steps to Reproduce:

1. On fresh cluster with version 4.14.11
2. Create whereabout daemon set with help of documentation   
3. Update the reconciler_cron_expression to: "*/1 * * * *"
$ oc create configmap whereabouts-config -n openshift-multus --from-literal=reconciler_cron_expression="*/1 * * * *"

4. Create new project
$ oc new-project nadtesting

5. Apply below nad.yaml
$ cat nad.yaml 
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: macvlan-net-attach1
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "macvlan",
      "master": "br-ex",
      "mode": "bridge",
      "ipam": {
        "type": "whereabouts",
        "datastore": "kubernetes",
        "range": "172.17.20.0/24",
        "range_start": "172.17.20.11",
        "range_end": "172.17.20.12"
      }
    }'

6. Create deployment using net-attach-def with two replica,
$ cat naddeployment.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: deployment1
  labels:
    app: macvlan1
spec:
  replicas: 2
  selector:
    matchLabels:
      app: macvlan1
  template:
    metadata:
      annotations:
           k8s.v1.cni.cncf.io/networks: macvlan-net-attach1
      labels:
        app: macvlan1
    spec:
      containers:
      - name: google
        image: gcr.io/google-samples/kubernetes-bootcamp:v1
        ports:
        - containerPort: 8080

7. Two Pod will be created
$ oc get pods -o wide
NAME                          READY   STATUS    RESTARTS   AGE   IP            NODE                                       NOMINATED NODE   READINESS GATES
deployment1-fbfdf5cbc-d6sgr   1/1     Running   0          15m   10.129.2.9    ci-ln-xvfy762-c1627-h7xzk-worker-0-qvzq2   <none>           <none>
deployment1-fbfdf5cbc-njkpz   1/1     Running   0          15m   10.128.2.16   ci-ln-xvfy762-c1627-h7xzk-worker-0-8bdfh   <none>           <none>

8. Power off the node using debug
$ oc debug node/ci-ln-xvfy762-c1627-h7xzk-worker-0-8bdfh 
# chroot /host
# shutdown

9. Wait for sometime new pod will created on healthy node which stuck in containercreating 
$ oc get pod -o wide
NAME                          READY   STATUS              RESTARTS   AGE     IP            NODE                                       NOMINATED NODE   READINESS GATES
deployment1-fbfdf5cbc-6cb8d   0/1     ContainerCreating   0          9m53s   <none>        ci-ln-xvfy762-c1627-h7xzk-worker-0-blzlk   <none>           <none>
deployment1-fbfdf5cbc-d6sgr   1/1     Running             0          28m     10.129.2.9    ci-ln-xvfy762-c1627-h7xzk-worker-0-qvzq2   <none>           <none>
deployment1-fbfdf5cbc-njkpz   1/1     Terminating         0          28m     10.128.2.16   ci-ln-xvfy762-c1627-h7xzk-worker-0-8bdfh   <none>           <none>

10. Node status just for reference,
$ oc get nodes  
NAME                                       STATUS     ROLES                  AGE   VERSION
ci-ln-xvfy762-c1627-h7xzk-master-0         Ready      control-plane,master   59m   v1.27.10+28ed2d7
ci-ln-xvfy762-c1627-h7xzk-master-1         Ready      control-plane,master   59m   v1.27.10+28ed2d7
ci-ln-xvfy762-c1627-h7xzk-master-2         Ready      control-plane,master   58m   v1.27.10+28ed2d7
ci-ln-xvfy762-c1627-h7xzk-worker-0-8bdfh   NotReady   worker                 43m   v1.27.10+28ed2d7
ci-ln-xvfy762-c1627-h7xzk-worker-0-blzlk   Ready      worker                 43m   v1.27.10+28ed2d7
ci-ln-xvfy762-c1627-h7xzk-worker-0-qvzq2   Ready      worker                 43m   v1.27.10+28ed2d

Actual results:

Shutdown node's pod stuck in terminating state and not releasing IP. New Pod is stuck in container creating status.

Expected results:

New Pod should start smoothly on new-node.

Additional info:

- Just for information : If i follow manual approach the this issue will resolve for that i need to follow this step
1. remove that termination IP from overlapping
$ oc delete overlappingrangeipreservations.whereabouts.cni.cncf.io <IP>

2. remove that termination IP from ippools.whereabouts.cni.cncf.io
$ oc edit ippools.whereabouts.cni.cncf.io <IP Pool> 
Remove that stale IP from list

Also, the whereabouts-reconciler logs on the Terminating pod's node report:
2024-02-19T10:48:00Z [debug] Added IP 172.17.20.12 for pod nadtesting/deployment1-fbfdf5cbc-njkpz
2024-02-19T10:48:00Z [debug] the IP reservation: IP: 172.17.20.12 is reserved for pod: nadtesting/deployment1-fbfdf5cbc-njkpz
2024-02-19T10:48:00Z [debug] pod reference nadtesting/deployment1-fbfdf5cbc-njkpz matches allocation; Allocation IP: 172.17.20.12; PodIPs: map[172.17.20.12:{}]
2024-02-19T10:48:00Z [debug] no IP addresses to cleanup
2024-02-19T10:48:00Z [verbose] reconciler success

i.e. it fails to recognize the need to remove the allocation.

blocks

OCPBUGS-37707 When node shutdown, the Pod whereabouts IP cannot be released (for a stateless application)

Closed

is cloned by

OCPBUGS-37707 When node shutdown, the Pod whereabouts IP cannot be released (for a stateless application)

Closed

links to

RHEA-2024:3718 OpenShift Container Platform 4.17.z bug fix update

Assignee:: Marcelo Guerrero Viveros

Reporter:: Ketan Lakhwara

Need Info From:: None

Contributors:: Carlos Goncalves

QA Contact:: Weibin Liang

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Created:: 2024/02/19 12:28 PM

Updated:: 2025/09/13 2:17 PM

Resolved:: 2024/10/01 5:40 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide