-
Bug
-
Resolution: Not a Bug
-
Critical
-
None
-
4.10.z
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Critical
-
No
-
None
-
None
-
Rejected
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
CU has AAP operator installed in OCP cluster 4.10.51 When running AUTOMATION JOBs the pods are only getting scheduled to 3 nodes and not evenly distributing to other worker nodes even though they have other worker nodes with low utilization
Version-Release number of selected component (if applicable):
OCP version : 4.10.51
How reproducible:
Step 1 : Installed AAP operator v2.2 in OCP cluster version 4.10.51 Step2 : Ran the sample job multiple times and could see that the automation jobs are only getting scheduled to one worker node
Additional Info about the cluster :
Cluster specific Details:
- using the default LowNodeUtilization profile
spec:
mastersSchedulable: false
policy:
name: ""
status: {}
$ oc get nodes
NAME STATUS ROLES AGE VERSION
master-0.sharedocp4upi410ovn.lab.psi.pnq2.redhat.com Ready master 6d7h v1.23.12+8a6bfe4
master-1.sharedocp4upi410ovn.lab.psi.pnq2.redhat.com Ready master 6d7h v1.23.12+8a6bfe4
master-2.sharedocp4upi410ovn.lab.psi.pnq2.redhat.com Ready master 6d7h v1.23.12+8a6bfe4
worker-0.sharedocp4upi410ovn.lab.psi.pnq2.redhat.com Ready worker 6d6h v1.23.12+8a6bfe4
worker-1.sharedocp4upi410ovn.lab.psi.pnq2.redhat.com Ready worker 6d6h v1.23.12+8a6bfe4
worker-2.sharedocp4upi410ovn.lab.psi.pnq2.redhat.com Ready worker 6d6h v1.23.12+8a6bfe4
3 worker machines
node utilization: current
worker-0.sharedocp4upi410ovn.lab.psi.pnq2.redhat.com 857m 24% 4820Mi 70%
worker-1.sharedocp4upi410ovn.lab.psi.pnq2.redhat.com 500m 14% 5793Mi 84%
worker-2.sharedocp4upi410ovn.lab.psi.pnq2.redhat.com 303m 8% 3209Mi 46%
Additional TEST done on the cluster with a sample/test deployment
1. Created a new project:
$oc new-project dep-test
Now using project "dep-test" on server "https://api.sharedocp4upi410ovn.lab.psi.pnq2.redhat.com:6443".
You can add applications to this project with the 'new-app' command. For example, try:
oc new-app rails-postgresql-example
to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:
kubectl create deployment hello-node --image=k8s.gcr.io/e2e-test-images/agnhost:2.33 -- /agnhost serve-hostname
2. Created a sample deploy:
[quicklab@upi-0 deploy]$ oc new-app httpd
--> Found image f339827 (4 weeks old) in image stream "openshift/httpd" under tag "2.4-el8" for "httpd"
Apache httpd 2.4
----------------
Apache httpd 2.4 available as container, is a powerful, efficient, and extensible web server. Apache supports a variety of features, many implemented as compiled modules which extend the core functionality. These can range from server-side programming language support to authentication schemes. Virtual hosting allows one Apache installation to serve many different Web sites.
Tags: builder, httpd, httpd-24
--> Creating resources ...
deployment.apps "httpd" created
service "httpd" created
--> Success
Application is not exposed. You can expose services to the outside world by executing one or more of the commands below:
'oc expose service/httpd'
Run 'oc status' to view your app.
3. Get the pods and the node scheduled
$ oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
httpd-795f6dddf9-c8vrj 1/1 Running 0 10s 10.131.0.35 worker-2.sharedocp4upi410ovn.lab.psi.pnq2.redhat.com <none> <none>
4. Now scale the pod to 3 since we have three worker nodes
$ oc scale deploy httpd --replicas=3
deployment.apps/httpd scaled
[quicklab@upi-0 deploy]$ oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
httpd-795f6dddf9-2ldbm 0/1 ContainerCreating 0 3s <none> worker-1.sharedocp4upi410ovn.lab.psi.pnq2.redhat.com <none> <none>
httpd-795f6dddf9-c44tw 0/1 ContainerCreating 0 3s <none> worker-0.sharedocp4upi410ovn.lab.psi.pnq2.redhat.com <none> <none>
httpd-795f6dddf9-c8vrj 1/1 Running 0 38s 10.131.0.35 worker-2.sharedocp4upi410ovn.lab.psi.pnq2.redhat.com <none> <none>
[quicklab@upi-0 deploy]$ oc get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
httpd-795f6dddf9-2ldbm 1/1 Running 0 119s 10.129.2.54 worker-1.sharedocp4upi410ovn.lab.psi.pnq2.redhat.com <none> <none>
httpd-795f6dddf9-c44tw 1/1 Running 0 119s 10.128.2.186 worker-0.sharedocp4upi410ovn.lab.psi.pnq2.redhat.com <none> <none>
httpd-795f6dddf9-c8vrj 1/1 Running 0 2m34s 10.131.0.35 worker-2.sharedocp4upi410ovn.lab.psi.pnq2.redhat.com <none> <none>
Thus the default profile is scheduling the pods to three nodes evenly!
Actual results:
- The automation-jobs are only scheduled to one node - From the scheduler logs, could see all the worker nodes are evaluated for the scheduling, but it is continuously getting scheduled to one node automation-job-12-qgpjg 1/1 Running 0 2s 10.129.2.49 worker-1.sharedocp4upi410ovn.lab.psi.pnq2.redhat.com <none> <none> automation-job-16-ftmhz 1/1 Running 0 3s 10.129.2.52 worker-1.sharedocp4upi410ovn.lab.psi.pnq2.redhat.com <none> <none> automation-job-17-82fdj 1/1 Running 0 4s 10.129.2.51 worker-1.sharedocp4upi410ovn.lab.psi.pnq2.redhat.com <none> <none> automation-job-13-gzclq 1/1 Running 0 3s 10.129.2.53 worker-1.sharedocp4upi410ovn.lab.psi.pnq2.redhat.com <none> <none>
Expected results:
The AAP automation jobs should also be distributed evenly on to the nodes
Additional info:
- Default instance groups was used in the AAP which would be similar as follows,
apiVersion: v1
kind: Pod
metadata:
namespace: ansible-automation-platform
spec:
serviceAccountName: default
automountServiceAccountToken: false
containers:
- image: >-
registry.redhat.io/ansible-automation-platform-22/ee-supported-rhel8@sha256:a77ac9d7fd9f73a07aa5f771d546bd50281495f9f39d5a34c4ecf2888a1a70c0
name: worker
args:
- ansible-runner
- worker
- '--private-data-dir=/runner'
resources:
requests:
cpu: 250m
memory: 100Mi
- When adding the pod topology constraints the pod schedule is almost even to the nodes.
Doc Reference for ansible : https://access.redhat.com/documentation/en-us/red_hat_ansible_automation_platform/2.3/html/red_hat_ansible_automation_platform_performance_considerations_for_operator_based_installations/assembly-specify-dedicted-nodes#doc-wrapper
Doc Reference for OCP: https://docs.openshift.com/container-platform/4.11/nodes/scheduling/nodes-scheduler-pod-topology-spread-constraints.html#nodes-scheduler-pod-topology-spread-constraints-about_nodes-scheduler-pod-topology-spread-constraints
- When suggested the same with the CU , they were not satisfied with the the solution as they think its relatively a workaround
- Since this was reproducible would need to check and suggest if the provided solution to add the topology constraints is a recommended solution here
Business Impact:
The few nodes, where the JOBS are frequently scheduled, are sometimes getting over utilized and results in the failure of the automation job pods and blocks the work flow