-
Bug
-
Resolution: Not a Bug
-
Normal
-
None
-
4.15.0
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
The customer is having OCP 4.15.18 AWS UPI private cluster where the worker, infra, etc nodes get deregistered from the ingress AWS Classic Load Balancer when the cloud-controller-manager reconciles except master nodes.
2024-10-16T02:16:11.263027499Z I1016 02:16:11.262989 1 controller.go:716] Syncing backends for all LB services. 2024-10-16T02:16:11.263073222Z I1016 02:16:11.263058 1 controller.go:720] Successfully updated 1 out of 1 load balancers to direct traffic to the updated set of nodes 2024-10-16T02:16:11.675344914Z W1016 02:16:11.675297 1 aws.go:3631] Found multiple subnets in AZ "ap-southeast-1b"; choosing "subnet-020xxxxxx" between subnets "subnet-020xxxxxx" and "subnet-0542dxxxxxx" 2024-10-16T02:16:11.675344914Z W1016 02:16:11.675333 1 aws.go:3626] Found multiple subnets in AZ "ap-southeast-1a"; choosing "subnet-0439xxxxxx" between subnets "subnet-0ddxxxxxx" and "subnet-0439xxxxxx" 2024-10-16T02:16:11.675385024Z W1016 02:16:11.675346 1 aws.go:3631] Found multiple subnets in AZ "ap-southeast-1b"; choosing "subnet-020xxxxxx" between subnets "subnet-020xxxxxx" and "subnet-0d27bxxxxxxx" 2024-10-16T02:16:11.675385024Z W1016 02:16:11.675359 1 aws.go:3631] Found multiple subnets in AZ "ap-southeast-1b"; choosing "subnet-020xxxxxx" between subnets "subnet-020xxxxxx" and "subnet-0ba4axxxxxxxx" 2024-10-16T02:16:11.675390833Z W1016 02:16:11.675382 1 aws.go:3626] Found multiple subnets in AZ "ap-southeast-1a"; choosing "subnet-04353xxxxxx" between subnets "subnet-0439xxxxxx" and "subnet-04353xxxxxx" 2024-10-16T02:16:11.675407903Z W1016 02:16:11.675397 1 aws.go:3631] Found multiple subnets in AZ "ap-southeast-1a"; choosing "subnet-04353xxxxxx" between subnets "subnet-04353xxxxxx" and "subnet-0ad250xxxxx" 2024-10-16T02:16:11.675426130Z W1016 02:16:11.675417 1 aws.go:3626] Found multiple subnets in AZ "ap-southeast-1a"; choosing "subnet-0053xxxxxx" between subnets "subnet-04353xxxxxx" and "subnet-0053xxxxxx" 2024-10-16T02:16:11.915384178Z I1016 02:16:11.913555 1 aws.go:3212] Existing security group ingress: sg-06axxxxxx [{ 2024-10-16T02:16:11.915384178Z FromPort: 80, 2024-10-16T02:16:11.915384178Z IpProtocol: "tcp", 2024-10-16T02:16:11.915384178Z IpRanges: [{ 2024-10-16T02:16:11.915384178Z CidrIp: "0.0.0.0/0" 2024-10-16T02:16:11.915384178Z }], 2024-10-16T02:16:11.915384178Z ToPort: 80 2024-10-16T02:16:11.915384178Z } { 2024-10-16T02:16:11.915384178Z FromPort: 3, 2024-10-16T02:16:11.915384178Z IpProtocol: "icmp", 2024-10-16T02:16:11.915384178Z IpRanges: [{ 2024-10-16T02:16:11.915384178Z CidrIp: "0.0.0.0/0" 2024-10-16T02:16:11.915384178Z }], 2024-10-16T02:16:11.915384178Z ToPort: 4 2024-10-16T02:16:11.915384178Z } { 2024-10-16T02:16:11.915384178Z FromPort: 443, 2024-10-16T02:16:11.915384178Z IpProtocol: "tcp", 2024-10-16T02:16:11.915384178Z IpRanges: [{ 2024-10-16T02:16:11.915384178Z CidrIp: "0.0.0.0/0" 2024-10-16T02:16:11.915384178Z }], 2024-10-16T02:16:11.915384178Z ToPort: 443 2024-10-16T02:16:11.915384178Z }] 2024-10-16T02:16:11.990159340Z I1016 02:16:11.990123 1 aws_loadbalancer.go:1574] Creating proxy protocol policy on load balancer 2024-10-16T02:16:12.052159081Z I1016 02:16:12.052115 1 aws_loadbalancer.go:1198] Creating additional load balancer tags for a348d4f0xxxxxxxxx 2024-10-16T02:16:12.100943358Z I1016 02:16:12.100903 1 aws_loadbalancer.go:1225] Updating load-balancer attributes for "a348d4f0xxxxxxxxx" 2024-10-16T02:16:12.580245637Z I1016 02:16:12.580204 1 node_controller.go:267] Update 10 nodes status took 1.329172494s. 2024-10-16T02:16:12.586699146Z I1016 02:16:12.586651 1 aws.go:4671] Removing rule for traffic from the load balancer (sg-06a06xxxxxxxx) to instance (sg-046exxxxxxx) 2024-10-16T02:16:12.647426008Z W1016 02:16:12.647381 1 aws.go:4698] Revoking ingress was not needed; concurrent change? groupId=sg-046exxxxxxx =============================================== 2024-10-16T02:16:12.712758269Z I1016 02:16:12.710969 1 aws_loadbalancer.go:1485] Instances removed from load-balancer a348d4f0xxxxxxxxx ===============================================
I found the following GitHub code for the above function related to removing the nodes from the CLB.
-->
https://github.com/openshift/cloud-provider-aws/blob/fd77d92ced47559dadf53fb8c97d1cbeb64dde8c/pkg/providers/v1/aws_loadbalancer.go#L1437
I even checked with Miciah Masters internally and understood that the function verifies the following things to pick the nodes to add to CLB.
- The node must not have the node.kubernetes.io/exclude-from-external-load-balancers label.
- The node must not have the ToBeDeletedByClusterAutoscaler taint.
- For older versions of OpenShift, the node must be ready (i.e., its "Ready" status condition must have status "True").
- For newer versions of OpenShift, the node must not be marked for deletion (i.e., its deletionTimestamp is null).
- Then cloud-provider-aws has these additional criteria: * The node's spec.providerID must be set.
- If the service's service.beta.kubernetes.io/aws-load-balancer-target-node-labels annotation is set, and then instances are selected based on that.
I have checked the must-gather and can confirm that all the above criteria match but still not sure why CCM is removing the worker nodes from CLB except the master nodes. That's why we need engineering team's help on this issue.