-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.21
-
None
-
None
-
False
-
-
0
-
None
-
None
-
None
-
None
-
OCPEDGE Sprint 284, OCPEDGE Sprint 285
-
2
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description:
The tnf-after-setup-job fails to be created by the cluster-etcd-operator when the cluster nodes have long hostnames (FQDNs). The operator attempts to use the full node name as a label value in the Job's pod template, violating the Kubernetes 63-character limit for label values.
During the installation of Two-Node Fencing (TNF) on OpenShift (specifically observed in version 4.21 via ZTP), the etcd ClusterOperator remains in Progressing: True state indefinitely.
The oc describe co etcd shows: Message: tnf-after-setup-job-<hostname> Progressing: Job is running
However, the Job and its associated Pods never appear in the openshift-etcd namespace. Investigation of the cluster-etcd-operator logs reveals a JobCreateFailed warning because the generated label exceeds the maximum allowed length of 63 characters.
The operator logs show the following validation error:
I0205 13:20:14.344362 1 event.go:377] Event(...): type: 'Warning' reason: 'JobCreateFailed' Failed to create Job.batch/tnf-after-setup-job-worker-01.cnf77.se-lab.eng.rdu2.dc.redhat.com -n openshift-etcd: Job.batch "tnf-after-setup-job-worker-01.cnf77.se-lab.eng.rdu2.dc.redhat.com" is invalid: spec.template.labels: Invalid value: "tnf-after-setup-job-worker-01.cnf77.se-lab.eng.rdu2.dc.redhat.com": must be no more than 63 characters
Expected Results:
The operator should handle long hostnames by either:
- Truncating the hostname used in labels.
- Using a hash of the hostname for the label value.
- Ensuring the label value conforms to DNS_LABEL standards as defined in official Kubernetes documentation.
Steps to Reproduce:
- Deploy an OpenShift cluster (version 4.21) with Two-Node Fencing enabled.
- Use hostnames/FQDNs that exceed 40-50 characters (so that the prefix tnf-after-setup-job- + hostname exceeds 63 chars).
- Monitor the openshift-etcd-operator logs and the etcd ClusterOperator status
Suggested Fix:
The logic within the cluster-etcd-operator that generates the Job manifest for TNF needs to implement a helper function to sanitize and truncate the spec.template.labels strings.