-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
1.3, 1.3.1, 1.3.2, 1.3.3, 1.4
-
None
-
False
-
-
False
-
-
Known Issue
-
Done
-
-
Description of problem:
It doesn't seem possible to have 2 replicas of a Helm-based RHDH instance running on different nodes, as one would expect for typical HA deployments.
Prerequisites (if any, like setup, operators/versions):
- Helm Chart 1.3.0 and 1.4.0
- Tested an a ROSA 4.17 cluster with at least 2 nodes
Steps to Reproduce
- Check that you have at least 2 nodes available in the cluster, e.g.:
$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-1-11.ec2.internal Ready worker 72m v1.30.6 ip-10-0-1-71.ec2.internal Ready worker 72m v1.30.6
- Create a values file with a topology spread constraint that enforces the scheduling of those replicas on different nodes:
# my-values-topology-spread-constraints.yaml upstream: backstage: replicas: 2 topologySpreadConstraints: - maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app.kubernetes.io/instance: my-backstage-1
- deploy RHDH using Helm and provide the specified values file, e.g.:
$ git clone https://github.com/redhat-developer/rhdh-chart.git && cd rhdh-chart $ helm upgrade --install my-backstage-1 \ charts/backstage \ --set global.clusterRouterBase=`oc get ingress.config.openshift.io/cluster '-o=jsonpath={.spec.domain}'` \ --values my-values-topology-spread-constraints.yaml
Actual results:
Only 1 replica will be running. The second one will get stuck on a Multi-Attach Error:
$ oc get deploy my-backstage-1 NAME READY UP-TO-DATE AVAILABLE AGE my-backstage-1 1/2 2 1 23m
Topology Spread Constraints: kubernetes.io/hostname:DoNotSchedule when max skew 1 is exceeded for selector app.kubernetes.io/instance=my-backstage-1 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 8m26s default-scheduler 0/2 nodes are available: 1 node(s) didn't find available persistent volumes to bind, 1 node(s) didn't match pod topology spread constraints. preemption: 0/2 nodes are available: 1 No preemption victims found for incoming pod, 1 Preemption is not helpful for scheduling. Warning FailedScheduling 8m25s default-scheduler 0/2 nodes are available: 1 node(s) didn't find available persistent volumes to bind, 1 node(s) didn't match pod topology spread constraints. preemption: 0/2 nodes are available: 1 No preemption victims found for incoming pod, 1 Preemption is not helpful for scheduling. Normal Scheduled 8m23s default-scheduler Successfully assigned my-ns/my-backstage-1-744b9f4bb-8t8h5 to ip-10-0-1-71.ec2.internal Warning FailedAttachVolume 8m19s attachdetach-controller Multi-Attach error for volume "pvc-319f0c25-2f15-4320-83db-5f55e7a2c2fb" Volume is already used by pod(s) my-backstage-1-744b9f4bb-bzfdk
The list of PVCs, to confirm that this is about the dynamic plugins root PVC:
$ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE data-my-backstage-1-postgresql-0 Bound pvc-aecafeaf-cc58-45bb-a421-19f5624a4e0a 1Gi RWO gp3-csi <unset> 30m my-backstage-1-dynamic-plugins-root Bound pvc-319f0c25-2f15-4320-83db-5f55e7a2c2fb 5Gi RWO gp3-csi <unset> 30m
Expected results:
Both replicas should be running.
Reproducibility (Always/Intermittent/Only Once):
Always.
A user reported a similar issue when upgrading a 1.3 Helm release (see RHDHBUGS-135). We also noticed something similar when upgrading an Helm-based instance from 1.3 to 1.4 (see https://redhat-internal.slack.com/archives/C04CUSD4JSG/p1734467111777559 ).
This might happen if the cluster scheduler assigns the new pod to a different node.
Build Details:
Helm 1.3 and 1.4
Additional info (Such as Logs, Screenshots, etc):
This seems to be caused by mounting the dynamic plugins root PVC as RWO by default (RHIDP-3572).
The Operator is not affected because it does not create a dynamic plugins root PVC out of the box.
- is caused by
-
RHIDP-3572 update helm chart and operator to use non ephemeral PVC
- Closed
- is duplicated by
-
RHIDP-5344 Multi-Attached error for Volume (PVC)
- Closed