Loading...

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 1.5.0
Affects Version/s: 1.3.0, 1.3.1, 1.3.2, 1.3.3, 1.3.4, 1.3.5, 1.4.0, 1.4.1, 1.4.2
Component/s: Customer, Helm Chart
Labels:
- Customer

Story Points:
3
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Release Note Text:

Hide
= Support for multiple replicas across cluster nodes

Previously, the {product-short} Helm Chart did not support running multiple replicas on different cluster nodes due to the automatic creation of a dynamic plugins root Persistent Volume Claim (PVC). This issue has been resolved by reverting the PVC creation and switching to an ephemeral volume by default, allowing multiple replicas to function properly.

Show
= Support for multiple replicas across cluster nodes Previously, the {product-short} Helm Chart did not support running multiple replicas on different cluster nodes due to the automatic creation of a dynamic plugins root Persistent Volume Claim (PVC). This issue has been resolved by reverting the PVC creation and switching to an ephemeral volume by default, allowing multiple replicas to function properly.
Release Note Type:
Bug Fix
Release Note Status:
In Progress
Intelligence Requested:
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

It doesn't seem possible to have 2 replicas of a Helm-based RHDH instance running on different nodes, as one would expect for typical HA deployments.

Prerequisites (if any, like setup, operators/versions):

Helm Chart 1.3.0 and 1.4.0
Tested an a ROSA 4.17 cluster with at least 2 nodes

Steps to Reproduce

Check that you have at least 2 nodes available in the cluster, e.g.:

$ oc get nodes                                  
NAME                        STATUS   ROLES    AGE   VERSION
ip-10-0-1-11.ec2.internal   Ready    worker   72m   v1.30.6
ip-10-0-1-71.ec2.internal   Ready    worker   72m   v1.30.6

Create a values file with a topology spread constraint that enforces the scheduling of those replicas on different nodes:

# my-values-topology-spread-constraints.yaml
upstream:
  backstage:
    replicas: 2
    topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app.kubernetes.io/instance: my-backstage-1

deploy RHDH using Helm and provide the specified values file, e.g.:

$ git clone https://github.com/redhat-developer/rhdh-chart.git && cd rhdh-chart
$ helm upgrade --install my-backstage-1 \
    charts/backstage \
    --set global.clusterRouterBase=`oc get ingress.config.openshift.io/cluster '-o=jsonpath={.spec.domain}'` \
    --values my-values-topology-spread-constraints.yaml

Actual results:

Only 1 replica will be running. The second one will get stuck on a Multi-Attach Error:

$ oc get deploy my-backstage-1                                                                                   
NAME             READY   UP-TO-DATE   AVAILABLE   AGE
my-backstage-1   1/2     2            1           23m

Topology Spread Constraints:  kubernetes.io/hostname:DoNotSchedule when max skew 1 is exceeded for selector app.kubernetes.io/instance=my-backstage-1
Events:
  Type     Reason              Age    From                     Message
  ----     ------              ----   ----                     -------
  Warning  FailedScheduling    8m26s  default-scheduler        0/2 nodes are available: 1 node(s) didn't find available persistent volumes to bind, 1 node(s) didn't match pod topology spread constraints. preemption: 0/2 nodes are available: 1 No preemption victims found for incoming pod, 1 Preemption is not helpful for scheduling.
  Warning  FailedScheduling    8m25s  default-scheduler        0/2 nodes are available: 1 node(s) didn't find available persistent volumes to bind, 1 node(s) didn't match pod topology spread constraints. preemption: 0/2 nodes are available: 1 No preemption victims found for incoming pod, 1 Preemption is not helpful for scheduling.
  Normal   Scheduled           8m23s  default-scheduler        Successfully assigned my-ns/my-backstage-1-744b9f4bb-8t8h5 to ip-10-0-1-71.ec2.internal
  Warning  FailedAttachVolume  8m19s  attachdetach-controller  Multi-Attach error for volume "pvc-319f0c25-2f15-4320-83db-5f55e7a2c2fb" Volume is already used by pod(s) my-backstage-1-744b9f4bb-bzfdk

The list of PVCs, to confirm that this is about the dynamic plugins root PVC:

$ oc get pvc                  
NAME                                        STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
data-my-backstage-1-postgresql-0            Bound         pvc-aecafeaf-cc58-45bb-a421-19f5624a4e0a   1Gi        RWO            gp3-csi        <unset>                 30m
my-backstage-1-dynamic-plugins-root         Bound         pvc-319f0c25-2f15-4320-83db-5f55e7a2c2fb   5Gi        RWO            gp3-csi        <unset>                 30m

Expected results:

Both replicas should be running.

Reproducibility (Always/Intermittent/Only Once):

Always.

A user reported a similar issue when upgrading a 1.3 Helm release (see RHDHBUGS-135). We also noticed something similar when upgrading an Helm-based instance from 1.3 to 1.4 (see https://redhat-internal.slack.com/archives/C04CUSD4JSG/p1734467111777559 ).

This might happen if the cluster scheduler assigns the new pod to a different node.

Build Details:

Helm 1.3 and 1.4

Additional info (Such as Logs, Screenshots, etc):

This seems to be caused by mounting the dynamic plugins root PVC as RWO by default (~~RHIDP-3572~~).

The Operator is not affected because it does not create a dynamic plugins root PVC out of the box.

is caused by

RHIDP-3572 update helm chart and operator to use non ephemeral PVC

Closed

is duplicated by

RHIDP-5344 Multi-Attached error for Volume (PVC)

Backlog

is incorporated by

RHIDP-5516 [Helm] Disable dynamic plugin cache PVC by default

In Progress

relates to

RHIDP-5573 [Helm] Dynamic plugins PVC Storage class and access mode are not configurable

Release Pending

RHIDP-5839 In the Helm chart, the audit log PVC Storage class and Access mode are not configurable

Closed

Details

Description

Description of problem:

Prerequisites (if any, like setup, operators/versions):

Steps to Reproduce

Actual results:

Expected results:

Reproducibility (Always/Intermittent/Only Once):

Build Details:

Additional info (Such as Logs, Screenshots, etc):

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates