Uploaded image for project: 'JBoss Enterprise Application Platform'
  1. JBoss Enterprise Application Platform
  2. JBEAP-24814

Can't scale down from 3 to 1 pods when Tx recovery is ongoing

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • OP-2.3.10.GA
    • OpenShift, Operator
    • None
    • False
    • None
    • False

      Trying to run the application at https://github.com/kabir/eap-operator-tx-recovery-demo/tree/JBEAP-24814. I've also attached an archive of it.

      I deploy everything as mentioned in the application README, scale the application to three pods, and then I make sure that pods 2 and 3 are in a transaction (steps below).

      After almost an hour (5 hours yesterday) I still have three pods, with 2 and 3 in the SCALING_DOWN_RECOVERY_INVESTIGATION state:

      % oc get  wfly eap7-app --template={{.status}} -w        
      map[hosts:[eap7-app-route-myproject.apps.cluster-l4fgj.l4fgj.sandbox309.opentlc.com] pods:[map[name:eap7-app-0 podIP:10.129.2.23 state:ACTIVE] map[name:eap7-app-1 podIP:10.128.2.13 state:ACTIVE] map[name:eap7-app-2 podIP:10.131.0.35 state:ACTIVE]] replicas:3 scalingdownPods:0 selector:app.kubernetes.io/name=eap7-app]map[hosts:[eap7-app-route-myproject.apps.cluster-l4fgj.l4fgj.sandbox309.opentlc.com] pods:[map[name:eap7-app-0 podIP:10.129.2.23 state:ACTIVE] map[name:eap7-app-1 podIP: state:ACTIVE] map[name:eap7-app-2 podIP:10.131.0.35 state:ACTIVE]] replicas:3 scalingdownPods:0 selector:app.kubernetes.io/name=eap7-app]map[hosts:[eap7-app-route-myproject.apps.cluster-l4fgj.l4fgj.sandbox309.opentlc.com] pods:[map[name:eap7-app-0 podIP:10.129.2.23 state:ACTIVE] map[name:eap7-app-1 podIP:10.128.2.14 state:ACTIVE] map[name:eap7-app-2 podIP:10.131.0.35 state:ACTIVE]] replicas:3 scalingdownPods:0 selector:app.kubernetes.io/name=eap7-app]
      
      
      --- SCALING TO 1 ----
      
      map[hosts:[eap7-app-route-myproject.apps.cluster-l4fgj.l4fgj.sandbox309.opentlc.com] pods:[map[name:eap7-app-0 podIP:10.129.2.23 state:ACTIVE] map[name:eap7-app-1 podIP:10.128.2.14 state:ACTIVE] map[name:eap7-app-2 podIP:10.131.0.35 state:ACTIVE]] replicas:3 scalingdownPods:0 selector:app.kubernetes.io/name=eap7-app]map[hosts:[eap7-app-route-myproject.apps.cluster-l4fgj.l4fgj.sandbox309.opentlc.com] pods:[map[name:eap7-app-0 podIP:10.129.2.23 state:ACTIVE] map[name:eap7-app-1 podIP:10.128.2.14 state:SCALING_DOWN_RECOVERY_INVESTIGATION] map[name:eap7-app-2 podIP:10.131.0.35 state:SCALING_DOWN_RECOVERY_INVESTIGATION]] replicas:3 scalingdownPods:2 selector:app.kubernetes.io/name=eap7-app]
      
      
      map[hosts:[eap7-app-route-myproject.apps.cluster-l4fgj.l4fgj.sandbox309.opentlc.com] pods:[map[name:eap7-app-0 podIP:10.129.2.23 state:ACTIVE] map[name:eap7-app-1 podIP:10.128.2.40 state:SCALING_DOWN_RECOVERY_INVESTIGATION] map[name:eap7-app-2 podIP:10.128.2.41 state:SCALING_DOWN_RECOVERY_INVESTIGATION]] replicas:3 scalingdownPods:2 selector:app.kubernetes.io/name=eap7-app]
      
      

      As I understand it, this should take only a minute.

      The commands to get to this stage are:

      [~/sourcecontrol/eap-operator-tx-recovery-demo] 
      % ./demo.sh add one
      --- SNIP ---
      > 
      * Mark bundle as not supporting multiuse
      < HTTP/1.1 202 Accepted
      --- SNIP ---
      eap7-app-1%                      
      

      The above was added on pod 1. Now we try another

      [~/sourcecontrol/eap-operator-tx-recovery-demo] 
      % ./demo.sh add two
      --- SNIP ---
      < HTTP/1.1 202 Accepted
      --- SNIP ---
      eap7-app-2%                                                                                                                                                        
      

      The above was added on pod 2. Now we try another add

      % ./demo.sh add three
      ---- SNIP ----
      * < HTTP/1.1 409 Conflict
      ---- SNIP ----
      

      The above hit one of the already done pods (2 or 1). So we try again:

      % ./demo.sh add three
      --- SNIP ---
      < HTTP/1.1 202 Accepted
      --- SNIP ---
      eap7-app-0%                                                                                                                                                        
      

      This worked and was added on pod 0. Now we release pod 0's transaction (pods 1 and 2 are still hanging)

      % ./demo.sh release 0
      

      Now I try to scale the pods to 1 (from 3).

      Looking in the logs for pod 1, it looks like the pod is terminated before the EAP instance has a chance to be brought up. The attached logs below contains a few attempts at running oc logs -f eap7-app-1 (it seems to disconnect when the pod is terminated). Look for

         ERROR *** WildFly wrapper process (1) received TERM signal ***

      to see where OpenShift stops the pod.

        1. pod 1.log
          84 kB
          Kabir Khan

              Unassigned Unassigned
              kkhan1@redhat.com Kabir Khan
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: