Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-18259

After update from 18.0.4 to fr3 galera pods are down

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • None
    • Sprint 3
    • 1
    • Important

      To Reproduce Steps to reproduce the behavior:
      This was seen when running the job as in the google document "Run Trunk uni03gamma Update using testproject" link to the ticket.

       

      all galera pods are down:

      pod/openstack-cell1-galera-0                                          0/1     Running            3 (2m21s ago)    3h31m
      pod/openstack-cell1-galera-1                                          0/1     Running            2 (2m7s ago)     3h31m
      pod/openstack-cell1-galera-2                                          0/1     Running            4 (3m24s ago)    17m
      pod/openstack-galera-0                                                0/1     Running            5 (95s ago)      3h31m
      pod/openstack-galera-1                                                0/1     Running            4 (2m49s ago)    15m
      pod/openstack-galera-2                                                0/1     Running            2 (4m12s ago)    17m

       

      from e.g. galera-0 pod https://sf.apps.int.gpc.ocp-hub.prod.psi.redhat.com/logs/03c/components-integration/03cadb3677924a64bacd3abc64905d7c/logs/controller-0/ci-framework-data/logs/openstack-k8s-operators-openstack-must-gather/namespaces/openstack/pods/openstack-galera-0/openstack-galera-0-describe 

      Events:
        Type     Reason     Age   From     Message
        ----     ------     ----  ----     -------
        Warning  Unhealthy  67m   kubelet  Readiness probe failed: + mysql -uroot -sNEe 'show status like '\''wsrep_local_state_comment'\'';'
      + grep -w -e Synced
      + tail -1
        Warning  Unhealthy  25m (x29 over 3h14m)  kubelet  Readiness probe failed: command timed out
        Warning  Unhealthy  20m (x28 over 155m)   kubelet  Liveness probe failed: command timed out
        Warning  Unhealthy  2m49s (x93 over 19m)  kubelet  Startup probe failed: /var/lib/operator-scripts/mysql_probe.sh: line 187: $2: unbound variable

      From the pod log its waiting on gcomm URI be configured:

      Running command: '/usr/local/bin/detect_gcomm_and_start.sh'
      ++ [[ -n '' ]]
      ++ [[ -n '' ]]
      + echo 'Running command: '\''/usr/local/bin/detect_gcomm_and_start.sh'\'''
      + umask 0022
      + exec /usr/local/bin/detect_gcomm_and_start.sh
      Waiting for gcomm URI to be configured for this POD

       
      another shows

      Events:
        Type     Reason          Age                 From               Message
        ----     ------          ----                ----               -------
        Normal   Scheduled       20m                 default-scheduler  Successfully assigned openstack/openstack-galera-2 to master-0
        Normal   AddedInterface  19m                 multus             Add eth0 [192.168.17.0/23] from ovn-kubernetes
        Normal   Pulled          19m                 kubelet            Container image "registry.redhat.io/rhoso/openstack-mariadb-rhel9@sha256:13119fe0ec56263a2bff3fc9c3892ea6386b837955280d8488f05d8ea6c4e44d" already present on machine
        Normal   Created         19m                 kubelet            Created container mysql-bootstrap
        Normal   Started         19m                 kubelet            Started container mysql-bootstrap
        Normal   Pulled          19m                 kubelet            Container image "registry.redhat.io/rhoso/openstack-mariadb-rhel9@sha256:13119fe0ec56263a2bff3fc9c3892ea6386b837955280d8488f05d8ea6c4e44d" already present on machine
        Normal   Created         19m                 kubelet            Created container galera
        Normal   Started         19m                 kubelet            Started container galera
        Warning  Unhealthy       15m (x6 over 17m)   kubelet            Readiness probe failed: command timed out
        Warning  Unhealthy       15m (x5 over 17m)   kubelet            Liveness probe failed: command timed out
        Warning  Unhealthy       14m (x13 over 16m)  kubelet            Readiness probe failed: wsrep_local_state_comment (Donor/Desynced) differs from Synced
        Warning  Unhealthy       8m51s               kubelet            Readiness probe failed: wsrep_local_state_comment (Initialized) differs from Synced
        Warning  Unhealthy       2m20s               kubelet            Startup probe failed: waiting for SST to finish

      Manually `rsh` to a galera pod and running the `detect_last_commit.sh` https://github.com/openstack-k8s-operators/mariadb-operator/blob/main/templates/galera/bin/detect_last_commit.sh , which the operator runs` it failed to get the seq no https://github.com/openstack-k8s-operators/mariadb-operator/blob/main/templates/galera/bin/detect_last_commit.sh#L110

      Currents suspicion is that it was an environment issue with a bad storage pv for the pods.

      Expected behavior
      Update complete successfully.

      Bug impact

      • ctlplane services are blocked due to DB down

      Known workaround

      • rsh into the pods which fail to get the sequ no and delete the files holding the DB data.
      • restart the pods afterwards

              rhn-support-lmiccini Luca Miccini
              rhn-support-mschuppe Martin Schuppert
              rhos-dfg-pidone
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: