-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
rhos-18.0.4
-
None
-
False
-
-
False
-
?
-
None
-
-
-
Sprint 3
-
1
-
Important
To Reproduce Steps to reproduce the behavior:
This was seen when running the job as in the google document "Run Trunk uni03gamma Update using testproject" link to the ticket.
all galera pods are down:
pod/openstack-cell1-galera-0 0/1 Running 3 (2m21s ago) 3h31m pod/openstack-cell1-galera-1 0/1 Running 2 (2m7s ago) 3h31m pod/openstack-cell1-galera-2 0/1 Running 4 (3m24s ago) 17m pod/openstack-galera-0 0/1 Running 5 (95s ago) 3h31m pod/openstack-galera-1 0/1 Running 4 (2m49s ago) 15m pod/openstack-galera-2 0/1 Running 2 (4m12s ago) 17m
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Unhealthy 67m kubelet Readiness probe failed: + mysql -uroot -sNEe 'show status like '\''wsrep_local_state_comment'\'';' + grep -w -e Synced + tail -1 Warning Unhealthy 25m (x29 over 3h14m) kubelet Readiness probe failed: command timed out Warning Unhealthy 20m (x28 over 155m) kubelet Liveness probe failed: command timed out Warning Unhealthy 2m49s (x93 over 19m) kubelet Startup probe failed: /var/lib/operator-scripts/mysql_probe.sh: line 187: $2: unbound variable
From the pod log its waiting on gcomm URI be configured:
Running command: '/usr/local/bin/detect_gcomm_and_start.sh' ++ [[ -n '' ]] ++ [[ -n '' ]] + echo 'Running command: '\''/usr/local/bin/detect_gcomm_and_start.sh'\''' + umask 0022 + exec /usr/local/bin/detect_gcomm_and_start.sh Waiting for gcomm URI to be configured for this POD
another shows
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 20m default-scheduler Successfully assigned openstack/openstack-galera-2 to master-0 Normal AddedInterface 19m multus Add eth0 [192.168.17.0/23] from ovn-kubernetes Normal Pulled 19m kubelet Container image "registry.redhat.io/rhoso/openstack-mariadb-rhel9@sha256:13119fe0ec56263a2bff3fc9c3892ea6386b837955280d8488f05d8ea6c4e44d" already present on machine Normal Created 19m kubelet Created container mysql-bootstrap Normal Started 19m kubelet Started container mysql-bootstrap Normal Pulled 19m kubelet Container image "registry.redhat.io/rhoso/openstack-mariadb-rhel9@sha256:13119fe0ec56263a2bff3fc9c3892ea6386b837955280d8488f05d8ea6c4e44d" already present on machine Normal Created 19m kubelet Created container galera Normal Started 19m kubelet Started container galera Warning Unhealthy 15m (x6 over 17m) kubelet Readiness probe failed: command timed out Warning Unhealthy 15m (x5 over 17m) kubelet Liveness probe failed: command timed out Warning Unhealthy 14m (x13 over 16m) kubelet Readiness probe failed: wsrep_local_state_comment (Donor/Desynced) differs from Synced Warning Unhealthy 8m51s kubelet Readiness probe failed: wsrep_local_state_comment (Initialized) differs from Synced Warning Unhealthy 2m20s kubelet Startup probe failed: waiting for SST to finish
Manually `rsh` to a galera pod and running the `detect_last_commit.sh` https://github.com/openstack-k8s-operators/mariadb-operator/blob/main/templates/galera/bin/detect_last_commit.sh , which the operator runs` it failed to get the seq no https://github.com/openstack-k8s-operators/mariadb-operator/blob/main/templates/galera/bin/detect_last_commit.sh#L110
Currents suspicion is that it was an environment issue with a bad storage pv for the pods.
Expected behavior
Update complete successfully.
Bug impact
- ctlplane services are blocked due to DB down
Known workaround
- rsh into the pods which fail to get the sequ no and delete the files holding the DB data.
- restart the pods afterwards