Loading...

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: rhos-18.0.10 FR 3
Affects Version/s: rhos-18.0.4
Component/s: mariadb-operator
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs Approval:
?
Regression:
None
Intelligence Requested:
Market:

Sprint:
Sprint 3
sprint_count:
1
Severity:
Important

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

To Reproduce Steps to reproduce the behavior:
This was seen when running the job as in the google document "Run Trunk uni03gamma Update using testproject" link to the ticket.

all galera pods are down:

pod/openstack-cell1-galera-0                                          0/1     Running            3 (2m21s ago)    3h31m
pod/openstack-cell1-galera-1                                          0/1     Running            2 (2m7s ago)     3h31m
pod/openstack-cell1-galera-2                                          0/1     Running            4 (3m24s ago)    17m
pod/openstack-galera-0                                                0/1     Running            5 (95s ago)      3h31m
pod/openstack-galera-1                                                0/1     Running            4 (2m49s ago)    15m
pod/openstack-galera-2                                                0/1     Running            2 (4m12s ago)    17m

from e.g. galera-0 pod https://sf.apps.int.gpc.ocp-hub.prod.psi.redhat.com/logs/03c/components-integration/03cadb3677924a64bacd3abc64905d7c/logs/controller-0/ci-framework-data/logs/openstack-k8s-operators-openstack-must-gather/namespaces/openstack/pods/openstack-galera-0/openstack-galera-0-describe

Events:
  Type     Reason     Age   From     Message
  ----     ------     ----  ----     -------
  Warning  Unhealthy  67m   kubelet  Readiness probe failed: + mysql -uroot -sNEe 'show status like '\''wsrep_local_state_comment'\'';'
+ grep -w -e Synced
+ tail -1
  Warning  Unhealthy  25m (x29 over 3h14m)  kubelet  Readiness probe failed: command timed out
  Warning  Unhealthy  20m (x28 over 155m)   kubelet  Liveness probe failed: command timed out
  Warning  Unhealthy  2m49s (x93 over 19m)  kubelet  Startup probe failed: /var/lib/operator-scripts/mysql_probe.sh: line 187: $2: unbound variable

From the pod log its waiting on gcomm URI be configured:

Running command: '/usr/local/bin/detect_gcomm_and_start.sh'
++ [[ -n '' ]]
++ [[ -n '' ]]
+ echo 'Running command: '\''/usr/local/bin/detect_gcomm_and_start.sh'\'''
+ umask 0022
+ exec /usr/local/bin/detect_gcomm_and_start.sh
Waiting for gcomm URI to be configured for this POD

another shows

Events:
  Type     Reason          Age                 From               Message
  ----     ------          ----                ----               -------
  Normal   Scheduled       20m                 default-scheduler  Successfully assigned openstack/openstack-galera-2 to master-0
  Normal   AddedInterface  19m                 multus             Add eth0 [192.168.17.0/23] from ovn-kubernetes
  Normal   Pulled          19m                 kubelet            Container image "registry.redhat.io/rhoso/openstack-mariadb-rhel9@sha256:13119fe0ec56263a2bff3fc9c3892ea6386b837955280d8488f05d8ea6c4e44d" already present on machine
  Normal   Created         19m                 kubelet            Created container mysql-bootstrap
  Normal   Started         19m                 kubelet            Started container mysql-bootstrap
  Normal   Pulled          19m                 kubelet            Container image "registry.redhat.io/rhoso/openstack-mariadb-rhel9@sha256:13119fe0ec56263a2bff3fc9c3892ea6386b837955280d8488f05d8ea6c4e44d" already present on machine
  Normal   Created         19m                 kubelet            Created container galera
  Normal   Started         19m                 kubelet            Started container galera
  Warning  Unhealthy       15m (x6 over 17m)   kubelet            Readiness probe failed: command timed out
  Warning  Unhealthy       15m (x5 over 17m)   kubelet            Liveness probe failed: command timed out
  Warning  Unhealthy       14m (x13 over 16m)  kubelet            Readiness probe failed: wsrep_local_state_comment (Donor/Desynced) differs from Synced
  Warning  Unhealthy       8m51s               kubelet            Readiness probe failed: wsrep_local_state_comment (Initialized) differs from Synced
  Warning  Unhealthy       2m20s               kubelet            Startup probe failed: waiting for SST to finish

Manually `rsh` to a galera pod and running the `detect_last_commit.sh` https://github.com/openstack-k8s-operators/mariadb-operator/blob/main/templates/galera/bin/detect_last_commit.sh , which the operator runs` it failed to get the seq no https://github.com/openstack-k8s-operators/mariadb-operator/blob/main/templates/galera/bin/detect_last_commit.sh#L110

Currents suspicion is that it was an environment issue with a bad storage pv for the pods.

Expected behavior
Update complete successfully.

Bug impact

ctlplane services are blocked due to DB down

Known workaround

rsh into the pods which fail to get the sequ no and delete the files holding the DB data.
restart the pods afterwards

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty