Details
-
Bug
-
Resolution: Duplicate
-
Normal
-
None
-
None
-
False
-
-
False
Description
Deploy type
Manually deployed Kfdef
Version
v1.2.0
Environment
K8s Version: v1.25.11+1485cc9)
OCP Version: 4.12.26_1555
Current Behavior
Sometimes we see DSPA deployment get stalled at:
2023-09-07T14:37:45Z INFO Performing Database Health Check {"namespace": "mission-code-data", "dspa_name": "pipelines-definition"}
Due to https://github.com/opendatahub-io/data-science-pipelines-operator/issues/280 after some time we see:
2023-09-07T14:39:57Z INFO Unable to connect to Database {"namespace": "mission-code-data", "dspa_name": "pipelines-definition"}
And it just repeats. The mariadb pod was the one deployed by default, and I was able to successfully access the pod and the `mlpipeline` (default db) and run `SELECT 1` which seems to be our test for checking db connection.
~ $ oc port-forward -n mission-code-data service/mariadb-pipelines-definition 3306 Forwarding from 127.0.0.1:3306 -> 3306 Forwarding from [::1]:3306 -> 3306 Handling connection for 3306 ~ $ mysql --host=127.0.0.1 --port=3306 --user=root MariaDB [(none)]> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | mlpipeline | | mysql | | performance_schema | +--------------------+ 4 rows in set (0.082 sec) MariaDB [(none)]> use mlpipeline Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed MariaDB [mlpipeline]> select 1; +---+ | 1 | +---+ | 1 | +---+ 1 row in set (0.054 sec)
Expected Behavior
DSPA comes up just fine when default mariadb connects, with no failures to connect. We may expect to see "Performing Database Health Check", a couple of times IF the mariadb pod is still coming up, but if the pod is available, we expect this check to succeed relatively fast, in seconds.
Steps To Reproduce
Seems flaky, do not yet know how to consistently reproduce.
Workaround (if any)
Log in to your cluster via terminal uscing `oc login`, then execute the following:
# Set this to your dspa namespace (if using standalone) or your data science project (if using odh) namespace=my-ds-project dspa=pipelines-definition patch='{"spec":{"database":{"disableHealthCheck":true}}}' oc -n namespace patch dspa ${DSPA_NAME} --type=merge -p ${patch} # Can wait for db connection timeout, takes ~5 min, or just delete the dsp operator pod oc delete -n odh-applications pod $(oc get pods -n odh-applications -l app.kubernetes.io/name=data-science-pipelines-operator --no-headers=true | awk '{print $1}')
Anything else
Note that when using the odh-dashboard users are faced with the following prompt:
We encountered an error creating or loading your pipeline server. To continue, delete this pipeline server and create a new one. Deleting this pipeline server will delete all of its resources, including pipelines, runs, and jobs.
Migrated from GitHub: https://github.com/opendatahub-io/data-science-pipelines-operator/issues/320