Uploaded image for project: 'Red Hat OpenShift AI Engineering'
  1. Red Hat OpenShift AI Engineering
  2. RHOAIENG-1657

[Bug]: DSPO getting stuck at "Performing Database Health Check"

    XMLWordPrintable

Details

    • False
    • Hide

      None

      Show
      None
    • False

    Description

      Deploy type

      Manually deployed Kfdef

      Version

      v1.2.0

      Environment

      K8s Version: v1.25.11+1485cc9)
      OCP Version: 4.12.26_1555

      Current Behavior

      Sometimes we see DSPA deployment get stalled at:

      2023-09-07T14:37:45Z	INFO	Performing Database Health Check	{"namespace": "mission-code-data", "dspa_name": "pipelines-definition"}
      

      Due to https://github.com/opendatahub-io/data-science-pipelines-operator/issues/280 after some time we see:

      2023-09-07T14:39:57Z	INFO	Unable to connect to Database	{"namespace": "mission-code-data", "dspa_name": "pipelines-definition"}
      

      And it just repeats. The mariadb pod was the one deployed by default, and I was able to successfully access the pod and the `mlpipeline` (default db) and run `SELECT 1` which seems to be our test for checking db connection.

      ~ $ oc port-forward -n mission-code-data service/mariadb-pipelines-definition 3306
      Forwarding from 127.0.0.1:3306 -> 3306
      Forwarding from [::1]:3306 -> 3306
      Handling connection for 3306
      
      ~ $ mysql --host=127.0.0.1 --port=3306 --user=root
      MariaDB [(none)]> show databases;
      +--------------------+
      | Database           |
      +--------------------+
      | information_schema |
      | mlpipeline         |
      | mysql              |
      | performance_schema |
      +--------------------+
      4 rows in set (0.082 sec)
      
      MariaDB [(none)]> use mlpipeline
      Reading table information for completion of table and column names
      You can turn off this feature to get a quicker startup with -A
      
      Database changed
      MariaDB [mlpipeline]> select 1;
      +---+
      | 1 |
      +---+
      | 1 |
      +---+
      1 row in set (0.054 sec)
      

      Expected Behavior

      DSPA comes up just fine when default mariadb connects, with no failures to connect. We may expect to see "Performing Database Health Check", a couple of times IF the mariadb pod is still coming up, but if the pod is available, we expect this check to succeed relatively fast, in seconds.

      Steps To Reproduce

      Seems flaky, do not yet know how to consistently reproduce.

      Workaround (if any)

      Log in to your cluster via terminal uscing `oc login`, then execute the following:

      # Set this to your dspa namespace (if using standalone) or your data science project (if using odh)
      namespace=my-ds-project
      
      dspa=pipelines-definition
      
      patch='{"spec":{"database":{"disableHealthCheck":true}}}'
      oc -n namespace patch dspa ${DSPA_NAME} --type=merge -p ${patch}
      
      # Can wait for db connection timeout, takes ~5 min, or just delete the dsp operator pod
      oc delete -n odh-applications pod $(oc get pods -n odh-applications -l app.kubernetes.io/name=data-science-pipelines-operator --no-headers=true | awk '{print $1}')
      
      

      Anything else

      Note that when using the odh-dashboard users are faced with the following prompt:

      We encountered an error creating or loading your pipeline server. To continue, delete this pipeline server and create a new one. Deleting this pipeline server will delete all of its resources, including pipelines, runs, and jobs.
      

      Migrated from GitHub: https://github.com/opendatahub-io/data-science-pipelines-operator/issues/320

      Attachments

        Activity

          People

            Unassigned Unassigned
            humairkhan Humair Khan
            RHOAI Data Science Pipelines
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              PagerDuty