Uploaded image for project: 'Red Hat OpenShift Data Science'
  1. Red Hat OpenShift Data Science
  2. RHODS-7507

Notebook controller raises many 409 errors

XMLWordPrintable

    • False
    • None
    • False
    • Testable
    • No
    • 1.26.0
    • No
    • No
    • Pending
    • None
    • RHODS 1.26

      Description of problem:

      During the notebook scale test execution, we notice that many HTTP error 409 are reported by the Kube APIServer (green line):

      In the logs of the APIServer log in the master nodes

      we can find that many of them (373 out of 413) are caused by this event:

      {
        "kind": "Event",
        "apiVersion": "audit.k8s.io/v1",
        "level": "Metadata",
        "auditID": "fe485de8-b4f4-4f4d-abb0-dced2a59b39e",
        "stage": "ResponseComplete",
        "requestURI": "/api/v1/namespaces/rhods-notebooks/configmaps",
        "verb": "create",
        "user": {
          "username": "system:serviceaccount:redhat-ods-applications:odh-notebook-controller-manager",
          "uid": "c2174df4-f0be-46d4-ae36-4e19d77f51d4",
          "groups": [
            "system:serviceaccounts",
            "system:serviceaccounts:redhat-ods-applications",
            "system:authenticated"
          ],
          "extra": {
            "authentication.kubernetes.io/pod-name": [
              "odh-notebook-controller-manager-5f75589659-9jfcv"
            ],
            "authentication.kubernetes.io/pod-uid": [
              "ef723cfc-1cf7-4246-bef6-9de3506f1f49"
            ]
          }
        },
        "sourceIPs": [
          "10.0.168.13"
        ],
        "userAgent": "manager/v0.0.0 (linux/amd64) kubernetes/$Format",
        "objectRef": {
          "resource": "configmaps",
          "namespace": "rhods-notebooks",
          "name": "trusted-ca",
          "apiVersion": "v1"
        },
        "responseStatus": {
          "metadata": {},
          "status": "Failure",
          "message": "configmaps \"trusted-ca\" already exists",
          "reason": "AlreadyExists",
          "details": {
            "name": "trusted-ca",
            "kind": "configmaps"
          },
          "code": 409
        },
        "requestReceivedTimestamp": "2023-03-09T13:23:24.287913Z",
        "stageTimestamp": "2023-03-09T13:23:24.294380Z",
        "annotations": {
          "authorization.k8s.io/decision": "allow",
          "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"odh-notebook-controller-manager-rolebinding\" of ClusterRole \"odh-notebook-controller-manager-role\" to ServiceAccount \"odh-notebook-controller-manager/redhat-ods-applications\""
        }
      }
      

      For better troubleshooting, this error should be avoid whenever possible.

      Prerequisites (if any, like setup, operators/versions):

      Steps to Reproduce

      1. create notebooks
      2. check the client error in prometheus:
      'sum by (code) (increase(apiserver_request_total{code=~"4.."}[2m]))'
      
      1. check the kube-apiserver logs with:
      oc debug node/<master nodes>
      chroot /host
      tail -f /var/log/kube-apiserver/audit.log | grep '"code":409' | grep trusted-ca
      

      Actual results:

      see many matching lines

      Expected results:

      see no matching line

      Reproducibility (Always/Intermittent/Only Once):

      always

      Build Details:

      Workaround:

      Additional info:

      This is likely caused by this unconditional Create call.

      	trustedCAConfigMap := &corev1.ConfigMap{
      		ObjectMeta: metav1.ObjectMeta{
      			Name:      "trusted-ca",
      			Namespace: notebook.Namespace,
      			Labels:    map[string]string{"config.openshift.io/inject-trusted-cabundle": "true"},
      		},
      	}
      
      	err := r.Client.Create(ctx, trustedCAConfigMap)
      	if err != nil {
      		if apierrs.IsAlreadyExists(err) {
      			return nil
      		}
      	}
      

      The CREATE could be turned into

      • get the resource
      • if doesn't exist, create it
      • if exists and not identical update it

      This last test (currently missing) is part of the controller duty.

        1. screenshot-4.png
          screenshot-4.png
          87 kB
        2. screenshot-3.png
          screenshot-3.png
          86 kB
        3. screenshot-2.png
          screenshot-2.png
          169 kB
        4. screenshot-1.png
          screenshot-1.png
          126 kB
        5. image-2023-03-10-10-11-05-095.png
          image-2023-03-10-10-11-05-095.png
          93 kB

              vhire Vaishnavi Hire
              kpouget2 Kevin Pouget
              Kevin Pouget Kevin Pouget
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: