Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-75881

jobset-controller-manager can't be ready with ProbeError and Unhealthy

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.19.z, 4.21
    • JobSet
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • In Progress
    • Bug Fix
    • Clearing `openshift.io/node-selector` annotation to disable defaultNodeSelector, if it is configured in the cluster. Because oc adm restart-kubelet, oc adm copy-to-node commands need to run on any node type.

      Description of problem:

      jobset-controller-manager pod can't be ready with error  :

       

      Events:
        Type     Reason          Age                    From               Message
        ----     ------          ----                   ----               -------
        Normal   Scheduled       29m                    default-scheduler  Successfully assigned openshift-jobset-operator/jobset-controller-manager-59b8f68c49-mdgnj to ip-10-0-73-53.us-east-2.compute.internal
        Normal   AddedInterface  29m                    multus             Add eth0 [10.128.8.18/23] from ovn-kubernetes
        Normal   Pulling         29m                    kubelet            Pulling image "quay.io/zhouying7780/jobset:js01"
        Normal   Pulled          29m                    kubelet            Successfully pulled image "quay.io/zhouying7780/jobset:js01" in 16.398s (16.398s including waiting). Image size: 76074147 bytes.
        Normal   Created         29m                    kubelet            Created container: manager
        Normal   Started         29m                    kubelet            Started container manager
        Warning  Unhealthy       27m (x12 over 29m)     kubelet            Readiness probe failed: Get "http://10.128.8.18:8081/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
        Warning  ProbeError      3m58s (x160 over 29m)  kubelet            Readiness probe error: Get "http://10.128.8.18:8081/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

       

       

      Version-Release number of selected component (if applicable):

      •  main branch

      How reproducible:

      Always

      Steps to Reproduce:

      Step1. Build the jobset operator with main branch .

      Step2. Build the operand image.

      Step3. Update  the .spec.install.spec.deployments[0].spec.template.spec.containers[0].image field in the JobSet CSV under manifests/jobset-operator.clusterserviceversion.yaml to point to the newly built image
      Setp4.  Build the bundle and index image;

      Step5, Use the index image to create catalogsoure 

      Step6. On console install cert-manager and jobset operator 

      Actual results:

      The jobset-controller-manager pod not ready

      Events:
        Type     Reason          Age                    From               Message
        ----     ------          ----                   ----               -------
        Normal   Scheduled       29m                    default-scheduler  Successfully assigned openshift-jobset-operator/jobset-controller-manager-59b8f68c49-mdgnj to ip-10-0-73-53.us-east-2.compute.internal
        Normal   AddedInterface  29m                    multus             Add eth0 [10.128.8.18/23] from ovn-kubernetes
        Normal   Pulling         29m                    kubelet            Pulling image "quay.io/zhouying7780/jobset:js01"
        Normal   Pulled          29m                    kubelet            Successfully pulled image "quay.io/zhouying7780/jobset:js01" in 16.398s (16.398s including waiting). Image size: 76074147 bytes.
        Normal   Created         29m                    kubelet            Created container: manager
        Normal   Started         29m                    kubelet            Started container manager
        Warning  Unhealthy       27m (x12 over 29m)     kubelet            Readiness probe failed: Get "http://10.128.8.18:8081/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
        Warning  ProbeError      3m58s (x160 over 29m)  kubelet            Readiness probe error: Get "http://10.128.8.18:8081/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)  

      Expected results:

      Can run with no errors.

      Additional information:

      see logs from the jobset-controller-manager pod : 
      2026-02-04T03:31:49Z ERROR controller-runtime.cache.UnhandledError Failed to watch {"reflector": "sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:114", "type": "*v1.Pod", "error": "failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:openshift-jobset-operator:jobset-controller-manager\" cannot list resource \"pods\" in API group \"\" at the cluster scope"}
      k8s.io/apimachinery/pkg/util/runtime.logError
      /workspace/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:221
      k8s.io/apimachinery/pkg/util/runtime.handleError
      /workspace/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:212
      k8s.io/apimachinery/pkg/util/runtime.HandleErrorWithContext
      /workspace/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:198
      k8s.io/client-go/tools/cache.DefaultWatchErrorHandler
      /workspace/vendor/k8s.io/client-go/tools/cache/reflector.go:204
      k8s.io/client-go/tools/cache.(*Reflector).RunWithContext.func1
      /workspace/vendor/k8s.io/client-go/tools/cache/reflector.go:370
      k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
      /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:233
      k8s.io/apimachinery/pkg/util/wait.BackoffUntilWithContext.func1
      /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:255
      k8s.io/apimachinery/pkg/util/wait.BackoffUntilWithContext
      /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:256
      k8s.io/apimachinery/pkg/util/wait.BackoffUntil
      /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:233
      k8s.io/client-go/tools/cache.(*Reflector).RunWithContext
      /workspace/vendor/k8s.io/client-go/tools/cache/reflector.go:368
      k8s.io/client-go/tools/cache.(*controller).RunWithContext.(*Group).StartWithContext.func3
      /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:63
      k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1
      /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72
       
      Asked AI, after apply additional ClusterRole , the issue fixed : 
      cat <<EOF | oc apply -f -
      apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRole
      metadata:
        name: jobset-controller-manager-full-access
      rules:

      • apiGroups: ["batch"]
          resources: ["jobs"]
          verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
      • apiGroups: ["jobset.x-k8s.io"]
          resources: ["jobsets", "jobsets/status", "jobsets/finalizers"]
          verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
      • apiGroups: [""]
          resources: ["pods", "services", "events", "configmaps"]
          verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

        apiVersion: rbac.authorization.k8s.io/v1
        kind: ClusterRoleBinding
        metadata:
          name: jobset-controller-manager-full-binding
        roleRef:
          apiGroup: rbac.authorization.k8s.io
          kind: ClusterRole
          name: jobset-controller-manager-full-access
        subjects:
      • kind: ServiceAccount
          name: jobset-controller-manager
          namespace: openshift-jobset-operator
        EOF 

              Unassigned Unassigned
              yinzhou@redhat.com Ying Zhou
              None
              None
              Ying Zhou Ying Zhou
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: