Uploaded image for project: 'OPCT - OpenShift Provider Compatibility Tool'
  1. OPCT - OpenShift Provider Compatibility Tool
  2. OPCT-6

[bug] The RBAC used on Sonobuoy SA stuck the cluster upgrades on Y-stream

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Blocker Blocker
    • opct-v0.3.0
    • None
    • CLI

      BUG Description

      The RBAC used created by CLI is blocking cluster upgrades, which is impacting the development on the 'cluster upgrade' feature.

      Steps to reproduce:

      • Run the OPCT
      • Run the cluster upgrade (manually, by run-upgrade with openshift-tests, or through CLI with development feature)
      • The cluster operator service-ca stuck on "Progressing..."

      I also able to reproduce in different scenarios:

      • S1) Running upgrade with existing permissions [1]
      • S2) Running upgrade without setting permissions
      • S3) Running upgrade with SCC used by kube-cert (without [1][2]

      The S1 and S3 got the same errors.

      On S2, the cluster is upgraded successfully,  but the Sonobuoy got stuck (another block handled by SPLAT-876 ).

      As described on the KCS[3], the ClusterOperator service-ca getting stuck could be due changes done on the system groups. As described on [1] the CLI is associating the group system:serviceaccounts to anyuid SCC groups:
      $ oc adm policy who-can use scc anyuid | grep serviceaccounts
              system:serviceaccounts
       

      ENGINEERING DETAILS

      [1] https://github.com/redhat-openshift-ecosystem/provider-certification-tool/blob/main/pkg/run/run.go#L193-L249

      [2] https://github.com/cncf/k8s-conformance/tree/master/v1.24/openshift#run-conformance-tests

      [3] https://access.redhat.com/solutions/5875621 

       

            [OPCT-6] [bug] The RBAC used on Sonobuoy SA stuck the cluster upgrades on Y-stream

            Marco Braga added a comment -

            The PR #34 fixes the *cluster upgrade* RBAC while running 4.10->4.11. A lot of tests have been done using the regular execution and upgrade feature.

            Closing this card.

            Marco Braga added a comment - The PR #34 fixes the * cluster upgrade * RBAC while running 4.10->4.11. A lot of tests have been done using the regular execution and upgrade feature. Closing this card.

            Marco Braga added a comment -

            The PR #34 resolves the cluster upgrade issues.

            The PR creates the custom resources (ServiceAccount, ClusterRole, and ClusterRoleBinding) and tells the backend (sonobuoy) to do not create it.

            The ClusterRole is high privileged by default, as the Sonobuoy should run with high privileges. (it can be reviewed in the future while pod security admission is completely reviewed on the upstream - see opened reference issue https://github.com/vmware-tanzu/sonobuoy/issues/1858)

            Marco Braga added a comment - The PR #34 resolves the cluster upgrade issues. The PR creates the custom resources (ServiceAccount, ClusterRole, and ClusterRoleBinding) and tells the backend (sonobuoy) to do not create it. The ClusterRole is high privileged by default, as the Sonobuoy should run with high privileges. (it can be reviewed in the future while pod security admission is completely reviewed on the upstream - see opened reference issue https://github.com/vmware-tanzu/sonobuoy/issues/1858 )

            Marco Braga added a comment -

            The issue described on the last comment, with regards the number of failures, is handled by the card https://issues.redhat.com/browse/SPLAT-909 . There I am detailing the amount of failures when comparing with the previous version (v0.1).

            The current card should keep the focus on the RBAC changes needed on OPCT to avoid stuck when upgrading the cluster, mainly 4.10->4.11. The proposal on the PR mentioned above (provider-certification-tool/pull/34) resolves the stuck problem and allows the cluster to finished the upgrades, we need to make sure the impact and will not introduce regressions.

            Marco Braga added a comment - The issue described on the last comment, with regards the number of failures, is handled by the card https://issues.redhat.com/browse/SPLAT-909 . There I am detailing the amount of failures when comparing with the previous version (v0.1). The current card should keep the focus on the RBAC changes needed on OPCT to avoid stuck when upgrading the cluster, mainly 4.10->4.11. The proposal on the PR mentioned above (provider-certification-tool/pull/34) resolves the stuck problem and allows the cluster to finished the upgrades, we need to make sure the impact and will not introduce regressions.

            Marco Braga added a comment -

            Running the full execution against the latest CLI version (main branch), it seems not to have good results:

             

            $ ./openshift-provider-cert-linux-amd64-main results  202211220441_sonobuoy_bd6c044b-f8b7-42a7-980a-c0bb70a08890.tar.gz  |grep 'Run Details' -A 3
            Run Details:
            API Server version: v1.23.5+012e945
            Node health: 7/7 (100%)
            Pods health: 244/248 (98%)
            
            $ ./openshift-provider-cert-linux-amd64-main results -p 20-openshift-conformance-validated 202211220441_sonobuoy_bd6c044b-f8b7-42a7-980a-c0bb70a08890.tar.gz  |head -n 7
            Plugin: 20-openshift-conformance-validated
            Status: failed
            Total: 3375
            Passed: 1295
            Failed: 82
            Skipped: 1998
             

             

            • `main` branch - ran after PR version (existing cluster - exec timestamp 202211221130)

             

            $ ./openshift-provider-cert-linux-amd64-main results  202211221130_sonobuoy_74c57454-0955-4617-9ce8-888c772a0713.tar.gz  |grep 'Run Details' -A 3
            Run Details:
            API Server version: v1.23.5+012e945
            Node health: 7/7 (100%)
            Pods health: 243/247 (98%)
            
            $ ./openshift-provider-cert-linux-amd64-main results -p openshift-conformance-validated 202211221130_sonobuoy_74c57454-0955-4617-9ce8-888c772a0713.tar.gz  |head -n 7
            Plugin: openshift-conformance-validated
            Status: failed
            Total: 3375
            Passed: 1346
            Failed: 30
            Skipped: 1999
            

             

            We need to dig into failures to understand what happened to increase almost 3x the number of failures. But it seems the new RBAC will impact the runtime of openshift-tests utility.

            Some actions items I can see in this case:

            • check the failed details - mainly when using the main branch as a baseline
            • check which RBAC the upgrade job (openshift-tests run-upgrade) is using on CI

             

            Marco Braga added a comment - Running the full execution against the latest CLI version (main branch), it seems not to have good results: Currnet PR (exec timestamp 202211220441)   $ ./openshift-provider-cert-linux-amd64-main results  202211220441_sonobuoy_bd6c044b-f8b7-42a7-980a-c0bb70a08890.tar.gz  |grep 'Run Details' -A 3 Run Details: API Server version: v1.23.5+012e945 Node health: 7/7 (100%) Pods health: 244/248 (98%) $ ./openshift-provider-cert-linux-amd64-main results -p 20-openshift-conformance-validated 202211220441_sonobuoy_bd6c044b-f8b7-42a7-980a-c0bb70a08890.tar.gz  |head -n 7 Plugin: 20-openshift-conformance-validated Status: failed Total: 3375 Passed: 1295 Failed: 82 Skipped: 1998   `main` branch - ran after PR version (existing cluster - exec timestamp 202211221130)   $ ./openshift-provider-cert-linux-amd64-main results  202211221130_sonobuoy_74c57454-0955-4617-9ce8-888c772a0713.tar.gz  |grep 'Run Details' -A 3 Run Details: API Server version: v1.23.5+012e945 Node health: 7/7 (100%) Pods health: 243/247 (98%) $ ./openshift-provider-cert-linux-amd64-main results -p openshift-conformance-validated 202211221130_sonobuoy_74c57454-0955-4617-9ce8-888c772a0713.tar.gz  |head -n 7 Plugin: openshift-conformance-validated Status: failed Total: 3375 Passed: 1346 Failed: 30 Skipped: 1999   We need to dig into failures to understand what happened to increase almost 3x the number of failures. But it seems the new RBAC will impact the runtime of openshift-tests utility. Some actions items I can see in this case: check the failed details - mainly when using the main branch as a baseline check which RBAC the upgrade job (openshift-tests run-upgrade) is using on CI  

            Marco Braga added a comment - - edited

            https://github.com/redhat-openshift-ecosystem/provider-certification-tool/pull/34

            PR Submited doing the same of commands:

            $ oc adm policy add-scc-to-user anyuid -z sonobuoy-serviceaccount
            $ oc adm policy add-scc-to-user privileged -z sonobuoy-serviceaccount 

            Hacking `oc adm policy add-scc-to-user anyuid -z sonobuoy-serviceaccount`:

            $ oc adm policy add-scc-to-user anyuid -z sonobuoy-serviceaccount --loglevel 10 -n openshift-provider-certification
                    round_trippers.go:466] curl -v -XGET  -H "Accept: application/json, *\/*" -H "User-Agent: oc/4.11.0 (linux/amd64) kubernetes/5e53738"
                        'https://api.mrbnone4103018.devcluster.openshift.com:6443/apis/rbac.authorization.k8s.io/v1/clusterroles/system:openshift:scc:anyuid'
                    round_trippers.go:466] curl -v -XGET  -H "Accept: application/json, *\/*" -H "User-Agent: oc/4.11.0 (linux/amd64) kubernetes/5e53738"
                        'https://api.mrbnone4103018.devcluster.openshift.com:6443/apis/rbac.authorization.k8s.io/v1/namespaces/openshift-provider-certification/rolebindings/system:openshift:scc:anyuid'        request.go:1073] Request Body: {
                        "kind":"RoleBinding",
                        "apiVersion":"rbac.authorization.k8s.io/v1",
                        "metadata":{
                            "name":"system:openshift:scc:anyuid",
                            "namespace":"openshift-provider-certification",
                            "creationTimestamp":null
                        },
                        "subjects":[{
                            "kind":"ServiceAccount",
                            "name":"sonobuoy-serviceaccount",
                            "namespace":"openshift-provider-certification"
                            }],
                        "roleRef":{
                            "apiGroup":"rbac.authorization.k8s.io",
                            "kind":"ClusterRole",
                            "name":"system:openshift:scc:anyuid"
                        }}
                    round_trippers.go:466] curl -v -XPOST  -H "Accept: application/json, *\/*" -H "Content-Type: application/json" -H "User-Agent: oc/4.11.0 (linux/amd64) kubernetes/5e53738"
                        'https://api.mrbnone4103018.devcluster.openshift.com:6443/apis/rbac.authorization.k8s.io/v1/namespaces/openshift-provider-certification/rolebindings'
             

            Marco Braga added a comment - - edited https://github.com/redhat-openshift-ecosystem/provider-certification-tool/pull/34 PR Submited doing the same of commands: $ oc adm policy add-scc-to-user anyuid -z sonobuoy-serviceaccount $ oc adm policy add-scc-to-user privileged -z sonobuoy-serviceaccount Hacking `oc adm policy add-scc-to-user anyuid -z sonobuoy-serviceaccount`: $ oc adm policy add-scc-to-user anyuid -z sonobuoy-serviceaccount --loglevel 10 -n openshift-provider-certification         round_trippers.go:466] curl -v -XGET  -H "Accept: application/json, *\/*" -H "User-Agent: oc/4.11.0 (linux/amd64) kubernetes/5e53738"             'https: //api.mrbnone4103018.devcluster.openshift.com:6443/apis/rbac.authorization.k8s.io/v1/clusterroles/system:openshift:scc:anyuid'         round_trippers.go:466] curl -v -XGET  -H "Accept: application/json, *\/*" -H "User-Agent: oc/4.11.0 (linux/amd64) kubernetes/5e53738"             'https: //api.mrbnone4103018.devcluster.openshift.com:6443/apis/rbac.authorization.k8s.io/v1/namespaces/openshift-provider-certification/rolebindings/system:openshift:scc:anyuid'         request.go:1073] Request Body: {             "kind" : "RoleBinding" ,             "apiVersion" : "rbac.authorization.k8s.io/v1" ,             "metadata" :{                 "name" : "system:openshift:scc:anyuid" ,                 "namespace" : "openshift-provider-certification" ,                 "creationTimestamp" : null             },             "subjects" :[{                 "kind" : "ServiceAccount" ,                 "name" : "sonobuoy-serviceaccount" ,                 "namespace" : "openshift-provider-certification"                 }],             "roleRef" :{                 "apiGroup" : "rbac.authorization.k8s.io" ,                 "kind" : "ClusterRole" ,                 "name" : "system:openshift:scc:anyuid"             }}         round_trippers.go:466] curl -v -XPOST  -H "Accept: application/json, *\/*" -H "Content-Type: application/json" -H "User-Agent: oc/4.11.0 (linux/amd64) kubernetes/5e53738"             'https: //api.mrbnone4103018.devcluster.openshift.com:6443/apis/rbac.authorization.k8s.io/v1/namespaces/openshift-provider-certification/rolebindings'

            Marco Braga added a comment - - edited

            I can confirm that the update between z-stream the problem will not happen, only when y-stream is being updated.

            I also can see that the service-ca pod is not restarted while z-stream updates. I wonder if the token projected by service account token is being renewed generating the errors on the aggregator pod (sonobuoy) - as the workloads in certification namespace will not be touched while running updates. If it is confirmed, we may have a blocker of running the certificate environment inside the cluster, even pausing the MCP for the dedicated node.

            I also needed to avoid the utilization of system:serviceAccounts groups on the ClusterRoleBinding, as it seems to stuck the service-ca (detailed similar behavior on KCS[1]):

             

            oc adm policy who-can use scc anyuid | grep serviceaccounts
                    system:serviceaccounts 

             

            Steps to reproduce and get successful upgrade - but getting stuck in the aggregator logs due lack of permissions (described on the body of this card):

            • Create Paused MCP selecting nodes on node-role.kubernetes.io/tests=''
            • Create existing namespace, service account, and cluster roles
            cat << EOF | oc create -f -
            ---
            apiVersion: v1
            kind: Namespace
            metadata:
              name: openshift-provider-certification
              annotations:
                openshift.io/node-selector: node-role.kubernetes.io/tests=
                scheduler.alpha.kubernetes.io/defaultTolerations: '[{"key":"node-role.kubernetes.io/tests","operator":"Exists","effect":"NoSchedule"}]'
            ---
            apiVersion: v1
            kind: ServiceAccount
            metadata:
              labels:
                component: sonobuoy
              name: sonobuoy-serviceaccount
              namespace: openshift-provider-certification
            ---
            apiVersion: rbac.authorization.k8s.io/v1
            kind: ClusterRole
            metadata:
              name: opct-scc-anyuid
            rules:
            - apiGroups:
              - security.openshift.io
              resourceNames:
              - anyuid
              resources:
              - securitycontextconstraints
              verbs:
              - use
            ---
            apiVersion: rbac.authorization.k8s.io/v1
            kind: ClusterRole
            metadata:
              name: opct-scc-privileged
            rules:
            - apiGroups:
              - security.openshift.io
              resourceNames:
              - privileged
              resources:
              - securitycontextconstraints
              verbs:
              - use
            EOF
            • Add Service account to SCC anyuid and privileged
            oc adm policy add-scc-to-user anyuid -z sonobuoy-serviceaccount 
            oc adm policy add-scc-to-user privileged -z sonobuoy-serviceaccount 
            • Patch CLI to allow reuse existing namespaces
            • Patch CLI to create ClusterRoleBind for recently created ClusterRole and serviceAccount - instead of system:serviceaccounts and system:authenticated
            • Run the patched CLI update y-stream (example from 4.10.30 cluster):
            $ ./openshift-provider-cert-linux-amd64 run -w --dedicated --mode upgrade --upgrade-to-image $(oc adm release info 4.11.4 -o jsonpath={.image}) 
            • The cli will get stuck reporting after the upgrade has been started
            Fri, 18 Nov 2022 19:09:01 -03> Global Status: running
            JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
            05-openshift-cluster-upgrade       | running    |            | 0/0 (0 failures)          | status=Working towards 4.11.4: 106 of 803 done (13% complete), waiting on kube-apiserver
            10-openshift-kube-conformance      | running    |            | 0/345 (0 failures)        | status=waiting-for=05-openshift-cluster-upgrade=(0/0/0)=[51/100]
            20-openshift-conformance-validated | running    |            | 0/3251 (0 failures)       | status=blocked-by=10-openshift-kube-conformance=(0/-345/0)=[0/100]
            99-openshift-artifacts-collector   | running    |            | 0/0 (0 failures)          | status=blocked-by=20-openshift-conformance-validated=(0/-3251/0)=[0/100]
            Fri, 18 Nov 2022 19:09:11 -03> Global Status: running
            JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
            05-openshift-cluster-upgrade       | running    |            | 0/0 (0 failures)          | status=Working towards 4.11.4: 106 of 803 done (13% complete)
            10-openshift-kube-conformance      | running    |            | 0/345 (0 failures)        | status=waiting-for=05-openshift-cluster-upgrade=(0/0/0)=[53/100]
            20-openshift-conformance-validated | running    |            | 0/3251 (0 failures)       | status=blocked-by=10-openshift-kube-conformance=(0/-345/0)=[0/100]
            99-openshift-artifacts-collector   | running    |            | 0/0 (0 failures)          | status=blocked-by=20-openshift-conformance-validated=(0/-3251/0)=[0/100]
            (...) 
            Fri, 18 Nov 2022 21:36:41 -03> Global Status: running
            JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
            05-openshift-cluster-upgrade       | running    |            | 0/0 (0 failures)          | status=Working towards 4.11.4: 106 of 803 done (13% complete)
            10-openshift-kube-conformance      | running    |            | 0/345 (0 failures)        | status=waiting-for=05-openshift-cluster-upgrade=(0/0/0)=[65/100]
            20-openshift-conformance-validated | running    |            | 0/3251 (0 failures)       | status=blocked-by=10-openshift-kube-conformance=(0/-345/0)=[0/100]
            99-openshift-artifacts-collector   | running    |            | 0/0 (0 failures)          | status=blocked-by=20-openshift-conformance-validated=(0/-3251/0)=[0/100]
            
            • The sonobuoy aggregator will report the errors on logs
            $ oc logs sonobuoy  -n openshift-provider-certification  --tail 10 -f 
            
            time="2022-11-19T00:37:03Z" level=info msg="couldn't annotate sonobuoy pod" error="couldn't patch pod annotation: pods \"sonobuoy\" is forbidden: unable to validate against any security context constraint: [provider restricted-v2: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, spec.containers[0].securityContext.runAsUser: Invalid value: 1000: must be in the ranges: [1000650000, 1000659999], provider restricted: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, provider machine-api-termination-handler: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, spec.volumes[0]: Invalid value: \"configMap\": configMap volumes are not allowed to be used, spec.volumes[1]: Invalid value: \"configMap\": configMap volumes are not allowed to be used, spec.volumes[2]: Invalid value: \"emptyDir\": emptyDir volumes are not allowed to be used, spec.volumes[3]: Invalid value: \"projected\": projected volumes are not allowed to be used, provider hostnetwork-v2: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, provider hostnetwork: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, provider hostaccess: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group]"
            
            time="2022-11-19T00:37:07Z" level=info msg="received request" client_cert="[20-openshift-conformance-validated]" method=POST plugin_name=20-openshift-conformance-validated url=/api/v1/progress/global/20-openshift-conformance-validated
            
            time="2022-11-19T00:37:10Z" level=info msg="received request" client_cert="[99-openshift-artifacts-collector]" method=POST plugin_name=99-openshift-artifacts-collector url=/api/v1/progress/global/99-openshift-artifacts-collector
            • Althrough, the upgrade will continue normally - and finished, then the plugin (cluster-upgrade) will finished successfully.
            $ oc get clusterversion
            NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
            version   4.11.4    True        False         125m    Cluster version is 4.11.4
            • The next plugin will get stuck, as the aggregator can't update the status of the plugins (dependency of plugin blocker engine)
            $ oc get pods -n openshift-provider-certification
            NAME                                                               READY   STATUS      RESTARTS   AGE
            sonobuoy                                                           1/1     Running     0          159m
            sonobuoy-05-openshift-cluster-upgrade-job-62b87a8c593344b9         0/3     Completed   0          159m
            sonobuoy-10-openshift-kube-conformance-job-4aaafda465f04268        3/3     Running     0          159m
            sonobuoy-20-openshift-conformance-validated-job-dd49f61f180b47ef   3/3     Running     0          159m
            sonobuoy-99-openshift-artifacts-collector-job-1e6c9cd771754cac     3/3     Running     0          159m
            
            $ ./openshift-provider-cert-linux-amd64 status
            Fri, 18 Nov 2022 21:37:02 -03> Global Status: running
            JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
            05-openshift-cluster-upgrade       | running    |            | 0/0 (0 failures)          | status=Working towards 4.11.4: 106 of 803 done (13% complete)
            10-openshift-kube-conformance      | running    |            | 0/345 (0 failures)        | status=waiting-for=05-openshift-cluster-upgrade=(0/0/0)=[65/100]
            20-openshift-conformance-validated | running    |            | 0/3251 (0 failures)       | status=blocked-by=10-openshift-kube-conformance=(0/-345/0)=[0/100]
            99-openshift-artifacts-collector   | running    |            | 0/0 (0 failures)          | status=blocked-by=20-openshift-conformance-validated=(0/-3251/0)=[0/100]
             
            • NOTE: When destroying the environment, create manually Namespace, ServiceAccount, and Cluster Roles (as described above), then running the tool to update z-stream:
            $ ./openshift-provider-cert-linux-amd64 destroy
            # create resources manually (NS, SA, ClusterRoles)
            $ ./openshift-provider-cert-linux-amd64 run -w --dedicated --mode upgrade --upgrade-to-image $(oc adm release info 4.11.7 -o jsonpath={.image}) 
            • The upgrade has been finished correctly:
            $ ./openshift-provider-cert-linux-amd64 run -w --dedicated --mode upgrade --upgrade-to-image $(oc adm release info 4.11.7 -o jsonpath={.image})
            (...)
            Fri, 18 Nov 2022 22:09:08 -03> Global Status: running
            JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
            05-openshift-cluster-upgrade       | running    |            | 0/0 (0 failures)          | status=4.11.4=upgrade-progressing-False           
            10-openshift-kube-conformance      | running    |            | 0/352 (0 failures)        | status=waiting-for=05-openshift-cluster-upgrade=(0/0/0)=[0/100]
            20-openshift-conformance-validated | running    |            | 0/3487 (0 failures)       | status=waiting-for=10-openshift-kube-conformance=(0/0/0)=[0/100]
            99-openshift-artifacts-collector   | running    |            | 0/0 (0 failures)          | status=waiting-for=20-openshift-conformance-validated=(0/0/0)=[0/100]
            (...)
            Fri, 18 Nov 2022 22:36:58 -03> Global Status: running
            JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
            05-openshift-cluster-upgrade       | running    |            | 0/0 (0 failures)          | status=Working towards 4.11.7: 635 of 803 done (79% complete)
            10-openshift-kube-conformance      | running    |            | 0/352 (0 failures)        | status=waiting-for=05-openshift-cluster-upgrade=(0/0/0)=[99/100]
            20-openshift-conformance-validated | running    |            | 0/3487 (0 failures)       | status=blocked-by=10-openshift-kube-conformance=(0/-352/0)=[0/100]
            99-openshift-artifacts-collector   | running    |            | 0/0 (0 failures)          | status=blocked-by=20-openshift-conformance-validated=(0/-3487/0)=[0/100]
            (...)
            INFO[2022-11-18T23:13:38-03:00] Waiting for post-processor...                
            Fri, 18 Nov 2022 23:16:48 -03> Global Status: complete
            JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
            05-openshift-cluster-upgrade       | complete   | passed     | 0/0 (0 failures)          | Total tests processed: 36 (36 pass / 0 failed)    
            10-openshift-kube-conformance      | complete   | failed     | 20/20 (0 failures)        | Total tests processed: 23 (22 pass / 1 failed)    
            20-openshift-conformance-validated | complete   | failed     | 20/20 (0 failures)        | Total tests processed: 12 (11 pass / 1 failed)    
            99-openshift-artifacts-collector   | complete   | passed     | 0/0 (0 failures)          | Total tests processed: 3 (3 pass / 0 failed)      
            INFO[2022-11-18T23:16:48-03:00] The execution has completed! Use retrieve command to collect the results and share the archive with your Red Hat partner. 
            
            $ oc get clusterversion
            NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
            version   4.11.7    True        False         17m     Cluster version is 4.11.7
            

             

            [1] https://access.redhat.com/solutions/5875621

            Marco Braga added a comment - - edited I can confirm that the update between z-stream the problem will not happen, only when y-stream is being updated. I also can see that the service-ca pod is not restarted while z-stream updates. I wonder if the token projected by service account token is being renewed generating the errors on the aggregator pod (sonobuoy) - as the workloads in certification namespace will not be touched while running updates. If it is confirmed, we may have a blocker of running the certificate environment inside the cluster, even pausing the MCP for the dedicated node. I also needed to avoid the utilization of system:serviceAccounts groups on the ClusterRoleBinding, as it seems to stuck the service-ca (detailed similar behavior on KCS [1] ):   oc adm policy who-can use scc anyuid | grep serviceaccounts         system:serviceaccounts   Steps to reproduce and get successful upgrade - but getting stuck in the aggregator logs due lack of permissions (described on the body of this card): Create Paused MCP selecting nodes on node-role.kubernetes.io/tests='' Create existing namespace, service account, and cluster roles cat << EOF | oc create -f - --- apiVersion: v1 kind: Namespace metadata:   name: openshift-provider-certification   annotations:     openshift.io/node-selector: node-role.kubernetes.io/tests=     scheduler.alpha.kubernetes.io/defaultTolerations: '[{ "key" : "node-role.kubernetes.io/tests" , " operator " : "Exists" , "effect" : "NoSchedule" }]' --- apiVersion: v1 kind: ServiceAccount metadata:   labels:     component: sonobuoy   name: sonobuoy-serviceaccount   namespace: openshift-provider-certification --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata:   name: opct-scc-anyuid rules: - apiGroups:   - security.openshift.io   resourceNames:   - anyuid   resources:   - securitycontextconstraints   verbs:   - use --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata:   name: opct-scc-privileged rules: - apiGroups:   - security.openshift.io   resourceNames:   - privileged   resources:   - securitycontextconstraints   verbs:   - use EOF Add Service account to SCC anyuid and privileged oc adm policy add-scc-to-user anyuid -z sonobuoy-serviceaccount oc adm policy add-scc-to-user privileged -z sonobuoy-serviceaccount Patch CLI to allow reuse existing namespaces Patch CLI to create ClusterRoleBind for recently created ClusterRole and serviceAccount - instead of system:serviceaccounts and system:authenticated Run the patched CLI update y-stream (example from 4.10.30 cluster): $ ./openshift-provider-cert-linux-amd64 run -w --dedicated --mode upgrade --upgrade-to-image $(oc adm release info 4.11.4 -o jsonpath={.image}) The cli will get stuck reporting after the upgrade has been started Fri, 18 Nov 2022 19:09:01 -03> Global Status: running JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                            05-openshift-cluster-upgrade       | running    |            | 0/0 (0 failures)          | status=Working towards 4.11.4: 106 of 803 done (13% complete), waiting on kube-apiserver 10-openshift-kube-conformance      | running    |            | 0/345 (0 failures)        | status=waiting- for =05-openshift-cluster-upgrade=(0/0/0)=[51/100] 20-openshift-conformance-validated | running    |            | 0/3251 (0 failures)       | status=blocked-by=10-openshift-kube-conformance=(0/-345/0)=[0/100] 99-openshift-artifacts-collector   | running    |            | 0/0 (0 failures)          | status=blocked-by=20-openshift-conformance-validated=(0/-3251/0)=[0/100] Fri, 18 Nov 2022 19:09:11 -03> Global Status: running JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                            05-openshift-cluster-upgrade       | running    |            | 0/0 (0 failures)          | status=Working towards 4.11.4: 106 of 803 done (13% complete) 10-openshift-kube-conformance      | running    |            | 0/345 (0 failures)        | status=waiting- for =05-openshift-cluster-upgrade=(0/0/0)=[53/100] 20-openshift-conformance-validated | running    |            | 0/3251 (0 failures)       | status=blocked-by=10-openshift-kube-conformance=(0/-345/0)=[0/100] 99-openshift-artifacts-collector   | running    |            | 0/0 (0 failures)          | status=blocked-by=20-openshift-conformance-validated=(0/-3251/0)=[0/100] (...) Fri, 18 Nov 2022 21:36:41 -03> Global Status: running JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                            05-openshift-cluster-upgrade       | running    |            | 0/0 (0 failures)          | status=Working towards 4.11.4: 106 of 803 done (13% complete) 10-openshift-kube-conformance      | running    |            | 0/345 (0 failures)        | status=waiting- for =05-openshift-cluster-upgrade=(0/0/0)=[65/100] 20-openshift-conformance-validated | running    |            | 0/3251 (0 failures)       | status=blocked-by=10-openshift-kube-conformance=(0/-345/0)=[0/100] 99-openshift-artifacts-collector   | running    |            | 0/0 (0 failures)          | status=blocked-by=20-openshift-conformance-validated=(0/-3251/0)=[0/100] The sonobuoy aggregator will report the errors on logs $ oc logs sonobuoy  -n openshift-provider-certification  --tail 10 -f time= "2022-11-19T00:37:03Z" level=info msg= "couldn 't annotate sonobuoy pod" error= "couldn' t patch pod annotation: pods \" sonobuoy\ " is forbidden: unable to validate against any security context constraint: [provider restricted-v2: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, spec.containers[0].securityContext.runAsUser: Invalid value: 1000: must be in the ranges: [1000650000, 1000659999], provider restricted: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, provider machine-api-termination-handler: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, spec.volumes[0]: Invalid value: \" configMap\ ": configMap volumes are not allowed to be used, spec.volumes[1]: Invalid value: \" configMap\ ": configMap volumes are not allowed to be used, spec.volumes[2]: Invalid value: \" emptyDir\ ": emptyDir volumes are not allowed to be used, spec.volumes[3]: Invalid value: \" projected\ ": projected volumes are not allowed to be used, provider hostnetwork-v2: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, provider hostnetwork: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, provider hostaccess: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group]" time= "2022-11-19T00:37:07Z" level=info msg= "received request" client_cert= "[20-openshift-conformance-validated]" method=POST plugin_name=20-openshift-conformance-validated url=/api/v1/progress/global/20-openshift-conformance-validated time= "2022-11-19T00:37:10Z" level=info msg= "received request" client_cert= "[99-openshift-artifacts-collector]" method=POST plugin_name=99-openshift-artifacts-collector url=/api/v1/progress/global/99-openshift-artifacts-collector Althrough, the upgrade will continue normally - and finished, then the plugin (cluster-upgrade) will finished successfully. $ oc get clusterversion NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS version   4.11.4    True        False         125m    Cluster version is 4.11.4 The next plugin will get stuck, as the aggregator can't update the status of the plugins (dependency of plugin blocker engine) $ oc get pods -n openshift-provider-certification NAME                                                               READY   STATUS      RESTARTS   AGE sonobuoy                                                           1/1     Running     0          159m sonobuoy-05-openshift-cluster-upgrade-job-62b87a8c593344b9         0/3     Completed   0          159m sonobuoy-10-openshift-kube-conformance-job-4aaafda465f04268        3/3     Running     0          159m sonobuoy-20-openshift-conformance-validated-job-dd49f61f180b47ef   3/3     Running     0          159m sonobuoy-99-openshift-artifacts-collector-job-1e6c9cd771754cac     3/3     Running     0          159m $ ./openshift-provider-cert-linux-amd64 status Fri, 18 Nov 2022 21:37:02 -03> Global Status: running JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                            05-openshift-cluster-upgrade       | running    |            | 0/0 (0 failures)          | status=Working towards 4.11.4: 106 of 803 done (13% complete) 10-openshift-kube-conformance      | running    |            | 0/345 (0 failures)        | status=waiting- for =05-openshift-cluster-upgrade=(0/0/0)=[65/100] 20-openshift-conformance-validated | running    |            | 0/3251 (0 failures)       | status=blocked-by=10-openshift-kube-conformance=(0/-345/0)=[0/100] 99-openshift-artifacts-collector   | running    |            | 0/0 (0 failures)          | status=blocked-by=20-openshift-conformance-validated=(0/-3251/0)=[0/100] NOTE: When destroying the environment, create manually Namespace, ServiceAccount, and Cluster Roles (as described above), then running the tool to update z-stream: $ ./openshift-provider-cert-linux-amd64 destroy # create resources manually (NS, SA, ClusterRoles) $ ./openshift-provider-cert-linux-amd64 run -w --dedicated --mode upgrade --upgrade-to-image $(oc adm release info 4.11.7 -o jsonpath={.image}) The upgrade has been finished correctly: $ ./openshift-provider-cert-linux-amd64 run -w --dedicated --mode upgrade --upgrade-to-image $(oc adm release info 4.11.7 -o jsonpath={.image}) (...) Fri, 18 Nov 2022 22:09:08 -03> Global Status: running JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                            05-openshift-cluster-upgrade       | running    |            | 0/0 (0 failures)          | status=4.11.4=upgrade-progressing-False            10-openshift-kube-conformance      | running    |            | 0/352 (0 failures)        | status=waiting- for =05-openshift-cluster-upgrade=(0/0/0)=[0/100] 20-openshift-conformance-validated | running    |            | 0/3487 (0 failures)       | status=waiting- for =10-openshift-kube-conformance=(0/0/0)=[0/100] 99-openshift-artifacts-collector   | running    |            | 0/0 (0 failures)          | status=waiting- for =20-openshift-conformance-validated=(0/0/0)=[0/100] (...) Fri, 18 Nov 2022 22:36:58 -03> Global Status: running JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                            05-openshift-cluster-upgrade       | running    |            | 0/0 (0 failures)          | status=Working towards 4.11.7: 635 of 803 done (79% complete) 10-openshift-kube-conformance      | running    |            | 0/352 (0 failures)        | status=waiting- for =05-openshift-cluster-upgrade=(0/0/0)=[99/100] 20-openshift-conformance-validated | running    |            | 0/3487 (0 failures)       | status=blocked-by=10-openshift-kube-conformance=(0/-352/0)=[0/100] 99-openshift-artifacts-collector   | running    |            | 0/0 (0 failures)          | status=blocked-by=20-openshift-conformance-validated=(0/-3487/0)=[0/100] (...) INFO[2022-11-18T23:13:38-03:00] Waiting for post-processor...                 Fri, 18 Nov 2022 23:16:48 -03> Global Status: complete JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                            05-openshift-cluster-upgrade       | complete   | passed     | 0/0 (0 failures)          | Total tests processed: 36 (36 pass / 0 failed)     10-openshift-kube-conformance      | complete   | failed     | 20/20 (0 failures)        | Total tests processed: 23 (22 pass / 1 failed)     20-openshift-conformance-validated | complete   | failed     | 20/20 (0 failures)        | Total tests processed: 12 (11 pass / 1 failed)     99-openshift-artifacts-collector   | complete   | passed     | 0/0 (0 failures)          | Total tests processed: 3 (3 pass / 0 failed)       INFO[2022-11-18T23:16:48-03:00] The execution has completed! Use retrieve command to collect the results and share the archive with your Red Hat partner.  $ oc get clusterversion NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS version   4.11.7    True        False         17m     Cluster version is 4.11.7   [1] https://access.redhat.com/solutions/5875621

              rhn-support-mrbraga Marco Braga
              rhn-support-mrbraga Marco Braga
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: