[OPCT-6] [bug] The RBAC used on Sonobuoy SA stuck the cluster upgrades on Y-stream

Type: Bug
Resolution: Done
Priority: Blocker
Fix Version/s: opct-v0.3.0
Affects Version/s: None
Component/s: CLI
Labels:
- OPCT
- opct-v0.3

Ready:
False
Epic Link:
OPCT Release v0.3 (upgrade)
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

BUG Description

The RBAC used created by CLI is blocking cluster upgrades, which is impacting the development on the 'cluster upgrade' feature.

Steps to reproduce:

Run the OPCT
Run the cluster upgrade (manually, by run-upgrade with openshift-tests, or through CLI with development feature)
The cluster operator service-ca stuck on "Progressing..."

I also able to reproduce in different scenarios:

S1) Running upgrade with existing permissions [1]
S2) Running upgrade without setting permissions
S3) Running upgrade with SCC used by kube-cert (without [1]) [2]

The S1 and S3 got the same errors.

On S2, the cluster is upgraded successfully, but the Sonobuoy got stuck (another block handled by ~~SPLAT-876~~ ).

As described on the KCS[3], the ClusterOperator service-ca getting stuck could be due changes done on the system groups. As described on [1] the CLI is associating the group system:serviceaccounts to anyuid SCC groups:
$ oc adm policy who-can use scc anyuid | grep serviceaccounts
system:serviceaccounts

ENGINEERING DETAILS

[1] https://github.com/redhat-openshift-ecosystem/provider-certification-tool/blob/main/pkg/run/run.go#L193-L249

[2] https://github.com/cncf/k8s-conformance/tree/master/v1.24/openshift#run-conformance-tests

[3] https://access.redhat.com/solutions/5875621

blocks

OPCT-7 [bug][backend] Sonobuoy's aggregator stop working after cluster upgrades

Closed

OPCT-11 [bug] 4.12 execution is not updating progress

Closed

links to

provider-certification-tool/pull/34

Marco Braga added a comment - 2023/01/11 7:49 PM

The PR #34 fixes the *cluster upgrade* RBAC while running 4.10->4.11. A lot of tests have been done using the regular execution and upgrade feature.

Closing this card.

Marco Braga added a comment - 2023/01/11 7:49 PM The PR #34 fixes the * cluster upgrade * RBAC while running 4.10->4.11. A lot of tests have been done using the regular execution and upgrade feature. Closing this card.

Marco Braga added a comment - 2023/01/11 7:48 PM

The PR #34 resolves the cluster upgrade issues.

The PR creates the custom resources (ServiceAccount, ClusterRole, and ClusterRoleBinding) and tells the backend (sonobuoy) to do not create it.

The ClusterRole is high privileged by default, as the Sonobuoy should run with high privileges. (it can be reviewed in the future while pod security admission is completely reviewed on the upstream - see opened reference issue https://github.com/vmware-tanzu/sonobuoy/issues/1858)

Marco Braga added a comment - 2023/01/11 7:48 PM The PR #34 resolves the cluster upgrade issues. The PR creates the custom resources (ServiceAccount, ClusterRole, and ClusterRoleBinding) and tells the backend (sonobuoy) to do not create it. The ClusterRole is high privileged by default, as the Sonobuoy should run with high privileges. (it can be reviewed in the future while pod security admission is completely reviewed on the upstream - see opened reference issue https://github.com/vmware-tanzu/sonobuoy/issues/1858 )

Marco Braga added a comment - 2022/12/15 7:47 PM

The issue described on the last comment, with regards the number of failures, is handled by the card https://issues.redhat.com/browse/SPLAT-909 . There I am detailing the amount of failures when comparing with the previous version (v0.1).

The current card should keep the focus on the RBAC changes needed on OPCT to avoid stuck when upgrading the cluster, mainly 4.10->4.11. The proposal on the PR mentioned above (provider-certification-tool/pull/34) resolves the stuck problem and allows the cluster to finished the upgrades, we need to make sure the impact and will not introduce regressions.

Marco Braga added a comment - 2022/12/15 7:47 PM The issue described on the last comment, with regards the number of failures, is handled by the card https://issues.redhat.com/browse/SPLAT-909 . There I am detailing the amount of failures when comparing with the previous version (v0.1). The current card should keep the focus on the RBAC changes needed on OPCT to avoid stuck when upgrading the cluster, mainly 4.10->4.11. The proposal on the PR mentioned above (provider-certification-tool/pull/34) resolves the stuck problem and allows the cluster to finished the upgrades, we need to make sure the impact and will not introduce regressions.

Marco Braga added a comment - 2022/11/22 1:24 PM

Running the full execution against the latest CLI version (main branch), it seems not to have good results:

Currnet PR (exec timestamp 202211220441)

$ ./openshift-provider-cert-linux-amd64-main results  202211220441_sonobuoy_bd6c044b-f8b7-42a7-980a-c0bb70a08890.tar.gz  |grep 'Run Details' -A 3
Run Details:
API Server version: v1.23.5+012e945
Node health: 7/7 (100%)
Pods health: 244/248 (98%)

$ ./openshift-provider-cert-linux-amd64-main results -p 20-openshift-conformance-validated 202211220441_sonobuoy_bd6c044b-f8b7-42a7-980a-c0bb70a08890.tar.gz  |head -n 7
Plugin: 20-openshift-conformance-validated
Status: failed
Total: 3375
Passed: 1295
Failed: 82
Skipped: 1998

`main` branch - ran after PR version (existing cluster - exec timestamp 202211221130)

$ ./openshift-provider-cert-linux-amd64-main results  202211221130_sonobuoy_74c57454-0955-4617-9ce8-888c772a0713.tar.gz  |grep 'Run Details' -A 3
Run Details:
API Server version: v1.23.5+012e945
Node health: 7/7 (100%)
Pods health: 243/247 (98%)

$ ./openshift-provider-cert-linux-amd64-main results -p openshift-conformance-validated 202211221130_sonobuoy_74c57454-0955-4617-9ce8-888c772a0713.tar.gz  |head -n 7
Plugin: openshift-conformance-validated
Status: failed
Total: 3375
Passed: 1346
Failed: 30
Skipped: 1999

We need to dig into failures to understand what happened to increase almost 3x the number of failures. But it seems the new RBAC will impact the runtime of openshift-tests utility.

Some actions items I can see in this case:

check the failed details - mainly when using the main branch as a baseline
check which RBAC the upgrade job (openshift-tests run-upgrade) is using on CI

Marco Braga added a comment - 2022/11/22 1:24 PM Running the full execution against the latest CLI version (main branch), it seems not to have good results: Currnet PR (exec timestamp 202211220441) $ ./openshift-provider-cert-linux-amd64-main results 202211220441_sonobuoy_bd6c044b-f8b7-42a7-980a-c0bb70a08890.tar.gz |grep 'Run Details' -A 3 Run Details: API Server version: v1.23.5+012e945 Node health: 7/7 (100%) Pods health: 244/248 (98%) $ ./openshift-provider-cert-linux-amd64-main results -p 20-openshift-conformance-validated 202211220441_sonobuoy_bd6c044b-f8b7-42a7-980a-c0bb70a08890.tar.gz |head -n 7 Plugin: 20-openshift-conformance-validated Status: failed Total: 3375 Passed: 1295 Failed: 82 Skipped: 1998 `main` branch - ran after PR version (existing cluster - exec timestamp 202211221130) $ ./openshift-provider-cert-linux-amd64-main results 202211221130_sonobuoy_74c57454-0955-4617-9ce8-888c772a0713.tar.gz |grep 'Run Details' -A 3 Run Details: API Server version: v1.23.5+012e945 Node health: 7/7 (100%) Pods health: 243/247 (98%) $ ./openshift-provider-cert-linux-amd64-main results -p openshift-conformance-validated 202211221130_sonobuoy_74c57454-0955-4617-9ce8-888c772a0713.tar.gz |head -n 7 Plugin: openshift-conformance-validated Status: failed Total: 3375 Passed: 1346 Failed: 30 Skipped: 1999 We need to dig into failures to understand what happened to increase almost 3x the number of failures. But it seems the new RBAC will impact the runtime of openshift-tests utility. Some actions items I can see in this case: check the failed details - mainly when using the main branch as a baseline check which RBAC the upgrade job (openshift-tests run-upgrade) is using on CI

Marco Braga added a comment - 2022/11/22 4:20 AM - edited

https://github.com/redhat-openshift-ecosystem/provider-certification-tool/pull/34

PR Submited doing the same of commands:

$ oc adm policy add-scc-to-user anyuid -z sonobuoy-serviceaccount
$ oc adm policy add-scc-to-user privileged -z sonobuoy-serviceaccount

Hacking `oc adm policy add-scc-to-user anyuid -z sonobuoy-serviceaccount`:

$ oc adm policy add-scc-to-user anyuid -z sonobuoy-serviceaccount --loglevel 10 -n openshift-provider-certification
        round_trippers.go:466] curl -v -XGET  -H "Accept: application/json, *\/*" -H "User-Agent: oc/4.11.0 (linux/amd64) kubernetes/5e53738"
            'https://api.mrbnone4103018.devcluster.openshift.com:6443/apis/rbac.authorization.k8s.io/v1/clusterroles/system:openshift:scc:anyuid'
        round_trippers.go:466] curl -v -XGET  -H "Accept: application/json, *\/*" -H "User-Agent: oc/4.11.0 (linux/amd64) kubernetes/5e53738"
            'https://api.mrbnone4103018.devcluster.openshift.com:6443/apis/rbac.authorization.k8s.io/v1/namespaces/openshift-provider-certification/rolebindings/system:openshift:scc:anyuid'        request.go:1073] Request Body: {
            "kind":"RoleBinding",
            "apiVersion":"rbac.authorization.k8s.io/v1",
            "metadata":{
                "name":"system:openshift:scc:anyuid",
                "namespace":"openshift-provider-certification",
                "creationTimestamp":null
            },
            "subjects":[{
                "kind":"ServiceAccount",
                "name":"sonobuoy-serviceaccount",
                "namespace":"openshift-provider-certification"
                }],
            "roleRef":{
                "apiGroup":"rbac.authorization.k8s.io",
                "kind":"ClusterRole",
                "name":"system:openshift:scc:anyuid"
            }}
        round_trippers.go:466] curl -v -XPOST  -H "Accept: application/json, *\/*" -H "Content-Type: application/json" -H "User-Agent: oc/4.11.0 (linux/amd64) kubernetes/5e53738"
            'https://api.mrbnone4103018.devcluster.openshift.com:6443/apis/rbac.authorization.k8s.io/v1/namespaces/openshift-provider-certification/rolebindings'

Marco Braga added a comment - 2022/11/22 4:20 AM - edited https://github.com/redhat-openshift-ecosystem/provider-certification-tool/pull/34 PR Submited doing the same of commands: $ oc adm policy add-scc-to-user anyuid -z sonobuoy-serviceaccount $ oc adm policy add-scc-to-user privileged -z sonobuoy-serviceaccount Hacking `oc adm policy add-scc-to-user anyuid -z sonobuoy-serviceaccount`: $ oc adm policy add-scc-to-user anyuid -z sonobuoy-serviceaccount --loglevel 10 -n openshift-provider-certification round_trippers.go:466] curl -v -XGET -H "Accept: application/json, *\/*" -H "User-Agent: oc/4.11.0 (linux/amd64) kubernetes/5e53738" 'https: //api.mrbnone4103018.devcluster.openshift.com:6443/apis/rbac.authorization.k8s.io/v1/clusterroles/system:openshift:scc:anyuid' round_trippers.go:466] curl -v -XGET -H "Accept: application/json, *\/*" -H "User-Agent: oc/4.11.0 (linux/amd64) kubernetes/5e53738" 'https: //api.mrbnone4103018.devcluster.openshift.com:6443/apis/rbac.authorization.k8s.io/v1/namespaces/openshift-provider-certification/rolebindings/system:openshift:scc:anyuid' request.go:1073] Request Body: { "kind" : "RoleBinding" , "apiVersion" : "rbac.authorization.k8s.io/v1" , "metadata" :{ "name" : "system:openshift:scc:anyuid" , "namespace" : "openshift-provider-certification" , "creationTimestamp" : null }, "subjects" :[{ "kind" : "ServiceAccount" , "name" : "sonobuoy-serviceaccount" , "namespace" : "openshift-provider-certification" }], "roleRef" :{ "apiGroup" : "rbac.authorization.k8s.io" , "kind" : "ClusterRole" , "name" : "system:openshift:scc:anyuid" }} round_trippers.go:466] curl -v -XPOST -H "Accept: application/json, *\/*" -H "Content-Type: application/json" -H "User-Agent: oc/4.11.0 (linux/amd64) kubernetes/5e53738" 'https: //api.mrbnone4103018.devcluster.openshift.com:6443/apis/rbac.authorization.k8s.io/v1/namespaces/openshift-provider-certification/rolebindings'

Marco Braga added a comment - 2022/11/19 1:03 AM - edited

I can confirm that the update between z-stream the problem will not happen, only when y-stream is being updated.

I also can see that the service-ca pod is not restarted while z-stream updates. I wonder if the token projected by service account token is being renewed generating the errors on the aggregator pod (sonobuoy) - as the workloads in certification namespace will not be touched while running updates. If it is confirmed, we may have a blocker of running the certificate environment inside the cluster, even pausing the MCP for the dedicated node.

I also needed to avoid the utilization of system:serviceAccounts groups on the ClusterRoleBinding, as it seems to stuck the service-ca (detailed similar behavior on KCS[1]):

oc adm policy who-can use scc anyuid | grep serviceaccounts
        system:serviceaccounts

Steps to reproduce and get successful upgrade - but getting stuck in the aggregator logs due lack of permissions (described on the body of this card):

Create Paused MCP selecting nodes on node-role.kubernetes.io/tests=''
Create existing namespace, service account, and cluster roles

cat << EOF | oc create -f -
---
apiVersion: v1
kind: Namespace
metadata:
  name: openshift-provider-certification
  annotations:
    openshift.io/node-selector: node-role.kubernetes.io/tests=
    scheduler.alpha.kubernetes.io/defaultTolerations: '[{"key":"node-role.kubernetes.io/tests","operator":"Exists","effect":"NoSchedule"}]'
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    component: sonobuoy
  name: sonobuoy-serviceaccount
  namespace: openshift-provider-certification
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: opct-scc-anyuid
rules:
- apiGroups:
  - security.openshift.io
  resourceNames:
  - anyuid
  resources:
  - securitycontextconstraints
  verbs:
  - use
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: opct-scc-privileged
rules:
- apiGroups:
  - security.openshift.io
  resourceNames:
  - privileged
  resources:
  - securitycontextconstraints
  verbs:
  - use
EOF

Add Service account to SCC anyuid and privileged

oc adm policy add-scc-to-user anyuid -z sonobuoy-serviceaccount 
oc adm policy add-scc-to-user privileged -z sonobuoy-serviceaccount

Patch CLI to allow reuse existing namespaces
Patch CLI to create ClusterRoleBind for recently created ClusterRole and serviceAccount - instead of system:serviceaccounts and system:authenticated
Run the patched CLI update y-stream (example from 4.10.30 cluster):

$ ./openshift-provider-cert-linux-amd64 run -w --dedicated --mode upgrade --upgrade-to-image $(oc adm release info 4.11.4 -o jsonpath={.image})

The cli will get stuck reporting after the upgrade has been started

Fri, 18 Nov 2022 19:09:01 -03> Global Status: running
JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
05-openshift-cluster-upgrade       | running    |            | 0/0 (0 failures)          | status=Working towards 4.11.4: 106 of 803 done (13% complete), waiting on kube-apiserver
10-openshift-kube-conformance      | running    |            | 0/345 (0 failures)        | status=waiting-for=05-openshift-cluster-upgrade=(0/0/0)=[51/100]
20-openshift-conformance-validated | running    |            | 0/3251 (0 failures)       | status=blocked-by=10-openshift-kube-conformance=(0/-345/0)=[0/100]
99-openshift-artifacts-collector   | running    |            | 0/0 (0 failures)          | status=blocked-by=20-openshift-conformance-validated=(0/-3251/0)=[0/100]
Fri, 18 Nov 2022 19:09:11 -03> Global Status: running
JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
05-openshift-cluster-upgrade       | running    |            | 0/0 (0 failures)          | status=Working towards 4.11.4: 106 of 803 done (13% complete)
10-openshift-kube-conformance      | running    |            | 0/345 (0 failures)        | status=waiting-for=05-openshift-cluster-upgrade=(0/0/0)=[53/100]
20-openshift-conformance-validated | running    |            | 0/3251 (0 failures)       | status=blocked-by=10-openshift-kube-conformance=(0/-345/0)=[0/100]
99-openshift-artifacts-collector   | running    |            | 0/0 (0 failures)          | status=blocked-by=20-openshift-conformance-validated=(0/-3251/0)=[0/100]
(...) 
Fri, 18 Nov 2022 21:36:41 -03> Global Status: running
JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
05-openshift-cluster-upgrade       | running    |            | 0/0 (0 failures)          | status=Working towards 4.11.4: 106 of 803 done (13% complete)
10-openshift-kube-conformance      | running    |            | 0/345 (0 failures)        | status=waiting-for=05-openshift-cluster-upgrade=(0/0/0)=[65/100]
20-openshift-conformance-validated | running    |            | 0/3251 (0 failures)       | status=blocked-by=10-openshift-kube-conformance=(0/-345/0)=[0/100]
99-openshift-artifacts-collector   | running    |            | 0/0 (0 failures)          | status=blocked-by=20-openshift-conformance-validated=(0/-3251/0)=[0/100]

The sonobuoy aggregator will report the errors on logs

$ oc logs sonobuoy  -n openshift-provider-certification  --tail 10 -f 

time="2022-11-19T00:37:03Z" level=info msg="couldn't annotate sonobuoy pod" error="couldn't patch pod annotation: pods \"sonobuoy\" is forbidden: unable to validate against any security context constraint: [provider restricted-v2: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, spec.containers[0].securityContext.runAsUser: Invalid value: 1000: must be in the ranges: [1000650000, 1000659999], provider restricted: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, provider machine-api-termination-handler: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, spec.volumes[0]: Invalid value: \"configMap\": configMap volumes are not allowed to be used, spec.volumes[1]: Invalid value: \"configMap\": configMap volumes are not allowed to be used, spec.volumes[2]: Invalid value: \"emptyDir\": emptyDir volumes are not allowed to be used, spec.volumes[3]: Invalid value: \"projected\": projected volumes are not allowed to be used, provider hostnetwork-v2: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, provider hostnetwork: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, provider hostaccess: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group]"

time="2022-11-19T00:37:07Z" level=info msg="received request" client_cert="[20-openshift-conformance-validated]" method=POST plugin_name=20-openshift-conformance-validated url=/api/v1/progress/global/20-openshift-conformance-validated

time="2022-11-19T00:37:10Z" level=info msg="received request" client_cert="[99-openshift-artifacts-collector]" method=POST plugin_name=99-openshift-artifacts-collector url=/api/v1/progress/global/99-openshift-artifacts-collector

Althrough, the upgrade will continue normally - and finished, then the plugin (cluster-upgrade) will finished successfully.

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.4    True        False         125m    Cluster version is 4.11.4

The next plugin will get stuck, as the aggregator can't update the status of the plugins (dependency of plugin blocker engine)

$ oc get pods -n openshift-provider-certification
NAME                                                               READY   STATUS      RESTARTS   AGE
sonobuoy                                                           1/1     Running     0          159m
sonobuoy-05-openshift-cluster-upgrade-job-62b87a8c593344b9         0/3     Completed   0          159m
sonobuoy-10-openshift-kube-conformance-job-4aaafda465f04268        3/3     Running     0          159m
sonobuoy-20-openshift-conformance-validated-job-dd49f61f180b47ef   3/3     Running     0          159m
sonobuoy-99-openshift-artifacts-collector-job-1e6c9cd771754cac     3/3     Running     0          159m

$ ./openshift-provider-cert-linux-amd64 status
Fri, 18 Nov 2022 21:37:02 -03> Global Status: running
JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
05-openshift-cluster-upgrade       | running    |            | 0/0 (0 failures)          | status=Working towards 4.11.4: 106 of 803 done (13% complete)
10-openshift-kube-conformance      | running    |            | 0/345 (0 failures)        | status=waiting-for=05-openshift-cluster-upgrade=(0/0/0)=[65/100]
20-openshift-conformance-validated | running    |            | 0/3251 (0 failures)       | status=blocked-by=10-openshift-kube-conformance=(0/-345/0)=[0/100]
99-openshift-artifacts-collector   | running    |            | 0/0 (0 failures)          | status=blocked-by=20-openshift-conformance-validated=(0/-3251/0)=[0/100]

NOTE: When destroying the environment, create manually Namespace, ServiceAccount, and Cluster Roles (as described above), then running the tool to update z-stream:

$ ./openshift-provider-cert-linux-amd64 destroy
# create resources manually (NS, SA, ClusterRoles)
$ ./openshift-provider-cert-linux-amd64 run -w --dedicated --mode upgrade --upgrade-to-image $(oc adm release info 4.11.7 -o jsonpath={.image})

The upgrade has been finished correctly:

$ ./openshift-provider-cert-linux-amd64 run -w --dedicated --mode upgrade --upgrade-to-image $(oc adm release info 4.11.7 -o jsonpath={.image})
(...)
Fri, 18 Nov 2022 22:09:08 -03> Global Status: running
JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
05-openshift-cluster-upgrade       | running    |            | 0/0 (0 failures)          | status=4.11.4=upgrade-progressing-False           
10-openshift-kube-conformance      | running    |            | 0/352 (0 failures)        | status=waiting-for=05-openshift-cluster-upgrade=(0/0/0)=[0/100]
20-openshift-conformance-validated | running    |            | 0/3487 (0 failures)       | status=waiting-for=10-openshift-kube-conformance=(0/0/0)=[0/100]
99-openshift-artifacts-collector   | running    |            | 0/0 (0 failures)          | status=waiting-for=20-openshift-conformance-validated=(0/0/0)=[0/100]
(...)
Fri, 18 Nov 2022 22:36:58 -03> Global Status: running
JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
05-openshift-cluster-upgrade       | running    |            | 0/0 (0 failures)          | status=Working towards 4.11.7: 635 of 803 done (79% complete)
10-openshift-kube-conformance      | running    |            | 0/352 (0 failures)        | status=waiting-for=05-openshift-cluster-upgrade=(0/0/0)=[99/100]
20-openshift-conformance-validated | running    |            | 0/3487 (0 failures)       | status=blocked-by=10-openshift-kube-conformance=(0/-352/0)=[0/100]
99-openshift-artifacts-collector   | running    |            | 0/0 (0 failures)          | status=blocked-by=20-openshift-conformance-validated=(0/-3487/0)=[0/100]
(...)
INFO[2022-11-18T23:13:38-03:00] Waiting for post-processor...                
Fri, 18 Nov 2022 23:16:48 -03> Global Status: complete
JOB_NAME                           | STATUS     | RESULTS    | PROGRESS                  | MESSAGE                                           
05-openshift-cluster-upgrade       | complete   | passed     | 0/0 (0 failures)          | Total tests processed: 36 (36 pass / 0 failed)    
10-openshift-kube-conformance      | complete   | failed     | 20/20 (0 failures)        | Total tests processed: 23 (22 pass / 1 failed)    
20-openshift-conformance-validated | complete   | failed     | 20/20 (0 failures)        | Total tests processed: 12 (11 pass / 1 failed)    
99-openshift-artifacts-collector   | complete   | passed     | 0/0 (0 failures)          | Total tests processed: 3 (3 pass / 0 failed)      
INFO[2022-11-18T23:16:48-03:00] The execution has completed! Use retrieve command to collect the results and share the archive with your Red Hat partner. 

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.7    True        False         17m     Cluster version is 4.11.7

[1] https://access.redhat.com/solutions/5875621

Marco Braga added a comment - 2022/11/19 1:03 AM - edited I can confirm that the update between z-stream the problem will not happen, only when y-stream is being updated. I also can see that the service-ca pod is not restarted while z-stream updates. I wonder if the token projected by service account token is being renewed generating the errors on the aggregator pod (sonobuoy) - as the workloads in certification namespace will not be touched while running updates. If it is confirmed, we may have a blocker of running the certificate environment inside the cluster, even pausing the MCP for the dedicated node. I also needed to avoid the utilization of system:serviceAccounts groups on the ClusterRoleBinding, as it seems to stuck the service-ca (detailed similar behavior on KCS [1] ): oc adm policy who-can use scc anyuid | grep serviceaccounts system:serviceaccounts Steps to reproduce and get successful upgrade - but getting stuck in the aggregator logs due lack of permissions (described on the body of this card): Create Paused MCP selecting nodes on node-role.kubernetes.io/tests='' Create existing namespace, service account, and cluster roles cat << EOF | oc create -f - --- apiVersion: v1 kind: Namespace metadata: name: openshift-provider-certification annotations: openshift.io/node-selector: node-role.kubernetes.io/tests= scheduler.alpha.kubernetes.io/defaultTolerations: '[{ "key" : "node-role.kubernetes.io/tests" , " operator " : "Exists" , "effect" : "NoSchedule" }]' --- apiVersion: v1 kind: ServiceAccount metadata: labels: component: sonobuoy name: sonobuoy-serviceaccount namespace: openshift-provider-certification --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: opct-scc-anyuid rules: - apiGroups: - security.openshift.io resourceNames: - anyuid resources: - securitycontextconstraints verbs: - use --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: opct-scc-privileged rules: - apiGroups: - security.openshift.io resourceNames: - privileged resources: - securitycontextconstraints verbs: - use EOF Add Service account to SCC anyuid and privileged oc adm policy add-scc-to-user anyuid -z sonobuoy-serviceaccount oc adm policy add-scc-to-user privileged -z sonobuoy-serviceaccount Patch CLI to allow reuse existing namespaces Patch CLI to create ClusterRoleBind for recently created ClusterRole and serviceAccount - instead of system:serviceaccounts and system:authenticated Run the patched CLI update y-stream (example from 4.10.30 cluster): $ ./openshift-provider-cert-linux-amd64 run -w --dedicated --mode upgrade --upgrade-to-image $(oc adm release info 4.11.4 -o jsonpath={.image}) The cli will get stuck reporting after the upgrade has been started Fri, 18 Nov 2022 19:09:01 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE 05-openshift-cluster-upgrade | running | | 0/0 (0 failures) | status=Working towards 4.11.4: 106 of 803 done (13% complete), waiting on kube-apiserver 10-openshift-kube-conformance | running | | 0/345 (0 failures) | status=waiting- for =05-openshift-cluster-upgrade=(0/0/0)=[51/100] 20-openshift-conformance-validated | running | | 0/3251 (0 failures) | status=blocked-by=10-openshift-kube-conformance=(0/-345/0)=[0/100] 99-openshift-artifacts-collector | running | | 0/0 (0 failures) | status=blocked-by=20-openshift-conformance-validated=(0/-3251/0)=[0/100] Fri, 18 Nov 2022 19:09:11 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE 05-openshift-cluster-upgrade | running | | 0/0 (0 failures) | status=Working towards 4.11.4: 106 of 803 done (13% complete) 10-openshift-kube-conformance | running | | 0/345 (0 failures) | status=waiting- for =05-openshift-cluster-upgrade=(0/0/0)=[53/100] 20-openshift-conformance-validated | running | | 0/3251 (0 failures) | status=blocked-by=10-openshift-kube-conformance=(0/-345/0)=[0/100] 99-openshift-artifacts-collector | running | | 0/0 (0 failures) | status=blocked-by=20-openshift-conformance-validated=(0/-3251/0)=[0/100] (...) Fri, 18 Nov 2022 21:36:41 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE 05-openshift-cluster-upgrade | running | | 0/0 (0 failures) | status=Working towards 4.11.4: 106 of 803 done (13% complete) 10-openshift-kube-conformance | running | | 0/345 (0 failures) | status=waiting- for =05-openshift-cluster-upgrade=(0/0/0)=[65/100] 20-openshift-conformance-validated | running | | 0/3251 (0 failures) | status=blocked-by=10-openshift-kube-conformance=(0/-345/0)=[0/100] 99-openshift-artifacts-collector | running | | 0/0 (0 failures) | status=blocked-by=20-openshift-conformance-validated=(0/-3251/0)=[0/100] The sonobuoy aggregator will report the errors on logs $ oc logs sonobuoy -n openshift-provider-certification --tail 10 -f time= "2022-11-19T00:37:03Z" level=info msg= "couldn 't annotate sonobuoy pod" error= "couldn' t patch pod annotation: pods \" sonobuoy\ " is forbidden: unable to validate against any security context constraint: [provider restricted-v2: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, spec.containers[0].securityContext.runAsUser: Invalid value: 1000: must be in the ranges: [1000650000, 1000659999], provider restricted: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, provider machine-api-termination-handler: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, spec.volumes[0]: Invalid value: \" configMap\ ": configMap volumes are not allowed to be used, spec.volumes[1]: Invalid value: \" configMap\ ": configMap volumes are not allowed to be used, spec.volumes[2]: Invalid value: \" emptyDir\ ": emptyDir volumes are not allowed to be used, spec.volumes[3]: Invalid value: \" projected\ ": projected volumes are not allowed to be used, provider hostnetwork-v2: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, provider hostnetwork: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, provider hostaccess: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group]" time= "2022-11-19T00:37:07Z" level=info msg= "received request" client_cert= "[20-openshift-conformance-validated]" method=POST plugin_name=20-openshift-conformance-validated url=/api/v1/progress/global/20-openshift-conformance-validated time= "2022-11-19T00:37:10Z" level=info msg= "received request" client_cert= "[99-openshift-artifacts-collector]" method=POST plugin_name=99-openshift-artifacts-collector url=/api/v1/progress/global/99-openshift-artifacts-collector Althrough, the upgrade will continue normally - and finished, then the plugin (cluster-upgrade) will finished successfully. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.4 True False 125m Cluster version is 4.11.4 The next plugin will get stuck, as the aggregator can't update the status of the plugins (dependency of plugin blocker engine) $ oc get pods -n openshift-provider-certification NAME READY STATUS RESTARTS AGE sonobuoy 1/1 Running 0 159m sonobuoy-05-openshift-cluster-upgrade-job-62b87a8c593344b9 0/3 Completed 0 159m sonobuoy-10-openshift-kube-conformance-job-4aaafda465f04268 3/3 Running 0 159m sonobuoy-20-openshift-conformance-validated-job-dd49f61f180b47ef 3/3 Running 0 159m sonobuoy-99-openshift-artifacts-collector-job-1e6c9cd771754cac 3/3 Running 0 159m $ ./openshift-provider-cert-linux-amd64 status Fri, 18 Nov 2022 21:37:02 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE 05-openshift-cluster-upgrade | running | | 0/0 (0 failures) | status=Working towards 4.11.4: 106 of 803 done (13% complete) 10-openshift-kube-conformance | running | | 0/345 (0 failures) | status=waiting- for =05-openshift-cluster-upgrade=(0/0/0)=[65/100] 20-openshift-conformance-validated | running | | 0/3251 (0 failures) | status=blocked-by=10-openshift-kube-conformance=(0/-345/0)=[0/100] 99-openshift-artifacts-collector | running | | 0/0 (0 failures) | status=blocked-by=20-openshift-conformance-validated=(0/-3251/0)=[0/100] NOTE: When destroying the environment, create manually Namespace, ServiceAccount, and Cluster Roles (as described above), then running the tool to update z-stream: $ ./openshift-provider-cert-linux-amd64 destroy # create resources manually (NS, SA, ClusterRoles) $ ./openshift-provider-cert-linux-amd64 run -w --dedicated --mode upgrade --upgrade-to-image $(oc adm release info 4.11.7 -o jsonpath={.image}) The upgrade has been finished correctly: $ ./openshift-provider-cert-linux-amd64 run -w --dedicated --mode upgrade --upgrade-to-image $(oc adm release info 4.11.7 -o jsonpath={.image}) (...) Fri, 18 Nov 2022 22:09:08 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE 05-openshift-cluster-upgrade | running | | 0/0 (0 failures) | status=4.11.4=upgrade-progressing-False 10-openshift-kube-conformance | running | | 0/352 (0 failures) | status=waiting- for =05-openshift-cluster-upgrade=(0/0/0)=[0/100] 20-openshift-conformance-validated | running | | 0/3487 (0 failures) | status=waiting- for =10-openshift-kube-conformance=(0/0/0)=[0/100] 99-openshift-artifacts-collector | running | | 0/0 (0 failures) | status=waiting- for =20-openshift-conformance-validated=(0/0/0)=[0/100] (...) Fri, 18 Nov 2022 22:36:58 -03> Global Status: running JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE 05-openshift-cluster-upgrade | running | | 0/0 (0 failures) | status=Working towards 4.11.7: 635 of 803 done (79% complete) 10-openshift-kube-conformance | running | | 0/352 (0 failures) | status=waiting- for =05-openshift-cluster-upgrade=(0/0/0)=[99/100] 20-openshift-conformance-validated | running | | 0/3487 (0 failures) | status=blocked-by=10-openshift-kube-conformance=(0/-352/0)=[0/100] 99-openshift-artifacts-collector | running | | 0/0 (0 failures) | status=blocked-by=20-openshift-conformance-validated=(0/-3487/0)=[0/100] (...) INFO[2022-11-18T23:13:38-03:00] Waiting for post-processor... Fri, 18 Nov 2022 23:16:48 -03> Global Status: complete JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE 05-openshift-cluster-upgrade | complete | passed | 0/0 (0 failures) | Total tests processed: 36 (36 pass / 0 failed) 10-openshift-kube-conformance | complete | failed | 20/20 (0 failures) | Total tests processed: 23 (22 pass / 1 failed) 20-openshift-conformance-validated | complete | failed | 20/20 (0 failures) | Total tests processed: 12 (11 pass / 1 failed) 99-openshift-artifacts-collector | complete | passed | 0/0 (0 failures) | Total tests processed: 3 (3 pass / 0 failed) INFO[2022-11-18T23:16:48-03:00] The execution has completed! Use retrieve command to collect the results and share the archive with your Red Hat partner. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.7 True False 17m Cluster version is 4.11.7 [1] https://access.redhat.com/solutions/5875621

Assignee:: Marco Braga

Reporter:: Marco Braga

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2022/11/17 10:12 PM

Updated:: 2023/01/24 2:40 PM

Resolved:: 2023/01/11 7:49 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Marco Braga added a comment - 2023/01/11 7:49 PM

Expand comment: Marco Braga added a comment - 2023/01/11 7:49 PM

Collapse comment: Marco Braga added a comment - 2023/01/11 7:48 PM

Expand comment: Marco Braga added a comment - 2023/01/11 7:48 PM

Collapse comment: Marco Braga added a comment - 2022/12/15 7:47 PM

Expand comment: Marco Braga added a comment - 2022/12/15 7:47 PM

Collapse comment: Marco Braga added a comment - 2022/11/22 1:24 PM

Expand comment: Marco Braga added a comment - 2022/11/22 1:24 PM

Collapse comment: Marco Braga added a comment - 2022/11/22 4:20 AM, Edited by Marco Braga - 2022/11/22 4:21 AM

Expand comment: Marco Braga added a comment - 2022/11/22 4:20 AM, Edited by Marco Braga - 2022/11/22 4:21 AM

Collapse comment: Marco Braga added a comment - 2022/11/19 1:03 AM, Edited by Marco Braga - 2022/11/19 2:18 AM

Expand comment: Marco Braga added a comment - 2022/11/19 1:03 AM, Edited by Marco Braga - 2022/11/19 2:18 AM

People

Dates

PagerDuty