I can confirm that the update between z-stream the problem will not happen, only when y-stream is being updated.
I also can see that the service-ca pod is not restarted while z-stream updates. I wonder if the token projected by service account token is being renewed generating the errors on the aggregator pod (sonobuoy) - as the workloads in certification namespace will not be touched while running updates. If it is confirmed, we may have a blocker of running the certificate environment inside the cluster, even pausing the MCP for the dedicated node.
I also needed to avoid the utilization of system:serviceAccounts groups on the ClusterRoleBinding, as it seems to stuck the service-ca (detailed similar behavior on KCS[1]):
oc adm policy who-can use scc anyuid | grep serviceaccounts
system:serviceaccounts
Steps to reproduce and get successful upgrade - but getting stuck in the aggregator logs due lack of permissions (described on the body of this card):
- Create Paused MCP selecting nodes on node-role.kubernetes.io/tests=''
- Create existing namespace, service account, and cluster roles
cat << EOF | oc create -f -
---
apiVersion: v1
kind: Namespace
metadata:
name: openshift-provider-certification
annotations:
openshift.io/node-selector: node-role.kubernetes.io/tests=
scheduler.alpha.kubernetes.io/defaultTolerations: '[{"key":"node-role.kubernetes.io/tests","operator":"Exists","effect":"NoSchedule"}]'
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
component: sonobuoy
name: sonobuoy-serviceaccount
namespace: openshift-provider-certification
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: opct-scc-anyuid
rules:
- apiGroups:
- security.openshift.io
resourceNames:
- anyuid
resources:
- securitycontextconstraints
verbs:
- use
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: opct-scc-privileged
rules:
- apiGroups:
- security.openshift.io
resourceNames:
- privileged
resources:
- securitycontextconstraints
verbs:
- use
EOF
- Add Service account to SCC anyuid and privileged
oc adm policy add-scc-to-user anyuid -z sonobuoy-serviceaccount
oc adm policy add-scc-to-user privileged -z sonobuoy-serviceaccount
- Patch CLI to allow reuse existing namespaces
- Patch CLI to create ClusterRoleBind for recently created ClusterRole and serviceAccount - instead of system:serviceaccounts and system:authenticated
- Run the patched CLI update y-stream (example from 4.10.30 cluster):
$ ./openshift-provider-cert-linux-amd64 run -w --dedicated --mode upgrade --upgrade-to-image $(oc adm release info 4.11.4 -o jsonpath={.image})
- The cli will get stuck reporting after the upgrade has been started
Fri, 18 Nov 2022 19:09:01 -03> Global Status: running
JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE
05-openshift-cluster-upgrade | running | | 0/0 (0 failures) | status=Working towards 4.11.4: 106 of 803 done (13% complete), waiting on kube-apiserver
10-openshift-kube-conformance | running | | 0/345 (0 failures) | status=waiting-for=05-openshift-cluster-upgrade=(0/0/0)=[51/100]
20-openshift-conformance-validated | running | | 0/3251 (0 failures) | status=blocked-by=10-openshift-kube-conformance=(0/-345/0)=[0/100]
99-openshift-artifacts-collector | running | | 0/0 (0 failures) | status=blocked-by=20-openshift-conformance-validated=(0/-3251/0)=[0/100]
Fri, 18 Nov 2022 19:09:11 -03> Global Status: running
JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE
05-openshift-cluster-upgrade | running | | 0/0 (0 failures) | status=Working towards 4.11.4: 106 of 803 done (13% complete)
10-openshift-kube-conformance | running | | 0/345 (0 failures) | status=waiting-for=05-openshift-cluster-upgrade=(0/0/0)=[53/100]
20-openshift-conformance-validated | running | | 0/3251 (0 failures) | status=blocked-by=10-openshift-kube-conformance=(0/-345/0)=[0/100]
99-openshift-artifacts-collector | running | | 0/0 (0 failures) | status=blocked-by=20-openshift-conformance-validated=(0/-3251/0)=[0/100]
(...)
Fri, 18 Nov 2022 21:36:41 -03> Global Status: running
JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE
05-openshift-cluster-upgrade | running | | 0/0 (0 failures) | status=Working towards 4.11.4: 106 of 803 done (13% complete)
10-openshift-kube-conformance | running | | 0/345 (0 failures) | status=waiting-for=05-openshift-cluster-upgrade=(0/0/0)=[65/100]
20-openshift-conformance-validated | running | | 0/3251 (0 failures) | status=blocked-by=10-openshift-kube-conformance=(0/-345/0)=[0/100]
99-openshift-artifacts-collector | running | | 0/0 (0 failures) | status=blocked-by=20-openshift-conformance-validated=(0/-3251/0)=[0/100]
- The sonobuoy aggregator will report the errors on logs
$ oc logs sonobuoy -n openshift-provider-certification --tail 10 -f
time="2022-11-19T00:37:03Z" level=info msg="couldn't annotate sonobuoy pod" error="couldn't patch pod annotation: pods \"sonobuoy\" is forbidden: unable to validate against any security context constraint: [provider restricted-v2: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, spec.containers[0].securityContext.runAsUser: Invalid value: 1000: must be in the ranges: [1000650000, 1000659999], provider restricted: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, provider machine-api-termination-handler: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, spec.volumes[0]: Invalid value: \"configMap\": configMap volumes are not allowed to be used, spec.volumes[1]: Invalid value: \"configMap\": configMap volumes are not allowed to be used, spec.volumes[2]: Invalid value: \"emptyDir\": emptyDir volumes are not allowed to be used, spec.volumes[3]: Invalid value: \"projected\": projected volumes are not allowed to be used, provider hostnetwork-v2: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, provider hostnetwork: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group, provider hostaccess: .spec.securityContext.fsGroup: Invalid value: []int64{2000}: 2000 is not an allowed group]"
time="2022-11-19T00:37:07Z" level=info msg="received request" client_cert="[20-openshift-conformance-validated]" method=POST plugin_name=20-openshift-conformance-validated url=/api/v1/progress/global/20-openshift-conformance-validated
time="2022-11-19T00:37:10Z" level=info msg="received request" client_cert="[99-openshift-artifacts-collector]" method=POST plugin_name=99-openshift-artifacts-collector url=/api/v1/progress/global/99-openshift-artifacts-collector
- Althrough, the upgrade will continue normally - and finished, then the plugin (cluster-upgrade) will finished successfully.
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.4 True False 125m Cluster version is 4.11.4
- The next plugin will get stuck, as the aggregator can't update the status of the plugins (dependency of plugin blocker engine)
$ oc get pods -n openshift-provider-certification
NAME READY STATUS RESTARTS AGE
sonobuoy 1/1 Running 0 159m
sonobuoy-05-openshift-cluster-upgrade-job-62b87a8c593344b9 0/3 Completed 0 159m
sonobuoy-10-openshift-kube-conformance-job-4aaafda465f04268 3/3 Running 0 159m
sonobuoy-20-openshift-conformance-validated-job-dd49f61f180b47ef 3/3 Running 0 159m
sonobuoy-99-openshift-artifacts-collector-job-1e6c9cd771754cac 3/3 Running 0 159m
$ ./openshift-provider-cert-linux-amd64 status
Fri, 18 Nov 2022 21:37:02 -03> Global Status: running
JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE
05-openshift-cluster-upgrade | running | | 0/0 (0 failures) | status=Working towards 4.11.4: 106 of 803 done (13% complete)
10-openshift-kube-conformance | running | | 0/345 (0 failures) | status=waiting-for=05-openshift-cluster-upgrade=(0/0/0)=[65/100]
20-openshift-conformance-validated | running | | 0/3251 (0 failures) | status=blocked-by=10-openshift-kube-conformance=(0/-345/0)=[0/100]
99-openshift-artifacts-collector | running | | 0/0 (0 failures) | status=blocked-by=20-openshift-conformance-validated=(0/-3251/0)=[0/100]
- NOTE: When destroying the environment, create manually Namespace, ServiceAccount, and Cluster Roles (as described above), then running the tool to update z-stream:
$ ./openshift-provider-cert-linux-amd64 destroy
# create resources manually (NS, SA, ClusterRoles)
$ ./openshift-provider-cert-linux-amd64 run -w --dedicated --mode upgrade --upgrade-to-image $(oc adm release info 4.11.7 -o jsonpath={.image})
- The upgrade has been finished correctly:
$ ./openshift-provider-cert-linux-amd64 run -w --dedicated --mode upgrade --upgrade-to-image $(oc adm release info 4.11.7 -o jsonpath={.image})
(...)
Fri, 18 Nov 2022 22:09:08 -03> Global Status: running
JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE
05-openshift-cluster-upgrade | running | | 0/0 (0 failures) | status=4.11.4=upgrade-progressing-False
10-openshift-kube-conformance | running | | 0/352 (0 failures) | status=waiting-for=05-openshift-cluster-upgrade=(0/0/0)=[0/100]
20-openshift-conformance-validated | running | | 0/3487 (0 failures) | status=waiting-for=10-openshift-kube-conformance=(0/0/0)=[0/100]
99-openshift-artifacts-collector | running | | 0/0 (0 failures) | status=waiting-for=20-openshift-conformance-validated=(0/0/0)=[0/100]
(...)
Fri, 18 Nov 2022 22:36:58 -03> Global Status: running
JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE
05-openshift-cluster-upgrade | running | | 0/0 (0 failures) | status=Working towards 4.11.7: 635 of 803 done (79% complete)
10-openshift-kube-conformance | running | | 0/352 (0 failures) | status=waiting-for=05-openshift-cluster-upgrade=(0/0/0)=[99/100]
20-openshift-conformance-validated | running | | 0/3487 (0 failures) | status=blocked-by=10-openshift-kube-conformance=(0/-352/0)=[0/100]
99-openshift-artifacts-collector | running | | 0/0 (0 failures) | status=blocked-by=20-openshift-conformance-validated=(0/-3487/0)=[0/100]
(...)
INFO[2022-11-18T23:13:38-03:00] Waiting for post-processor...
Fri, 18 Nov 2022 23:16:48 -03> Global Status: complete
JOB_NAME | STATUS | RESULTS | PROGRESS | MESSAGE
05-openshift-cluster-upgrade | complete | passed | 0/0 (0 failures) | Total tests processed: 36 (36 pass / 0 failed)
10-openshift-kube-conformance | complete | failed | 20/20 (0 failures) | Total tests processed: 23 (22 pass / 1 failed)
20-openshift-conformance-validated | complete | failed | 20/20 (0 failures) | Total tests processed: 12 (11 pass / 1 failed)
99-openshift-artifacts-collector | complete | passed | 0/0 (0 failures) | Total tests processed: 3 (3 pass / 0 failed)
INFO[2022-11-18T23:16:48-03:00] The execution has completed! Use retrieve command to collect the results and share the archive with your Red Hat partner.
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.7 True False 17m Cluster version is 4.11.7
[1] https://access.redhat.com/solutions/5875621
The PR #34 fixes the *cluster upgrade* RBAC while running 4.10->4.11. A lot of tests have been done using the regular execution and upgrade feature.
Closing this card.