[OCPBUGS-18969] SNO fails install because image-registry operator is degraded - "Degraded: The registry is removed..." - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: 4.15.0
Affects Version/s: 4.14
Component/s: Image Registry
Labels:

Regression:
No
Sprint:
Sprint 243
sprint_count:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, the Image Registry pruner relied on a cluster role that was managed by the openshift-apiserver. This could cause the pruner job to intermittently fail during an upgrade. Now, the Image Registry Operator is responsible for creating the pruner cluster role, which resolves the issue. (link:https://issues.redhat.com/browse/OCPBUGS-18969[*~~OCPBUGS-18969~~*])

Show
* Previously, the Image Registry pruner relied on a cluster role that was managed by the openshift-apiserver. This could cause the pruner job to intermittently fail during an upgrade. Now, the Image Registry Operator is responsible for creating the pruner cluster role, which resolves the issue. (link: https://issues.redhat.com/browse/OCPBUGS-18969 [* OCPBUGS-18969 *])
Release Note Type:
Bug Fix
Release Note Status:
Done
Latest Status Summary:
9/19: telco prioritization pending triage
RH Private Keywords:
Target Version:

4.15.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

While installing many SNOs via ZTP using ACM, two SNOs failed to complete install because the image-registry was degraded during the install process.

# cat clusters | xargs -I % sh -c "echo '%'; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get clusterversion"
vm01831
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       False         18h     Error while reconciling 4.14.0-rc.0: the cluster operator image-registry is degraded
vm02740
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       False         18h     Error while reconciling 4.14.0-rc.0: the cluster operator image-registry is degraded

# cat clusters | xargs -I % sh -c "echo '%'; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get co image-registry"
vm01831
NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
image-registry   4.14.0-rc.0   True        False         True       18h     Degraded: The registry is removed...
vm02740
NAME             VERSION       AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
image-registry   4.14.0-rc.0   True        False         True       18h     Degraded: The registry is removed...

Both showed the image-pruner job pod in error state:
# cat clusters | xargs -I % sh -c "echo '%'; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get po -n openshift-image-registry"
vm01831
NAME                                               READY   STATUS    RESTARTS   AGE
cluster-image-registry-operator-5d497944d4-czn64   1/1     Running   0          18h
image-pruner-28242720-w6jmv                        0/1     Error     0          18h
node-ca-vtfj8                                      1/1     Running   0          18h
vm02740
NAME                                               READY   STATUS    RESTARTS      AGE
cluster-image-registry-operator-5d497944d4-lbtqw   1/1     Running   1 (18h ago)   18h
image-pruner-28242720-ltqzk                        0/1     Error     0             18h
node-ca-4fntj                                      1/1     Running   0             18h

Version-Release number of selected component (if applicable):

Deployed SNO OCP - 4.14.0-rc.0
Hub 4.13.11
ACM - 2.9.0-DOWNSTREAM-2023-09-07-04-47-52

How reproducible:

Rare, only 2 clusters were found in this state after the test

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

Seems like some permissions might have been lacking:

# oc --kubeconfig /root/hv-vm/kc/vm01831/kubeconfig logs -n openshift-image-registry image-pruner-28242720-w6jmv
Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found
attempt #1 has failed (exit code 1), going to make another attempt...
Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found
attempt #2 has failed (exit code 1), going to make another attempt...
Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found
attempt #3 has failed (exit code 1), going to make another attempt...
Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found
attempt #4 has failed (exit code 1), going to make another attempt...
Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found
attempt #5 has failed (exit code 1), going to make another attempt...
Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

vm00001-must-gather-successful.tar.gz
26.68 MB
2023/09/19 5:32 PM
vm01831-image-registry-degraded.tar.gz
37.14 MB
2023/09/13 6:32 PM
vm02740-image-registry-degraded.tar.gz
35.65 MB
2023/09/13 6:32 PM

links to

openshift/cluster-image-registry-operator#925: OCPBUGS-18969: move pruner role creation from openshift-apiserver

openshift/openshift-apiserver#391: OCPBUGS-18969: remove image pruner cluster role creation

RHEA-2023:7198 rpm

Errata Tool added a comment - 2024/02/27 8:51 PM

Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

For information on the advisory (Critical: OpenShift Container Platform 4.15.0 bug fix and security update), and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2023:7198

Errata Tool added a comment - 2024/02/27 8:51 PM Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Critical: OpenShift Container Platform 4.15.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:7198

Wen Wang added a comment - 2023/10/10 10:53 AM

do regression test with version: 4.15.0-0.nightly-2023-10-09-101435, and upgrade from 4.14 to 4.15 success, so close it

Wen Wang added a comment - 2023/10/10 10:53 AM do regression test with version: 4.15.0-0.nightly-2023-10-09-101435, and upgrade from 4.14 to 4.15 success, so close it

Wen Wang added a comment - 2023/10/08 7:17 AM - edited

fmissi i will do upgrade and regression test when have a available 4.15 build,now latest build is not include the bug

Wen Wang added a comment - 2023/10/08 7:17 AM - edited fmissi i will do upgrade and regression test when have a available 4.15 build,now latest build is not include the bug

OpenShift Jira Bot added a comment - 2023/10/05 9:35 PM

Hi fmissi,

Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

OpenShift Jira Bot added a comment - 2023/10/05 9:35 PM Hi fmissi , Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

Flavian Missi added a comment - 2023/10/05 8:15 AM

Hi wewang@redhat.com, we're about to merge the second PR that should fix this.

The changes made move the "system:image-pruner" role creation from the openshift-apiserver into the image registry operator.

Since this is a difficult bug to reproduce, I think we can focus on regression tests for the pruner. We need to cover both fresh install and upgrade cases.

Let me know if you need more info, and thanks!

Flavian Missi added a comment - 2023/10/05 8:15 AM Hi wewang@redhat.com , we're about to merge the second PR that should fix this. The changes made move the "system:image-pruner" role creation from the openshift-apiserver into the image registry operator. Since this is a difficult bug to reproduce, I think we can focus on regression tests for the pruner. We need to cover both fresh install and upgrade cases. Let me know if you need more info, and thanks!

Flavian Missi added a comment - 2023/09/22 9:00 AM - edited

Hi bzvonar@redhat.com, I wouldn't expect anyone other than the registry team to get this done but we are happy to take in contributions

I'm also uncertain about backporting the change once it lands on the main branch, it's not something I would do without a lot of consideration.

Flavian Missi added a comment - 2023/09/22 9:00 AM - edited Hi bzvonar@redhat.com , I wouldn't expect anyone other than the registry team to get this done but we are happy to take in contributions I'm also uncertain about backporting the change once it lands on the main branch, it's not something I would do without a lot of consideration.

Flavian Missi added a comment - 2023/09/20 8:16 AM

Thanks akrzos@redhat.com !

The problem here is that the openshift-apiserver is currently responsible for creating a cluster role that the pruner relies on. This cluster role should be created by the registry operator (which also manages the pruner), so that there's interdependence between components that might be updated simultaneously.

Flavian Missi added a comment - 2023/09/20 8:16 AM Thanks akrzos@redhat.com ! The problem here is that the openshift-apiserver is currently responsible for creating a cluster role that the pruner relies on. This cluster role should be created by the registry operator (which also manages the pruner), so that there's interdependence between components that might be updated simultaneously.

Alex Krzos added a comment - 2023/09/19 5:32 PM

fmissi I attached a successfully installed SNO must-gather (vm00001, Version 4.14.0-rc.1) Let me know if you need anything else.

Alex Krzos added a comment - 2023/09/19 5:32 PM fmissi I attached a successfully installed SNO must-gather (vm00001, Version 4.14.0-rc.1) Let me know if you need anything else.

Flavian Missi added a comment - 2023/09/19 2:17 PM

Okay so it looks like the system:image-pruner cluster role is created by openshift-apiserver:

2023-09-13T00:13:35.527957614Z I0913 00:13:35.527791       1 storage_rbac.go:226] created clusterrole.rbac.authorization.k8s.io/system:image-pruner

And the pruner job tries to run for the first time:

2023-09-13T00:01:35.766589593Z Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found

And the last try:

2023-09-13T00:09:07.529941880Z Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found

—

So the required role for the pruner to work is created AFTER the pruner job runs and gives up retrying.

Flavian Missi added a comment - 2023/09/19 2:17 PM Okay so it looks like the system:image-pruner cluster role is created by openshift-apiserver: 2023-09-13T00:13:35.527957614Z I0913 00:13:35.527791 1 storage_rbac.go:226] created clusterrole.rbac.authorization.k8s.io/system:image-pruner And the pruner job tries to run for the first time: 2023-09-13T00:01:35.766589593Z Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io " system:image-pruner" not found And the last try: 2023-09-13T00:09:07.529941880Z Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io " system:image-pruner" not found — So the required role for the pruner to work is created AFTER the pruner job runs and gives up retrying.

Flavian Missi added a comment - 2023/09/19 2:10 PM

So the pruner should run whether the registry is removed or not.

As you pointed out, the problem does seem to be a missing ClusterRole (system:image-pruner).

It should be created by openshift-apiserver, I'm currently trying to find out why that didn't happen.

Flavian Missi added a comment - 2023/09/19 2:10 PM So the pruner should run whether the registry is removed or not. As you pointed out, the problem does seem to be a missing ClusterRole (system:image-pruner). It should be created by openshift-apiserver, I'm currently trying to find out why that didn't happen.

Assignee:: Flavian Missi

Reporter:: Alex Krzos

QA Contact:: Wen Wang

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2023/09/13 6:31 PM

Updated:: 2024/02/27 8:51 PM

Resolved:: 2024/02/27 8:51 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

[OCPBUGS-18969] SNO fails install because image-registry operator is degraded - "Degraded: The registry is removed..."

Collapse comment: Errata Tool added a comment - 2024/02/27 8:51 PM

Expand comment: Errata Tool added a comment - 2024/02/27 8:51 PM

Collapse comment: Wen Wang added a comment - 2023/10/10 10:53 AM

Expand comment: Wen Wang added a comment - 2023/10/10 10:53 AM

Collapse comment: Wen Wang added a comment - 2023/10/08 7:17 AM, Edited by Wen Wang - 2023/10/08 7:18 AM

Expand comment: Wen Wang added a comment - 2023/10/08 7:17 AM, Edited by Wen Wang - 2023/10/08 7:18 AM

Collapse comment: OpenShift Jira Bot added a comment - 2023/10/05 9:35 PM

Expand comment: OpenShift Jira Bot added a comment - 2023/10/05 9:35 PM

Collapse comment: Flavian Missi added a comment - 2023/10/05 8:15 AM

Expand comment: Flavian Missi added a comment - 2023/10/05 8:15 AM

Collapse comment: Flavian Missi added a comment - 2023/09/22 9:00 AM, Edited by Flavian Missi - 2023/09/22 9:01 AM

Expand comment: Flavian Missi added a comment - 2023/09/22 9:00 AM, Edited by Flavian Missi - 2023/09/22 9:01 AM

Collapse comment: Flavian Missi added a comment - 2023/09/20 8:16 AM

Expand comment: Flavian Missi added a comment - 2023/09/20 8:16 AM

Collapse comment: Alex Krzos added a comment - 2023/09/19 5:32 PM

Expand comment: Alex Krzos added a comment - 2023/09/19 5:32 PM

Collapse comment: Flavian Missi added a comment - 2023/09/19 2:17 PM

Expand comment: Flavian Missi added a comment - 2023/09/19 2:17 PM

Collapse comment: Flavian Missi added a comment - 2023/09/19 2:10 PM

Expand comment: Flavian Missi added a comment - 2023/09/19 2:10 PM

People

Dates