Description of problem:
MCE is in an Error phase after what appears to have been an upgrade to ACM 2.6.6 and MCE 2.1.7. MCE resource is reporting:
rpc error: code = Unknown desc = malformed header: missing HTTP content-type
since July 14th. On July 24th the MCE operator was re-installed following the steps at https://access.redhat.com/solutions/6459071 but MCE has not recovered. Looking at the MCE operator pod logs we see repeating stream errors such as:
2023-07-24T17:36:58.288825855Z 1.6902202182887614e+09 DEBUG controller-runtime.webhook.webhooks wrote response {"webhook": "/validate-multicluster-openshift-io-v1-multiclusterengine", "code": 200, "reason": "", "UID": "beee0c21-e368-465a-99d9-b7b8da16b1be", "allowed": true} 2023-07-24T17:37:01.897230172Z W0724 17:37:01.897170 1 reflector.go:324] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: failed to list *v1.ConfigMap: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 1463; INTERNAL_ERROR; received from peer 2023-07-24T17:37:01.897300566Z I0724 17:37:01.897233 1 trace.go:205] Trace[590526907]: "Reflector ListAndWatch" name:sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262 (24-Jul-2023 17:36:00.718) (total time: 61178ms): 2023-07-24T17:37:01.897300566Z Trace[590526907]: ---"Objects listed" error:stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 1463; INTERNAL_ERROR; received from peer 61178ms (17:37:01.897) 2023-07-24T17:37:01.897300566Z Trace[590526907]: [1m1.178279099s] [1m1.178279099s] END 2023-07-24T17:37:01.897300566Z E0724 17:37:01.897251 1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:262: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: stream error when reading response body, may be caused by closed connection. Please retry. Original error: stream error: stream ID 1463; INTERNAL_ERROR; received from peer
The shift support team investigated due to concerns over the cluster health and reported:
etcd response rate for the cluster is pretty bad, and they seem to have an issue with volumes for ODF, but I don't see any signs of an issue with cluster health. All nodes are ready, minimum specs met, MCP up to date, all pods Ready or Completed.
Version-Release number of selected component (if applicable):
ACM 2.6.6 / MCE 2.1.7
How reproducible:
Have not seen in lab
Steps to Reproduce:
- ...
Actual results:
Expected results:
Additional info:
- clones
-
ACM-6550 MCE in Error Phase after upgrade to ACM 2.6.6 MCE 2.1.7
- Closed