Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.15, 4.16
Component/s: Machine Config Operator
Labels:
- mco-triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Machine-config operator stuck into degraded state due to degraded MCP as the newly generated rendered machineconfig file is greater than 1.5 MB which is the limit for the etcd.

From the logs we could see that the nodes were upgraded to this latest rendered Config. However the MCP stuck into degraded state with the below errors
~~~
  message: 'Failed to render configuration for pool master: etcdserver: request is too large'
  reason: ""
  status: "True"
  type: RenderDegraded
~~~

Also the machine-config-controller logs are full of failures etcdserver: request is too large
~~~
2025-09-29T15:11:54.420173338Z E0929 15:11:54.420119       1 render_controller.go:460] Error syncing Generated MCFG: %!w(*errors.StatusError=&{{{ } {   <nil>} Failure etcdserver: request is too large  <nil> 500}})
2025-09-29T15:11:54.427327865Z E0929 15:11:54.427288       1 render_controller.go:396] etcdserver: request is too large
2025-09-29T15:11:54.427327865Z I0929 15:11:54.427311       1 render_controller.go:397] Dropping machineconfigpool "master" out of the queue: etcdserver: request is too large
[..]
2025-09-30T10:31:29.086682700Z I0930 10:31:29.086599       1 render_controller.go:391] Error syncing machineconfigpool master: etcdserver: request is too large
2025-09-30T10:31:39.917029591Z E0930 10:31:39.916935       1 render_controller.go:460] Error syncing Generated MCFG: %!w(*errors.StatusError=&{{{ } {   <nil>} Failure etcdserver: request is too large  <nil> 500}})
2025-09-30T10:31:39.924226014Z I0930 10:31:39.924182       1 render_controller.go:391] Error syncing machineconfigpool master: etcdserver: request is too large
~~~

Observed that the  size of the latest rendered machineconfig and the respective machine-config we found that the machine-configs are big in size (1556422 bytes)

checking the contents of the inside the machine-config, confirms that the size of the registries ("/etc/containers/registries.conf") file generated by all the mirrors setup is HUGE.
~~~
$ omc get mc 99-master-generated-registries -o json | jq -r .spec.config.storage.files[0].contents.source | cut -d',' -f2 | base64 -d | less
~~~

which interns make the rendered macineconfig huge and impacting the mcp to be stuck into degraded state. 

For now we couldn't think of any workaround rather than reducing the size of content of MCP. As this MC is generated by controller itself its difficult for the customer to delete something from there. Looking forward for workaround and permanent fix for this issue.

In my opinion there should be some checks for the renreded machine config to not exceeding 1.5MB limit in etcd db. Open to discuss this further to see other alternates.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1.  cluster with lots of machinconfigs including large size ImageDigestMirrorSet object, combined rendered machineconfig file is larger than 1.5MB in size.
    2.  upgrade the cluster, check the new rendered machineconfig file is greater than 1.5 MB
    3. check the machine-config-operator stuck into degraded due to 'Failed to render configuration for pool master: etcdserver: request  is too large'

Actual results:

     MachineConfigOperator stuck into degraded state due to degraded  MachineConfigPool/MCP because of huge size of the new rendered machineconfig generated during cluster upgrade.

Expected results:

    Cluster should avoid such blockers during upgrade. Ideally there should be a check for the size of new rendered generated , to ensure that it should not exceed the default limit of 1.5 MiB at etcd end.

Additional info:

     This happens during cluster upgrade from OCP-4.14 to 4.16 but can happen during any cluster upgrade or any machine-config update.

links to

MCP in Degraded State due to etcdserver: request is too large during cluster upgrade.

Assignee:: Team MCO

Reporter:: Nirupma Nirupma

Need Info From:: None

Contributors:: None

QA Contact:: Sergio Regidor de la Rosa

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2025/10/01 11:45 AM

Updated:: 2025/10/02 9:55 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates