[OCPBUGS-18640] Cluster fails to install at day-0 with PerformanceProfile - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.14.z
Affects Version/s: 4.14
Component/s: Node Tuning Operator
Labels:
None

Regression:
No
Sprint:
CNF Compute Sprint 251
sprint_count:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
* Currently, applying a performance profile at day-0 is not supported.
Release Note Type:
Known Issue
Release Note Status:
Done
Latest Status Summary:

Hide
2024-03-20: all fixes merged in the 4.14 branch
5/3 : https://github.com/openshift/cluster-node-tuning-operator/pull/963 should be the last fix

Show
2024-03-20: all fixes merged in the 4.14 branch 5/3 : https://github.com/openshift/cluster-node-tuning-operator/pull/963 should be the last fix
RH Private Keywords:
Target Version:

4.14.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

Picked up 4.14-ec-4 (which uses cgroups v1 as default) and trying to create a cluster with following PerformanceProfile (and corresponding mcp) by placing them in the manifests folder,

 
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: clusterbotpp
spec:
  cpu:
    isolated: "1-3"
    reserved: "0"
  realTimeKernel:
    enabled: false
  nodeSelector:
    node-role.kubernetes.io/worker: ""
  machineConfigPoolSelector:
    pools.operator.machineconfiguration.openshift.io/worker: ""

and,

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: worker 
spec:
  machineConfigSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: worker
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""

The cluster often fails to install because bootkube spends a lot of time chasing this error,

 
Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: Created "clusterbotpp_kubeletconfig.yaml" kubeletconfigs.v1.machineconfiguration.openshift.io/performance-clusterbotpp -n
Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: Failed to update status for the "clusterbotpp_kubeletconfig.yaml" kubeletconfigs.v1.machineconfiguration.openshift.io/performance-clusterbotpp -n : Operation cannot be fulfilled on kubeletconfigs.machineconfiguration.openshift.io "performance-clusterbotpp": StorageError: invalid object, Code: 4, Key: /kubernetes.io/machineconfiguration.openshift.io/kubeletconfigs/performance-clusterbotpp, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 11f98d74-af1b-4a4c-9692-6dce56ee5cd9, UID in object meta:
Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: [#1717] failed to create some manifests:
Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: "clusterbotpp_kubeletconfig.yaml": failed to update status for kubeletconfigs.v1.machineconfiguration.openshift.io/performance-clusterbotpp -n : Operation cannot be fulfilled on kubeletconfigs.machineconfiguration.openshift.io "performance-clusterbotpp": StorageError: invalid object, Code: 4, Key: /kubernetes.io/machineconfiguration.openshift.io/kubeletconfigs/performance-clusterbotpp, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 11f98d74-af1b-4a4c-9692-6dce56ee5cd9, UID in object meta:
Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: Created "clusterbotpp_kubeletconfig.yaml" kubeletconfigs.v1.machineconfiguration.openshift.io/performance-clusterbotpp -n
Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: Failed to update status for the "clusterbotpp_kubeletconfig.yaml" kubeletconfigs.v1.machineconfiguration.openshift.io/performance-clusterbotpp -n : Operation cannot be fulfilled on kubeletconfigs.machineconfiguration.openshift.io "performance-clusterbotpp": StorageError: invalid object, Code: 4, Key: /kubernetes.io/machineconfiguration.openshift.io/kubeletconfigs/performance-clusterbotpp, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 597dfcf3-012d-4730-912a-78efabb920ba, UID in object meta:

This leads to worker nodes not getting ready in time, which leads to installer marking the cluster installation failed. Ironically, even after the cluster installer returns with failure, if you wait long enough (sometimes) I have observed the cluster eventually reconciles and the worker nodes get provisioned.

I am attaching the installation logs from one such run with this issue.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Often

Steps to Reproduce:

1. Try to install new cluster by placing PeformanceProfile in the manifests folder
2.
3.

Actual results:

Cluster installation failed.

Expected results:

Cluster installation should succeed.

Additional info:

Also, I didn't observe this occurring in 4.13.9.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

log-bundle-20230906143243.tar.gz
5.43 MB
2023/09/06 9:45 PM

blocks

OCPBUGS-17859 Avoid extra reboot with cgroups v2 at day-0 for PerformanceProfile

Closed

depends on

OCPBUGS-29752 day-0 with PerformanceProfile manifest renderer uses invalid uid

Closed

is cloned by

OCPBUGS-29751 day-0 with PerformanceProfile manifest renderer uses invalid uid

Closed

OCPBUGS-25116 Cluster fails to install at day-0 with PerformanceProfile

Closed

is depended on by

OCPBUGS-25115 [4.14] Cluster fails to install at day-0 with PerformanceProfile

Closed

OCPBUGS-25116 Cluster fails to install at day-0 with PerformanceProfile

Closed

is related to

OCPBUGS-19352 Node in NotReady state as unified_cgroup_hierarchy=1 are set

Closed

links to

https://github.com/openshift/cluster-node-tuning-operator/pull/963

openshift/cluster-node-tuning-operator#854: OCPBUGS-18640: Fix Racing Machine Configs and add Day 0 Support

openshift/cluster-node-tuning-operator#871: OCPBUGS-18640: Fix Racing Machine Configs and add Day 0 Support (#854)

openshift/cluster-node-tuning-operator#989: OCPBUGS-18640: [release-4.14][manual] backport performance profile owner reference ehnancements

PR 935

RHBA-2024:1458 OpenShift Container Platform 4.14.z bug fix update

(1 is depended on by, 1 is related to, 6 links to)

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide