Loading...

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.15.z
Affects Version/s: 4.15
Component/s: Node Tuning Operator
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
No
Latest Status Summary:
5/3: PR is ready but depends on another PR that was blocked by a general CI issue until now. trending towards merging all required dependencies

Target Backport Versions:
None
Target Version:

4.15.z
Release Blocker:
Proposed
Sprint:
None

RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
Done
Release Note Type:
Known Issue
Release Note Text:

Hide
* Providing a performance profile as an extra manifest at Day 0 did not work in 4.15.0 but is now possible in 4.15.2 with the following limitation.

The installation of OCP might fail when a performance profile is present in the extra manifests folder and targets the primary or worker pools. This is caused by the internal install ordering that processes the performance profile before the default primary and worker `MachineConfigPools` are created. It is possible to workaround this issue by including a copy of the stock primary or worker `MachineConfigPools` in the extra manifests folder. (link:https://issues.redhat.com/browse/OCPBUGS-27948[*~~OCPBUGS-27948~~*], link:https://issues.redhat.com/browse/OCPBUGS-29752[*~~OCPBUGS-29752~~*])

Show
* Providing a performance profile as an extra manifest at Day 0 did not work in 4.15.0 but is now possible in 4.15.2 with the following limitation. The installation of OCP might fail when a performance profile is present in the extra manifests folder and targets the primary or worker pools. This is caused by the internal install ordering that processes the performance profile before the default primary and worker `MachineConfigPools` are created. It is possible to workaround this issue by including a copy of the stock primary or worker `MachineConfigPools` in the extra manifests folder. (link: https://issues.redhat.com/browse/OCPBUGS-27948 [* OCPBUGS-27948 *], link: https://issues.redhat.com/browse/OCPBUGS-29752 [* OCPBUGS-29752 *])

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Picked up 4.14-ec-4 (which uses cgroups v1 as default) and trying to create a cluster with following PerformanceProfile (and corresponding mcp) by placing them in the manifests folder,

 
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: clusterbotpp
spec:
  cpu:
    isolated: "1-3"
    reserved: "0"
  realTimeKernel:
    enabled: false
  nodeSelector:
    node-role.kubernetes.io/worker: ""
  machineConfigPoolSelector:
    pools.operator.machineconfiguration.openshift.io/worker: ""

and,

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: worker 
spec:
  machineConfigSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: worker
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""

The cluster often fails to install because bootkube spends a lot of time chasing this error,

 
Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: Created "clusterbotpp_kubeletconfig.yaml" kubeletconfigs.v1.machineconfiguration.openshift.io/performance-clusterbotpp -n
Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: Failed to update status for the "clusterbotpp_kubeletconfig.yaml" kubeletconfigs.v1.machineconfiguration.openshift.io/performance-clusterbotpp -n : Operation cannot be fulfilled on kubeletconfigs.machineconfiguration.openshift.io "performance-clusterbotpp": StorageError: invalid object, Code: 4, Key: /kubernetes.io/machineconfiguration.openshift.io/kubeletconfigs/performance-clusterbotpp, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 11f98d74-af1b-4a4c-9692-6dce56ee5cd9, UID in object meta:
Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: [#1717] failed to create some manifests:
Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: "clusterbotpp_kubeletconfig.yaml": failed to update status for kubeletconfigs.v1.machineconfiguration.openshift.io/performance-clusterbotpp -n : Operation cannot be fulfilled on kubeletconfigs.machineconfiguration.openshift.io "performance-clusterbotpp": StorageError: invalid object, Code: 4, Key: /kubernetes.io/machineconfiguration.openshift.io/kubeletconfigs/performance-clusterbotpp, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 11f98d74-af1b-4a4c-9692-6dce56ee5cd9, UID in object meta:
Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: Created "clusterbotpp_kubeletconfig.yaml" kubeletconfigs.v1.machineconfiguration.openshift.io/performance-clusterbotpp -n
Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: Failed to update status for the "clusterbotpp_kubeletconfig.yaml" kubeletconfigs.v1.machineconfiguration.openshift.io/performance-clusterbotpp -n : Operation cannot be fulfilled on kubeletconfigs.machineconfiguration.openshift.io "performance-clusterbotpp": StorageError: invalid object, Code: 4, Key: /kubernetes.io/machineconfiguration.openshift.io/kubeletconfigs/performance-clusterbotpp, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 597dfcf3-012d-4730-912a-78efabb920ba, UID in object meta:

This leads to worker nodes not getting ready in time, which leads to installer marking the cluster installation failed. Ironically, even after the cluster installer returns with failure, if you wait long enough (sometimes) I have observed the cluster eventually reconciles and the worker nodes get provisioned.

I am attaching the installation logs from one such run with this issue.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Often

Steps to Reproduce:

1. Try to install new cluster by placing PeformanceProfile in the manifests folder
2.
3.

Actual results:

Cluster installation failed.

Expected results:

Cluster installation should succeed.

Additional info:

Also, I didn't observe this occurring in 4.13.9.

clones

OCPBUGS-29751 day-0 with PerformanceProfile manifest renderer uses invalid uid

Closed

depends on

OCPBUGS-29751 day-0 with PerformanceProfile manifest renderer uses invalid uid

Closed

is depended on by

OCPBUGS-18640 Cluster fails to install at day-0 with PerformanceProfile

Closed

links to

openshift/cluster-node-tuning-operator#963: OCPBUGS-29752: [release-4.15][manual] backport performance profile owner reference ehnancements

RHSA-2024:1210 OpenShift Container Platform 4.15.z security update

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates