[OCPBUGS-26062] Day 0 PerformanceProfile is failing for SNO and Compact clusters - Red Hat Issue Tracker

Type: Bug
Resolution: Won't Do
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.15.0
Component/s: Node Tuning Operator
Labels:
None

Severity:
Important
Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

Since https://github.com/openshift/cluster-node-tuning-operator/pull/854, the preferred way to create a PerformanceProfile is to do it at Day 0.

However it seems not working for SNO and compact clusters when the PerformanceProfile is referencing the master MCP.

Version-Release number of selected component (if applicable):

OpenShift v4.15.0-rc.0

How reproducible:

Tested on BM IPI and SNO BM deployments.

Steps to Reproduce:

1. * Create an install-config.yaml file to deploy a BareMetal IPI OpenShift 4.15.0-rc.0 cluster with compute.workers.replicas set to 0.
   * or create an install-config.yaml file to deploy a BareMetal SNO cluster using the the manual method described in OpenShift documentation (https://docs.openshift.com/container-platform/latest/installing/installing_sno/install-sno-installing-sno.html#install-sno-installing-sno-manually).

2. After running the command {{openshift-install create manifests}}, create the following manifests at Day 0 (they are similar to the ones referrenced in https://issues.redhat.com/browse/OCPBUGS-18640):

---
kind: MachineConfigPool
apiVersion: machineconfiguration.openshift.io/v1
metadata:
  name: master
spec:
  machineConfigSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: master
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/master: ""

---
kind: PerformanceProfile
apiVersion: performance.openshift.io/v2
metadata:
  name: dpdk
spec:
  cpu:
    isolated: "1-3"
    reserved: "0"
  hugepages:
    defaultHugepagesSize: 2M
    pages:
      - size: 2M
        count: 32
  net:
    userLevelNetworking: true
  numa:
    topologyPolicy: single-numa-node
  realTimeKernel:
    enabled: false
  machineConfigPoolSelector:
    pools.operator.machineconfiguration.openshift.io/master: ""
  nodeSelector:
    node-role.kubernetes.io/master: ""
 
3. Deploy the cluster

Actual results:

Cluster deployment fails at bootstrapping stage

For SNO clusters, most of the time logs are spamming the following error

> journalctl -b -u bootkube.service

bootkube.sh[7451]: [#4] failed to create some manifests:
bootkube.sh[7451]: "performance_profile_dpdk.yaml": failed to create performanceprofiles.v2.performance.openshift.io/dpdk -n : Internal error occurred: failed calling webhook "vwb.performance.openshift.io": failed to call webhook: Post "https://performance-addon-operator-service.openshift-cluster-node-tuning-operator.svc:443/validate-performance-openshift-io-v2-performanceprofile?timeout=10s": no endpoints available for service "performance-addon-operator-service"

For compact clusters (and SNO when it doesn't fail previously) logs are spamming the following error

> oc  -n openshift-machine-config-operator logs deployment/machine-config-controller -c machine-config-controller

I1223 14:10:24.299182       1 kubelet_config_controller.go:491] KubeletConfig performance-dpdk has been deleted
W1223 14:10:25.095025       1 kubelet_config_controller.go:462] error updating the kubelet config with annotation key "machineconfiguration.openshift.io/mc-name-suffix" and value "": kubeletconfig.machineconfiguration.openshift.io "performance-dpdk" not found
W1223 14:10:25.095050       1 kubelet_config_controller.go:429] error updating kubeletconfig status: kubeletconfig.machineconfiguration.openshift.io "performance-dpdk" not found
I1223 14:10:25.095060       1 kubelet_config_controller.go:332] Error syncing kubeletconfig performance-dpdk: kubeletconfig.machineconfiguration.openshift.io "performance-dpdk" not found
I1223 14:10:25.133332       1 node_controller.go:1035] No nodes available for updates
I1223 14:10:25.133603       1 status.go:224] Degraded Machine: cnvqe-08.lab.eng.tlv2.redhat.com and Degraded Reason: machineconfig.machineconfiguration.openshift.io "rendered-master-82d8570749169c031983cc3e9151d030" not found

Additional info:

It seems simply creating a Tuned resource at Day 0 is also failing for SNO and compact clusters

---
kind: Tuned
apiVersion: tuned.openshift.io/v1
metadata:
  name: hugepages
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
    - name: openshift-node-hugepages
      data: |
        [main]
        summary=Boot time configuration for hugepages
        include=openshift-node
        [bootloader]
        cmdline_openshift_node_hugepages=default_hugepagesz=2M hugepages=32
  recommend:
    - machineConfigLabels:
        machineconfiguration.openshift.io/role: "master"
      priority: 25
      profile: openshift-node-hugepages

> oc  -n openshift-machine-config-operator logs deployment/machine-config-controller -c machine-config-controller

I1222 21:35:08.908410       1 status.go:224] Degraded Machine: cnvqe-03.lab.eng.tlv2.redhat.com and Degraded Reason: machineconfig.machineconfiguration.openshift.io "rendered-master-f3b3143b5d67b2efcb405cb1051662a4" not found

> oc  -n openshift-machine-config-operator logs daemonset/machine-config-daemon -c machine-config-daemon

I1222 21:26:28.144081   15114 node.go:52] Setting initial node config: rendered-master-f3b3143b5d67b2efcb405cb1051662a4
I1222 21:26:28.152814   15114 daemon.go:1495] In bootstrap mode
E1222 21:26:28.152954   15114 writer.go:226] Marking Degraded due to: machineconfig.machineconfiguration.openshift.io "rendered-master-f3b3143b5d67b2efcb405cb1051662a4" not found

duplicates

OCPBUGS-22095 PerformanceProfile render fails at Day-0 because the master/worker pools are not yet present

Closed

is related to

OCPBUGS-25300 OCP SNO RAN DU deployment has additional reboot

Closed

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide