Details
-
Bug
-
Resolution: Not a Bug
-
Normal
-
None
-
4.13, 4.14
-
No
-
False
-
Description
Description of problem:
All OCP nodes in OCP 4.13-4.14 latest always hit kernel taint # 4
Version-Release number of selected component (if applicable):
Starting to detect this issue since the following OCP 4.13-4.14 releases: - OpenShift 4.14 nightly 2023-03-08 19:42 - OpenShift 4.13 nightly 2023-03-14 05:37 This started happening since using CentOS Stream CoreOS operating system in the nodes, instead of RHCOS, which appears as default for these latest OCP releases.
How reproducible:
100%
Steps to Reproduce:
1. Deploy OCP 4.13-4.14 latest nightly version in a cluster based on 3 master nodes and 3 worker nodes. 2. On each node, check /proc/sys/kernel/tainted file.
Actual results:
All nodes are hitting kernel taint # 4 - "kernel running on an out of specification system (tainted bit 2)"
Expected results:
Under normal circumstances (i.e. nodes without taints), all nodes should have kernel taint # 0.
Additional info:
Testing OCP deployments using Distributed-CI, deploying the cluster with IPI installation in baremetal nodes. After the deployment, we ran CNF Cert Suite to validate the workloads deployed, and the unit test related to kernel taints was failing because of this. Some DCI jobs where this effect was observed: - OCP 4.13 installation: https://www.distributed-ci.io/jobs/db4c5de3-5c24-4de3-a298-23759f606a7d/jobStates - CNF Cert Suite execution: https://www.distributed-ci.io/jobs/a229f9d1-0438-47d5-acc9-e77e4751c521/jobStates?sort=date - OCP 4.14 installation: https://www.distributed-ci.io/jobs/aa8cf280-4f41-4e7b-b756-451b79d8de5b/jobStates - CNF Cert Suite execution: https://www.distributed-ci.io/jobs/bab0a0a1-5c6a-4c23-821a-148ed76c9386/jobStates?sort=date
Note that I am not able to upload files to this Jira card. All the files related to the deployments can be seen on the Distributed-CI jobs we have launched:
- OCP 4.13: https://www.distributed-ci.io/jobs/db4c5de3-5c24-4de3-a298-23759f606a7d/files
> then, check must_gather.tar.gz, journal<node>.log, <node>-console.log and openshift_install.log - OCP 4.14: https://www.distributed-ci.io/jobs/aa8cf280-4f41-4e7b-b756-451b79d8de5b/files -> the same
More information related to OS information, extracted from a similar case I opened some time ago (OCPBUGS-3083):
OCP Version at Install Time: Starting appearing at OpenShift 4.14 nightly 2023-03-08 19:42 / OpenShift 4.13 nightly 2023-03-14 05:37
(for following data, I'll focus in a deployment based on OpenShift 4.13 nightly 2023-03-14 05:37, but it applies the same for OCP 4.14)
CentOS Stream CoreOS (NOT COREOS) Version at Install Time: (from /etc/os-release) -> CentOS Stream CoreOS 413.92.202303061740-0 (Plow) / (kernel version) -> 5.14.0-282.el9.x86_64
OCP Version after Upgrade (if applicable): -
RHCOS Version after Upgrade (if applicable): -
Platform (AWS, Azure, bare metal, GCP, vSphere, etc.): baremetal - we're following IPI deployment using baremetal-deployment
Architecture (x86_64, ppc64le, s390x, etc.): x86_64
If you're having problems booting/installing CentOS Stream, please provide: -> All nodes using CentOS Stream has kernel taint # 4.
- Reproduction steps that work with a single CentOS Stream node
If checking the tainted file, it will show this (should be 0 in a healthy node)
$ cat /proc/sys/kernel/tainted 4
- The full contents of the serial console showing disk initialization, network configuration, and Ignition stages. See this article for information about configuring your serial console. Screenshots or a video recording of the console is usually not sufficient -> please check the console logs attached to this case.
- Ignition JSON -> I was not able to retrieve it, but in OCPBUGS-3083, it was not really needed in the end.
- Output of journalctl -b -> please check journal files attached.
If you're having problems post-upgrade, please provide: -> not having it, but also providing must-gather just in case.
- A complete must-gather (oc adm must-gather) -> check must_gather provided.
If you're having SELinux related issues, please provide: -> not having SELinux issues.
- The full /var/log/audit/audit.log file
- Were any SELinux modules or booleans changed from the default configuration?
- The output of ostree admin config-diff | grep selinux/targeted on impacted nodes
Please add anything else that might be useful, for example:
- Kernel command line (cat /proc/cmdline)
# example with a master node, but it applies to the other nodes: $ cat /proc/cmdline BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-8bb3298191b10a91e3d87a8f67872865cb6d42a8ba72cbcfd865b42b77396813/vmlinuz-5.14.0-282.el9.x86_64 ignition.platform.id=metal ostree=/ostree/boot.0/rhcos/8bb3298191b10a91e3d87a8f67872865cb6d42a8ba72cbcfd865b42b77396813/0 ip=dhcp root=UUID=9d8b9428-4760-4385-b1ef-7541c3b01a6c rw rootflags=prjquota boot=UUID=61b17ccf-b51f-4659-823b-1efcf6cb6f42 systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller=1
- Contents of /etc/NetworkManager/system-connections/
# example with a master node, but it applies to the other nodes: $ ls /etc/NetworkManager/system-connections/ bond0.360.nmconnection bond0.nmconnection ens1f0.nmconnection ens1f1.nmconnection $ sudo cat /etc/NetworkManager/system-connections/bond0.360.nmconnection [connection] id=bond0.360 type=vlan interface-name=bond0.360 autoconnect=true autoconnect-priority=99 [vlan] parent=bond0 id=360 [ethernet] mtu=9000 [ipv4] method=auto dhcp-timeout=2147483647 never-default=true [ipv6] method=disabled $ sudo cat /etc/NetworkManager/system-connections/bond0.nmconnection [connection] id=bond0 type=bond interface-name=bond0 autoconnect=true connection.autoconnect-slaves=1 autoconnect-priority=99 [ethernet] mtu=9000 [bond] mode=802.3ad [ipv4] method=auto dhcp-timeout=2147483647 [ipv6] method=disabled $ sudo cat /etc/NetworkManager/system-connections/ens1f0.nmconnection [connection] id=ens1f0 type=ethernet interface-name=ens1f0 master=bond0 slave-type=bond autoconnect=true autoconnect-priority=99 [ethernet] $ sudo cat /etc/NetworkManager/system-connections/ens1f1.nmconnection [connection] id=ens1f1 type=ethernet interface-name=ens1f1 master=bond0 slave-type=bond autoconnect=true autoconnect-priority=99 [ethernet]
- Contents of /etc/sysconfig/network-scripts/
# example with a master node, but it applies to the other nodes: $ ls /etc/sysconfig/network-scripts/ # empty