-
Bug
-
Resolution: Done
-
Normal
-
None
-
4.14.z
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
x86_64, ppc64le, s390x, aarch64
-
None
-
None
-
None
-
None
-
None
-
None
In testing my restoration of the PR for the RT team's jobs, I discovered that my s390x node didn't recover after I crashed it. It ssh'd to the other node, and discovered that the kdump service, while enabled, didn't run successfully because there wasn't any memory allocated for the crash kernel.
I suspect this is a configuration problem. Here is the MCO resource definition:
apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-worker-kdump spec: config: ignition: version: 3.2.0 storage: files: - contents: source: data:text/plain;charset=utf-8;base64,cGF0aCAvdmFyL2NyYXNoCmNvcmVfY29sbGVjdG9yIG1ha2VkdW1wZmlsZSAtbCAtLW1lc3NhZ2UtbGV2ZWwgNyAtZCAzMQo= mode: 420 overwrite: true path: /etc/kdump.conf - contents: source: data:text/plain;charset=utf-8;base64,S0RVTVBfQ09NTUFORExJTkVfUkVNT1ZFPSJodWdlcGFnZXMgaHVnZXBhZ2VzeiBzbHViX2RlYnVnIHF1aWV0IGxvZ19idWZfbGVuIHN3aW90bGIgaHVnZXRsYl9jbWEgaWduaXRpb24uZmlyc3Rib290IHJkLm11bHRpcGF0aD1kZWZhdWx0IgpLRFVNUF9DT01NQU5ETElORV9BUFBFTkQ9ImlycXBvbGwgbWF4Y3B1cz0xIG5vaXJxZGlzdHJpYiByZXNldF9kZXZpY2VzIGNncm91cF9kaXNhYmxlPW1lbW9yeSBudW1hPW9mZiB1ZGV2LmNoaWxkcmVuLW1heD0yIGVoZWEudXNlX21jcz0wIHBhbmljPTEwIGt2bV9jbWFfcmVzdl9yYXRpbz0wIHRyYW5zcGFyZW50X2h1Z2VwYWdlPW5ldmVyIG5vdm1jb3JlZGQgaHVnZXRsYl9jbWE9MCBzcmN1dHJlZS5iaWdfY3B1X2xpbT0wIgpLRVhFQ19BUkdTPSItLWR0LW5vLW9sZC1yb290IC1zIgpLRFVNUF9JTUc9InZtbGludXoiCg== mode: 420 overwrite: true path: /etc/sysconfig/kdump systemd: units: - enabled: true name: kdump.service kernelArguments: - crashkernel="2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G"
I discovered this on s390x, but I suspect it affects all jobs using the current kdump steps.