MicroShift Runtime Problem Analysis - CORRECTED ================================================================================ Job Information: - Job ID: 2000096273457221632 - Pull Request: #5901 (RHOAI 2.25 bump) - Job Type: pull-ci-openshift-microshift-release-4.20-e2e-aws-ai-model-serving - Status: FAILURE - MicroShift Version: 4.20.0_0.nightly_2025_12_10_143047_20251211075838_2fcb37f94 SUMMARY ================================================================================ MicroShift is crashing with a segmentation fault (SIGSEGV) in the kustomize library while processing the built-in KServe AI Model Serving manifest. The crash occurs during the PatchTransformer phase when handling deprecated kustomize syntax ('vars' and 'commonLabels'). This causes continuous crash-restart loops every ~26 seconds, preventing MicroShift from completing startup. CRITICAL FINDINGS ================================================================================ Problem 1: Kustomize Crash Processing KServe Manifest -------------------------------------------------------------------------------- Severity: Critical Component: MicroShift - Kustomize / KServe AI Model Serving Manifests **CRASHING MANIFEST IDENTIFIED:** /usr/lib/microshift/manifests.d/010-microshift-ai-model-serving-kserve Evidence from logs: Dec 14 07:43:27 ... Applying kustomization at /usr/lib/microshift/manifests.d/003-microshift-observability was successful. Dec 14 07:43:27 ... Applying kustomization at /usr/lib/microshift/manifests.d/010-microshift-ai-model-serving-kserve # Warning: 'vars' is deprecated. Please use 'replacements' instead. # Warning: 'commonLabels' is deprecated. Please use 'labels' instead. Dec 14 07:43:28 ... [signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x39cdd25] Deprecated Kustomize Features Detected: - 'vars' (should use 'replacements') - 'commonLabels' (should use 'labels') Stack Trace (consistent across all 66 crashes): goroutine [running]: sigs.k8s.io/kustomize/kyaml/yaml.(*RNode).Content(...) vendor/sigs.k8s.io/kustomize/kyaml/yaml/rnode.go:724 sigs.k8s.io/kustomize/kyaml/yaml.(*RNode).getMapFieldValue vendor/sigs.k8s.io/kustomize/kyaml/yaml/rnode.go:437 +0x45 sigs.k8s.io/kustomize/kyaml/yaml.(*RNode).GetApiVersion(...) vendor/sigs.k8s.io/kustomize/kyaml/yaml/rnode.go:419 sigs.k8s.io/kustomize/kyaml/resid.GvkFromNode vendor/sigs.k8s.io/kustomize/kyaml/resid/gvk.go:32 +0x4b sigs.k8s.io/kustomize/api/internal/builtins.(*PatchTransformerPlugin).transformStrategicMerge vendor/sigs.k8s.io/kustomize/api/internal/builtins/PatchTransformer.go:112 +0x2e2 Root Cause Analysis: The crash occurs in the kustomize library when processing the KServe AI Model Serving kustomization. The manifest uses deprecated kustomize features ('vars' and 'commonLabels') that trigger a nil pointer dereference in the PatchTransformer during strategic merge patch processing. The crash location at rnode.go:724 indicates the RNode object is nil or has nil internal state when trying to access YAML content during patch transformation. **This is directly related to PR #5901 (RHOAI 2.25 bump)**, which likely: 1. Updated the KServe manifests to version 2.25 2. Introduced or modified kustomization patches 3. Triggered a bug in the kustomize library when processing these patches 4. The deprecated syntax may be interacting poorly with the kustomize version vendored in MicroShift 4.20 Crash Statistics: - Total MicroShift restarts: 66 times - Crash interval: Approximately every 26 seconds - Consistent failure point: Always when processing KServe kustomization - Processing order: Crashes after successfully processing observability manifests (003-microshift-observability) Recommendations: 1. **Immediate Fix**: Update PR #5901 to fix the KServe kustomization: - Convert 'vars' to 'replacements' - Convert 'commonLabels' to 'labels' - Run 'kustomize edit fix' on the kustomization - Test the kustomization standalone before integrating 2. **Validate Kustomization**: Test the KServe manifest independently: ```bash # On a development system with the PR checked out cd /usr/lib/microshift/manifests.d/010-microshift-ai-model-serving-kserve kustomize build . ``` 3. **Check Kustomize Version Compatibility**: Verify that the RHOAI 2.25 KServe manifests are compatible with the kustomize version vendored in MicroShift 4.20. 4. **Temporary Workaround** (for testing only): Disable KServe by removing or renaming the directory: ```bash mv /usr/lib/microshift/manifests.d/010-microshift-ai-model-serving-kserve \ /usr/lib/microshift/manifests.d/010-microshift-ai-model-serving-kserve.disabled systemctl restart microshift ``` 5. **PR #5901 Review Focus**: - Review all changes to KServe manifests - Check kustomization.yaml for deprecated features - Verify patches are correctly formatted - Test manifest processing in isolation Problem 2: Greenboot Healthcheck Timeout (Secondary Issue) -------------------------------------------------------------------------------- Severity: High Component: Greenboot Evidence: From the test log (build-log.txt): ERROR: /home/ec2-user/microshift/scripts/ci-ai-model-serving/tests/02-wait-for-greenboot.sh failed Root Cause Analysis: The greenboot healthcheck timeout is a SYMPTOM, not the root cause. Greenboot failed because MicroShift never successfully started due to the KServe manifest crash. Timeline: - Test started at 06:24:51 UTC - Greenboot wait script started at ~07:28:00 UTC - Script timed out after 30 retries (15 minutes) at ~07:43:20 UTC - During this time, MicroShift crashed 66 times processing KServe manifests Conclusion: This issue will resolve automatically once the KServe kustomization is fixed. CORRECTED DIAGNOSIS ================================================================================ Initial Diagnosis (INCORRECT): - Suspected: NVIDIA device plugin in /etc/microshift/manifests.d/10-nvidia-device-plugin/ - Reason: Only custom manifest found in SOS report Corrected Diagnosis (CORRECT): - Actual culprit: KServe AI Model Serving in /usr/lib/microshift/manifests.d/010-microshift-ai-model-serving-kserve - Reason: Log analysis shows crash occurs immediately after attempting to process this specific kustomization - Evidence: Warnings about deprecated 'vars' and 'commonLabels' appear just before the crash The NVIDIA device plugin is NOT involved in this crash. It's in a different directory (/etc/microshift/manifests.d/) and would be processed after the built-in manifests if MicroShift ever got that far. SYSTEM STATE AT SOS REPORT COLLECTION ================================================================================ Collection Time: Dec 14 07:43:46 UTC (only 18 seconds into the 66th MicroShift startup attempt) MicroShift Service Status: Active: activating (start) since Sun 2025-12-14 07:43:28 UTC; 18s ago Main PID: 12371 (microshift) Tasks: 14 Memory: 337.4M CPU: 10.751s Manifest Processing Order (before crash): 1. /usr/lib/microshift/manifests.d/003-microshift-observability - SUCCESS 2. /usr/lib/microshift/manifests.d/010-microshift-ai-model-serving-kserve - CRASH Configuration: - Network: Multus enabled, ClusterNetwork: 10.42.0.0/16, ServiceNetwork: 10.43.0.0/16 - No configuration errors detected NEXT STEPS ================================================================================ 1. **FIX PR #5901**: Update the KServe kustomization to remove deprecated syntax 2. **Run kustomize edit fix**: Automatically update the kustomization.yaml 3. **Test manifest independently**: Validate with standalone kustomize before integrating 4. **Verify kustomize version**: Ensure RHOAI 2.25 KServe is compatible with MicroShift 4.20's kustomize 5. **Add CI validation**: Include kustomize validation in PR checks 6. **Review RHOAI 2.25 changes**: Identify what changed in KServe manifests DIRECT LINK TO ROOT CAUSE ================================================================================ The crash is caused by PR #5901's RHOAI 2.25 bump, which introduced or modified the KServe kustomization in a way that triggers a bug when processed by the kustomize library. The use of deprecated syntax ('vars' and 'commonLabels') is a strong indicator of where the problem lies. This is a regression introduced by PR #5901 and must be fixed in that PR before it can merge. TEST FAILURE CORRELATION ================================================================================ The AI Model Serving E2E test failed because: 1. PR #5901 introduced broken KServe manifests 2. MicroShift could not start due to kustomize crash when processing those manifests 3. Greenboot healthcheck timed out waiting for a service that would never become healthy 4. The test correctly identified this as a failure This IS an AI Model Serving issue - specifically a problem with the RHOAI 2.25 KServe manifest kustomization introduced in PR #5901. ARTIFACTS ================================================================================ Job URL: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_microshift/5901/pull-ci-openshift-microshift-release-4.20-e2e-aws-ai-model-serving/2000096273457221632 SOS Report: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_microshift/5901/pull-ci-openshift-microshift-release-4.20-e2e-aws-ai-model-serving/2000096273457221632/artifacts/e2e-aws-ai-model-serving/openshift-microshift-infra-sos-aws/artifacts/sosreport-i-0be41bcaac1ad25a8-2025-12-14-rxhffej.tar.xz Build Log: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_microshift/5901/pull-ci-openshift-microshift-release-4.20-e2e-aws-ai-model-serving/2000096273457221632/artifacts/e2e-aws-ai-model-serving/openshift-microshift-e2e-bare-metal-tests/build-log.txt Pull Request: https://github.com/openshift/microshift/pull/5901 MicroShift Journal (search for "010-microshift-ai-model-serving-kserve"): Shows the exact moment of crash during KServe manifest processing Analysis generated: 2025-12-14 Corrected after identifying actual crashing manifest from logs