-
Bug
-
Resolution: Not a Bug
-
Undefined
-
None
-
4.14
-
Important
-
No
-
CNF RAN Sprint 239
-
1
-
False
-
-
-
7/18: suspecting an automation or lab issue - working on a reproduction.
Description of problem:
Following node restart, openshift-bare-metal-events consumer pod is running but does not receive any new events. Prior to reboot, events are received normally, even after recovery from hw-event-proxy app restart, cloud-event-sidecar crash and consumer app restart. This does not occur when tested on Dell hardware.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
100%
Steps to Reproduce:
On HPE bare metal SNO running bare metal event relay 4.14:
1. Deploy consumer app
2. Confirm events are received by consumer app:
$ oc logs -n openshift-bare-metal-events consumer-766b977d84-6jbb6 cloud-event-consumer -f [...] time="2023-07-11T19:01:13Z" level=info msg="received event {\"id\":\"2e30e91a-01d2-4524-8468-099fc288108a\",\"type\":\"event.redfish.alert\",\"source\":\"/cluster/node/sno.kni-qe-25.lab.eng.rdu2.redhat.com/redfish/v1/Systems\",\"dataContentType\":\"application/json\",\"time\":\"2023-07-11T19:01:13.741275348Z\",\"data\":{\"version\":\"v1\",\"values\":[{\"resource\":\"/redfish/v1/Systems\",\"dataType\":\"notification\",\"valueType\":\"redfish-event\",\"value\":{\"@odata.context\":\"/redfish/v1/$metadata#Event.Event\",\"@odata.type\":\"#Event.v1_0_0.Event\",\"Events\":[{\"EventGroupId\":0,\"EventId\":\"TestEventId\",\"EventTimestamp\":\"2023-07-11T15:01:12Z\",\"Message\":\"Test Event\",\"MessageArgs\":[\"NoAMS\",\"Busy\",\"Cached\"],\"OriginOfCondition\":\"/redfish/v1/Systems/1/\",\"Severity\":\"OK\",\"MessageId\":\"iLOEvents.2.1.FanNotif2\",\"EventType\":\"Alert\"}],\"Name\":\"Events\"}}]}}" time="2023-07-11T19:01:14Z" level=info msg="Latency for the event: 0 ms" time="2023-07-11T19:01:14Z" level=info msg="received event {\"id\":\"2e30e91a-01d2-4524-8468-099fc288108a\",\"type\":\"event.redfish.alert\",\"source\":\"/cluster/node/sno.kni-qe-25.lab.eng.rdu2.redhat.com/redfish/v1/Systems\",\"dataContentType\":\"application/json\",\"time\":\"2023-07-11T19:01:14.148292967Z\",\"data\":{\"version\":\"v1\",\"values\":[{\"resource\":\"/redfish/v1/Systems\",\"dataType\":\"notification\",\"valueType\":\"redfish-event\",\"value\":{\"@odata.context\":\"/redfish/v1/$metadata#Event.Event\",\"@odata.type\":\"#Event.v1_0_0.Event\",\"Events\":[{\"EventGroupId\":0,\"EventId\":\"TestEventId\",\"EventTimestamp\":\"2023-07-11T15:01:13Z\",\"Message\":\"Test Event\",\"MessageArgs\":[\"NoAMS\",\"Busy\",\"Cached\"],\"OriginOfCondition\":\"/redfish/v1/Systems/1/\",\"Severity\":\"OK\",\"MessageId\":\"iLOEvents.2.1.DiskNotif2\",\"EventType\":\"Alert\"}],\"Name\":\"Events\"}}]}}" time="2023-07-11T19:01:14Z" level=info msg="Latency for the event: 0 ms" time="2023-07-11T19:01:14Z" level=info msg="received event {\"id\":\"2e30e91a-01d2-4524-8468-099fc288108a\",\"type\":\"event.redfish.alert\",\"source\":\"/cluster/node/sno.kni-qe-25.lab.eng.rdu2.redhat.com/redfish/v1/Systems\",\"dataContentType\":\"application/json\",\"time\":\"2023-07-11T19:01:14.594085811Z\",\"data\":{\"version\":\"v1\",\"values\":[{\"resource\":\"/redfish/v1/Systems\",\"dataType\":\"notification\",\"valueType\":\"redfish-event\",\"value\":{\"@odata.context\":\"/redfish/v1/$metadata#Event.Event\",\"@odata.type\":\"#Event.v1_0_0.Event\",\"Events\":[{\"EventGroupId\":0,\"EventId\":\"TestEventId\",\"EventTimestamp\":\"2023-07-11T15:01:13Z\",\"Message\":\"Test Event\",\"MessageArgs\":[\"NoAMS\",\"Busy\",\"Cached\"],\"OriginOfCondition\":\"/redfish/v1/Systems/1/\",\"Severity\":\"OK\",\"MessageId\":\"iLOEvents.2.1.PowerNotif2\",\"EventType\":\"Alert\"}],\"Name\":\"Events\"}}]}}" time="2023-07-11T19:01:14Z" level=info msg="Latency for the event: 0 ms" time="2023-07-11T19:01:14Z" level=info msg="received event {\"id\":\"2e30e91a-01d2-4524-8468-099fc288108a\",\"type\":\"event.redfish.alert\",\"source\":\"/cluster/node/sno.kni-qe-25.lab.eng.rdu2.redhat.com/redfish/v1/Systems\",\"dataContentType\":\"application/json\",\"time\":\"2023-07-11T19:01:14.818322017Z\",\"data\":{\"version\":\"v1\",\"values\":[{\"resource\":\"/redfish/v1/Systems\",\"dataType\":\"notification\",\"valueType\":\"redfish-event\",\"value\":{\"@odata.context\":\"/redfish/v1/$metadata#Event.Event\",\"@odata.type\":\"#Event.v1_0_0.Event\",\"Events\":[{\"EventGroupId\":0,\"EventId\":\"TestEventId\",\"EventTimestamp\":\"2023-07-11T15:01:13Z\",\"Message\":\"Test Event\",\"MessageArgs\":[\"NoAMS\",\"Busy\",\"Cached\"],\"OriginOfCondition\":\"/redfish/v1/Systems/1/\",\"Severity\":\"OK\",\"MessageId\":\"iLOEvents.2.1.MemoeryNotif2\",\"EventType\":\"Alert\"}],\"Name\":\"Events\"}}]}}" time="2023-07-11T19:01:46Z" level=info msg="Latency for the event: 1 ms" time="2023-07-11T19:01:46Z" level=info msg="received event {\"id\":\"2e30e91a-01d2-4524-8468-099fc288108a\",\"type\":\"event.redfish.alert\",\"source\":\"/cluster/node/sno.kni-qe-25.lab.eng.rdu2.redhat.com/redfish/v1/Systems\",\"dataContentType\":\"application/json\",\"time\":\"2023-07-11T19:01:46.272037861Z\",\"data\":{\"version\":\"v1\",\"values\":[{\"resource\":\"/redfish/v1/Systems\",\"dataType\":\"notification\",\"valueType\":\"redfish-event\",\"value\":{\"@odata.context\":\"/redfish/v1/$metadata#Event.Event\",\"@odata.type\":\"#Event.v1_0_0.Event\",\"Events\":[{\"EventGroupId\":0,\"EventId\":\"TestEventId\",\"EventTimestamp\":\"2023-07-11T15:01:45Z\",\"Message\":\"Test Event\",\"MessageArgs\":[\"NoAMS\",\"Busy\",\"Cached\"],\"OriginOfCondition\":\"/redfish/v1/Systems/1/\",\"Severity\":\"OK\",\"MessageId\":\"iLOEvents.2.1.TempNotif1\",\"EventType\":\"Alert\"}],\"Name\":\"Events\"}}]}}"
3. Reboot SNO. When system has rebooted, cluster and pods are running, check consumer for events:
$ oc logs -n openshift-bare-metal-events consumer-766b977d84-6jbb6 cloud-event-consumer -f time="2023-07-11T19:08:20Z" level=info msg="checking for rest service health\n" time="2023-07-11T19:08:20Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health " time="2023-07-11T19:08:20Z" level=warning msg="try 0, return health check of the rest service for error Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused" time="2023-07-11T19:08:22Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health " time="2023-07-11T19:08:22Z" level=warning msg="try 1, return health check of the rest service for error Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused" time="2023-07-11T19:08:24Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health " time="2023-07-11T19:08:24Z" level=warning msg="try 2, return health check of the rest service for error Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused" time="2023-07-11T19:08:26Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health " time="2023-07-11T19:08:26Z" level=warning msg="try 3, return health check of the rest service for error Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused" time="2023-07-11T19:08:28Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health " time="2023-07-11T19:08:28Z" level=warning msg="try 4, return health check of the rest service for error Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused" time="2023-07-11T19:08:30Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health " time="2023-07-11T19:08:30Z" level=warning msg="try 5, return health check of the rest service for error Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused" time="2023-07-11T19:08:32Z" level=info msg="checking for rest service health\n" time="2023-07-11T19:08:32Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health " time="2023-07-11T19:08:32Z" level=warning msg="try 0, return health check of the rest service for error Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused" time="2023-07-11T19:08:34Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health " time="2023-07-11T19:08:34Z" level=warning msg="try 1, return health check of the rest service for error Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused" time="2023-07-11T19:08:36Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health " time="2023-07-11T19:08:36Z" level=warning msg="try 2, return health check of the rest service for error Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused" time="2023-07-11T19:08:38Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health " time="2023-07-11T19:08:38Z" level=warning msg="try 3, return health check of the rest service for error Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused" time="2023-07-11T19:08:40Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health " time="2023-07-11T19:08:40Z" level=warning msg="try 4, return health check of the rest service for error Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused" time="2023-07-11T19:08:42Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health " time="2023-07-11T19:08:42Z" level=warning msg="try 5, return health check of the rest service for error Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused" time="2023-07-11T19:08:44Z" level=info msg="checking for rest service health\n" time="2023-07-11T19:08:44Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health " time="2023-07-11T19:08:44Z" level=warning msg="try 0, return health check of the rest service for error Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused" time="2023-07-11T19:08:46Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health " time="2023-07-11T19:08:46Z" level=warning msg="try 1, return health check of the rest service for error Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused" time="2023-07-11T19:08:48Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health " time="2023-07-11T19:08:48Z" level=warning msg="try 2, return health check of the rest service for error Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused" time="2023-07-11T19:08:50Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health " time="2023-07-11T19:08:50Z" level=warning msg="try 3, return health check of the rest service for error Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused" time="2023-07-11T19:08:52Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health " time="2023-07-11T19:08:52Z" level=info msg="rest service returned healthy status" time="2023-07-11T19:08:54Z" level=info msg="created subscription: EndpointURI: http://127.0.0.1:9089/event\n URILocation: http://localhost:9085/api/ocloudNotifications/v1/subscriptions/4171d66b-7279-4444-b366-35808f8c41d2\n ID: 4171d66b-7279-4444-b366-35808f8c41d2\n Resource: /cluster/node/sno.kni-qe-25.lab.eng.rdu2.redhat.com/redfish/v1/Systems\n" time="2023-07-11T19:08:54Z" level=info msg="created subscription: EndpointURI: http://127.0.0.1:9089/event\n URILocation: http://localhost:9085/api/ocloudNotifications/v1/subscriptions/4171d66b-7279-4444-b366-35808f8c41d2\n ID: 4171d66b-7279-4444-b366-35808f8c41d2\n Resource: /cluster/node/sno.kni-qe-25.lab.eng.rdu2.redhat.com/redfish/v1/Systems\n" time="2023-07-11T19:08:54Z" level=info msg="created subscription: EndpointURI: http://127.0.0.1:9089/event\n URILocation: http://localhost:9085/api/ocloudNotifications/v1/subscriptions/4171d66b-7279-4444-b366-35808f8c41d2\n ID: 4171d66b-7279-4444-b366-35808f8c41d2\n Resource: /cluster/node/sno.kni-qe-25.lab.eng.rdu2.redhat.com/redfish/v1/Systems\n" time="2023-07-11T19:08:54Z" level=info msg="created subscription: EndpointURI: http://127.0.0.1:9089/event\n URILocation: http://localhost:9085/api/ocloudNotifications/v1/subscriptions/4171d66b-7279-4444-b366-35808f8c41d2\n ID: 4171d66b-7279-4444-b366-35808f8c41d2\n Resource: /cluster/node/sno.kni-qe-25.lab.eng.rdu2.redhat.com/redfish/v1/Systems\n" time="2023-07-11T19:08:54Z" level=info msg="created subscription: EndpointURI: http://127.0.0.1:9089/event\n URILocation: http://localhost:9085/api/ocloudNotifications/v1/subscriptions/4171d66b-7279-4444-b366-35808f8c41d2\n ID: 4171d66b-7279-4444-b366-35808f8c41d2\n Resource: /cluster/node/sno.kni-qe-25.lab.eng.rdu2.redhat.com/redfish/v1/Systems\n" time="2023-07-11T19:08:54Z" level=info msg="waiting for events"
Actual results:
consumer is running but not receiving new events
Expected results:
consumer recovers and receives events, just as it does after recovery from hw-event-proxy app restart, cloud-event-sidecar crash and consumer app restart.
Additional info:
spoke must-gather here: https://drive.google.com/file/d/1znSKQiCT7WFMRd2A-VtVUhVBLaUeZspn/view?usp=sharing