Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-16062

bmer consumer does not collect events after node restart on HPE bare metal SNO

XMLWordPrintable

    • Important
    • No
    • CNF RAN Sprint 239
    • 1
    • False
    • Hide

      None

      Show
      None
    • 7/18: suspecting an automation or lab issue - working on a reproduction.

      Description of problem:

      Following node restart, openshift-bare-metal-events consumer pod is running but does not receive any new events.
      
      Prior to reboot, events are received normally, even after recovery from hw-event-proxy app restart, cloud-event-sidecar crash and consumer app restart.
      
      This does not occur when tested on Dell hardware.

      Version-Release number of selected component (if applicable):

      4.14

      How reproducible:

      100%

      Steps to Reproduce:

      On HPE bare metal SNO running bare metal event relay 4.14:
      1. Deploy consumer app
      2. Confirm events are received by consumer app:

      $ oc logs -n openshift-bare-metal-events consumer-766b977d84-6jbb6  cloud-event-consumer -f
      [...]
      time="2023-07-11T19:01:13Z" level=info msg="received event {\"id\":\"2e30e91a-01d2-4524-8468-099fc288108a\",\"type\":\"event.redfish.alert\",\"source\":\"/cluster/node/sno.kni-qe-25.lab.eng.rdu2.redhat.com/redfish/v1/Systems\",\"dataContentType\":\"application/json\",\"time\":\"2023-07-11T19:01:13.741275348Z\",\"data\":{\"version\":\"v1\",\"values\":[{\"resource\":\"/redfish/v1/Systems\",\"dataType\":\"notification\",\"valueType\":\"redfish-event\",\"value\":{\"@odata.context\":\"/redfish/v1/$metadata#Event.Event\",\"@odata.type\":\"#Event.v1_0_0.Event\",\"Events\":[{\"EventGroupId\":0,\"EventId\":\"TestEventId\",\"EventTimestamp\":\"2023-07-11T15:01:12Z\",\"Message\":\"Test Event\",\"MessageArgs\":[\"NoAMS\",\"Busy\",\"Cached\"],\"OriginOfCondition\":\"/redfish/v1/Systems/1/\",\"Severity\":\"OK\",\"MessageId\":\"iLOEvents.2.1.FanNotif2\",\"EventType\":\"Alert\"}],\"Name\":\"Events\"}}]}}"
      time="2023-07-11T19:01:14Z" level=info msg="Latency for the event: 0 ms"
      time="2023-07-11T19:01:14Z" level=info msg="received event {\"id\":\"2e30e91a-01d2-4524-8468-099fc288108a\",\"type\":\"event.redfish.alert\",\"source\":\"/cluster/node/sno.kni-qe-25.lab.eng.rdu2.redhat.com/redfish/v1/Systems\",\"dataContentType\":\"application/json\",\"time\":\"2023-07-11T19:01:14.148292967Z\",\"data\":{\"version\":\"v1\",\"values\":[{\"resource\":\"/redfish/v1/Systems\",\"dataType\":\"notification\",\"valueType\":\"redfish-event\",\"value\":{\"@odata.context\":\"/redfish/v1/$metadata#Event.Event\",\"@odata.type\":\"#Event.v1_0_0.Event\",\"Events\":[{\"EventGroupId\":0,\"EventId\":\"TestEventId\",\"EventTimestamp\":\"2023-07-11T15:01:13Z\",\"Message\":\"Test Event\",\"MessageArgs\":[\"NoAMS\",\"Busy\",\"Cached\"],\"OriginOfCondition\":\"/redfish/v1/Systems/1/\",\"Severity\":\"OK\",\"MessageId\":\"iLOEvents.2.1.DiskNotif2\",\"EventType\":\"Alert\"}],\"Name\":\"Events\"}}]}}"
      time="2023-07-11T19:01:14Z" level=info msg="Latency for the event: 0 ms"
      time="2023-07-11T19:01:14Z" level=info msg="received event {\"id\":\"2e30e91a-01d2-4524-8468-099fc288108a\",\"type\":\"event.redfish.alert\",\"source\":\"/cluster/node/sno.kni-qe-25.lab.eng.rdu2.redhat.com/redfish/v1/Systems\",\"dataContentType\":\"application/json\",\"time\":\"2023-07-11T19:01:14.594085811Z\",\"data\":{\"version\":\"v1\",\"values\":[{\"resource\":\"/redfish/v1/Systems\",\"dataType\":\"notification\",\"valueType\":\"redfish-event\",\"value\":{\"@odata.context\":\"/redfish/v1/$metadata#Event.Event\",\"@odata.type\":\"#Event.v1_0_0.Event\",\"Events\":[{\"EventGroupId\":0,\"EventId\":\"TestEventId\",\"EventTimestamp\":\"2023-07-11T15:01:13Z\",\"Message\":\"Test Event\",\"MessageArgs\":[\"NoAMS\",\"Busy\",\"Cached\"],\"OriginOfCondition\":\"/redfish/v1/Systems/1/\",\"Severity\":\"OK\",\"MessageId\":\"iLOEvents.2.1.PowerNotif2\",\"EventType\":\"Alert\"}],\"Name\":\"Events\"}}]}}"
      time="2023-07-11T19:01:14Z" level=info msg="Latency for the event: 0 ms"
      time="2023-07-11T19:01:14Z" level=info msg="received event {\"id\":\"2e30e91a-01d2-4524-8468-099fc288108a\",\"type\":\"event.redfish.alert\",\"source\":\"/cluster/node/sno.kni-qe-25.lab.eng.rdu2.redhat.com/redfish/v1/Systems\",\"dataContentType\":\"application/json\",\"time\":\"2023-07-11T19:01:14.818322017Z\",\"data\":{\"version\":\"v1\",\"values\":[{\"resource\":\"/redfish/v1/Systems\",\"dataType\":\"notification\",\"valueType\":\"redfish-event\",\"value\":{\"@odata.context\":\"/redfish/v1/$metadata#Event.Event\",\"@odata.type\":\"#Event.v1_0_0.Event\",\"Events\":[{\"EventGroupId\":0,\"EventId\":\"TestEventId\",\"EventTimestamp\":\"2023-07-11T15:01:13Z\",\"Message\":\"Test Event\",\"MessageArgs\":[\"NoAMS\",\"Busy\",\"Cached\"],\"OriginOfCondition\":\"/redfish/v1/Systems/1/\",\"Severity\":\"OK\",\"MessageId\":\"iLOEvents.2.1.MemoeryNotif2\",\"EventType\":\"Alert\"}],\"Name\":\"Events\"}}]}}"
      time="2023-07-11T19:01:46Z" level=info msg="Latency for the event: 1 ms"
      time="2023-07-11T19:01:46Z" level=info msg="received event {\"id\":\"2e30e91a-01d2-4524-8468-099fc288108a\",\"type\":\"event.redfish.alert\",\"source\":\"/cluster/node/sno.kni-qe-25.lab.eng.rdu2.redhat.com/redfish/v1/Systems\",\"dataContentType\":\"application/json\",\"time\":\"2023-07-11T19:01:46.272037861Z\",\"data\":{\"version\":\"v1\",\"values\":[{\"resource\":\"/redfish/v1/Systems\",\"dataType\":\"notification\",\"valueType\":\"redfish-event\",\"value\":{\"@odata.context\":\"/redfish/v1/$metadata#Event.Event\",\"@odata.type\":\"#Event.v1_0_0.Event\",\"Events\":[{\"EventGroupId\":0,\"EventId\":\"TestEventId\",\"EventTimestamp\":\"2023-07-11T15:01:45Z\",\"Message\":\"Test Event\",\"MessageArgs\":[\"NoAMS\",\"Busy\",\"Cached\"],\"OriginOfCondition\":\"/redfish/v1/Systems/1/\",\"Severity\":\"OK\",\"MessageId\":\"iLOEvents.2.1.TempNotif1\",\"EventType\":\"Alert\"}],\"Name\":\"Events\"}}]}}"
      

      3. Reboot SNO.  When system has rebooted, cluster and pods are running, check consumer for events:

       

      $ oc logs -n openshift-bare-metal-events consumer-766b977d84-6jbb6  cloud-event-consumer -f 
      time="2023-07-11T19:08:20Z" level=info msg="checking for rest service health\n"
      time="2023-07-11T19:08:20Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health "
      time="2023-07-11T19:08:20Z" level=warning msg="try 0, return health check of the rest service for error  Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused"
      time="2023-07-11T19:08:22Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health "
      time="2023-07-11T19:08:22Z" level=warning msg="try 1, return health check of the rest service for error  Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused"
      time="2023-07-11T19:08:24Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health "
      time="2023-07-11T19:08:24Z" level=warning msg="try 2, return health check of the rest service for error  Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused"
      time="2023-07-11T19:08:26Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health "
      time="2023-07-11T19:08:26Z" level=warning msg="try 3, return health check of the rest service for error  Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused"
      time="2023-07-11T19:08:28Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health "
      time="2023-07-11T19:08:28Z" level=warning msg="try 4, return health check of the rest service for error  Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused"
      time="2023-07-11T19:08:30Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health "
      time="2023-07-11T19:08:30Z" level=warning msg="try 5, return health check of the rest service for error  Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused"
      time="2023-07-11T19:08:32Z" level=info msg="checking for rest service health\n"
      time="2023-07-11T19:08:32Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health "
      time="2023-07-11T19:08:32Z" level=warning msg="try 0, return health check of the rest service for error  Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused"
      time="2023-07-11T19:08:34Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health "
      time="2023-07-11T19:08:34Z" level=warning msg="try 1, return health check of the rest service for error  Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused"
      time="2023-07-11T19:08:36Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health "
      time="2023-07-11T19:08:36Z" level=warning msg="try 2, return health check of the rest service for error  Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused"
      time="2023-07-11T19:08:38Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health "
      time="2023-07-11T19:08:38Z" level=warning msg="try 3, return health check of the rest service for error  Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused"
      time="2023-07-11T19:08:40Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health "
      time="2023-07-11T19:08:40Z" level=warning msg="try 4, return health check of the rest service for error  Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused"
      time="2023-07-11T19:08:42Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health "
      time="2023-07-11T19:08:42Z" level=warning msg="try 5, return health check of the rest service for error  Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused"
      time="2023-07-11T19:08:44Z" level=info msg="checking for rest service health\n"
      time="2023-07-11T19:08:44Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health "
      time="2023-07-11T19:08:44Z" level=warning msg="try 0, return health check of the rest service for error  Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused"
      time="2023-07-11T19:08:46Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health "
      time="2023-07-11T19:08:46Z" level=warning msg="try 1, return health check of the rest service for error  Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused"
      time="2023-07-11T19:08:48Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health "
      time="2023-07-11T19:08:48Z" level=warning msg="try 2, return health check of the rest service for error  Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused"
      time="2023-07-11T19:08:50Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health "
      time="2023-07-11T19:08:50Z" level=warning msg="try 3, return health check of the rest service for error  Get \"http://127.0.0.1:9085/api/ocloudNotifications/v1/health\": dial tcp 127.0.0.1:9085: connect: connection refused"
      time="2023-07-11T19:08:52Z" level=info msg="health check http://127.0.0.1:9085/api/ocloudNotifications/v1/health "
      time="2023-07-11T19:08:52Z" level=info msg="rest service returned healthy status"
      time="2023-07-11T19:08:54Z" level=info msg="created subscription:   EndpointURI: http://127.0.0.1:9089/event\n  URILocation: http://localhost:9085/api/ocloudNotifications/v1/subscriptions/4171d66b-7279-4444-b366-35808f8c41d2\n  ID: 4171d66b-7279-4444-b366-35808f8c41d2\n  Resource: /cluster/node/sno.kni-qe-25.lab.eng.rdu2.redhat.com/redfish/v1/Systems\n"
      time="2023-07-11T19:08:54Z" level=info msg="created subscription:   EndpointURI: http://127.0.0.1:9089/event\n  URILocation: http://localhost:9085/api/ocloudNotifications/v1/subscriptions/4171d66b-7279-4444-b366-35808f8c41d2\n  ID: 4171d66b-7279-4444-b366-35808f8c41d2\n  Resource: /cluster/node/sno.kni-qe-25.lab.eng.rdu2.redhat.com/redfish/v1/Systems\n"
      time="2023-07-11T19:08:54Z" level=info msg="created subscription:   EndpointURI: http://127.0.0.1:9089/event\n  URILocation: http://localhost:9085/api/ocloudNotifications/v1/subscriptions/4171d66b-7279-4444-b366-35808f8c41d2\n  ID: 4171d66b-7279-4444-b366-35808f8c41d2\n  Resource: /cluster/node/sno.kni-qe-25.lab.eng.rdu2.redhat.com/redfish/v1/Systems\n"
      time="2023-07-11T19:08:54Z" level=info msg="created subscription:   EndpointURI: http://127.0.0.1:9089/event\n  URILocation: http://localhost:9085/api/ocloudNotifications/v1/subscriptions/4171d66b-7279-4444-b366-35808f8c41d2\n  ID: 4171d66b-7279-4444-b366-35808f8c41d2\n  Resource: /cluster/node/sno.kni-qe-25.lab.eng.rdu2.redhat.com/redfish/v1/Systems\n"
      time="2023-07-11T19:08:54Z" level=info msg="created subscription:   EndpointURI: http://127.0.0.1:9089/event\n  URILocation: http://localhost:9085/api/ocloudNotifications/v1/subscriptions/4171d66b-7279-4444-b366-35808f8c41d2\n  ID: 4171d66b-7279-4444-b366-35808f8c41d2\n  Resource: /cluster/node/sno.kni-qe-25.lab.eng.rdu2.redhat.com/redfish/v1/Systems\n"
      time="2023-07-11T19:08:54Z" level=info msg="waiting for events"
       

       

       

      Actual results:

      consumer is running but not receiving new events

      Expected results:

      consumer recovers and receives events, just as it does after recovery from hw-event-proxy app restart, cloud-event-sidecar crash and consumer app restart.

      Additional info:

      spoke must-gather here: https://drive.google.com/file/d/1znSKQiCT7WFMRd2A-VtVUhVBLaUeZspn/view?usp=sharing

            jacding@redhat.com Jack Ding
            bblock@redhat.com Bonnie Block
            Bonnie Block Bonnie Block
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: