Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-6334

nn.CrossEntropyLoss overflow with FP16 and minibatch

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • PyTorch
    • None
    • PyTorch Sprint 18, PyTorch Sprint 19, PyTorch Sprint 20, PyTorch Sprint 21, PyTorch Sprint 22, PyTorch Sprint 23, PyTorch Sprint 24

          1. 🐛 Describe the bug

      Using nn.CrossEntropyLoss with FP16 and long sequence is stable. However, introducing minibatch dimension would led to overflow and `CrossEntropyLoss` would output `inf`.

      To reproduce:

      ```Python
      import torch

      ce = torch.nn.CrossEntropyLoss().cuda().half()

      inp = torch.rand((20, 14749, 1025))
      inp = inp.cuda().half()
      t = torch.randint(low=0, high=14749, size=[20, 1025]).cuda()

      loss = ce(inp, t)
      print(loss)

      ce = torch.nn.CrossEntropyLoss().cuda()
      inp = torch.rand((20, 14749, 1025))
      inp = inp.cuda()

      loss = ce(inp, t)
      print(loss)

      ce.half()
      inp = inp.cuda().half()
      inp = inp.transpose(1,2)
      inp = inp.flatten(start_dim=0, end_dim=1)
      t = t.flatten(start_dim=0, end_dim=1)

      loss = ce(inp, t)
      print(loss)
      ```

      The first loss would be `inf`. Both the second and the third would be correct.

          1. Versions

      I tested on 1.8.2 and 1.12.1, both are the same.

      cc @albanD @mruberry @jbschlosser @walterddr @kshitij12345 @saketh-are

              rh-ee-visgoyal Vishal Goyal
              rh-ee-visgoyal Vishal Goyal
              PyTorch Core
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: