Loading...

XML

Word

Printable

Type: Feature
Resolution: Done
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
AIPCC-3716
External issue URL:
https://github.com/pytorch/pytorch/issues/167627

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Intelligence Requested:
Market:

1. 1. Description

There’s an incorrect code example in the docstring of:

*File:* `torch/onnx/_internal/exporter/_torchlib/ops/nn.py`
*Location:* Around line 123

—

1. 1. Current (Broken) Example

```python
attn_weight = torch.softmax(
(Q @ K.transpose(-2, -1) * attn_mask, dim=-1
)
```

This example has a syntax error and also uses the attention mask incorrectly.

—

1. 1. Suggested Fix

The example should follow the correct scaled dot-product attention formula used in PyTorch’s `MultiheadAttention`.

```python

Corrected version
scale_factor = 1.0 / math.sqrt(Q.size(-1))
attn_weight = torch.softmax(
(Q @ K.transpose(-2, -1) * scale_factor) + attn_mask, dim=-1
)
```

—

1. 1. Explanation

1. *Adds missing scale factor*
The dot product between Q and K should be scaled by `1 / sqrt(d_k)` to prevent large values before softmax.

2. *Fixes syntax error*
Removes the misplaced comma before `dim=-1`, ensuring `torch.softmax()` receives proper arguments.

3. *Corrects mask usage*
The mask should be *added*, not multiplied.
Adding large negative values (e.g., `-inf`) ensures masked positions have zero probability after softmax.

—

cc @justinchuby @titaiwangms

Assignee:: Gaurav Goswami

Reporter:: Gaurav Goswami

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2025/11/24 5:22 AM

Updated:: 2025/11/24 9:16 AM

Resolved:: 2025/11/24 9:16 AM

Estimated:

Remaining:

Logged:

Not Specified

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Time Tracking

PagerDuty