Loading...

XML

Word

Printable

Type: Task
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: InstructLab - Training
Labels:
- github

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

[2789517500] Upstream Reporter: Tanya Osokin
Upstream issue status: Open
Upstream description:

The "overall_throughput" that is calculated in https://github.com/instructlab/training/blob/main/src/instructlab/training/main_ds.py#L422 taking args.samples_per_gpu for the batch size instead of the actual "micro_batch_size". In each step the batch_size is different, but overall_throughput calculated based on a constant value : Part of a log for example with batch_size values of 125,112,121:

Epoch 0: 97%|??????????| 76/78 [03:54<00:05, 2.94s/it][92m{ "epoch": 0, "step": 76, "rank": 0, "overall_throughput": 44.94857943825548, "lr": 2.0000000000000003e-06, "cuda_mem_allocated": 1.2444758415222168, "cuda_malloc_retries": 0, "num_loss_counted_tokens": 25623, "batch_size": 125, "total_loss": 3.9130468719509817, "samples_seen": 9661, "timestamp": "2024-12-20T13:51:34.253834" }[0m Epoch: 0, Step: 77, Rank: 3, loss = 0.95703125Epoch: 0, Step: 77, Rank: 1, loss = 0.71484375Epoch: 0, Step: 77, Rank: 2, loss = 0.64453125Epoch: 0, Step: 77, Rank: 5, loss = 2.953125Epoch: 0, Step: 77, Rank: 7, loss = 12.5Epoch: 0, Step: 77, Rank: 6, loss = 10.5

Epoch: 0, Step: 77, Rank: 4, loss = 1.765625

Epoch: 0, Step: 77, Rank: 0, loss = 0.921875

Epoch 0: 99%|??????????| 77/78 [03:57<00:02, 2.89s/it][92m{ "epoch": 0, "step": 77, "rank": 0, "overall_throughput": 47.957271498777644, "lr": 2.0000000000000003e-06, "cuda_mem_allocated": 1.2483596801757812, "cuda_malloc_retries": 0, "num_loss_counted_tokens": 23046, "batch_size": 112, "total_loss": 3.8774624663716044, "samples_seen": 9773, "timestamp": "2024-12-20T13:51:37.052739" }[0m Epoch: 0, Step: 78, Rank: 0, loss = 0.8671875Epoch: 0, Step: 78, Rank: 5, loss = 2.15625Epoch: 0, Step: 78, Rank: 7, loss = 12.75Epoch: 0, Step: 78, Rank: 3, loss = 0.72265625Epoch: 0, Step: 78, Rank: 4, loss = 1.1640625Epoch: 0, Step: 78, Rank: 6, loss = 14.8125

Epoch: 0, Step: 78, Rank: 2, loss = 0.57421875Epoch: 0, Step: 78, Rank: 1, loss = 0.2314453125

Epoch 0: 100%|??????????| 78/78 [04:00<00:00, 2.91s/it][92m{ "epoch": 0, "step": 78, "rank": 0, "overall_throughput": 45.40726680806918, "lr": 2.0000000000000003e-06, "cuda_mem_allocated": 1.2466816902160645, "cuda_malloc_retries": 0, "num_loss_counted_tokens": 27044, "batch_size": 121, "total_loss": 4.160331311936104, "samples_seen": 9894, "timestamp": "2024-12-20T13:51:39.872213" }[0m

Upstream URL: https://github.com/instructlab/training/issues/392

links to

Upstream issue

Assignee:: Unassigned

Reporter:: Upstream Sync

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2025/06/03 6:53 PM

Updated:: 2025/06/03 6:53 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates