multiplies the loss by a constant S before backpropagation, scaling gradients up so they fall into the representable range of FP16. After backprop, gradients are divided by S before the optimizer step.
| Format | Exponent Bits | Mantissa Bits | Dynamic Range (approx) | |--------|---------------|---------------|------------------------| | FP16 | 5 | 10 | 5.96e-8 to 65504 | | BF16 | 8 | 7 | 1.18e-38 to 3.4e38 |
BF16 has the , so gradients rarely underflow — even without loss scaling. The tradeoff: less precision (7 vs 10 mantissa bits), but for most deep learning tasks, BF16’s precision is sufficient.