Loss Functions and Gradient Descent 101
While learning about large language models, the issue of vanishing gradients came up. What is a gradient? I attempted to describe it as the difference between where you are and where you want to be (the target), which is given by the loss function. This led to the question of what exactly a loss function is. The video below from IBM Technology explains loss functions: a loss function is an evaluation metric (how well is the model performing) and/or a guide that directs the model’s learning process.
The primary reason for calculating the loss function is to guide the model’s learning process. It provides a numeric value that indicates how far off the model’s predictions are from the actual results. By analyzing the loss, the model’s parameters can be adjusted (optimization) since the loss function is a feedback mechanism to the model, telling it how well it is performing and where it needs to improve. – What is a Loss Function? Understanding How AI Models Learn
A smaller value of the loss function indicates that the performance of the model has improved.
A loss function can also be used as input to an algorithm that influences the model parameters to minimize loss, e.g. gradient descent. – What is a Loss Function? Understanding How AI Models Learn
The gradient of the loss function is useful because it enables algorithms to determine which adjustments (e.g. to weights) will result in a smaller loss. The next video on Gradient descent, how neural networks learn is a helpful introduction to how loss functions are used to guide learning.
Backpropagation is the algorithm used to compute the gradient. This video from 3Blue1Brown is a helpful explanation of what backpropagation is:
Two important phenomena in gradient descent are the problems of vanishing and exploding gradients. The Vanishing & Exploding Gradient explained | A problem resulting from backpropagation video describes these problems as follows: vanishing gradients mean that updated weights earlier in the network barely change (stuck) which means that the rest of the network cannot really minimize the loss function (i.e. learn). Exploding gradients mean that the earlier weights now increase so much that the optimal value of the loss function will never be achieved because weights become too big too quickly.