Categories: Artificial Intelligence

Learning about Large Language Models – Part 2

This post lists some key questions and concepts from Chapter 4 of Sebastian Raschka’s Build a Large Language Model from Scratch book. It is a continuation of the Learning about Large Language Models post.

  1. About how many parameters are in a small GPT-2 model? See Language Models are Unsupervised Multitask Learners at https://mng.bz/yoBq. p94
  2. What does the term “parameters” refer to in the context of deep learning/LLMs? p93
  3. What are logits? p99
  4. Describe the challenges of vanishing or exploding gradients in training DNNs with many layers. What do these problems lead to?
  5. Why is layer normalization used?
  6. What is the main idea behind layer normalization (what does it do to the outputs)?
  7. What is a ReLU? p100
  8. What is the torch.var keepDim parameter used for? p101
  9. What about the correction parameter (formerly unbiased)? How does it relate to Bessel’s_correction? p103
  10. Explain the difference between layer normalization and batch normalization. p104
  11. When is layer normalization advantageous? p105
  12. What is a GELU? How can it lead to better optimization properties during training (compared to a ReLU)?
  13. What is one advantage of small non-zero outputs on negative inputs to a GELU? p106
  14. Explain the role of a FeedForward module in enhancing the model’s ability to learn from and generalize data. Compare with Feedforward neural network. p108
  15. Why were shortcut connections originally proposed for deep networks in computer vision? p109
  16. Which PyTorch method computes loss gradients? p112
  17. Explain Pre-LayerNorm and Post-LayerNorm. p115
  18. What does preservation of shape throughout the transformer block architecture enable? p116
  19. Explain weight-tying as used in the original GPT-2 architecture and its advantages. p121

This is the author’s video corresponding to chapter 4 of the book.

Build an LLM from Scratch 4: Implementing a GPT model from Scratch To Generate Text

Math Concepts

The author mentioned that OpenAI used the biased variance option when training their GPT-2 model. The reasons why Bessel’s correction is usually used in statistics is explained well in this video:

Why We Divide by N-1 in the Sample Variance (The Bessel’s Correction)

I didn’t think the reason we need activation functions was expounded upon. This final video provides a great explanation.

Why Do We Need Activation Functions in Neural Networks?

Article info



Leave a Reply

Your email address will not be published. Required fields are marked *