Categories:
Learning about Large Language Models – Part 2
This post lists some key questions and concepts from Chapter 4 of Sebastian Raschka’s Build a Large Language Model from Scratch book. It is a continuation of the Learning about Large Language Models post.
- About how many parameters are in a small GPT-2 model? See Language Models are Unsupervised Multitask Learners at https://mng.bz/yoBq. p94
- What does the term “parameters” refer to in the context of deep learning/LLMs? p93
- What are logits? p99
- Describe the challenges of vanishing or exploding gradients in training DNNs with many layers. What do these problems lead to?
- Why is layer normalization used?
- What is the main idea behind layer normalization (what does it do to the outputs)?
- What is a ReLU? p100
- What is the torch.var keepDim parameter used for? p101
- What about the correction parameter (formerly unbiased)? How does it relate to Bessel’s_correction? p103
- Explain the difference between layer normalization and batch normalization. p104
- When is layer normalization advantageous? p105
- What is a GELU? How can it lead to better optimization properties during training (compared to a ReLU)?
- What is one advantage of small non-zero outputs on negative inputs to a GELU? p106
- Explain the role of a FeedForward module in enhancing the model’s ability to learn from and generalize data. Compare with Feedforward neural network. p108
- Why were shortcut connections originally proposed for deep networks in computer vision? p109
- Which PyTorch method computes loss gradients? p112
- Explain Pre-LayerNorm and Post-LayerNorm. p115
- What does preservation of shape throughout the transformer block architecture enable? p116
- Explain weight-tying as used in the original GPT-2 architecture and its advantages. p121
This is the author’s video corresponding to chapter 4 of the book.
Math Concepts
The author mentioned that OpenAI used the biased variance option when training their GPT-2 model. The reasons why Bessel’s correction is usually used in statistics is explained well in this video:
I didn’t think the reason we need activation functions was expounded upon. This final video provides a great explanation.
Leave a Reply