Learning about Large Language Models – Part 2

2025-07-25 —Categories: Artificial Intelligence

Learning about Large Language Models – Part 2

This post lists some key questions and concepts from Chapter 4 of Sebastian Raschka’s Build a Large Language Model from Scratch book. It is a continuation of the Learning about Large Language Models post.

About how many parameters are in a small GPT-2 model? See Language Models are Unsupervised Multitask Learners at https://mng.bz/yoBq. p94
What does the term “parameters” refer to in the context of deep learning/LLMs? p93
What are logits? p99
Describe the challenges of vanishing or exploding gradients in training DNNs with many layers. What do these problems lead to?
Why is layer normalization used?
What is the main idea behind layer normalization (what does it do to the outputs)?
What is a ReLU? p100
What is the torch.var keepDim parameter used for? p101
What about the correction parameter (formerly unbiased)? How does it relate to Bessel’s_correction? p103
Explain the difference between layer normalization and batch normalization. p104
When is layer normalization advantageous? p105
What is a GELU? How can it lead to better optimization properties during training (compared to a ReLU)?
What is one advantage of small non-zero outputs on negative inputs to a GELU? p106
Explain the role of a FeedForward module in enhancing the model’s ability to learn from and generalize data. Compare with Feedforward neural network. p108
Why were shortcut connections originally proposed for deep networks in computer vision? p109
Which PyTorch method computes loss gradients? p112
Explain Pre-LayerNorm and Post-LayerNorm. p115
What does preservation of shape throughout the transformer block architecture enable? p116
Explain weight-tying as used in the original GPT-2 architecture and its advantages. p121

This is the author’s video corresponding to chapter 4 of the book.

Build an LLM from Scratch 4: Implementing a GPT model from Scratch To Generate Text

Math Concepts

The author mentioned that OpenAI used the biased variance option when training their GPT-2 model. The reasons why Bessel’s correction is usually used in statistics is explained well in this video:

Why We Divide by N-1 in the Sample Variance (The Bessel’s Correction)

I didn’t think the reason we need activation functions was expounded upon. This final video provides a great explanation.

Why Do We Need Activation Functions in Neural Networks?

Saint's Log

Learning about Large Language Models – Part 2

Math Concepts

Article info

Categories

Tags

Leave a Reply