Learning about Large Language Models

2025-02-08 —Categories: Artificial Intelligence

Learning about Large Language Models

I am reading Sebastian Raschka’s Build a Large Language Model from Scratch book with a few colleagues at work. This video is mentioned in chapter 1 of the book.

Developing an LLM: Building, Training, Finetuning

We have quizes to see how well we understand the material. These are the questions & notes I have jotted down from my reading so far.

Chapter 1

What is an LLM? p2
What are 2 dimensions that “large” refers to?
Which architecture do LLMs utilize? p3
Why are LLMs often referred to as generative AI/genAI?
What is the relationship between AI, ML, deep learning, LLMs, and genAI?
Give a difference between traditional ML and deep learning.
What are other approaches to AI apart from ML and deep learning? p4
List 5 applications of LLMs.
What are 3 advantages of custom built LLMs? p5
What are the 2 general steps in creating an LLM? p6
What is a base/foundation model? Give an example. p7
What are the few-shot capabilities of a base model?
What are 2 categories of fine-tuning LLMs?
Which architecture did Attention Is All You Need introduce?
Describe the transformer architecture.

Part II

What are the 2 submodules of a transformer? p7
What is the purpose of the self-attention mechanism?
What is BERT? What do the initials stand for? p8
What does GPT stand for?
What is the difference between BERT and GPT? Which submodule of the original transformer does each focus on?
List a real-world application of BERT. p9
What is the difference between zero-shot and few-shot capabilities?
What are applications of transformers (other than LLMs). p10
Give 2 examples of architectures (other than transformers) that LLMs can be based on.
See Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research for a publicly available training dataset (may contain copyrighted works) p11
Why are models like GPT3 called based or foundation models?
What is an estimate of the cost of training GPT3? See https://www.reddit.com/r/MachineLearning/comments/h0jwoz/d_gpt3_the_4600000_language_model/
What type of learning is next-word prediction? p12
What is an autoregressive model? Why is GPT one?
How many transformer layers and parameters does GPT3 have? p13
When was GPT-3 introduced?
Which task was the original transformer model explicitly designed for? p14
What is emergent behavior?
What are the 3 main stages of coding an LLM in this book?
What is the key idea of the transformer architecture? p15

GPT References

Chapter 2

What is embedding? p18
What is retrieval-augmented generation? p19
Which embeddings are popular for RAG?
What is Word2Vec? What is the main idea behind it?
What is an advantage of high dimensionality in word embeddings? A disadvantage?
What is an advantage of optimizing embeddings as part of LLM training instead of using Word2Vec?
What is the embedding size of the smaller GPT-2 models? The largest GPT-3 models?

Chapter 3

Why can’t we simply translate a text from one language to another word by word? p52
How can this challenge be addressed using a deep neural network?
What is a recurrent neural network?
What was the most popular encoder-decoder architecture before the advent of transformers?
Explain how an encoder-decoder RNN works. p53
What is the big limitation of encoder-decoder RNNs?
What is the Bahdanau attention mechanism? p54
What is self-attention? p55
What serves as the cornerstone of every LLM based on the transformer architecture?
What does the self in self-attention refer to? p56
What is a context vector? p57
Why are context vectors essential in LLMs?
Why is the dot product a measure of similarity? p59
Give 2 reasons why the attention scores normalized?
Why is it advisable to use the softmax function for normalization in practice? p60
Why is it advisable to use the PyTorch implementation of softmax in particular (instead of your own)?
What is the difference between attention scores and attention weights? p62
How are context vectors computed from attention weights? p63
Which are the 3 weight matrices in self-attention with trainable weights? p65
How are these matrices initialized? How are they used?
What is the difference between weight parameters (matrices) and attention weights?
How are the attention scores computed in the self-attention with trainable weights technique?
What about the attention weights? p68
What is scaled-dot product attention? p69
Why do we scale by the square root of the embedding dimension?
How does the softmax function behave as the dot products increase?
How is the context vector computed?
What is nn.module? p71
What is a significant advantage of using nn.Linear instead of nn.Parameter(torch.rand(…))?
What is causal attention?
How can the tril function be used to create a mask where the values above the diagonal are 0?
Explain a more effective masking trick for more efficiently computing the masked attention weights.
What is dropout in deep learning?
Which are the two specific times when dropout is typically applied in the transformer architecture?
Why does nn.Dropout scale the remaining values? p79-80
What are some advantages of using register_buffer? p81
What is multi-head attention? p82
How can multiple heads be processed in parallel? p85

Update 2025-03-26: here is the video discussing Chapter 3.

Build an LLM from Scratch 3: Coding attention mechanisms

Saint's Log

Learning about Large Language Models

Chapter 1

Part II

Chapter 2

Chapter 3

Article info

Categories

Tags

Leave a Reply