Categories:
Learning about Large Language Models
I am reading Sebastian Raschka’s Build a Large Language Model from Scratch book with a few colleagues at work. This video is mentioned in chapter 1 of the book.
We have quizes to see how well we understand the material. These are the questions & notes I have jotted down from my reading so far.
Chapter 1
- What is an LLM? p2
- What are 2 dimensions that “large” refers to?
- Which architecture do LLMs utilize? p3
- Why are LLMs often referred to as generative AI/genAI?
- What is the relationship between AI, ML, deep learning, LLMs, and genAI?
- Give a difference between traditional ML and deep learning.
- What are other approaches to AI apart from ML and deep learning? p4
- List 5 applications of LLMs.
- What are 3 advantages of custom built LLMs? p5
- What are the 2 general steps in creating an LLM? p6
- What is a base/foundation model? Give an example. p7
- What are the few-shot capabilities of a base model?
- What are 2 categories of fine-tuning LLMs?
- Which architecture did Attention Is All You Need introduce?
- Describe the transformer architecture.
Part II
- What are the 2 submodules of a transformer? p7
- What is the purpose of the self-attention mechanism?
- What is BERT? What do the initials stand for? p8
- What does GPT stand for?
- What is the difference between BERT and GPT? Which submodule of the original transformer does each focus on?
- List a real-world application of BERT. p9
- What is the difference between zero-shot and few-shot capabilities?
- What are applications of transformers (other than LLMs). p10
- Give 2 examples of architectures (other than transformers) that LLMs can be based on.
- See Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research for a publicly available training dataset (may contain copyrighted works) p11
- Why are models like GPT3 called based or foundation models?
- What is an estimate of the cost of training GPT3? See https://www.reddit.com/r/MachineLearning/comments/h0jwoz/d_gpt3_the_4600000_language_model/
- What type of learning is next-word prediction? p12
- What is an autoregressive model? Why is GPT one?
- How many transformer layers and parameters does GPT3 have? p13
- When was GPT-3 introduced?
- Which task was the original transformer model explicitly designed for? p14
- What is emergent behavior?
- What are the 3 main stages of coding an LLM in this book?
- What is the key idea of the transformer architecture? p15
GPT References
- Improving Language Understanding by Generative Pre-Training p12
- Training language models to follow instructions with human feedback
Chapter 2
- What is embedding? p18
- What is retrieval-augmented generation? p19
- Which embeddings are popular for RAG?
- What is Word2Vec? What is the main idea behind it?
- What is an advantage of high dimensionality in word embeddings? A disadvantage?
- What is an advantage of optimizing embeddings as part of LLM training instead of using Word2Vec?
- What is the embedding size of the smaller GPT-2 models? The largest GPT-3 models?
Chapter 3
- Why can’t we simply translate a text from one language to another word by word? p52
- How can this challenge be addressed using a deep neural network?
- What is a recurrent neural network?
- What was the most popular encoder-decoder architecture before the advent of transformers?
- Explain how an encoder-decoder RNN works. p53
- What is the big limitation of encoder-decoder RNNs?
- What is the Bahdanau attention mechanism? p54
- What is self-attention? p55
- What serves as the cornerstone of every LLM based on the transformer architecture?
- What does the self in self-attention refer to? p56
- What is a context vector? p57
- Why are context vectors essential in LLMs?
- Why is the dot product a measure of similarity? p59
- Give 2 reasons why the attention scores normalized?
- Why is it advisable to use the softmax function for normalization in practice? p60
- Why is it advisable to use the PyTorch implementation of softmax in particular (instead of your own)?
- What is the difference between attention scores and attention weights? p62
- How are context vectors computed from attention weights? p63
- Which are the 3 weight matrices in self-attention with trainable weights? p65
- How are these matrices initialized? How are they used?
- What is the difference between weight parameters (matrices) and attention weights?
- How are the attention scores computed in the self-attention with trainable weights technique?
- What about the attention weights? p68
- What is scaled-dot product attention? p69
- Why do we scale by the square root of the embedding dimension?
- How does the softmax function behave as the dot products increase?
- How is the context vector computed?
- What is nn.module? p71
- What is a significant advantage of using nn.Linear instead of nn.Parameter(torch.rand(…))?
- What is causal attention?
- How can the tril function be used to create a mask where the values above the diagonal are 0?
- Explain a more effective masking trick for more efficiently computing the masked attention weights.
- What is dropout in deep learning?
- Which are the two specific times when dropout is typically applied in the transformer architecture?
- Why does nn.Dropout scale the remaining values? p79-80
- What are some advantages of using register_buffer? p81
- What is multi-head attention? p82
- How can multiple heads be processed in parallel? p85
Leave a Reply