Learning about Large Language Models – Part 2

This post lists some key questions and concepts from Chapter 4 of Sebastian Raschka’s Build a Large Language Model from Scratch book. It is a continuation of the Learning about Large Language Models post.

  1. About how many parameters are in a small GPT-2 model? See Language Models are Unsupervised Multitask Learners at https://mng.bz/yoBq. p94
  2. What does the term “parameters” refer to in the context of deep learning/LLMs? p93
  3. What are logits? p99
  4. Describe the challenges of vanishing or exploding gradients in training DNNs with many layers. What do these problems lead to?
  5. Why is layer normalization used?
  6. What is the main idea behind layer normalization (what does it do to the outputs)?
  7. What is a ReLU? p100
  8. What is the torch.var keepDim parameter used for? p101
  9. What about the correction parameter (formerly unbiased)? How does it relate to Bessel’s_correction? p103
  10. Explain the difference between layer normalization and batch normalization. p104
  11. When is layer normalization advantageous? p105
  12. What is a GELU? How can it lead to better optimization properties during training (compared to a ReLU)?
  13. What is one advantage of small non-zero outputs on negative inputs to a GELU? p106
  14. Explain the role of a FeedForward module in enhancing the model’s ability to learn from and generalize data. Compare with Feedforward neural network. p108
  15. Why were shortcut connections originally proposed for deep networks in computer vision? p109
  16. Which PyTorch method computes loss gradients? p112
  17. Explain Pre-LayerNorm and Post-LayerNorm. p115
  18. What does preservation of shape throughout the transformer block architecture enable? p116
  19. Explain weight-tying as used in the original GPT-2 architecture and its advantages. p121

This is the author’s video corresponding to chapter 4 of the book.

Build an LLM from Scratch 4: Implementing a GPT model from Scratch To Generate Text

Math Concepts

The author mentioned that OpenAI used the biased variance option when training their GPT-2 model. The reasons why Bessel’s correction is usually used in statistics is explained well in this video:

Why We Divide by N-1 in the Sample Variance (The Bessel’s Correction)

I didn’t think the reason we need activation functions was expounded upon. This final video provides a great explanation.

Why Do We Need Activation Functions in Neural Networks?

Cropping an Image in Python

The AI advances of the past couple of years have streamlined many mundane tasks. A recent example for me was having Copilot generate scripts to crop images. I wanted to crop the top portion of my screenshots of the Anycubic app (which I used in my post on Setting up my First 3D Printer). I figured I might as well have one post showing my evolution of prompts for this task:

Prompt 1 – Crop my PNGs

Write a python script that takes a number p and crops the top p pixels from a PNG

copilot:
Sure thing! Here's a Python script that uses the Pillow library to crop the top p pixels from a PNG image:

Setup Tip: You'll need to install Pillow if you haven’t already:

pip install pillow

Let me know if you'd like to expand it—for example, batch processing multiple images or adding a GUI.
from PIL import Image

def crop_top_pixels(image_path, p, output_path):
    # Open the image
    img = Image.open(image_path)

    # Ensure p isn't larger than the image height
    width, height = img.size
    p = min(p, height)

    # Define the crop box (left, upper, right, lower)
    crop_box = (0, p, width, height)

    # Crop and save
    cropped_img = img.crop(crop_box)
    cropped_img.save(output_path)
    print(f"Cropped image saved to {output_path}")

# Example usage
crop_top_pixels("input.png", 50, "output.png")

Pillow Installation

The only notable thing about installing pillow is that I still need this --break-system-packages flag. I will save that investigation for another day. I will upgrade pip though!

saint@MacBookPro Kobra3-app % pip3 install pillow

[notice] A new release of pip is available: 24.3.1 -> 25.1.1
[notice] To update, run: python3.13 -m pip install --upgrade pip
error: externally-managed-environment

× This environment is externally managed
╰─> To install Python packages system-wide, try brew install
    xyz, where xyz is the package you are trying to
    install.
...
note: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-system-packages.
hint: See PEP 668 for the detailed specification.
saint@MacBookPro Kobra3-app % pip3 install pillow --break-system-packages
Collecting pillow
  Downloading pillow-11.2.1-cp313-cp313-macosx_11_0_arm64.whl.metadata (8.9 kB)
Downloading pillow-11.2.1-cp313-cp313-macosx_11_0_arm64.whl (3.0 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.0/3.0 MB 27.0 MB/s eta 0:00:00
Installing collected packages: pillow
Successfully installed pillow-11.2.1

[notice] A new release of pip is available: 24.3.1 -> 25.1.1
[notice] To update, run: python3.13 -m pip install --upgrade pip

Prompt 2 – Invoking via Bash

write a bash script to pass all PNG images to the Python script

copilot:
You got it! Here's a simple Bash script that loops through all .png files in a directory and passes each one to the Python script you created:
#!/bin/bash

# Number of pixels to crop from the top
CROP_HEIGHT=50

# Directory containing PNG images (current directory)
IMAGE_DIR="."

# Python script name
PYTHON_SCRIPT="crop.py"

# Output directory
OUTPUT_DIR="./cropped"

# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"

# Loop through all PNG files
for img in "$IMAGE_DIR"/*.png; do
    filename=$(basename "$img")
    output_path="$OUTPUT_DIR/$filename"
    echo "Cropping $img..."
    python3 "$PYTHON_SCRIPT" "$img" "$CROP_HEIGHT" "$output_path"
done
copilot (continued):
To use this setup:
- Make sure your Python script (crop.py) accepts arguments like this:
import sys
from PIL import Image

def crop_top_pixels(image_path, p, output_path):
    img = Image.open(image_path)
    width, height = img.size
    p = min(p, height)
    crop_box = (0, p, width, height)
    cropped_img = img.crop(crop_box)
    cropped_img.save(output_path)

if __name__ == "__main__":
    crop_top_pixels(sys.argv[1], int(sys.argv[2]), sys.argv[3])
- Save the Bash script to a file like batch_crop.sh, make it executable with chmod +x batch_crop.sh, and run it from the terminal:
./batch_crop.sh

Prompt 3 – User Configurable Crop Dimensions

make bash script take number of pixels as a command line argument

copilot:
You got it, Saint. Here's the updated Bash script that takes the number of pixels to crop (CROP_HEIGHT) as a command-line argument:
#!/bin/bash

# Check for a crop height argument
if [ $# -lt 1 ]; then
    echo "Usage: $0 <crop_height>"
    exit 1
fi

CROP_HEIGHT=$1
IMAGE_DIR="."
PYTHON_SCRIPT="crop.py"
OUTPUT_DIR="./cropped"

mkdir -p "$OUTPUT_DIR"

for img in "$IMAGE_DIR"/*.png; do
    filename=$(basename "$img")
    output_path="$OUTPUT_DIR/$filename"
    echo "Cropping $img by $CROP_HEIGHT pixels..."
    python3 "$PYTHON_SCRIPT" "$img" "$CROP_HEIGHT" "$output_path"
done
copilot (continued):
How to run it:

./batch_crop.sh 50

Let me know if you want it to support additional flags (like input/output dirs or image formats) to make it more flexible!

Prompt 4 – Scale Down the Images

modify python script to support scaling each image down to a configurable width

copilot:
Absolutely! Here's how you can update your Python script to not only crop the top p pixels but also scale each image down to a configurable target width while preserving the aspect ratio:
import sys
from PIL import Image

def crop_and_scale(image_path, crop_pixels, output_path, target_width=None):
    # Open the image
    img = Image.open(image_path)
    width, height = img.size

    # Crop the top p pixels
    crop_pixels = min(crop_pixels, height)
    cropped_img = img.crop((0, crop_pixels, width, height))

    # Resize while maintaining aspect ratio
    if target_width is not None and target_width < cropped_img.width:
        aspect_ratio = cropped_img.height / cropped_img.width
        target_height = int(target_width * aspect_ratio)
        cropped_img = cropped_img.resize((target_width, target_height), Image.LANCZOS)

    # Save the result
    cropped_img.save(output_path)

if __name__ == "__main__":
    if len(sys.argv) < 4:
        print("Usage: python crop.py <input_path> <crop_pixels> <output_path> [target_width]")
        sys.exit(1)

    img_path = sys.argv[1]
    crop_amount = int(sys.argv[2])
    output_path = sys.argv[3]
    width = int(sys.argv[4]) if len(sys.argv) > 4 else None

    crop_and_scale(img_path, crop_amount, output_path, width)
copilot (continued):
Example command to run it:

python3 crop.py input.png 50 output.png 300

This will crop the top 50 pixels, then scale the image to 300px wide while maintaining the aspect ratio.
Let me know if you want to add height constraints or enforce padding for uniform size outputs!

Prompt 5 – No Bash Required

merge the functionality of the bash script into the python script

copilot:
You got it, Saint. Here's a standalone Python script that handles batch processing of all PNGs in the current directory, crops the top p pixels, and optionally scales them to a target width—all from the command line:
import os
import sys
from PIL import Image

def crop_and_scale(image_path, crop_pixels, output_path, target_width=None):
    img = Image.open(image_path)
    width, height = img.size

    # Crop top p pixels
    crop_pixels = min(crop_pixels, height)
    cropped = img.crop((0, crop_pixels, width, height))

    # Resize if target width is specified
    if target_width is not None and target_width < cropped.width:
        aspect_ratio = cropped.height / cropped.width
        new_height = int(target_width * aspect_ratio)
        cropped = cropped.resize((target_width, new_height), Image.LANCZOS)

    cropped.save(output_path)

def process_directory(crop_pixels, target_width=None):
    input_dir = "."
    output_dir = "./cropped"
    os.makedirs(output_dir, exist_ok=True)

    for filename in os.listdir(input_dir):
        if filename.lower().endswith(".png"):
            input_path = os.path.join(input_dir, filename)
            output_path = os.path.join(output_dir, filename)
            print(f"Processing {filename}...")
            crop_and_scale(input_path, crop_pixels, output_path, target_width)

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python batch_crop.py <crop_pixels> [target_width]")
        sys.exit(1)

    crop_pixels = int(sys.argv[1])
    target_width = int(sys.argv[2]) if len(sys.argv) > 2 else None

    process_directory(crop_pixels, target_width)
copilot (continued):
Example usage:

python batch_crop.py 50              # Crop top 50 pixels only
python batch_crop.py 50 300          # Crop top 50 pixels and scale to 300px wide

Want to add input/output directory flags, process JPGs too, or make it a little GUI app for drag-and-drop convenience? I’d be glad to help with any of that.

Prompt 6 – Renaming Files

modify the script to replace user specified filename prefix with a custom user specified prefix

copilot:
Sure thing! Here's the updated Python script that allows the user to specify:
- The number of pixels to crop (crop_pixels)
- An optional scaling width (target_width)
- A filename prefix to replace
- A new prefix to substitute in its place
import os
import sys
from PIL import Image

def crop_and_scale(image_path, crop_pixels, output_path, target_width=None):
    img = Image.open(image_path)
    width, height = img.size

    # Crop
    crop_pixels = min(crop_pixels, height)
    cropped = img.crop((0, crop_pixels, width, height))

    # Scale
    if target_width is not None and target_width < cropped.width:
        aspect_ratio = cropped.height / cropped.width
        new_height = int(target_width * aspect_ratio)
        cropped = cropped.resize((target_width, new_height), Image.LANCZOS)

    cropped.save(output_path)

def process_directory(crop_pixels, target_width, old_prefix, new_prefix):
    input_dir = "."
    output_dir = "./cropped"
    os.makedirs(output_dir, exist_ok=True)

    for filename in os.listdir(input_dir):
        if filename.lower().endswith(".png"):
            if not filename.startswith(old_prefix):
                continue  # Skip files that don't match the prefix

            # Replace old prefix with new
            new_filename = new_prefix + filename[len(old_prefix):]
            input_path = os.path.join(input_dir, filename)
            output_path = os.path.join(output_dir, new_filename)

            print(f"Processing {filename} → {new_filename}")
            crop_and_scale(input_path, crop_pixels, output_path, target_width)

if __name__ == "__main__":
    if len(sys.argv) < 4:
        print("Usage: python batch_crop.py <crop_pixels> <old_prefix> <new_prefix> [target_width]")
        sys.exit(1)

    crop_pixels = int(sys.argv[1])
    old_prefix = sys.argv[2]
    new_prefix = sys.argv[3]
    target_width = int(sys.argv[4]) if len(sys.argv) > 4 else None

    process_directory(crop_pixels, target_width, old_prefix, new_prefix)
Example usage:

python batch_crop.py 40 IMG_ thumb_ 300

If you have files like IMG_001.png, the output will be named thumb_001.png.
Let me know if you’d like suffix support, regex matching, or nested folder traversal next!

This final script enabled me to crop, scale down, and renamed all the PNGs from my phone with minimal effort. Here is an example of how I used it:

saint@MacBookPro Kobra3-5 % python3 crop.py 150 IMG_ 04-firmware-update_ 480
Processing IMG_4448.PNG → 04-firmware-update_4448.PNG
Processing IMG_4449.PNG → 04-firmware-update_4449.PNG
Processing IMG_4455.PNG → 04-firmware-update_4455.PNG
Processing IMG_4454.PNG → 04-firmware-update_4454.PNG
Processing IMG_4456.PNG → 04-firmware-update_4456.PNG
Processing IMG_4447.PNG → 04-firmware-update_4447.PNG
Processing IMG_4453.PNG → 04-firmware-update_4453.PNG
Processing IMG_4452.PNG → 04-firmware-update_4452.PNG
Processing IMG_4451.PNG → 04-firmware-update_4451.PNG

Syntax Highlighting on my Blog

This is the first post with enough inline code to get me to install a syntax highlighter. wordpress code formating plugin – Google Search led me to this post on How To Display Code in WordPress (and Make It Look Pretty). The first plugin it recommended was Enlighter – Customizable Syntax Highlighter – WordPress plugin | WordPress.org. I compared it to Code Block Pro – Beautiful Syntax Highlighting – WordPress plugin | WordPress.org, which was the first result in my initial bing search. I selected Enlighter because it appears to be developed as a community project, which makes its chances of being around and supported in years much higher.


Learning about Large Language Models

I am reading Sebastian Raschka’s Build a Large Language Model from Scratch book with a few colleagues at work. This video is mentioned in chapter 1 of the book.

Developing an LLM: Building, Training, Finetuning

We have quizes to see how well we understand the material. These are the questions & notes I have jotted down from my reading so far.

Chapter 1

  1. What is an LLM? p2
  2. What are 2 dimensions that “large” refers to?
  3. Which architecture do LLMs utilize? p3
  4. Why are LLMs often referred to as generative AI/genAI?
  5. What is the relationship between AI, ML, deep learning, LLMs, and genAI?
  6. Give a difference between traditional ML and deep learning.
  7. What are other approaches to AI apart from ML and deep learning? p4
  8. List 5 applications of LLMs.
  9. What are 3 advantages of custom built LLMs? p5
  10. What are the 2 general steps in creating an LLM? p6
  11. What is a base/foundation model? Give an example. p7
  12. What are the few-shot capabilities of a base model?
  13. What are 2 categories of fine-tuning LLMs?
  14. Which architecture did Attention Is All You Need introduce?
  15. Describe the transformer architecture.

Part II

  1. What are the 2 submodules of a transformer? p7
  2. What is the purpose of the self-attention mechanism?
  3. What is BERT? What do the initials stand for? p8
  4. What does GPT stand for?
  5. What is the difference between BERT and GPT? Which submodule of the original transformer does each focus on?
  6. List a real-world application of BERT. p9
  7. What is the difference between zero-shot and few-shot capabilities?
  8. What are applications of transformers (other than LLMs). p10
  9. Give 2 examples of architectures (other than transformers) that LLMs can be based on.
  10. See Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research for a publicly available training dataset (may contain copyrighted works) p11
  11. Why are models like GPT3 called based or foundation models?
  12. What is an estimate of the cost of training GPT3? See https://www.reddit.com/r/MachineLearning/comments/h0jwoz/d_gpt3_the_4600000_language_model/
  13. What type of learning is next-word prediction? p12
  14. What is an autoregressive model? Why is GPT one?
  15. How many transformer layers and parameters does GPT3 have? p13
  16. When was GPT-3 introduced?
  17. Which task was the original transformer model explicitly designed for? p14
  18. What is emergent behavior?
  19. What are the 3 main stages of coding an LLM in this book?
  20. What is the key idea of the transformer architecture? p15

GPT References

  1. Improving Language Understanding by Generative Pre-Training p12
  2. Training language models to follow instructions with human feedback

Chapter 2

  1. What is embedding? p18
  2. What is retrieval-augmented generation? p19
  3. Which embeddings are popular for RAG?
  4. What is Word2Vec? What is the main idea behind it?
  5. What is an advantage of high dimensionality in word embeddings? A disadvantage?
  6. What is an advantage of optimizing embeddings as part of LLM training instead of using Word2Vec?
  7. What is the embedding size of the smaller GPT-2 models? The largest GPT-3 models?

Chapter 3

  1. Why can’t we simply translate a text from one language to another word by word? p52
  2. How can this challenge be addressed using a deep neural network?
  3. What is a recurrent neural network?
  4. What was the most popular encoder-decoder architecture before the advent of transformers?
  5. Explain how an encoder-decoder RNN works. p53
  6. What is the big limitation of encoder-decoder RNNs?
  7. What is the Bahdanau attention mechanism? p54
  8. What is self-attention? p55
  9. What serves as the cornerstone of every LLM based on the transformer architecture?
  10. What does the self in self-attention refer to? p56
  11. What is a context vector? p57
  12. Why are context vectors essential in LLMs?
  13. Why is the dot product a measure of similarity? p59
  14. Give 2 reasons why the attention scores normalized?
  15. Why is it advisable to use the softmax function for normalization in practice? p60
  16. Why is it advisable to use the PyTorch implementation of softmax in particular (instead of your own)?
  17. What is the difference between attention scores and attention weights? p62
  18. How are context vectors computed from attention weights? p63
  19. Which are the 3 weight matrices in self-attention with trainable weights? p65
  20. How are these matrices initialized? How are they used?
  21. What is the difference between weight parameters (matrices) and attention weights?
  22. How are the attention scores computed in the self-attention with trainable weights technique?
  23. What about the attention weights? p68
  24. What is scaled-dot product attention? p69
  25. Why do we scale by the square root of the embedding dimension?
  26. How does the softmax function behave as the dot products increase?
  27. How is the context vector computed?
  28. What is nn.module? p71
  29. What is a significant advantage of using nn.Linear instead of nn.Parameter(torch.rand(…))?
  30. What is causal attention?
  31. How can the tril function be used to create a mask where the values above the diagonal are 0?
  32. Explain a more effective masking trick for more efficiently computing the masked attention weights.
  33. What is dropout in deep learning?
  34. Which are the two specific times when dropout is typically applied in the transformer architecture?
  35. Why does nn.Dropout scale the remaining values? p79-80
  36. What are some advantages of using register_buffer? p81
  37. What is multi-head attention? p82
  38. How can multiple heads be processed in parallel? p85

Update 2025-03-26: here is the video discussing Chapter 3.

Build an LLM from Scratch 3: Coding attention mechanisms

This video on attention in transformers was shared by a colleague to help shed light on this concept:

Attention in transformers, step-by-step | Deep Learning Chapter 6