Intro to Deep Learning — Exam Study Guide

0 / 0 questions reviewed
R1: Mock Exam (highest priority)
R2: In-Class Quiz
R3: Exercise Sheet
R4: Memory Protocol

Exam Hints from Prof. Vu

  • "I will only ask you the content that is on the slide." No transfer questions.
  • Coding questions: explain what a line does, find bugs — NOT write from scratch.
  • T/F: Each statement = 1 point. If unsure, write brief reasoning keywords for partial credit.
  • "In the exam, I showed you an example — if I ask you the same example, you should be able to do it."
  • Keep responses short. It is not necessary to be too wordy.
  • Only pen and paper allowed. 90 minutes. 77 points total.

Study Principles

  • Verbal First: Say the answer out loud before writing it down. The verbal step is where learning happens.
  • Active Recall: Click a question, try to answer it from memory, THEN check.
  • Breadth First: Cover every topic at surface level before going deep on any one.
  • Calculation Muscle Memory: Knowing the formula conceptually ≠ executing it under time pressure. Drill the computations.
  • Priority Order: R1 Mock → R2 Quizzes → R3 Exercises → R4 Memory Protocols.

Linear Algebra & Calculus (supports Tasks 2–5)

13 questions
R2 QUIZCalculation
Compute Wx+b where W=[[1,2,0],[-3,2,1]], x=[2,0,-2]ᵀ, b=[1,0]ᵀ
Week 3 Thu

Matrix-vector multiplication practice. Row 1: 1·2 + 2·0 + 0·(-2) + 1 = 3. Row 2: (-3)·2 + 2·0 + 1·(-2) + 0 = -8. Result: [3, -8]ᵀ

R2 QUIZConceptual
What is the dimension of ∂g/∂x for g: ℝ³→ℝ³? (Jacobian)
Week 3 Thu

3×3 Jacobian matrix. The Jacobian of g: ℝⁿ→ℝᵐ has dimensions m×n. Here m=3, n=3 → 3×3.

R2 QUIZCalculation
Compute Jacobian ∂g/∂x for g: ℝ²→ℝ³
Week 3 Thu

The Jacobian is a 3×2 matrix where entry (i,j) = ∂gᵢ/∂xⱼ. Compute each partial derivative and arrange in the matrix. Practice with specific functions like g(x₁,x₂) = [x₁², x₁x₂, x₂³].

R2 QUIZCalculation
Probability: what is P(sum = 7) with 2 dice?
Week 3 Thu

Favorable outcomes for sum=7: (1,6), (2,5), (3,4), (4,3), (5,2), (6,1) = 6 outcomes. Total outcomes: 6×6 = 36. P = 6/36 = 1/6.

R2 QUIZCalculation
Probability: P(doubles) with 2 dice?
Week 3 Thu

Doubles: (1,1), (2,2), (3,3), (4,4), (5,5), (6,6) = 6 outcomes. Total: 36. P = 6/36 = 1/6.

R2 QUIZConceptual
Is a one-hot vector space closed under addition? (Trick question)
Week 2 transcript

No. Adding two one-hot vectors gives [2,0,0...] or similar — violates the 0/1 constraint. One-hot vectors don't form a vector space because the space isn't closed under addition.

R2 QUIZCalculation
Calculate cosine similarity — determine orthogonality.
Week 2 transcript

Cosine similarity = (a·b)/(||a|| · ||b||). If result = 0, vectors are orthogonal (perpendicular). If = 1, parallel and same direction. If = -1, parallel and opposite direction. Practice computing dot products and norms for specific vectors.

R2 QUIZConceptual
Derivative vs Jacobian — what's the distinction?
Week 2 Fri transcript

Derivative: For f: ℝ→ℝ, a single number f'(x). Gradient: For f: ℝⁿ→ℝ, a vector of partial derivatives ∇f. Jacobian: For f: ℝⁿ→ℝᵐ, an m×n matrix of all partial derivatives. The Jacobian generalizes the derivative to vector-valued functions.

R3 EXERCISECalculation
Various vector/matrix operations (10 sub-questions).
Exercise Sheet 1, Q1.1–1.10

Practice: dot products, matrix multiplication, transpose, determinant, inverse, norm, cross product, eigenvalues. These are foundational operations used in every forward/backward pass calculation.

R3 EXERCISECalculation
Compute derivatives of various functions.
Exercise Sheet 2, Q2.1

Practice: power rule, chain rule, product rule. Key for backprop: d/dx(eˣ) = eˣ, d/dx(ln x) = 1/x, d/dx(σ(x)) = σ(x)(1-σ(x)), d/dx(tanh x) = 1-tanh²(x), d/dx(ReLU) = 1 if x>0 else 0.

R3 EXERCISECalculation
Eigenvalue computation.
Exercise Sheet 2, Q2.2

Find eigenvalues by solving det(A - λI) = 0. For 2×2 matrix [[a,b],[c,d]]: λ² - (a+d)λ + (ad-bc) = 0. Use quadratic formula. Then find eigenvectors by solving (A - λI)v = 0 for each eigenvalue.

R3 EXERCISECalculation
Probability, expected value, and variance calculations.
Exercise Sheet 3, Q3.1

E[X] = Σ xᵢ · P(xᵢ). Var(X) = E[X²] - (E[X])². For continuous: use integrals. Key distributions: Bernoulli, Gaussian (normal), Uniform. Practice computing these for discrete probability tables.

R3 EXERCISECalculation
Gradient descent steps — compute parameter updates.
Exercise Sheet 3, Q3.2

Update rule: w_{t+1} = w_t - η · ∂L/∂w. Practice: (1) Compute the gradient of a loss function, (2) Apply the update rule for 2-3 steps with a given learning rate. Watch how the parameter moves toward the minimum.

Conceptual

12 questions
R1 MOCK2 ptsConceptual
1a. What is the main difference between a regression and a classification task? Name one regression task and one classification task in speech or natural language processing.
Mock Exam 1a

Verbatim from mock exam.

Key distinction: Regression predicts a continuous value. Classification predicts a discrete category/label.

NLP examples — Regression: predicting a sentiment score (1.0–5.0). Classification: spam detection, language identification, next-word prediction.

R2 QUIZConceptual
"Is predicting the next word classification or regression?"
Week 13 Thu transcript

Classification. Words are discrete categories, not continuous numbers. The output is a probability distribution over the vocabulary.

R2 QUIZConceptual
"High training error indicates underfitting or overfitting?"
Week 4 Thu

Underfitting. The model can't even fit the training data well. Overfitting would show LOW training error but HIGH test error.

R2 QUIZConceptual
"What does empirical risk mean?" (trick: avg loss on TRAINING data, not dev)
Week 4 Thu

Average loss on the TRAINING data. Not the dev set. Empirical risk = (1/N) Σ L(f(xᵢ), yᵢ) over training samples. The "empirical" part means we use the actual observed data, not the true distribution.

R2 QUIZConceptual
"More features = better performance?" (No — curse of dimensionality)
Week 4 Fri

No. The curse of dimensionality: more features means the data becomes sparser in high-dimensional space. Need exponentially more data to maintain the same density. Can lead to overfitting. Feature selection/reduction (PCA, etc.) can help.

R2 QUIZConceptual
"What are the stop-criteria for k-means?"
Week 4 Thu

K-means stops when: (1) No data points change cluster assignment, (2) Centroids don't move significantly, (3) Maximum number of iterations reached, or (4) Average distance to centroids falls below a threshold.

R3 EXERCISEConceptual
Feature engineering — what is it and why does it matter?
Exercise Sheet 4, Q4.2

Feature engineering is the process of creating/selecting/transforming input features to improve model performance. Includes normalization, encoding categorical variables, creating interaction terms. Good features can make simple models perform well; bad features make even complex models fail.

R3 EXERCISEConceptual
Training curves interpretation — what do different curve shapes indicate?
Exercise Sheet 4, Q4.5

Training curves plot loss/accuracy vs. epochs. Key patterns: (1) Both curves converging = good fit, (2) Training low but val high = overfitting, (3) Both high = underfitting, (4) Gap between curves = generalization gap. Use to decide: more data, more complexity, or regularization.

R4 MEMORYConceptual
Define kernel, support vector, and maximal margin (SVM)
Memory Protocol 2021

Kernel: A function that computes dot products in a higher-dimensional space without explicitly mapping to it (kernel trick). Support vectors: The data points closest to the decision boundary — they "support" it. Maximal margin: The largest possible gap between the decision boundary and the nearest data points of each class.

R4 MEMORYConceptual
SVM for multiclass: how?
Memory Protocol 2022

SVMs are natively binary classifiers. For multiclass: (1) One-vs-All (OvA): Train K classifiers, each separating one class from all others. (2) One-vs-One (OvO): Train K(K-1)/2 classifiers for each pair. Use voting for final decision.

R4 MEMORYConceptual
Decision trees more transparent than NNs: why?
Memory Protocol 2022

Decision trees produce human-readable if/then rules. You can trace exactly why a prediction was made by following the path from root to leaf. Neural networks are "black boxes" — millions of parameters interact in non-linear ways, making it nearly impossible to explain individual predictions.

R4 MEMORYConceptual
Which NN model for sentence classification? Why different random initializations?
exam_example.pdf 1c

Feedforward NN with bag-of-words input, or CNN/RNN for sequence. Different random initializations lead to different local minima during training, so models may learn different features. Common practice: train multiple and ensemble, or pick best on validation.

Calculation

5 questions
R1 MOCK4 ptsCalculation
1b. Given a linear regression model \(f(\mathbf{x}) = \mathbf{w}^T \mathbf{x}\) with \(\mathbf{w} = [1, 0, 1, 0]^T\), compute the MSE for these 3 data points: \(\mathbf{x}_1=[1,0,0,0]^T, y_1=2\); \(\mathbf{x}_2=[0,1,0,1]^T, y_2=1\); \(\mathbf{x}_3=[0,0,1,1]^T, y_3=3\).
Mock Exam 1b

Verbatim from mock exam.

Step 1 — predictions: \(f(x_1)=1\cdot1+0+1\cdot0+0=1\), \(f(x_2)=0+0+0+0=0\), \(f(x_3)=0+0+1+0=1\)

Step 2 — squared errors: \((2-1)^2=1\), \((1-0)^2=1\), \((3-1)^2=4\)

Step 3 — MSE: \(\frac{1}{3}(1+1+4) = \frac{6}{3} = 2\)

R3 EXERCISECalculation
K-means clustering computation — assign points to clusters, update centroids, iterate.
Exercise Sheet 4, Q4.1

K-means algorithm: (1) Initialize K centroids randomly, (2) Assign each point to nearest centroid (Euclidean distance), (3) Recompute centroids as mean of assigned points, (4) Repeat until convergence. Practice the distance calculations and centroid updates by hand.

R3 EXERCISECalculation
Logistic regression computation — apply sigmoid to linear output.
Exercise Sheet 4, Q4.3

Logistic regression: \(\hat{y} = \sigma(\mathbf{w}^T\mathbf{x} + b)\) where \(\sigma(z) = \frac{1}{1+e^{-z}}\). Output is a probability ∈ (0,1). Decision boundary at 0.5. Practice computing the sigmoid for specific values.

R3 EXERCISECalculation
Linear regression / MSE calculation from exercise sheet.
Exercise Sheet 4, Q4.4

Same MSE formula as mock exam: \(\text{MSE} = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2\). Practice computing predictions from weight vectors, then computing the loss.

R4 MEMORYCalculation
Logistic regression output calculation — compute ŷ given weights and input.
exam_example.pdf 1b

Given weights w, bias b, and input x: compute z = wᵀx + b, then ŷ = σ(z) = 1/(1+e^(-z)). Remember: σ(0) = 0.5, σ is monotonic, σ(-z) = 1 - σ(z).

True/False

14 questions
R1 MOCK5 ptsTrue/False
1c. True or False? (5 statements, 1 pt each)
Mock Exam 1c
StatementTF
The goal of machine learning is to overfit the training data.
Regularization is used to prevent the model from overfitting.
Support Vector Machines (SVM) can only be used for binary classification.
K-means clustering is a partitional clustering algorithm.
Hyperparameters should be tuned on the test set.

Verbatim from mock exam. Key: goal is generalization not overfitting; SVMs can do multiclass via OvA/OvO; hyperparams tuned on validation set, NOT test set.

R2 QUIZTrue/False
T/F: "Hyperparameters and parameters are model-dependent"
Week 4 Thu

TRUE — but nuanced. Parameters (weights) are learned during training and are model-dependent. Hyperparameters (learning rate, etc.) are set before training. Most hyperparameters ARE model-dependent (e.g., number of layers), but learning rate is a general hyperparameter not tied to a specific model architecture.

R2 QUIZTrue/False
T/F: "In k-means, k is a trainable parameter"
Week 4 Thu

FALSE. k is a hyperparameter — it's set before training and not learned during the algorithm. The trainable "parameters" are the centroid positions.

R2 QUIZTrue/False
T/F: "The more training data, the better the SVM performs"
Week 4 Fri

FALSE. More data doesn't always help. Noisy data can hurt. Also, SVM performance depends on finding good support vectors — beyond a point, additional data points far from the decision boundary don't change the model.

R2 QUIZTrue/False
T/F: "The goal of preventing overfitting is to better generalize on unseen data"
Week 13 Thu transcript

TRUE. The entire point of regularization, early stopping, dropout, etc. is to ensure the model performs well on data it hasn't seen during training.

R2 QUIZTrue/False
T/F: "SVM can only be used for binary classification"
Week 13 Thu transcript

FALSE. SVMs can be extended to multiclass via One-vs-All or One-vs-One strategies. But natively, a single SVM is binary.

R2 QUIZTrue/False
T/F: "Is KNN parametric where k is a parameter?"
Week 6 Thu

FALSE. KNN is non-parametric — it doesn't learn fixed parameters. k is a hyperparameter (set before "training"). KNN stores all training data and computes distances at prediction time.

R2 QUIZTrue/False
T/F: "Can you stop k-means if avg max distance to centroids < fixed value?"
Week 6 Thu

TRUE. This is a valid convergence criterion. When no point is farther than a threshold from its centroid, the clustering is stable enough.

R2 QUIZTrue/False
T/F: "Does SVM decision boundary depend only on data near boundary?"
Week 6 Thu

TRUE. The decision boundary depends only on the support vectors — the points closest to it. All other points could be removed without changing the boundary.

R2 QUIZTrue/False
T/F: "Can gradient descent find global minimum?"
Week 6 Thu

FALSE (in general). Gradient descent can get stuck in local minima for non-convex loss functions (like neural networks). For convex functions (like linear regression), it does find the global minimum.

R2 QUIZTrue/False
T/F: "Is LDA a supervised method learning linear transformation?"
Week 6 Thu

TRUE. Linear Discriminant Analysis is supervised — it uses class labels to find a linear transformation that maxim

R1 MOCK EXAMTrue/False10 pts
[2024 Mock 1a] True or False? (10 statements, MINUS points for wrong answers, min 0)
Source: 2024 Mock Exam 1a
StatementTF
K-nearest neighbors algorithms is a parametric method.
Hyperparameters are trained to optimize the results on the training set.
Generalisation is a central problem of machine learning.
An overfitted model perfectly matches the evaluation data.
Regularization typically increases the error on the training set.
Logistic regression is typically used to predict real values.
Empirical risk is an average of a loss function on a finite development set.
Support vector machine can be used only for binary classification tasks.
Each data item is assigned to exactly one cluster with k-means clustering.
In active learning, systems actively select queries and request feedback from human.

From 2024 Mock Exam. Key traps:

Overfitted model ≠ matches evaluation data. Overfitting means matching TRAINING data too well, performing POORLY on evaluation/test data.

Empirical risk uses TRAINING set, not development set — this is the same trick from the in-class quiz.

10 statements with MINUS scoring — our existing mock had only 5. Budget more time for this.

Learn more
izes class separation.

R3 EXERCISETrue/False
ML fundamentals T/F block from exercise sheet.
Exercise Sheet 4, Q4.6

Practice T/F statements from exercise sheets. These cover similar ground to the mock exam: overfitting, regularization, model selection, bias-variance tradeoff.

R4 MEMORYTrue/False
T/F: k-means, CNN, RNN vanishing gradients, parameter tuning, batch training.
exam_example.pdf 1d

Mixed-topic T/F from older exam. Test yourself on each claim individually. Remember: brief reasoning keywords can earn partial credit even if your T/F answer is wrong.

Conceptual

7 questions
R3 EXERCISEConceptual
Loss function selection — when to use which loss?
Exercise Sheet 6, Q6.4

MSE: regression tasks. Cross-entropy (CE): classification tasks. Binary CE: two-class problems. Categorical CE: multi-class (with softmax). CE is preferred for classification because it penalizes confident wrong predictions more heavily and has nicer gradients with softmax.

R3 EXERCISEConceptual
Activation functions — compare sigmoid, tanh, ReLU.
Exercise Sheet 6, Q6.5

Sigmoid: σ(z) = 1/(1+e^(-z)), output (0,1), vanishing gradient for large |z|. Tanh: output (-1,1), zero-centered, still vanishing gradient. ReLU: max(0,z), no vanishing gradient for z>0, but "dying ReLU" for z<0. ReLU is most common in hidden layers; sigmoid/softmax for output layers.

R4 MEMORYConceptual
Explain how backward pass works (in words).
Memory Protocol 2021

The backward pass computes gradients of the loss w.r.t. each parameter using the chain rule. Starting from the output layer: (1) Compute error signal δ at output, (2) Propagate δ backward through each layer, (3) At each layer, compute ∂L/∂W = δ · aᵀ (activation from previous layer), (4) Update δ for next layer back using weights and activation derivative.

R4 MEMORYConceptual
Why non-linear activation functions?
Memory Protocol 2021 + 2022

Without non-linear activations, stacking multiple linear layers collapses to a single linear transformation: W₂(W₁x + b₁) + b₂ = W'x + b'. The network couldn't learn any non-linear patterns regardless of depth. Non-linear activations let the network approximate arbitrary functions.

R4 MEMORYConceptual
Why deep networks (multiple layers)?
Memory Protocol 2021

Deeper networks can represent increasingly abstract features hierarchically. Early layers learn simple patterns (edges), later layers combine them into complex concepts (faces). A single wide layer would need exponentially more neurons to represent the same functions. Depth = compositional power.

R4 MEMORYConceptual
Name 3 hyperparameters of a feedforward neural network.
Memory Protocol 2021

(1) Number of layers (depth), (2) Number of neurons per layer (width), (3) Learning rate. Others: activation function, batch size, number of epochs, optimizer choice, regularization strength.

Calculation

11 questions
R1 MOCK4 ptsCalculation
2a. Given sentence "this exam is fair" with vocabulary V={exam, fair, is, this}, compute bag-of-words → forward pass → cross-entropy loss.

Network: input (4) → hidden (3, ReLU) → output (2, softmax). True label: positive (class 1).

\(W^1 = \begin{bmatrix} 0 & 0.5 & 0 & 0 \\ 0 & 0 & 0.5 & 0 \\ 0.5 & 0 & 0 & 0 \end{bmatrix}\), \(W^2 = \begin{bmatrix} 0.5 & 0 & 0 \\ 0 & 0 & 0.5 \end{bmatrix}\), \(b^2 = \begin{bmatrix} 0 \\ 0.5 \end{bmatrix}\)
Mock Exam 2a

Verbatim from mock exam.

Step 1 — BoW: x = [1,1,1,1]ᵀ (each word appears once)

Step 2 — Hidden: h = ReLU(W¹x) = ReLU([0.5, 0.5, 0.5]ᵀ) = [0.5, 0.5, 0.5]ᵀ

Step 3 — Output: z = W²h + b² = [0.25, 0.75]ᵀ

Step 4 — Softmax: \(\hat{y} = [\frac{e^{0.25}}{e^{0.25}+e^{0.75}}, \frac{e^{0.75}}{e^{0.25}+e^{0.75}}]\)

Step 5 — CE Loss: \(L = -\log(\hat{y}_{\text{class 1}})\) where class 1 = positive

R1 MOCK5 ptsCalculation
2b. Backward pass: Given ŷ = [0.88, 0.12] and true label = class 2 (y = [0, 1]), compute δ² and gradients ∇W²CE, ∇b²CE.
Mock Exam 2b

Verbatim from mock exam.

Step 1 — Error signal: δ² = ŷ - y = [0.88, 0.12] - [0, 1] = [0.88, -0.88]

Step 2 — Weight gradient: \(\nabla_{W^2} = \delta^2 \cdot (a^1)^T\) where a¹ is the hidden layer output from forward pass

Step 3 — Bias gradient: \(\nabla_{b^2} = \delta^2 = [0.88, -0.88]\)

Key insight: For softmax + CE, the error signal simplifies to ŷ - y (prediction minus target).

R2 QUIZCalculation
Compute output y of 2-layer network: x=[1,0]ᵀ, given W¹, b¹, W², b², with ReLU→softmax.
Week 6 Thu

Same process as mock 2a but with different dimensions. (1) Compute z¹ = W¹x + b¹, (2) Apply ReLU: a¹ = max(0, z¹), (3) Compute z² = W²a¹ + b², (4) Apply softmax: ŷ = softmax(z²). Practice this until automatic.

R3 EXERCISECalculation
Parameter counting — how many weights and biases in each layer?
Exercise Sheet 6, Q6.1

For a layer mapping from n inputs to m outputs: Weights: m × n parameters, Biases: m parameters. Total per layer: m(n+1). For the whole network, sum across all layers. Example: 4→3→2 network = 3×4 + 3 + 2×3 + 2 = 12+3+6+2 = 23 parameters.

R3 EXERCISECalculation
Forward pass computation from exercise sheet.
Exercise Sheet 6, Q6.2

Same procedure as mock exam. Practice with different weight matrices and activation functions. Key steps: (1) Linear: z = Wx + b, (2) Activation: a = f(z), (3) Repeat for each layer, (4) Final output with appropriate activation (softmax for classification).

R3 EXERCISECalculation
Backpropagation computation from exercise sheet.
Exercise Sheet 6, Q6.3

Full backprop: (1) Compute δ at output layer (depends on loss function), (2) For each layer going backward: ∂L/∂W = δ · aᵀ (previous activation), ∂L/∂b = δ, then propagate: δ_prev = (Wᵀδ) ⊙ f'(z). The ⊙ is element-wise multiplication with the activation derivative.

R4 MEMORYCalculation
One-hot encode sentence → forward pass with f=x² → CE loss.
DL_mock_exam.pdf Ex2 + Gedächtnisprotokoll

Variant of mock exam 2a but with f(x) = x² as activation instead of ReLU. Same steps: encode input → matrix multiply → apply activation → matrix multiply → softmax → CE loss. Be careful: x² activation means f'(x) = 2x for the backward pass.

R4 MEMORYCalculation
Fill in backpropagation formulas from slide.
Gedächtnisprotokoll

The key formulas: (1) Output error: δᴸ = ŷ - y (for softmax+CE), (2) Hidden error: δˡ = (Wˡ⁺¹)ᵀδˡ⁺¹ ⊙ f'(zˡ), (3) Weight gradient: ∂L/∂Wˡ = δˡ(aˡ⁻¹)ᵀ, (4) Bias gradient: ∂L/∂bˡ = δˡ. Make sure you can fill these in from memory.

R1 MOCK EXAMCalculation5 pts
[2024 Mock 2a] Tweet sentiment classification: Given a 2-hidden-layer FFNN with activation \(f = x^2\) (element-wise), softmax output, no biases. Weights: \(W^1 \in \mathbb{R}^{2 \times 4}\), \(W^2 \in \mathbb{R}^{2 \times 2}\), \(W^3 \in \mathbb{R}^{3 \times 2}\). Compute output \(y\) for tweet 'this exam is fair' using bag-of-words with V = {exam, good, bad, fair}. Then compute cross-entropy loss (using \(\log_{10}\)) given correct label \(\hat{y} = (1,0,0)^T\).
Source: 2024 Mock Exam 2a

From 2024 Mock Exam. Key differences from our existing mock:

• Same f=x² activation trick, but different weight matrices:

\(W^1 = \begin{bmatrix} 0 & 1 & -1 & 0 \\ 1 & -1 & 0 & 1 \end{bmatrix}\), \(W^2 = \begin{bmatrix} -1 & 1 \\ 1 & 0 \end{bmatrix}\), \(W^3 = \begin{bmatrix} 0 & 1 \\ 1 & -1 \\ -1 & 1 \end{bmatrix}\)

3-class output (positive/negative/neutral) with softmax

• Uses log₁₀ for cross-entropy (not ln) — watch the base!

• Round to 1 decimal point

Procedure: (1) Encode tweet as BoW → x = [1,0,0,1]ᵀ, (2) z¹ = W¹x, a¹ = (z¹)², (3) z² = W²a¹, a² = (z²)², (4) z³ = W³a², y = softmax(z³), (5) CE = -log₁₀(ŷ_true_class)

Learn more
R1 MOCK EXAMConceptual4 pts
[2024 Mock 2b] Given a backpropagation diagram showing \(\frac{\partial C}{\partial w_{ij}^l} = \frac{\partial z_i^l}{\partial w_{ij}^l} \cdot \frac{\partial C}{\partial z_i^l}\), fill in the two missing equations: (1) the forward pass formula and (2) the backward pass (δ) formula.
Source: 2024 Mock Exam 2b

From 2024 Mock Exam. Fill-in-the-blank format.

Box 1 (Forward pass):

\(z_i^l = \sum_j w_{ij}^l \cdot a_j^{l-1}\) (for l > 1), or \(z_i^l = \sum_j w_{ij}^l \cdot x_j\) (for l = 1)

Box 2 (Backward pass / δ):

\(\delta_i^l = \frac{\partial C}{\partial z_i^l}\) — the error signal at neuron i in layer l

This is a diagram-based question testing whether you understand how the chain rule decomposes into a forward pass component and a backward pass component.

Learn more
R1 MOCK EXAMConceptual3 pts
[2024 Mock 2c] Why does parameter initialization have a strong impact on the final results? Describe one possibility how to initialize the parameters.
Source: 2024 Mock Exam 2c

From 2024 Mock Exam.

Why initialization matters: Neural networks use gradient descent to find local minima. Different initializations start the optimization at different points in the loss landscape, leading to different local minima → different final results. Bad initialization can cause vanishing/exploding gradients from the very start.

Xavier/Glorot initialization: \(W \sim \mathcal{N}(0, \frac{2}{n_{in} + n_{out}})\) — keeps variance of activations stable across layers. Good for sigmoid/tanh.

He initialization: \(W \sim \mathcal{N}(0, \frac{2}{n_{in}})\) — designed for ReLU activations.

Key point: we NEVER initialize all weights to the same value (e.g., all zeros) because then all neurons compute the same thing → symmetry problem → network can't learn different features.

Learn more

Code / Bug-Finding

1 questions
R1 MOCK5 ptsCode
2c. PyTorch training loop — explain what loss.backward() does, and identify 2 missing optimizer operations.
Mock Exam 2c

Verbatim from mock exam.

 1for epoch in range(num_epochs):
 2    for batch_x, batch_y in train_loader:
 3        output = model(batch_x)
 4        loss = criterion(output, batch_y)
 5        loss.backward()
 6
 7    print(f'Epoch {epoch}, Loss: {loss.item()}')

loss.backward() computes gradients of the loss w.r.t. all parameters via backpropagation (fills .grad attributes).

Missing operations:

1. optimizer.zero_grad() — must clear old gradients before backward() (otherwise they accumulate)

2. optimizer.step() — must update parameters using the computed gradients (after backward())

True/False

4 questions
R1 MOCK5 ptsTrue/False
2d. True or False? (5 statements, 1 pt each)
Mock Exam 2d
StatementTF
A neural network layer can be described as a linear transformation followed by a nonlinear activation.
Cross-entropy is often used as loss function for multi-class, multi-label classification.
In the backward pass, we start computing from δ¹.
A neural network typically consists of multiple layers.
ReLU activation is defined as min(0, z).

Verbatim from mock exam. Key: backward starts from OUTPUT (δᴸ), not δ¹. ReLU = max(0,z) not min.

R2 QUIZTrue/False
T/F: "Cross-entropy loss is often used for multi-label classification"
Week 13 Thu transcript

TRUE — this is tricky. Binary cross-entropy can be applied per-label independently for multi-label problems. Each label gets its own sigmoid output and BCE loss. The losses are then summed.

R2 QUIZTrue/False
T/F: "Neural network typically consists of multiple layers"
Week 13 Thu transcript

TRUE. By definition, a neural network has at least an input layer, one or more hidden layers, and an output layer. Even the simplest useful NN has multiple layers.

="mock_exams/DL_mock_exam_2024.pdf" target="_blank">📋 2024 Mock Exam
R1 MOCK EXAMConceptual3 pts
[2024 Mock 2c] Why does parameter initialization have a strong impact on the final results? Describe one possibility how to initialize the parameters.
Source: 2024 Mock Exam 2c

From 2024 Mock Exam.

Why initialization matters: Neural networks use gradient descent to find local minima. Different initializations start the optimization at different points in the loss landscape, leading to different local minima → different final results. Bad initialization can cause vanishing/exploding gradients from the very start.

Xavier/Glorot initialization: \(W \sim \mathcal{N}(0, \frac{2}{n_{in} + n_{out}})\) — keeps variance of activations stable across layers. Good for sigmoid/tanh.

He initialization: \(W \sim \mathcal{N}(0, \frac{2}{n_{in}})\) — designed for ReLU activations.

Key point: we NEVER initialize all weights to the same value (e.g., all zeros) because then all neurons compute the same thing → symmetry problem → network can't learn different features.

Learn more

Conceptual

7 questions
R1 MOCK2 ptsConceptual
3a. What is the main difference between an RNN and a feed-forward neural network? What kind of input data can RNNs handle that FFNNs cannot?
Mock Exam 3a

Verbatim from mock exam.

RNNs have recurrent connections — they maintain a hidden state that is updated at each timestep, creating a "memory" of previous inputs. FFNNs process each input independently with no memory.

RNNs handle sequential/variable-length input data (text, speech, time series) where order matters. FFNNs require fixed-size input.

R1 MOCK2 ptsConceptual
3b. Write the Elman RNN formula for the hidden state. Define all variables.
Mock Exam 3b

Verbatim from mock exam.

\(a_t = f(W_i x_t + W_h a_{t-1} + b)\)

Where: \(a_t\) = hidden state at time t, \(x_t\) = input at time t, \(a_{t-1}\) = previous hidden state, \(W_i\) = input-to-hidden weights, \(W_h\) = hidden-to-hidden (recurrent) weights, \(b\) = bias, \(f\) = activation function (e.g., tanh, ReLU).

R4 MEMORYConceptual
LSTM: how it works (intuitive explanation).
Memory Protocol 2021

LSTM uses gates to control information flow: (1) Forget gate: decides what to discard from cell state (sigmoid → 0=forget, 1=keep), (2) Input gate: decides what new info to store (sigmoid × tanh candidate), (3) Output gate: decides what to output based on cell state. The cell state acts as a "conveyor belt" — information can flow through unchanged, solving vanishing gradients.

R4 MEMORYConceptual
Why does LSTM have 4x parameters of a simple RNN?
Memory Protocol 2021 + 2022

A simple RNN has one set of weight matrices (Wᵢ, Wₕ, b). An LSTM has 4 sets — one for each gate: forget gate (Wf), input gate (Wi), candidate cell state (Wc), and output gate (Wo). Each gate has its own input weights, recurrent weights, and bias. So ~4× the parameters.

R4 MEMORYConceptual
Elman vs Jordan RNN — what's the difference?
Memory Protocol 2021 + 2022

Elman RNN: Hidden state feeds back to itself. \(a_t = f(W_i x_t + W_h a_{t-1})\). Jordan RNN: Output feeds back to hidden layer. \(a_t = f(W_i x_t + W_h y_{t-1})\). Elman is more common in practice. Key distinction: what gets fed back — hidden state (Elman) vs. output (Jordan).

R4 MEMORYConceptual
Gates in LSTM: what is the functionality of each?
Memory Protocol 2022

Forget gate: fₜ = σ(Wf·[hₜ₋₁, xₜ] + bf) — controls what to forget from cell state. Input gate: iₜ = σ(Wi·[hₜ₋₁, xₜ] + bi) — controls what new info to add. Output gate: oₜ = σ(Wo·[hₜ₋₁, xₜ] + bo) — controls what to output. All gates use sigmoid (0-1 range) to act as "valves".

R4 MEMORYConceptual
Is "backpropagation through space" a thing?
Gedächtnisprotokoll

No. The correct term is Backpropagation Through Time (BPTT). We "unroll" the RNN across timesteps and backpropagate through the unrolled graph. "Through space" is not a real concept — it's a trick question.

Calculation

8 questions
R1 MOCK3 ptsCalculation
3c. Given an Elman RNN with no biases and ReLU activation, compute the output at timestep 2.

\(W_i = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}\), \(W_h = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}\), \(W_o = \begin{bmatrix} 1 & 1 \end{bmatrix}\), \(a_0 = \begin{bmatrix} 0 \\ 0 \end{bmatrix}\), \(x_1 = \begin{bmatrix} 1 \\ 0 \end{bmatrix}\), \(x_2 = \begin{bmatrix} 0 \\ 1 \end{bmatrix}\)
Mock Exam 3c

Verbatim from mock exam.

Step 1 — t=1: \(a_1 = \text{ReLU}(W_i x_1 + W_h a_0) = \text{ReLU}([1,0]^T + [0,0]^T) = [1,0]^T\)

Step 2 — t=2: \(a_2 = \text{ReLU}(W_i x_2 + W_h a_1) = \text{ReLU}([0,1]^T + [0,1]^T) = [0,2]^T\)

Step 3 — output: \(y_2 = W_o a_2 = [1,1] \cdot [0,2]^T = 2\)

R3 EXERCISECalculation
RNN/LSTM parameter counting.
Exercise Sheet 9, Q9.1

For an RNN with input size n and hidden size h: Wᵢ has h×n params, Wₕ has h×h params, bias has h params. Total: h(n+h+1). For LSTM: multiply by 4 (four gates). Plus output layer Wₒ with o×h params. Practice counting for specific dimensions.

R3 EXERCISECalculation
Elman RNN forward pass computation (exercise sheet version).
Exercise Sheet 9, Q9.2

Same process as mock exam 3c. Given specific weight matrices and inputs, compute hidden states step by step: aₜ = f(Wᵢxₜ + Wₕaₜ₋₁ + b). Then compute outputs yₜ = Wₒaₜ. Practice with different sizes and activations.

R3 EXERCISECalculation
LSTM cell computation — step through one timestep.
Exercise Sheet 9, Q9.3

For one LSTM timestep: (1) fₜ = σ(Wf·[hₜ₋₁, xₜ] + bf), (2) iₜ = σ(Wi·[hₜ₋₁, xₜ] + bi), (3) c̃ₜ = tanh(Wc·[hₜ₋₁, xₜ] + bc), (4) cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ, (5) oₜ = σ(Wo·[hₜ₋₁, xₜ] + bo), (6) hₜ = oₜ ⊙ tanh(cₜ). Practice computing each gate value for given inputs.

R4 MEMORYCalculation
RNN with one-hot encoding of "deep learning", f=x², softmax output.
DL_mock_exam.pdf Ex4

Similar to mock exam but with RNN architecture and x² activation. One-hot encode each word → feed through RNN timesteps → apply f(z)=z² → output with softmax. Remember: f'(z) = 2z for the backward pass variant.

R4 MEMORYCode
LSTM equations — find 3 errors in the given equations.
DL_mock_exam.pdf Ex4

Common errors in LSTM equations: (1) Wrong activation (using ReLU instead of sigmoid for gates), (2) Missing element-wise multiplication ⊙, (3) Wrong concatenation in gate inputs, (4) Forget gate applied to wrong thing, (5) Output gate formula errors. Check each gate formula carefully against the standard LSTM.

R1 MOCK EXAMCalculation5 pts
[2024 Mock 4a] RNN forward pass: Compute softmax output of last hidden state for input 'deep learning'. V = {machine, deep, learning}, \(a_0 = [-1, 1]^T\), \(f = x^2\) (not sigmoid). Given \(W_i\), \(W_h\), \(W_{out}\). No biases. Round to 1 decimal.
Source: 2024 Mock Exam 4a

From 2024 Mock Exam.

Weights: \(W_i = \begin{bmatrix} 1 & -1 & 0 \\ -1 & 0 & 1 \end{bmatrix}\), \(W_h = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}\), \(W_{out} = \begin{bmatrix} -1 & 1 \\ 0 & -1 \\ 1 & 0 \end{bmatrix}\)

Important: The activation is f=x² (element-wise square), NOT sigmoid! This simplification is repeated from the other mock exam — expect this on the real exam.

Procedure: For each word in sequence: (1) encode as one-hot, (2) compute \(z_t = W_i x_t + W_h a_{t-1}\), (3) apply \(a_t = z_t^2\), (4) after last word: \(y = \text{softmax}(W_{out} \cdot a_{last})\)

Learn more
R1 MOCK EXAMConceptual4 pts
[2024 Mock 4b] LSTM equations error identification: Given 6 LSTM equations with 3 deliberate mistakes. Identify and correct them.
Source: 2024 Mock Exam 4b

From 2024 Mock Exam. Given equations:

\(f_t = \text{sigmoid}(W_f[h_{t-1}, x_{t-1}] + b_f)\) — Eq.2

\(i_t = \text{sigmoid}(W_i[h_{t-1}, x_{t-1}] + b_i)\) — Eq.3

\(\tilde{C}_t = \tanh(W_C[h_{t-1}, x_t] + b_C)\) — Eq.4

\(C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\) — Eq.5

\(o_t = \tanh(W_o[h_{t-1}, x_t] + b_o)\) — Eq.6

\(h_t = o_t \odot \tanh(C_t)\) — Eq.7

Three errors:

1. Eq.2 & 3: Should be \(x_t\) not \(x_{t-1}\) — LSTM gates use CURRENT input

2. Eq.6: Output gate uses sigmoid, not tanh — gates are always sigmoid (values 0-1)

This is a tricky "spot the bug" question. Know the correct LSTM equations cold.

Learn more

True/False

3 questions
R1 MOCK4 ptsTrue/False
3d. True or False? (4 statements, 1 pt each)
Mock Exam 3d
StatementTF
Recurrent neural networks are often trained using Backpropagation Through Time (BPTT).
LSTMs completely solve the vanishing gradient problem.
The last hidden state of an RNN always captures the information of the whole input sequence.
In practice, we always need to pad the input for an RNN to work.

Verbatim from mock exam. Key: LSTMs mitigate but don't completely solve vanishing gradients. Last hidden state may lose early info. Padding is practical necessity for batching, not theoretical requirement.

R2 QUIZConceptual
"In theory, does an RNN need padding?"
Week 13 Thu transcript

No. In theory, RNNs can process variable-length sequences one at a time. Padding is a practical requirement for batched processing — you need uniform tensor dimensions within a batch. A single sequence needs no padding.

er last word: \(y = \text{softmax}(W_{out} \cdot a_{last})\)

Learn more
R1 MOCK EXAMConceptual4 pts
[2024 Mock 4b] LSTM equations error identification: Given 6 LSTM equations with 3 deliberate mistakes. Identify and correct them.
Source: 2024 Mock Exam 4b

From 2024 Mock Exam. Given equations:

\(f_t = \text{sigmoid}(W_f[h_{t-1}, x_{t-1}] + b_f)\) — Eq.2

\(i_t = \text{sigmoid}(W_i[h_{t-1}, x_{t-1}] + b_i)\) — Eq.3

\(\tilde{C}_t = \tanh(W_C[h_{t-1}, x_t] + b_C)\) — Eq.4

\(C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\) — Eq.5

\(o_t = \tanh(W_o[h_{t-1}, x_t] + b_o)\) — Eq.6

\(h_t = o_t \odot \tanh(C_t)\) — Eq.7

Three errors:

1. Eq.2 & 3: Should be \(x_t\) not \(x_{t-1}\) — LSTM gates use CURRENT input

2. Eq.6: Output gate uses sigmoid, not tanh — gates are always sigmoid (values 0-1)

This is a tricky "spot the bug" question. Know the correct LSTM equations cold.

Learn more

Conceptual

10 questions
R1 MOCK2 ptsConceptual
4a. What is the motivation for using CNNs? Name two reasons.
Mock Exam 4a

Verbatim from mock exam.

(1) Parameter sharing / reduction: Same filter applied across entire input → far fewer parameters than fully connected. (2) Translation invariance: Features detected regardless of position (a cat is a cat whether left or right in the image). Also: local connectivity captures spatial patterns.

R1 MOCK2 ptsConceptual
4b. Name one speech task and one text task. How is the input to a CNN represented in each case?
Mock Exam 4b

Verbatim from mock exam.

Speech: e.g., speech recognition or speaker identification. Input = spectrogram (2D: time × frequency). Text: e.g., sentiment analysis or text classification. Input = word embeddings stacked as a matrix (2D: sequence length × embedding dimension).

R1 MOCK1 ptConceptual
4c. CNNs applied to word embeddings — explain the idea.
Mock Exam 4c

Verbatim from mock exam.

Stack word embeddings into a matrix (rows = words, columns = embedding dimensions). Apply 1D filters that span the full embedding width but vary in height (n-gram size). A filter of height 2 captures bigram patterns, height 3 captures trigrams, etc. This lets CNNs capture local n-gram features without explicit n-gram engineering.

R1 MOCK1 ptConceptual
4e. How do you compute the derivative of a max pooling layer?
Mock Exam 4e

Verbatim from mock exam.

The gradient only flows through the position that was selected as maximum. For the max element: gradient passes through unchanged (derivative = 1). For all non-max elements: gradient = 0. In practice, you store the indices of max elements during forward pass ("switches") and route gradients back through those positions.

R2 QUIZConceptual
"What if the input is a matrix for a feedforward NN?" (parameter explosion)
Week 8 Thu

A fully connected layer would need m×n parameters for every element of the matrix to every neuron. For a 100×100 image → 10,000 inputs → a hidden layer of 1000 neurons needs 10 million weights! This motivates CNNs: parameter sharing through filters dramatically reduces parameters while capturing spatial structure.

R2 QUIZConceptual
"What if the input is shifted?" (lose spatial relationships)
Week 8 Thu

If an object shifts position in the input, a fully connected network treats it as completely different input (different neurons activate). CNNs are translation-invariant — the same filter detects the same feature regardless of position. This is why CNNs are essential for vision and signal processing.

R2 QUIZConceptual
What are the CNN hyperparameters?
Week 8 Thu

Hyperparameters: (1) Number of filters (channels), (2) Filter/kernel size, (3) Stride, (4) Padding (same/valid), (5) Pooling size and type (max/average). Note: filter size is a hyperparameter, but filter weights are learned parameters.

R3 EXERCISEConceptual
Output dimension formula for convolution and pooling layers.
Exercise Sheet 8, Q8.1

Output size = \(\lfloor\frac{n - k + 2p}{s}\rfloor + 1\) where n = input size, k = kernel size, p = padding, s = stride. Same formula for both conv and pooling layers. For 2D: apply independently to height and width.

R4 MEMORYConceptual
CNN input for different sequence lengths?
Memory Protocol 2022

For variable-length inputs: (1) Padding: pad shorter sequences to max length, (2) Global pooling: apply global max/average pooling over the sequence dimension to get fixed-size output regardless of input length, (3) 1-max pooling: take maximum value from each filter's output across all positions.

R1 MOCK EXAMConceptual2 pts
[2024 Mock 3a] CNNs have been proposed for image processing. What is the intuition of using them for language and how does the input look like in the case of language?
Source: 2024 Mock Exam 3a

From 2024 Mock Exam.

Intuition: In images, CNNs detect local spatial patterns (edges, textures). In language, local patterns are n-grams — groups of adjacent words that carry meaning. A CNN filter sliding over a sentence detects local word patterns just like it detects visual patterns in images.

Input representation: Each word is represented as a dense vector (word embedding). The sentence becomes a 2D matrix: rows = words in sequence, columns = embedding dimensions. So a sentence of length L with embedding dimension D gives an L × D input matrix.

Filters then slide vertically (across words) with full width (across all embedding dimensions), detecting n-gram patterns of different sizes.

Learn more

Calculation

3 questions
R1 MOCK4 ptsCalculation
4d. Compute convolution + max-pooling output.

Input (3×4): \(\begin{bmatrix} 0 & 0 & 1 & 2 \\ 1 & 1 & 2 & 1 \\ 1 & 0 & 1 & 0 \end{bmatrix}\), Filter (2×2): \(\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}\), stride=(1,2), then max-pool (1×2).
Mock Exam 4d

Verbatim from mock exam.

Step 1 — Conv output size: height = (3-2)/1 + 1 = 2, width = (4-2)/2 + 1 = 2 → 2×2 output

Step 2 — Conv computation (stride 1 vertically, 2 horizontally):

Position (0,0): 0·1 + 0·0 + 1·0 + 1·1 = 1

Position (0,1): 1·1 + 2·0 + 2·0 + 1·1 = 2

Position (1,0): 1·1 + 1·0 + 1·0 + 0·1 = 1

Position (1,1): 2·1 + 1·0 + 1·0 + 0·1 = 2

Conv output: \(\begin{bmatrix} 1 & 2 \\ 1 & 2 \end{bmatrix}\)

Step 3 — Max pool (1×2): pool each row → [2, 2] → Result: \(\begin{bmatrix} 2 \\ 2 \end{bmatrix}\)

R3 EXERCISECalculation
2D convolution forward pass computation (exercise sheet).
Exercise Sheet 8, Q8.2

Practice the same convolution process with different inputs and filters. For each position: element-wise multiply filter with input patch, sum all products. Move filter by stride amount. Remember: output dimensions use the formula ⌊(n-k+2p)/s⌋ + 1.

R1 MOCK EXAMCalculation3 pts
[2024 Mock 3b] Compute convolution output (stride=2, no padding, \(f = x^2\)) then max pooling (2×2 window, stride=1). Input: 4×4 matrix, Filter: 2×2. Given specific matrices.
Source: 2024 Mock Exam 3b

From 2024 Mock Exam.

Input: \(\begin{pmatrix} 2 & 2 & 1 & 0 \\ -1 & 2 & 1 & 1 \\ 1 & -1 & 1 & -1 \\ 0 & 1 & 2 & -2 \end{pmatrix}\), Filter: \(\begin{pmatrix} 1 & -1 \\ 0 & 2 \end{pmatrix}\), Pooling: 2×2 max

Key differences from our existing mock: stride=2 for conv (not 1), activation f=x², pooling stride=1

Procedure:

1. Slide 2×2 filter with stride 2 → output size = ⌊(4-2)/2⌋+1 = 2 → 2×2 conv output

2. Apply activation f=x² element-wise

3. Apply 2×2 max pooling with stride 1 → output size = ⌊(2-2)/1⌋+1 = 1 → 1×1 output

Learn more

True/False

3 questions
R1 MOCK4 ptsTrue/False
4f. True or False? (4 statements, 1 pt each)
Mock Exam 4f
StatementTF
The number of filters is a hyperparameter.
The filter size determines the parameters of the model.
A CNN is a special case of a feedforward neural network.
The weights of the filter are the parameters of the model.

Verbatim from mock exam. Key: filter SIZE is a hyperparameter (you choose it), but filter WEIGHTS are the learned parameters. The tricky one is "filter size determines parameters" — FALSE, because while filter size affects the count of parameters, it doesn't determine which values the parameters take.

R2 QUIZTrue/False
Quiz. "The number of filters is a hyperparameter" (TRUE)
Week 13 Thu transcript (Q34)

TRUE. You (the designer) choose how many filters to use — it's not learned from data. This was an in-class quiz question testing the same concept as Mock 4f statement 1.

R1 MOCK EXAMTrue/False5 pts
[2024 Mock 3c] True or False? (5 statements on CNNs)
Source: 2024 Mock Exam 3c
StatementTF
The filter weights are trainable parameters of a CNN.
A convolutional layer with 10 filters (with bias) of size 3×3 has 100 trainable parameters.
Zero padding is only necessary when processing several input matrices at a time (in a batch or mini-batch).
A typical convolutional layer for language spans the whole sentence.
The average pooling layer can be used to downsample a matrix.

From 2024 Mock Exam. Key traps:

10 filters of 3×3 with bias ≠ 100 params. It's 10 × (3×3×C + 1) where C = number of input channels. With C=1: 10×(9+1) = 100 would be TRUE, but the question doesn't specify single-channel. In general CNN context (e.g., RGB), C>1 so it's FALSE.

Zero padding is used to control output spatial dimensions, NOT just for batching.

Conv for language spans the full embedding width but NOT the whole sentence — filters slide across words.

Learn more

Conceptual

18 questions
R1 MOCK2 ptsConceptual
5a. Seq2seq models are used for tasks where the input and output are both sequences. Name a property that the task should have and give an example.
Mock Exam 5a

Verbatim from mock exam.

Property: Variable-length input maps to variable-length output (lengths can differ). Example: Machine translation ("Ich bin müde" → "I am tired"), text summarization, speech recognition (audio → text).

R1 MOCK2 ptsConceptual
5b. Name two scoring methods for attention. How are the encoder hidden states used by the decoder?
Mock Exam 5b

Verbatim from mock exam.

Scoring methods: (1) Dot product: score = hₑᵀ · hd, (2) Additive (Bahdanau): score = vᵀ · tanh(W₁hₑ + W₂hd). Other option: (3) Scaled dot product: score = (hₑᵀ · hd) / √d.

How encoder states are used: Compute attention scores between decoder hidden state and ALL encoder hidden states → softmax → weighted sum of encoder states → context vector fed to decoder.

R1 MOCK2 ptsConceptual
5c. Self-attention: how are Q, K, V computed? What are the learnable weights?
Mock Exam 5c

Verbatim from mock exam.

Q = XW_Q, K = XW_K, V = XW_V where X is the input matrix and W_Q, W_K, W_V are separate learnable weight matrices. Each input token gets its own query, key, and value by linear projection. The three weight matrices W_Q, W_K, W_V are the learnable parameters.

R1 MOCK4 ptsConceptual
5d. Self-attention diagram: describe steps 1, 3, 4. What is the purpose of step 2?
Mock Exam 5d

Verbatim from mock exam (4 pts). Based on the jalammar.github.io illustration:

Step 1: Compute Q, K, V by multiplying input embeddings with weight matrices W_Q, W_K, W_V.

Step 2 (purpose): Compute attention scores by taking dot product of query with all keys: score = Q · Kᵀ. Then scale by √d_k to prevent softmax saturation. This determines how much each token should "attend to" every other token.

Step 3: Apply softmax to the scaled scores to get attention weights (probabilities summing to 1).

Step 4: Multiply attention weights by V to get weighted sum — the output representation for each token.

Full formula: \(\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\)

R2 QUIZConceptual
"How many weight tensors are there in nn.Linear?"
Week 10 Fri transcript

2: the weight matrix W and the bias vector b. nn.Linear(in, out) stores W of shape (out, in) and b of shape (out). If bias=False, then just 1 tensor (W only).

R4 MEMORYConceptual
Explain seq2seq in your own words.
Gedächtnisprotokoll

Seq2seq uses two RNNs: an encoder reads the input sequence and compresses it into a context vector (final hidden state), and a decoder generates the output sequence one token at a time, conditioned on the context vector. Problem: fixed-size context vector is a bottleneck for long sequences → solved by attention.

R4 MEMORYConceptual
Self-attention: describe the mechanism.
Memory Protocol 2021

Self-attention lets each token attend to all other tokens in the same sequence. For each token: (1) Create Q, K, V vectors via learned projections, (2) Score = dot product of Q with all K's, (3) Softmax to get weights, (4) Weighted sum of V's = output. Captures long-range dependencies without recurrence. O(n²) complexity in sequence length.

R4 MEMORYConceptual
Self-attention: what are parameters vs hyperparameters?
Memory Protocol 2022

Parameters (learned): W_Q, W_K, W_V weight matrices, output projection weights. Hyperparameters (chosen): d_model (model dimension), d_k (key/query dimension), d_v (value dimension), number of attention heads, number of layers.

Transformers & BERT

R2 SLIDEConceptual
Why does the transformer architecture eliminate recurrence and convolution entirely? What does it use instead?
Transformers slide 15

The transformer is based solely on attention mechanisms — no recurrence, no convolution. Advantages over RNNs: (1) fully parallelizable (no sequential dependency), (2) O(1) maximum path length for long-range dependencies vs O(n). Advantage over CNNs: O(1) path vs O(log_k(n)). The trade-off: self-attention has O(n²·d) complexity per layer, which is expensive for very long sequences.

R2 SLIDEConceptual
Name and describe the three types of attention used in the full transformer architecture.
Transformers slide 28

1. Encoder self-attention: Each encoder position attends to all positions in the encoder input. Captures relationships within the source sequence.

2. Masked decoder self-attention: Each decoder position attends only to previous decoder positions (future tokens are masked with −∞ before softmax). Ensures autoregressive generation.

3. Encoder-decoder (cross) attention: Queries come from the decoder, keys and values come from the encoder output. This is how the decoder "reads" the source — analogous to attention in seq2seq.

R2 SLIDEConceptual
Why does the transformer need positional encoding? What information would be lost without it?
Transformers slides 29–30

Self-attention treats the input as a set, not a sequence — it has no built-in notion of word order. Without positional encoding, "the cat sat on the mat" and "the mat sat on the cat" produce identical representations. Positional encodings are added to the input embeddings to inject position information. The original paper uses sinusoidal functions so the model can generalize to unseen sequence lengths.

R2 SLIDEConceptual
Write the sinusoidal positional encoding formulas. Why use sin and cos at different frequencies?
Transformers slide 31

\(PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d_{model}}}\right)\)

\(PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d_{model}}}\right)\)

Where pos = position in the sequence, i = dimension index, d_model = model dimension. Each dimension uses a different frequency. For any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos), letting the model learn to attend to relative positions.

R2 SLIDEConceptual
What is multi-head attention? Why use multiple heads instead of one? How are the outputs combined?
Transformers slides 32–34

Instead of one attention function with d_model-dimensional keys/values/queries, multi-head attention runs h parallel attention heads, each with reduced dimension d_k = d_v = d_model / h.

Formula: MultiHead(Q, K, V) = Concat(head₁, …, headₕ) · W^O, where each headᵢ = Attention(Q·Wᵢ^Q, K·Wᵢ^K, V·Wᵢ^V).

Why multiple heads: Different heads can learn to attend to different types of relationships (e.g., syntactic vs. semantic). The output projection W^O (d_model × d_model) combines information from all heads.

R2 SLIDEConceptual
Compare self-attention, recurrent, and convolutional layers: complexity per layer, sequential operations, maximum path length.
Transformers slide 37 (Table 1)

From "Attention Is All You Need" (Table 1 on the slides):

Layer TypeComplexity/LayerSequential OpsMax Path Length
Self-AttentionO(n²·d)O(1)O(1)
RecurrentO(n·d²)O(n)O(n)
ConvolutionalO(k·n·d²)O(1)O(log_k(n))

n = sequence length, d = representation dimension, k = kernel size. Self-attention wins on parallelization and long-range paths, but costs more per layer for long sequences (n > d).

R2 SLIDEConceptual
What is the problem with static word embeddings like Word2vec? How do contextual embeddings solve it?
Transformers slides 43–45

Problem: Static embeddings assign one fixed vector per word regardless of context. "Bank" gets the same vector in "river bank" and "bank account" — polysemy is lost.

Solution: Contextual embeddings generate a different vector for each word depending on its context. Instead of a lookup table, a model (ELMo, BERT) dynamically computes the embedding from the full sentence.

R2 SLIDEConceptual
What is ELMo? How does it produce contextual word representations?
Transformers slides 45–48

ELMo (Embeddings from Language Models, Peters et al. 2018) uses a bidirectional LSTM (BiLSTM): a forward LSTM reads left-to-right, a backward LSTM reads right-to-left. The contextualized embedding for each word is the concatenation of the forward and backward hidden states. Because each direction sees different context, the combined representation captures the full surrounding context.

R2 SLIDEConceptual
What are BERT's two pre-training objectives? How does BERT differ from ELMo architecturally?
Transformers slide 50

BERT (Devlin et al. 2019) replaces ELMo's BiLSTM with a Transformer encoder. Two pre-training objectives:

1. Masked Language Model (MLM): Randomly mask 15% of input tokens; train the model to predict them. Unlike left-to-right LMs, this allows true bidirectional context.

2. Next Sentence Prediction (NSP): Given sentences A and B, predict whether B follows A. Trains inter-sentence understanding.

Pre-trained BERT is a powerful feature extractor that can be fine-tuned with relatively little task-specific data.

R2 SLIDEConceptual
How is BERT fine-tuned for downstream tasks? Name the four task types shown in the lecture.
Transformers slide 51

Add a small task-specific output layer on top of pre-trained BERT and train end-to-end. Four task types:

1. Sentence pair classification (MNLI, QQP): [CLS] + Sentence A + [SEP] + Sentence B → class label from [CLS].

2. Single sentence classification (SST-2, CoLA): [CLS] + Sentence → class label from [CLS].

3. Question answering (SQuAD): Question + [SEP] + Paragraph → predict start/end span.

4. Token-level tagging (CoNLL NER): [CLS] + Tokens → per-token labels (B-PER, O, etc.).

Calculation

3 questions
R3 EXERCISECalculation
Seq2seq parameter counting — encoder + decoder + attention.
Exercise Sheet 10, Q10.2

Encoder RNN: h(n+h+1) params. Decoder RNN: h(m+h+1) params (m = target vocab embedding size). Attention: depends on type — additive has W₁(h), W₂(h), v(h) params. Output projection: vocab_size × h. Don't forget embeddings: vocab × d_embed for both source and target.

R3 EXERCISECalculation
Machine translation with attention — compute context vector.
Exercise Sheet 10, Q10.3

Given encoder hidden states h₁, h₂, h₃ and decoder state sₜ: (1) Compute scores: eᵢ = score(sₜ, hᵢ) using dot product or additive method, (2) Softmax: αᵢ = exp(eᵢ)/Σexp(eⱼ), (3) Context: cₜ = Σ αᵢhᵢ. Practice the full computation with specific numbers.

R2 SLIDECalculation
Multi-head attention dimensions: given d_model = 512 and h = 8 heads, compute d_k per head and count total parameters in W_Q, W_K, W_V, and W^O.
Transformers slides 32–34

d_k = d_model / h = 512 / 8 = 64 per head.

Parameters: Each of W_Q, W_K, W_V projects from d_model → d_model (all heads combined): 512 × 512 = 262,144 params each. W^O projects concatenated output back: 512 × 512 = 262,144 params. Total: 4 × 262,144 = 1,048,576 parameters for one multi-head attention sub-layer (excluding biases).

Code / Bug-Finding

5 questions
R2 QUIZCode
Self-attention bug: using single weight matrix for Q, K, V (need SEPARATE projections).
Week 14 Fri review

Bug: Using W for all three: Q=XW, K=XW, V=XW. Fix: Need separate matrices: Q=XW_Q, K=XW_K, V=XW_V. With a single matrix, Q and K would be identical, making the attention scores trivial (each token attends equally or only to itself).

R2 QUIZCode
Self-attention bug: using sequence length instead of hidden dim for scaling.
Week 14 Fri review

Bug: Dividing by √(seq_length) instead of √(d_k). Fix: Scale by √d_k (dimension of the key vectors). The scaling prevents dot products from growing too large with higher dimensions, which would push softmax into regions with tiny gradients.

R2 QUIZCode
Self-attention bug: computing score between Q and V instead of Q and K.
Week 14 Fri review

Bug: score = QVᵀ instead of QKᵀ. Fix: Attention scores are computed between queries and keys: score = QKᵀ/√d_k. V is only used after softmax to compute the weighted output. Q-K matching determines "where to look"; V provides "what to read".

R2 QUIZCode
Self-attention bug: missing softmax before weighted sum.
Week 14 Fri review

Bug: Using raw scores to weight V: output = scores · V. Fix: Must apply softmax first: output = softmax(scores) · V. Without softmax, the weights aren't normalized to sum to 1, and the attention mechanism doesn't produce proper probability-weighted combinations.

R2 QUIZCode
Seq2seq bug: decoder not taking encoder output.
Week 14 Fri review

Bug: Decoder runs independently without receiving encoder information. Fix: Decoder must be initialized with the encoder's final hidden state (basic seq2seq) or receive attention-weighted encoder states at each timestep (attention mechanism). Without this connection, the decoder has no knowledge of the input.

True/False

3 questions
R2 SLIDETrue/False
"Self-attention layers are always computationally cheaper than recurrent layers."
Transformers slide 37

False. Self-attention has O(n²·d) complexity; recurrent has O(n·d²). When sequence length n exceeds representation dimension d, self-attention is more expensive. Self-attention wins on parallelization (O(1) sequential ops) and path length, but not always on raw computation cost.

R2 SLIDETrue/False
"In the transformer decoder, self-attention is masked so that each position can only attend to earlier positions in the output sequence."
Transformers slide 28

True. The decoder uses masked self-attention: future positions are set to −∞ before softmax, zeroing their attention weights. This preserves the autoregressive property — the prediction for position t depends only on positions < t (no peeking at future tokens during generation).

R2 SLIDETrue/False
"BERT uses the full encoder-decoder transformer architecture."
Transformers slide 50

False. BERT uses only the transformer encoder (no decoder). It is designed for understanding tasks (classification, NER, QA), not autoregressive generation. The bidirectional context in BERT — seeing both left and right via masked language modeling — is possible precisely because it doesn't generate tokens left-to-right like a decoder.

Conceptual

20 questions
R1 MOCK1 ptConceptual
6a. What is the difference between SGD and mini-batch gradient descent?
Mock Exam 6a

Verbatim from mock exam.

SGD: Updates parameters using gradient from ONE sample at a time. Noisy but fast updates. Mini-batch GD: Updates using gradient averaged over a small batch (e.g., 32 samples). Less noisy than SGD, more efficient than full batch. Full batch GD: uses ALL training data per update — stable but slow.

R1 MOCK3 ptsConceptual
6b. Why does parameter initialization matter? Name and describe two methods.
Mock Exam 6b

Verbatim from mock exam.

Why: Bad initialization → vanishing/exploding gradients, slow convergence, or getting stuck. All-zeros = all neurons learn the same thing (symmetry problem).

Xavier/Glorot: \(W \sim \mathcal{N}(0, \frac{2}{n_{in}+n_{out}})\) — designed for sigmoid/tanh. Keeps variance stable across layers.

He/Kaiming: \(W \sim \mathcal{N}(0, \frac{2}{n_{in}})\) — designed for ReLU. Accounts for ReLU killing half the values.

R1 MOCK2 ptsConceptual
6c. What is weight decay? What is its purpose?
Mock Exam 6c

Verbatim from mock exam.

Weight decay adds a penalty proportional to the squared magnitude of weights to the loss function: \(L_{total} = L_{original} + \frac{\lambda}{2}||w||^2\). This encourages smaller weights, preventing any single weight from dominating. Purpose: Regularization — prevents overfitting by penalizing model complexity.

R1 MOCK2 ptsConceptual
6d. Dropout: if we don't scale during training, how do we scale at test time?
Mock Exam 6d

Verbatim from mock exam.

If dropout rate = p (e.g., 0.5) and we DON'T scale during training, then at test time we must multiply all weights by (1-p). Why? During training, on average only (1-p) fraction of neurons are active. At test time all neurons are active, so outputs would be too large by factor 1/(1-p). Multiplying by (1-p) compensates.

Alternative (inverted dropout): scale by 1/(1-p) during TRAINING, then no scaling needed at test time.

R2 QUIZConceptual
Why do we use ReLU? Why are there variations?
Week 7 Thu

Why ReLU: (1) Simple to compute: max(0,z), (2) No vanishing gradient for z>0 (gradient = 1), (3) Sparse activation (many zeros = efficient). Why variations: ReLU has the "dying ReLU" problem — neurons with z<0 always output 0 and stop learning. Variations like Leaky ReLU (small slope for z<0) and ELU address this.

R2 QUIZConceptual
What variations of ReLU exist?
Week 7 Thu

Leaky ReLU: f(z) = max(αz, z) with small α like 0.01. Parametric ReLU (PReLU): same but α is learned. ELU: f(z) = z for z>0, α(eᶻ-1) for z≤0 — smooth and can output negative values. SELU: self-normalizing, maintains mean/variance across layers. GELU: used in transformers, smooth approximation.

R2 QUIZConceptual
"How many layers? How many nodes? Which activation?" — architecture design questions.
Week 7 Thu

Architecture choices are hyperparameters selected via experimentation: (1) Depth: deeper = more abstract features but harder to train, (2) Width: wider layers = more capacity per layer, (3) Activation: ReLU for hidden layers (default), softmax for multi-class output, sigmoid for binary output. Use validation performance to guide choices.

R2 QUIZConceptual
"What does patience mean?" (early stopping)
Week 10 Fri transcript

Patience in early stopping = the number of epochs to wait after the last improvement in validation loss before stopping training. E.g., patience=5 means: if validation loss doesn't improve for 5 consecutive epochs, stop. Prevents overfitting by stopping before the model starts memorizing training data.

R2 QUIZConceptual
"What is the ResNet formula?"
Week 10 Fri transcript

\(y = f(x) + x\) — the skip/residual connection. The network learns the residual f(x) = y - x, which is easier to optimize. Helps with vanishing gradients in deep networks because gradients can flow directly through the skip connection.

R3 EXERCISEConceptual
Dropout theory — how and why it works.
Exercise Sheet 10, Q10.1

During training: randomly set neurons to 0 with probability p. Each forward pass uses a different "thinned" network. Effect: (1) Prevents co-adaptation — neurons can't rely on specific other neurons, (2) Implicit ensemble — averaging over 2ⁿ possible sub-networks, (3) Adds noise → regularization. At test time: use all neurons (with appropriate scaling).

R4 MEMORYConceptual
Kaiming initialization — explain the method.
Memory Protocol 2021

Kaiming/He initialization: \(W \sim \mathcal{N}(0, \sqrt{2/n_{in}})\) where n_in is the number of input connections. Designed specifically for ReLU activations. The factor of 2 compensates for ReLU zeroing out roughly half the values, which would otherwise halve the variance at each layer.

R4 MEMORYConceptual
AdaGrad: how it works.
Memory Protocol 2021

AdaGrad adapts the learning rate per parameter. It divides the learning rate by the square root of the sum of all past squared gradients: \(w_t = w_{t-1} - \frac{\eta}{\sqrt{G_t + \epsilon}} \cdot g_t\). Parameters with large past gradients get smaller updates. Problem: learning rate monotonically decreases → can stop learning too early. RMSProp and Adam fix this.

R4 MEMORYConceptual
Gradient clipping — what is it and why?
Memory Protocol 2021 + 2022

When gradients become very large (exploding gradients), clip them to a maximum norm. If ||g|| > threshold, scale g down to g · (threshold / ||g||). Prevents unstable training, especially in RNNs where BPTT can cause gradient magnitudes to grow exponentially across timesteps.

R4 MEMORYConceptual
Data augmentation — why use it?
Memory Protocol 2021

Artificially increase training data size by applying transformations (rotation, flipping, noise, cropping for images; synonym replacement, back-translation for text). Helps prevent overfitting by exposing the model to more variation. Particularly important when training data is limited.

R4 MEMORYConceptual
Residual networks (ResNet) — explain the concept.
Memory Protocol 2021 + 2022

Add skip connections that bypass one or more layers: y = F(x) + x. Benefits: (1) Gradients flow directly through skip connections → no vanishing gradient even in very deep networks (100+ layers), (2) Network only needs to learn the residual F(x) = y - x, which is often easier, (3) Worst case: if F(x) ≈ 0, the layer becomes identity → doesn't hurt performance.

R4 MEMORYConceptual
Batch normalization: motivation and how it works.
exam_example.pdf 1a

Motivation: Internal covariate shift — each layer's input distribution changes as previous layers update, slowing training. How: For each mini-batch, normalize activations: \(\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}\), then scale and shift: y = γx̂ + β (γ, β are learned). At test time: use running averages of μ and σ² from training.

R4 MEMORYConceptual
Overfitting: draw the curves and name strategies to combat it.
DL_mock_exam.pdf Ex5

Curves: Training loss decreases, validation loss decreases then increases — the gap is overfitting. Strategies: (1) More training data, (2) Regularization (L1/L2/weight decay), (3) Dropout, (4) Early stopping, (5) Data augmentation, (6) Reduce model capacity, (7) Batch normalization.

R4 MEMORYConceptual
Vanishing gradient diagnosis: deep network with sigmoid activations.
DL_mock_exam.pdf Ex5

Sigmoid saturates at 0 and 1, where gradient ≈ 0. In a deep network, gradients multiply through layers: ∂L/∂w₁ = ∂L/∂yₙ · σ'(zₙ) · ... · σ'(z₁). Since σ'(z) ≤ 0.25, the gradient shrinks exponentially with depth. Solutions: Use ReLU activation, residual connections, proper initialization (Xavier/He), or LSTM for recurrent networks.

R1 MOCK EXAMConceptual2 pts
[2024 Mock 5a] Training curves show: training cost decreasing, development cost decreasing then increasing (classic overfitting curve). Do the curves behave as expected? Explain two strategies to get a good model.
Source: 2024 Mock Exam 5a

From 2024 Mock Exam.

Behavior: This shows classic overfitting. Training loss keeps decreasing but dev loss starts increasing after a point — the model memorizes training data instead of learning generalizable patterns.

Two strategies:

1. Early stopping: Stop training at the point where dev loss is minimal (before it starts rising). Monitor validation loss and save the best checkpoint.

2. Regularization: Add L2 weight decay, dropout, or data augmentation to constrain the model and reduce overfitting.

Other valid answers: reduce model complexity, get more training data, batch normalization.

Learn more
R1 MOCK EXAMConceptual3 pts
[2024 Mock 5b] Given a multi-layer perceptron with 20 hidden layers with 200 neurons each using sigmoid activation. Which problem might occur during training, how can you detect it, and how could you solve it?
Source: 2024 Mock Exam 5b

From 2024 Mock Exam.

Problem: Vanishing gradients. Sigmoid's derivative has max value 0.25. With 20 layers, gradients get multiplied ~20 times by values ≤ 0.25 → gradients shrink to near zero → early layers stop learning.

Detection: (1) Monitor gradient magnitudes per layer — if early layers have near-zero gradients, that's vanishing. (2) Training loss plateaus very early. (3) Early layer weights barely change across epochs.

Solutions: (1) Replace sigmoid with ReLU (gradient = 1 for positive inputs). (2) Use residual connections (skip connections bypass the gradient bottleneck). (3) Use He initialization designed for ReLU. (4) Use batch normalization.

Learn more

Code / Bug-Finding

1 questions
R1 MOCK4 ptsCode
6e. Early stopping code — find 4 conceptual mistakes.
Mock Exam 6e

Verbatim from mock exam.

 1no_improvement = 0
 2patience = 10
 3best_loss = 0
 4
 5for epoch in range(1000):
 6    model.train()
 7    train(model, train_loader)
 8
 9    model.eval()
10    val_loss = evaluate(model, val_loader)
11
12    if val_loss < best_loss:
13        best_loss = val_loss
14        no_improvement = 0
15        save_model(model, 'best.pt')
16    else:
17        no_improvement += 1
18
19    if no_improvement >= patience:
20        break
21
22save_model(model, 'final.pt')

Find 4 conceptual mistakes. NOTE: Conceptual errors are logical errors of the algorithm, NOT syntax mistakes, typos, or runtime bugs.

Bugs:

1. Line 3: best_loss = 0 should be best_loss = float('inf') — loss is always positive, so val_loss < 0 is never true.

2. Line 22: Saves the LAST model (possibly overfit) instead of loading the BEST model. Should load 'best.pt' at the end.

3. Missing: No torch.no_grad() context during evaluation — wastes memory computing gradients during validation.

4. Missing: After break, should load best model before any final evaluation or saving.

True/False

1 questions
R1 MOCK EXAMTrue/False5 pts
[2024 Mock 5c] True or False? (5 statements on training tricks)
Source: 2024 Mock Exam 5c
StatementTF
Gradient clipping helps in the case of exploding gradients.
Given a training set with 100 examples, stochastic gradient descent performs one update step per epoch.
Dropout is a regularization technique.
A fully-connected layer with ReLU activation and residual connection has the form relu(Wx + b) + x.
Highway Network automatically removes unnecessary layers.

From 2024 Mock Exam. Key traps:

SGD with 100 examples: SGD processes ONE example at a time → 100 updates per epoch, not 1. (Batch GD would do 1 update per epoch.)

ResNet formula: relu(Wx + b) + x is the correct residual form — TRUE.

Highway Network: does NOT remove layers. It learns gating functions that control how much information flows through vs. bypasses each layer. The layers are still there; the network learns to route around them.

Learn more