David's Blog

David Liu5/10/26About 7 min

Written Questions

Bias-Variance Tradeoff (15 pts)

Suppose we have an L2-regularized linear regression model running until convergence, which has loss L(β) = 1 n Pn i=1 fβ(xi)−yi 2 +λ∥β∥ 2 2 . For each of the following, assuming it is the only variable changing, indicate whether it tends to increase bias, decrease bias, or keep bias the same, and similarly for variance. Consider bias and variance as the quality of model fit and sensitivity to data changes, rather than the magnitude of error. You will get full points if you get the answer right even without any explanation. If you get it wrong though, your explanations may be assigned partial points if they demonstrate partially correct reasoning.

(a) Increase the number of training examples n
(b) Decrease the regularization parameter λ
(c) Decrease the dimension d of the features ϕ(x) ∈ R d
(d) Increase c, where we replace the features ϕ(x) with c · ϕ(x), for some c ∈ R>0 (for this part, assume no regularization, i.e., λ = 0)
(e) Increase the gradient descent step size α (but not so much that gradient descent diverges)
Suppose you fit a model and find that it has low loss on the training data but high loss on the test data; for each of the above five values n, λ, d, c, and α, indicate whether you should increase or decrease it to reduce the test loss, or it has no impact on the test loss.

Change	Bias	Variance	Explanation
(a) Increase $n$ (Training examples)	Same	Decrease	Variance Decreases: As $n \to \infty$ , the variance approaches 0 because the model is less sensitive to noise in any specific sample. Bias Increases: With few points, a model can overfit (low bias). As $n$ increases, it becomes harder to fit all points perfectly, potentially increasing observed bias slightly towards the true model bias.
(b) Decrease $\lambda$ (Regularization)	Decrease	Increase	Bias Decreases: A lower $\lambda$ allows parameters to grow large to fit complex patterns. Variance Increases: The model becomes more sensitive to noise (overfitting).
(c) Decrease $d$ (Features)	Increase	Decrease	Bias Increases: Removing features restricts the hypothesis space, making the model "simpler" and potentially unable to capture the true relationship. Variance Decreases: Fewer parameters mean the model is less prone to overfitting noise.
(d) Increase $c$ (Scale features by $c$ , $\lambda=0$ )	Same	Same	With no regularization ( $\lambda=0$ ), linear regression is scale-invariant regarding predictions. If inputs $x$ are scaled by $c$ , the optimal weights $w$ naturally scale by $1/c$ , resulting in identical predictions $\hat{y}$ . Thus, model fit and stability remain unchanged.
(e) Increase $\alpha$ (Step size)	Same	Same	$\alpha$ is an optimization hyperparameter. Provided the gradient descent still converges (as stated in the prompt), it will reach the same optimal $w^*$ . The model's final capacity and fit do not change.

(f) Reducing Overfitting (High Test Loss, Low Train Loss)

Overfitting is characterized by High Variance. To fix this:

$n$ (Data Size): Increase. More data helps the model generalize better.
$\lambda$ (Regularization): Increase. Higher $\lambda$ penalizes complexity, reducing variance.
$d$ (Dimensions): Decrease. Reducing features simplifies the model.
$c$ (Scaling): No Impact. Scaling does not change model complexity (assuming $\lambda$ is adjusted or 0).
$\alpha$ (Step Size): No Impact. Affects convergence speed, not the final generalization capability.

Regularization/Sparsity (7 pts)

In class, we demonstrated the intuition behind ℓ1 and ℓ2 regularization. In this problem we will try to see why ℓ1 regularization creates sparsity (i.e. reduces some elements of β to zero) from the perspective of gradient descent. As a reminder, here’s the ℓ1 regularized linear regression objective.

Objective: $\mathcal{L}_{l_{1}}(\beta)=\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\beta^{T}x_{i})^{2}+\lambda\sum_{j=1}^{p}|\beta_{j}|$

(3 pts) Write down the partial derivative of the ℓ1 and ℓ2 regularization term (i.e. second term) with respect to an individual weight βj . You can ignore the case βj = 0 where the gradient may be undefined. [Hint: Recall that the l1 regularization term is Lℓ1 (β) = λ Pd j=1 |βj |, and the ℓ2 regularization term is Lℓ2 (β) = λ Pd j=1 β 2 j .]
(b) (2 pts) Does ℓ2 regularization encourage sparsity (i.e., push some coefficients βj to exactly zero)? Briefly justify your answer using the gradient in the previous question.
(c) (2 pts) Does ℓ1 regularization encourage sparsity? (Analyze how the gradient of the second term in Equation 1 affects the gradient of the first term)

(a) Partial Derivatives

For $L_2$ term ( $\lambda \sum \beta_j^2$ ):
$\frac{\partial}{\partial \beta_j} (\lambda \beta_j^2) = 2\lambda \beta_j$
For $L_1$ term ( $\lambda \sum |\beta_j|$ ):
$\frac{\partial}{\partial \beta_j} (\lambda |\beta_j|) = \lambda \cdot \text{sign}(\beta_j)$
(Note: The derivative is undefined at $\beta_j=0$ , but subgradients are used in practice).

(b) Does $l_2$ encourage sparsity?

No.

Justification: The gradient $2\lambda \beta_j$ approaches $0$ as $\beta_j \to 0$ . As the weight gets smaller, the "penalty force" pushing it toward zero also becomes infinitesimally small. It never strictly forces the weight to exact zero, only to very small values.

(c) Does $l_1$ encourage sparsity?

Yes.

Justification: The gradient is $\pm \lambda$ (constant magnitude) regardless of how small $\beta_j$ is. This provides a constant force pushing $\beta_j$ toward zero. If the gradient from the data term is smaller than $\lambda$ at the origin, the weight will be stuck exactly at zero.

Linear Regression (17 pts)

We are interested here in a linear regression problem with n samples and d features. The dataset has its features stored in a matrix X ∈ R n×d with corresponding labels in the vector y ∈ R n . The objective function for this problem can be written as:

$J(w)=\frac{1}{n}(y-Xw)^{T}(y-Xw)$ .

(a) (5 pts) Derive the closed-form solution for w∗ that minimizes J(w).
(b) (5 pts) Is there a closed-form solution for linear regression with L2 regularization? If yes, can you derive the formula of the closed-form solution for same? If not, please explain your answer.
(c) (2 pts) Is linear regression with no regularization guaranteed to have a unique solution for any dataset?
(d) [5190 ONLY] (5 pts) Show that in the special case of simple linear regression with an intercept, where ˆyi = w0 + w1xi , the optimal parameters (w ∗ 0 , w∗ 1 ) satisfy

Objective: $J(w)=\frac{1}{n}(y-Xw)^{T}(y-Xw)$ .

(a) Closed-form solution for $w^{*}$

\begin{aligned} J(w) &= \frac{1}{n} (y^T - (Xw)^T)(y - Xw) \\ &= \frac{1}{n} (y^T - w^T X^T)(y - Xw) \\ &= \frac{1}{n} (y^T y - y^T X w - w^T X^T y + w^T X^T X w) \end{aligned}

Compute the gradient $\nabla_w J(w)$ . Using matrix calculus identity $\nabla_w (z-Xw)^T(z-Xw) = -2X^T(z-Xw)$ :
$\nabla J(w) = \frac{1}{n} \left( -2X^T (y - Xw) \right)$ $\nabla J(w) = \frac{2}{n} (X^T X w - X^T y)$
Set gradient to 0 for optimality:
$X^T X w = X^T y$
Solve for $w$ :
$w^* = (X^T X)^{-1} X^T y$

(b) Closed-form for $L_2$ Regularization?

Yes.

Derivation: The objective is $J_{reg}(w) = J(w) + \lambda w^T w$ .
$\nabla J_{reg}(w) = \frac{2}{n} (X^T X w - X^T y) + 2\lambda w = 0$
Multiply by $n/2$ :
$X^T X w - X^T y + n\lambda I w = 0$ $(X^T X + n\lambda I) w = X^T y$ $w^* = (X^T X + n\lambda I)^{-1} X^T y$
(Note: The factor $n\lambda$ appears because the prompt divides the data term by $n$ but not the regularization term. Standard derivations often omit $1/n$ or include it in $\lambda$ ).

(c) Unique solution without regularization?

No.

A unique solution requires $(X^T X)$ to be invertible. This is not guaranteed if:
1. The number of features $d > n$ (fewer samples than variables).
2. The features are linearly dependent (collinear).
  In these cases, $X^T X$ is singular (determinant is 0).

(d) [5190 Only] Simple Linear Regression Property

Goal: Show $\frac{1}{n}\sum (y_{i}-\hat{y}_i)(x_{i}-\overline{x})=0$ .

Let residual $e_i = y_i - \hat{y}_i$ . We must show $\sum e_i (x_i - \bar{x}) = 0$ .
Expand the term:
$\sum e_i x_i - \sum e_i \bar{x} = \sum e_i x_i - \bar{x} \sum e_i$
From the Normal Equations (setting partial derivatives to 0 during optimization):
- $\frac{\partial J}{\partial w_0} = 0 \implies \sum (y_i - \hat{y}_i) = \sum e_i = 0$ .
- $\frac{\partial J}{\partial w_1} = 0 \implies \sum (y_i - \hat{y}_i)x_i = \sum e_i x_i = 0$ .
Substitute these zero terms back:
$0 - \bar{x}(0) = 0$
Q.E.D.

Linear Regression (7 pts)

Suppose we have a weight vector w = [w1, w2] T with input vectors xn ∈ R 2 and yn ∈ {0, 1} (y = wx = w1x1 + w2x2). Let us initialize all the weights to be 0. Also, suppose we have N = 2 examples in our dataset: (x1 = [1, −1]T , y1 = 0),(x2 = [−1, −1]T , y2 = 1). Show what happens in each iteration of the training process for an l2 regularized (λ = 1) linear regression model with batch gradient descent (learning rate = 1) on the above dataset for two epochs (steps).

(a) (2 pts) What is the value of the loss function at the beginning?
(b) (5 pts) What is the final state of the trained weight vector after 2 steps, and the corresponding value of the loss function?
(Hint: derive the partial derivative of the loss function with respect to weights, and calculate their values after each step)

Setup:

$x_1 = [1, -1]^T, y_1=0$
$x_2 = [-1, -1]^T, y_2=1$
$w^{(0)} = [0, 0]^T$
$\lambda = 1, \eta = 1, n=2$
Gradient: $\nabla L = \frac{2}{n}\sum (w^T x_i - y_i)x_i + 2\lambda w$

(a) Loss at Step 0

Predictions: $\hat{y}_1 = 0$ , $\hat{y}_2 = 0$ .
Data Loss: $\frac{1}{2} [(0-0)^2 + (0-1)^2] = 0.5$ .
Reg Loss: $1 \cdot (0^2+0^2) = 0$ .
Total Loss: 0.5

(b) Iterations

Step 1:

Gradient Calculation ( $w=[0,0]$ ):
$\text{Data term grad} = \frac{2}{2} [ (0-0)x_1 + (0-1)x_2 ] = -x_2 = [1, 1]^T$ $\text{Reg term grad} = 2(1)[0,0]^T = [0,0]^T$ $\nabla L = [1, 1]^T$
Update:
$w^{(1)} = w^{(0)} - \eta \nabla L = [0,0] - 1[1, 1] = [-1, -1]^T$

Step 2:

Predictions ( $w=[-1,-1]$ ):
$\hat{y}_1 = [-1, -1]\cdot[1, -1] = 0$ $\hat{y}_2 = [-1, -1]\cdot[-1, -1] = 2$
Gradient Calculation:
$\text{Data term grad} = (0-0)x_1 + (2-1)x_2 = x_2 = [-1, -1]^T$ $\text{Reg term grad} = 2(1)[-1, -1]^T = [-2, -2]^T$ $\nabla L = [-1, -1] + [-2, -2] = [-3, -3]^T$
Update:
$w^{(2)} = w^{(1)} - \eta \nabla L = [-1, -1] - 1[-3, -3] = [2, 2]^T$

Final State:

Weight Vector: $w = [2, 2]^T$
Final Loss:
- $\hat{y}_1 = [2, 2]\cdot[1, -1] = 0 \implies (0-0)^2 = 0$ .
- $\hat{y}_2 = [2, 2]\cdot[-1, -1] = -4 \implies (-4-1)^2 = 25$ .
- Data Loss: $\frac{1}{2}(25) = 12.5$ .
- Reg Loss: $1 \cdot (2^2+2^2) = 8$ .
- Total: 20.5

Gradient Descent (4 pts)

Suppose you are training a linear regression model on a small dataset using gradient descent. Consider the following two separate scenarios:

(a) (2 pts) The training loss is oscillating and even increasing. What is the likely cause, and how would you address it?
(b) (2 pts) After 1,000 iterations, the training loss is still decreasing at a steady rate. What does this suggest, and how would you address it?

(a) Loss oscillating/increasing

Cause: The learning rate (step size $\alpha$ ) is too large. The updates overshoot the minimum, causing divergence.
Fix: Decrease the learning rate.

(b) Slow decrease after 1000 iterations

Cause: The learning rate is too small. The model is converging, but steps are tiny, wasting computational resources.
Fix: Increase the learning rate (cautiously).