Neural Network
Neural Network
Feed-Forward Neural Network
- Signals move in one direction - forward - with no cycles or loops.
- Also called Multi-Layer Perceptrons (MLP).
- Input Layers
- Hidden layers
- Output Layers
Matrix Notation
- 1-layer Neural Net: π¦ = π 1π₯
- 2-layer Neural Net: π¦ = π 2π(π 1π₯)
- 3-layer Neural Net: π¦ = π 3π(π 2π(π 1π₯))
π is a non-linear activation function for hidden layers
Non-Linearity

- Sigmoid activation function:
- β’ Outputs values between 0 and 1
- β’ Probability of neuron firing/activated
- β’ ReLU (Rectified Linear Unit):
- β’ Efficient computation
- β’ Doesnβt saturate
- β’ Most commonly used today
Why Non-Linearity?
Q: What if we try to build a neural network without one?
- 2-layer Neural Net: π¦ = π 2π(π 1π₯) π¦ = π 2π 1π₯
- 3-layer Neural Net: π¦ = π 3π(π 2π(π 1π₯)) π¦ = π 3π 2π 1π₯
A: We would end up with linear classifier!
Non-Linearities are important for learning features/representations with increasing levels of complexity
Model Capacity
Capacity of a feed-forward neural network is affected by both
- Depth: number of hidden layers
- Width: number of neurons in each hidden layer
More neurons = more capacity
Loss functions
Same as single-layer models (i.e., linear and logistic regression)
Regression:
- MSE loss:
Classification:
Binary cross entropy for binary classification:
Cross entropy for multi-class classification:
Optimization
Solve for πβ= argminππΏ( ΰ·π¦, π¦)
Q: Donβt I have to optimize differently for different πΏ(Β·)?
A: No, just use gradient descent. It is the most general optimization
approach we know.
Q: But what if πΏ(Β·) is non-convex in π?
A: It almost surely is. Do gradient descent anyway. Just make sure
everything is differentiable.
Stochastic Gradient Descent
Back-Propagation for Computing Gradients
Neural network tips and tricks
- Optimization
- Activation Functions
- Managing Weights
- Dropout
- Managing Training
Optimization
Challenges
- Narrow Valleys
- Saddle Points
Accelerated Gradient Descent
Vanilla gradient descent:
π β π β πΌ β βππΏ π π π₯ , π¦
Accelerated gradient descent (momentum):
π β π β π β πΌ β βππΏ π π π₯ , π¦
π β π + π
Nesterov Momentum
Adaptive Learning Rates
Activation Functions
Historical Activation Functions
- sigmoid
- tanh
Vanishing Gradient Problem
- The gradient of the sigmoid function is often nearly zero
- β’ Recall: In backpropagation, gradients are products of local gradients
- β’ Quickly multiply to zero!
- β’ Early layers update very slowly
ReLU Activation
Activation function
π π§ = max 0, π§
β’ Gradient now positive on the entire region π§ β₯ 0
β’ Significant performance gains for deep neural networks
Leaky ReLU Activation
Managing Weights
Weight Initialization
Zero initialization: Very bad choice!
- β’ All neurons π§π = π π€πβ€π₯ in a given layer remain identical
- β’ Intuition: They start out equal, so their gradients are equal!
Long history of initialization tricks for π π based on βfan inβ πin
- β’ Here, πin is the dimension of the input of layer π π
- β’ Intuition: Keep initial layer inputs π§ π in the βlinearβ part of sigmoid
- β’ Note: Initialize intercept term to 0
β’ Kaiming initialization (also called βHe initializationβ)
- β’ For ReLU activations, use π πβΌ π 0, 2 πin
β’ Xavier initialization
- β’ For tanh activations, use π πβΌ π 0, 1 πin+πout (πout is output dimension)
Batch Normalization
Problem
- β’ During learning, the distribution of inputs to each layer are shifting (since the layers below are also updating)
- β’ This cause the objective to have a lot irregularity and hard to take large steps in the loss landscape
β’ Solution
- β’ As with feature standardization, standardize inputs to each layer to π 0, πΌ
- β’ Batch norm: Compute mean and standard deviation of current minibatch and use it to normalize the current layer (this is differentiable!)
- β’ Note: Needs nontrivial mini-batches or will divide by zero
- β’ Apply after every layer (typically before activation)

Regularization
Can use πΏ1 and πΏ2 regularization as before
- β’ As before, do not regularize any of the intercept terms!
- β’ πΏ2 regularization more common
Applied to βunrolledβ weight matrices
β’ Equivalently, Frobenius norm
