← Problems

8. Layer Normalisation

HardNormalisationBackpropagationNumPyTransformers

Implement the forward and backward passes of Layer Normalisation.

Layer Norm normalises each token (the last dimension) independently, then applies a learned affine transform:

LayerNorm(x)=γx^+β,x^=xμσ2+ε\text{LayerNorm}(x) = \gamma \cdot \hat{x} + \beta, \qquad \hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \varepsilon}}

where μ\mu and σ2\sigma^2 are computed over the last dimension only, and γ,βRD\gamma, \beta \in \mathbb{R}^D are learnable per-dimension parameters.

You must implement:

  • forward(x) — normalise x and return the result. Cache whatever you need for backward.
  • backward(dy) — given the upstream gradient dy, return (dx, dgamma, dbeta).

A gradient checker (_grad_check) is provided. It compares your analytic gradients against finite differences and returns True if all three agree within tolerance. Your solution passes when it returns True for all test cases.

Shapes

| Variable | Shape | |----------|-------| | x, dy, dx | (..., D) — any number of batch dims | | gamma, beta, dgamma, dbeta | (D,) |

Example

ln = LayerNorm(4) x = np.array([[1.0, 2.0, 3.0, 4.0]]) # shape (1, 4) y = ln.forward(x) # y ≈ [[-1.342, -0.447, 0.447, 1.342]] (unit variance, then scaled by gamma=1, beta=0)

Constraints

  • Use only NumPy (no PyTorch, no autograd).
  • Your backward pass must be fully analytic — do not call forward inside backward.
  • Must work for inputs with 2 or more dimensions, e.g. (B, D) and (B, T, D).
Python 3
⌘ + Enter
Run your code to see results
Ctrl + Enter