8. Layer Normalisation

HardNormalisationBackpropagationNumPyTransformers

Implement the forward and backward passes of Layer Normalisation.

Layer Norm normalises each token (the last dimension) independently, then applies a learned affine transform:

$\text{LayerNorm}(x) = \gamma \cdot \hat{x} + \beta, \qquad \hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \varepsilon}}$

where $\mu$ and $\sigma^2$ are computed over the last dimension only, and $\gamma, \beta \in \mathbb{R}^D$ are learnable per-dimension parameters.

You must implement:

forward(x) — normalise x and return the result. Cache whatever you need for backward.
backward(dy) — given the upstream gradient dy, return (dx, dgamma, dbeta).

A gradient checker (_grad_check) is provided. It compares your analytic gradients against finite differences and returns True if all three agree within tolerance. Your solution passes when it returns True for all test cases.

Shapes

| Variable | Shape | |----------|-------| | x, dy, dx | (..., D) — any number of batch dims | | gamma, beta, dgamma, dbeta | (D,) |

Example

ln = LayerNorm(4)
x  = np.array([[1.0, 2.0, 3.0, 4.0]])  # shape (1, 4)
y  = ln.forward(x)
# y ≈ [[-1.342, -0.447, 0.447, 1.342]]  (unit variance, then scaled by gamma=1, beta=0)

Constraints

Use only NumPy (no PyTorch, no autograd).
Your backward pass must be fully analytic — do not call forward inside backward.
Must work for inputs with 2 or more dimensions, e.g. (B, D) and (B, T, D).

Python 3

⌘ + Enter

▶

Run your code to see results

Ctrl + Enter