Implement the forward and backward passes of Layer Normalisation.
Layer Norm normalises each token (the last dimension) independently, then applies a learned affine transform:
LayerNorm(x)=γ⋅x^+β,x^=σ2+εx−μ
where μ and σ2 are computed over the last dimension only, and γ,β∈RD are learnable per-dimension parameters.
You must implement:
forward(x) — normalise x and return the result. Cache whatever you need for backward.backward(dy) — given the upstream gradient dy, return (dx, dgamma, dbeta).A gradient checker (_grad_check) is provided. It compares your analytic gradients against finite differences and returns True if all three agree within tolerance. Your solution passes when it returns True for all test cases.
| Variable | Shape |
|----------|-------|
| x, dy, dx | (..., D) — any number of batch dims |
| gamma, beta, dgamma, dbeta | (D,) |
ln = LayerNorm(4) x = np.array([[1.0, 2.0, 3.0, 4.0]]) # shape (1, 4) y = ln.forward(x) # y ≈ [[-1.342, -0.447, 0.447, 1.342]] (unit variance, then scaled by gamma=1, beta=0)
forward inside backward.(B, D) and (B, T, D).