0 comments

Comments sorted by top scores.

comment by gjm · 2021-01-08T09:54:53.152Z · LW(p) · GW(p)

You've dropped a factor of about half-way through your calculation. And then you've multiplied by $X^{- 1}$ between two lines separated by "="; the idea is that both sides are zero so it kinda-sorta makes sense but it's super-misleading. If you restore the factor of $α$ then your last equation ends up as $h = Y (X + α X^{- 1})^{- 1}$ .

But even this is wrong, I'm afraid. You can't multiply by $X^{- 1}$ there at all. There is no $X^{- 1}$ ; $X$ is not (except by coincidence, and in an ML application if this coincidence happens then you don't have anything like enough data) a square matrix and in general it has no inverse.

There are problems earlier in the derivation, too, which I think are encouraged by some of your nonstandard notation. E.g., you write $h x_{t}$ rather than $h \cdot x_{t}$ or $x_{t}^{T} h$ , and this has fooled you into writing down something wrong for what you write as $\frac{d}{d h} E$ . That's also nonstandard notation; it's defensible but again it makes it easy to get things wrong by mixing up left and right multiplications. Let's do it with more standard and explicit notation, which will make it harder to make mistakes:

$0 = \frac{\partial}{\partial h_{j}} (E + C) = \frac{\partial}{\partial h_{j}} [(X h - Y)^{T} (X h - Y) + α h^{T} h] = \frac{\partial}{\partial h_{j}} [h^{T} X^{T} X h - h^{T} X^{T} Y - Y^{T} X h + Y^{T} Y + α h^{T} h]$

The $Y^{T} Y$ is constant and its derivative is zero. The terms linear in $h$ are one another's transposes and readily yield $2 (X^{T} Y)_{j}$ . The second quadratic term is just $α \sum h_{k}^{2}$ whose $\frac{\partial}{\partial h_{j}}$ is $2 h_{j}$ . The first quadratic term is similarly $\sum (X h)_{k}^{2}$ which equals $\sum_{k} {(\sum_{l} X_{k l} h_{l})}_{l}^{2}$ whose $\frac{\partial}{\partial h_{j}}$ is $\sum_{k} 2 (\sum_{l} X_{k l} h_{l}) \frac{\partial}{\partial h_{j}} (\sum_{l} X_{k l} h_{l}) = \sum_{k} 2 (\sum_{l} X_{k l} h_{l}) X_{k j} = (2 X^{T} X h)_{j}$ .

So what ends up being zero is the $j$ th component of $2 X^{T} (X h - Y) + 2 α h$ and if you like you can write $\frac{d}{d h} (E + C) = 2 [X^{T} (X h - Y) + α h]$ . But again you need to be very clear about what you mean by that; $\frac{d}{d h} f (h)$ means "the $A$ such that to first order $f (h) = const + A h$ " and so actually the Right Thing to use for the "derivative" is the transpose of what I wrote down above.

Finishing off the correct derivation, we have

$X^{T} (X h - Y) + α h = 0$ so $(X^{T} X + α) h = Y$ so $h = (X^{T} X + α)^{- 1} Y$ .

comment by gjm · 2021-01-08T09:58:15.072Z · LW(p) · GW(p)

I think it genuinely doesn't make sense to say that reflects our prior expectation of $α$ ; the acolyte is correct. What $α$ reflects is our prior on $h$ ; that regularization term corresponds exactly to a prior that makes $h$ (multivariate) normally distributed with mean zero and covariance $1 / α$ times the identity (i.e., components independent and each component having variance $1 / α$ ).