Posts
Comments
By the law of large numbers, almost surely. This is the cross entropy of and . Also note that if we subtract this from the entropy of , we get . So minimising the cross entropy over is equivalent to maximising .
I think the cross entropy of and is actually (note the negative sign). The entropy of is . Since then the KL divergence is actually the cross entropy minus the entropy, not the other way around. So minimising the cross entropy over will minimise (not maximise) the KL divergence.
I believe the next paragraph is still correct: the maximum likelihood estimator is the parameter which maximises , which minimises the cross-entropy, which minimises the KL divergence.
Apologies if any of what I've said above is incorrect, I'm not an expert on this.
I think there is a mistake in this equation. and are the wrong way round. It should be: