Hello Konstantin,

I whole-heartedly agree about the use of the word “proof” and do not take the term lightly. But respectfully, and I don’t mean to be rude, what is presented in the piece is a valid and complete proof. To state it isn’t is simply incorrect. I would add that the proof presented is not of my own conjuring. This is the widely used and traditional proof one would find in most any mathematical statistics textbook or academic paper on the subject.

Your comment that you “… cannot simply state that NC_* = B(alpha_*, beta_*)” is not correct. You can in fact make this statement with solid mathematical justification. Below I’ve attempted to be as clear as possible, and provided as many separate steps as possible:

• Towards the end of the Binomial proof, we have shown “p^(alpha_*-1)*(1-p)^(beta_*-1)” is multiplicatively proportional to the PDF of the posterior distribution. It differs only by a multiplicate proportional constant (1/NC_*).

• We know by definition the PDF of the posterior distribution when integrated over the entire support of “p” must be 1.

• We know that for a Beta distributed random variable “p_prime” ~ Beta(alpha_*, beta_*), “p_prime” has PDF: “(1/B(alpha_*, beta_*))*p_prime^(alpha_*-1)*(1-p_prime)^(beta_*-1)”

• By definition, the PDF of “p_prime” when integrated over the support of “p_prime” must be equal to 1.

• Note that both the PDF of our posterior distribution and the PDF of “p_prime” share the identical kernel (i.e. “p^(alpha_*-1)*(1-p)^(beta_*-1)”)

• Because both the PDF of the posterior and the pdf of “p_prime” share the same kernel, and because they both integrate to the same value (i.e. 1), their multiplicate normalizing constants must by deduction be the same value (i.e. NC_* = B(alpha_*, beta_*)). There mathematically is no other empirical value NC_* other than B(alpha_*, beta_*) that would allow our posterior distribution to meet the integration condition. Remember, these NC_* and B(alpha_*, beta_*) are not functions; rather they are just empirical scalars. The “function” parts of our PDFs our entirely contained in their respective kernels. And again, the kernels in this case are identical. If I give you two functions, you see they have the same kernel, and they both integrate to the same value, then mathematically their normalizing constants must also be equal. If you think about it, I think you’ll get it 😊

I think what’s rubbing you the wrong-way is that we didn’t explicitly prove NC_* = B(alpha_*, beta_*) in a direct manner. Rather we leveraged the dual property of identical kernels and identical integration conditions to indirectly prove NC_* = B(alpha_*, beta_*). Short answer is, this is a scenario where such an indirect proof is still sound and mathematically rigorous.

Interestingly, if you look at the history in this space, the majority of conjugate priors for common parametric distributions were first discovered by leveraging this “dual” kernel/integration condition approach (as opposed to explicitly trying to directly show normalizing constants are equal to each other).

Phew! That was a lot. Again, I really focused this piece on the numerical methods. I stuck the conjugate prior material in just for fun. I wasn’t expecting to get such questions on it! It’s been a good engagement.

Let me know what you think working through the numerical methods piece.

- Andrew

Principal Data/ML Scientist @ The Cambridge Group | Harvard trained Statistician and Machine Learning Scientist | Expert in Statistical ML & Causal Inference

## More from Andrew Rothman

Principal Data/ML Scientist @ The Cambridge Group | Harvard trained Statistician and Machine Learning Scientist | Expert in Statistical ML & Causal Inference