Bayes Rule - The silent superpower that enables Machine Learning

aiplusfinance (58)in Project HOPE • 3 years ago

If you read my last post on What is Machine Learning - A Netflix Example, you will realize I tried to explain a very complicated concept (in addition to what one of the most successful companies in my generation - Netflix - does with such concept) in a way almost everyone can understand. Today, I'm going to try and do one better and offer another simple explanation to one of the most important but difficult to understand concepts in Machine Learning - Bayes Theorem.

Like my last article, I won't bore you with explanations you can find yourself on the internet but rather, I'll serve you with an explanation that will ground your understanding in the concept - hopefully. As always, please let me know how I did or how I can explain things better using the comments section.

So what is Bayes Rule - without the maths?

Bayes rule simply states that I can improve my current belief state (or knowledge) of anything (this is extremely crucial for learning, and I'll explain shortly) by appropriately weighting my current belief via a learning mechanism. This definition might sound like a mouthful (but is quite intuitive), hence let's unpack it before we continue.

Bayes rule improves my current belief state

Think about it for a second. Let's consider any learning process. Almost everyone comes into the learning arena with a current belief state that is most times wrong. For example, consider a new born baby. Since the first human she sees is the doctor (or nurse) that delivers her, she might assume he/she to be her parent. That is a current belief state that is not totally unreasonable - in fact, we call this an unbiased prediction or belief. However, how does she later learn that the doctor is not her parent? Clearly there has to be a learning mechanism that allows for this or such learning will not occur even in infinite time. In Bayes rule parlance, this improved estimate of the current belief state is called the posterior probability or belief.

Bayes rule improves my current belief state via an appropriate learning mechanism

The learning mechanism is actually the key to understanding why Bayes rule is so powerful in machine learning (ML). Continuing with the example of the new born baby, an appropriate learning mechanism becomes actions such as breastfeeding or simply not seeing the doctor any more (or at least not as regularly as her parents). The baby then updates her prior belief of the doctor being her parent (probably over finite but little time) to the posterior belief of the woman breastfeeding her as her parent. Remarkable and efficient! The learning mechanism in Bayesian parlance is called the likelihood function

So now that we've explained Bayes rule (without the maths), how does it relate to machine learning (ML)? We'll explore this by using the simplest machine learning algorithm out there - single variable linear regression.

Linear Regression is inherently Bayesian

Recall the linear regression you learnt in high school (ok, allow me to use a little maths here), y = mx + c (where y is the predicted or dependent variable, x the independent variable and c, a constant or more commonly called the intercept of the straight line.
Did you know that such regression is a subset of machine learning? More interestingly, if you've ever done such regression (fitting a straight line through a set of data points) for work or school, you've practiced machine learning (ML) - how cool is that?

Okay, back to our discussion. How is fitting a line to a set of data points Bayesian? The answer lies in recasting the regression - y = mx + c - problem into our Bayesian framework described above.
Here, y can be taken as our posterior belief state, our improved estimate of our current (or prior) belief state. That is easy to understand since y is what we are trying to predict or the outcome of our learning.

Now if y is our posterior belief state, then what is our prior belief state - and you guessed it - of course it's x our independent variable! To understand why x is a prior, imagine I ask you to close your eyes and I say to you, I have a scatter plot of data points x and I want you to tell me the best y that fits that data, the best answer (and the most logical and unbiased) you can give me is that y is equal to x - a fantastic prior! Now, we've established the Bayesian posterior and priors for linear regression. You may ask, what of the learning mechanism or likelihood function?

You all have clearly read my mind and getting a hang of this - the learning mechanism or likelihood function has to m the slope of the line. Why is this so? Recall, I had your eyes closed. Now imagine I ask you to open your eyes and you now see from the data, points sloping slightly upwards. Aha! You immediately realize that your original prior of y = x must be wrong and that the slope must be a positive number less than 1 (if you don't know why this is so, bug me in the comments section:-). Do you now see how the slope parameter m is a learning function or mechanism?

One final note, is that the astute reader might ask so what is the mapping of the regression constant c to the Bayesian space? A complicated question with an even more complicated answer but I'll try a much simpler explanation. You see, every prediction process has an error embedded in it. In machine learning (ML), we call these loss functions. The goal of machine learning is to minimize these loss functions but there always exists a small but finite loss or error even from the best models. The regression constant can be thought of as a Bayesian loss or error value from updating my prior belief state via the learning mechanism of the slope of the line "m"

Are all Machine Learning (ML) models inherently Bayesian then?

Frankly, the jury is still out on a definitive proof on this. Several experts (including my humble self) believe this to be the case, but to the best of my knowledge, no one has produced concrete proof for this. Interestingly enough, no one has also produced cases that show machine learning (ML) algorithms are not inherently Bayesian. Since we have at least one case (the linear regression example) that is shown to be Bayesian (in fact, several researchers have shown several algorithms such as artificial neural networks (ANNs) to also be Bayesian), I will say, we the believers are ahead - don't you think so?

Okay, this post was probably more technical than you maybe used to, but Bayesian thinking is a fascinating way to reason through life in general so much so that even the machines know this and are using it - and so should you:-)

Yours in learning,