I’m reading the book “Mathematics for Machine Learning”, it’s a free book that you can find here. So I’m reading section 8.3 of the book which explains the `maximum likelihood estimation`

(or MLE).

This is my understanding of how MLE works in machine learning:

Say we have a dataset of vectors $(x_1, x_2, …, x_n)$, we also have corresponding labels $(y_1, y_2, …, y_n)$ which are real numbers, finally we have a model with parameters $theta$. Now MLE is a way to find the best parameters $theta$ for the model, so that model would map $x_n$ to $hat{y}_n$ and $hat{y}_n$ is as close to $y_n$ as possible.

For each $x_n$ and $y_n$ we have a probability distribution $p(y_n|x_n,theta)$. Basically it estimates how likely our model with parameters $theta$ will output $y_n$ when we feed it $x_n$ (and the bigger the probability the better).

We then take a logarithm of each of the estimated probabilities and sum up all the logarithms, like this:

$$sum_{n=1}^Nlog{p(y_n|x_n,theta)}$$

The bigger this sum the better our model with parameters $theta$ explains the data, so we have to maximize the sum.

What I don’t understand is how do we chose the probability distribution $p(y_n|x_n,theta)$ and its parameters? In the book there is an Example 8.4, where they chose the probability distribution to be Gaussian distribution with zero mean, $epsilon_n sim mathcal{N}(0,,sigma^{2})$. They then assume that the linear model $x_n^Ttheta$ is used for prediction, so:

$$p(y_n|x_n,theta) = mathcal{N}(y_n|x_n^Ttheta,,sigma^{2})$$

and I don’t understand why they replaced zero mean with $x_n^Ttheta$, also where do we get covariance $sigma^{2}$?

So this is my question, how do we chose the probability distribution and it’s parameters? In the example above the distribution is Gaussian but it could be any other distribution from those that exist and different distributions have different types and numbers of parameters. Also as I understood each $x_n$ and $y_n$ have its own probability distribution $p(y_n|x_n,theta)$ which even more complicates the problem.

I would really appreciate your help. Also note that I’m just learning the math for machine learning and not very skilled. If you need any additional info please ask in the comments.

Thanks!