Possible bug in some probability commands

Following an example from the documentation

Expectation[x^2 + 3 y^2 + 11, {x, y} [Distributed] DirichletDistribution[{1, 4, 5}]]


, I try

Expectation[x^2 + y^2, {x, y} [Distributed] UniformDistribution[{-1, 2}]]

, but obtain the returned input. The following works well.

Expectation[x^2 + y^2, {x [Distributed] UniformDistribution[{-1, 2}],y[Distributed] UniformDistribution[{-1, 2}]}]


The same issue with some other commands.
Do I use a correct syntax in {x, y} [Distributed] UniformDistribution[{-1, 2}]]?

probability distributions – Can integration be done on this function?

I have recently been working on a project that tries to establishes an upper probability bound of a function that takes in $d,n,epsilon$, here is the function:

$e^{2n(-2d^2-epsilon ^2+2epsilon d)}$.

I am trying to determine a partial derivative with respect to $d$. I know that just like the PDF, this function can not be integrated using an anti-derivative. Apparently, my knowledge in Calculus is not sufficient to allow further advancements.

There are two things that will help me tremendiously:

  1. Is there a way to integrate this function with respect to $d$?
  2. Is there a way to establish an upper bound of the intergral with respect to $d$?


probability theory – Expected Value of Nonnegative Identically Distributed Random Variables

Let $X_1, X_2 geq 0$ be two non-negative identically distributed random variables. I wonder if the following equation holds.

mathbb{E}[X_1X_2] = mathbb{E}[Y^2]

where $Y$ is a random variable having the common distribution of ${X_1, X_2}$.

My attempt: We know that, in general, $mathbb{E}[X_1X_2] neq mathbb{E}[X_1^2]$ and/or $mathbb{E}[X_1X_2] neq mathbb{E}[X_2^2]$. One can take $X_1 in {0,1}$ with probability $1/2$ and take $X_2 := -X_1$. However, the nonnegativity excludes this case.

Also, if we look at
mathbb{E}[Y^2] = int y^2, dF_{Y} = int y^2 , dF_{X_1,X_2}

where the last equality hold since $Y$ has the common joint distribution of ${X_1,X_2}$. But I get stuck to see if this is equal to $
mathbb{E}[X_1X_2] = int x_1x_2 ,dF_{X_1,X_2}$
. Any comment is appreciated.

probability – If $f(y)=(1— theta)+2theta y,$ find an estimator of $theta$ by Maximum likelihood estimation.

The random variable $Y$ has a probability density function

$f(y)=(1— theta)+2theta y,$ for $0 <y< 1$

for $-1 < theta < 1$. There are $n$ observations $y_i. i = 1,2,…,n$, drawn independently from this distribution. Derive the Maximum Likelihood Estimator of $theta$.

I have tried to differentiate the log likelihood function. But I am stuck at that step. I am not sure how to find an expression for the estimator. Any help would be highly appreciated.

probability – Please anyone help me with step by step I’m so confusing about this subject

    **a)** A  fair coin is tossed 4 times, what is the probability that it lands on Heads each time?
    **b)** You have just tossed a fair coin 4 times and it landed on Heads each time, if you toss that coin again, what is the probability that it will land on heads?
    **c)** Give examples of two independent events.
    **d)** Dependent events are (sometimes, always, never) (choose one) mutually exclusive.
    **e)** If you were studying the effect that eating a healthy breakfast has on a child getting good grades in school, which of these probabilities would be much more relevant for you to know.

a) P( High grades/Healthy breakfast) or b) P(Healthy breakfast/High Grades)
f) 60 % of a firms employees are Female.. 38% of the firms employees are Female office workers. What percentage of the females who work in the firm are office workers?

probability theory – Donsker and Varadhan inequality proof without absolute continuity assumption

I’ve been attempting to understand the proof of the Donsker-Varadhan dual form of the Kullback-Liebler divergence, as defined by
operatorname{KL}(mu | lambda)
= begin{cases}
int_X logleft(frac{dmu}{dlambda}right) , dmu, & text{if $mu ll lambda$ and $logleft(frac{dmu}{dlambda}right) in L^1(mu)$,} \
infty, & text{otherwise.}

with Donsker-Varadhan dual form
operatorname{KL}(mu | lambda)
= sup_{Phi in mathcal{C}} left(int_X Phi , dmu – logint_X exp(Phi) , dlambdaright).

Many of the steps in the proof are helpfully outlined here: Reconciling Donsker-Varadhan definition of KL divergence with the “usual” definition, and I can follow along readily.

However, a crucial first step is establishing that (for any function $Phi$)
operatorname{KL}(mu|lambda)ge left{int Phi dmu-logint e^{Phi}dlambdaright},$$

said to be an immediate consequence of Jensen’s inequality. I can prove this easily in the case when $mu ll lambda$ and $lambda ll mu$:

$$ operatorname{KL}(mu|lambda) – int Phi dmu = int left( -logleft(frac{e^{Phi}}{dmu / dlambda}right) right) dmu ge -log int frac{e^{Phi}}{dmu / dlambda} dmu = -logintexp(Phi)dlambda.$$
However, this last step appears to crucially rely on the existence of $dlambda/dmu$ and thus that $lambda ll mu$, which isn’t assumed by the overall theorem. Where I have been able to find proofs of the above in the machine learning literature, this assumption seems to be implicitly made, but I don’t believe it is necessary and it is very restrictive.

My question is: how can we prove ref{ineq} without assuming $lambda ll mu$?

probability – Queues and blockers

In a single-file queue of $n$ people with distinct heights, define a blocker to be someone who is either taller than the person standing immediately behind them, or the last person in the queue. For example, suppose that Ashanti has height $a,$ Blaine has height $b,$ Charlie has height $c,$ Dakota has height $d,$ and Elia has height $e,$ and that $a<b<c<d<e.$ If these five people lined up in the order Ashanti, Elia, Charlie, Blaine, Dakota (from front to back), then there would be three blockers: Elia, Charlie, and Dakota. For integers $n ge 1$ and $k ge 0,$ let $Q(n,k)$ be the number of ways that $n$ people can queue up such that there are exactly $k$ blockers.

(a) Show that
$(Q(3,2)= 2 cdot Q(2,2)+ 2 cdot Q(2,1).)$

(b)Show that for $n ge 2$ and $k ge 1,$
$(Q(n,k)=k cdot Q(n-1,k)+(n-k+1) cdot Q(n-1,k-1).)$(You can assume that $Q(1,1)=1,$ and that $Q(n,0)=0$ for all $n.$)

I think if I somehow prove part b it proves part a so I’m focusing on part b but I don’t know what to do.

probability – Learning how summations work

Before, I begin I want say that the original question asks, “A couple decides to continue to have children until a daughter is born. What is the expected number of children until a daughter is born?”

Below, is the solution that they had in the back of the textbook (Statistical Inference, Second Edition, Roger L. Berger, George Casella)
enter image description here
What I do not understand is how to deduce summations that well.

First of all, how do they get
$$sum^infty_{k=1}k(1-p )^{k-1}p = p – sum^infty_{k=1}frac{d}{dp}(1-p )^{k}$$

Next, How do they simplify this as well $$-pfrac{d}{dp}{bigg(sum^infty_{k=0}(1-p)^k-1bigg)} = -pfrac{d}{dp}bigg(frac{1}{p}-1bigg) = frac{1}{p}$$

I would like this to be shown and explained intuitively because the statistics course I am taking requires me to be able to deduce summations on my own, and it seems to be a huge part of understanding the subject. Thank you ahead of time.

machine learning – How to chose the probability distribution and its parameters in maximum likelihood estimation

I’m reading the book “Mathematics for Machine Learning”, it’s a free book that you can find here. So I’m reading section 8.3 of the book which explains the maximum likelihood estimation (or MLE).
This is my understanding of how MLE works in machine learning:

Say we have a dataset of vectors $(x_1, x_2, …, x_n)$, we also have corresponding labels $(y_1, y_2, …, y_n)$ which are real numbers, finally we have a model with parameters $theta$. Now MLE is a way to find the best parameters $theta$ for the model, so that model would map $x_n$ to $hat{y}_n$ and $hat{y}_n$ is as close to $y_n$ as possible.

For each $x_n$ and $y_n$ we have a probability distribution $p(y_n|x_n,theta)$. Basically it estimates how likely our model with parameters $theta$ will output $y_n$ when we feed it $x_n$ (and the bigger the probability the better).

We then take a logarithm of each of the estimated probabilities and sum up all the logarithms, like this:

The bigger this sum the better our model with parameters $theta$ explains the data, so we have to maximize the sum.

What I don’t understand is how do we chose the probability distribution $p(y_n|x_n,theta)$ and its parameters? In the book there is an Example 8.4, where they chose the probability distribution to be Gaussian distribution with zero mean, $epsilon_n sim mathcal{N}(0,,sigma^{2})$. They then assume that the linear model $x_n^Ttheta$ is used for prediction, so:
$$p(y_n|x_n,theta) = mathcal{N}(y_n|x_n^Ttheta,,sigma^{2})$$
and I don’t understand why they replaced zero mean with $x_n^Ttheta$, also where do we get covariance $sigma^{2}$?

So this is my question, how do we chose the probability distribution and it’s parameters? In the example above the distribution is Gaussian but it could be any other distribution from those that exist and different distributions have different types and numbers of parameters. Also as I understood each $x_n$ and $y_n$ have its own probability distribution $p(y_n|x_n,theta)$ which even more complicates the problem.

I would really appreciate your help. Also note that I’m just learning the math for machine learning and not very skilled. If you need any additional info please ask in the comments.


probability – What’s more likely: $7$-digit number with no $1$’s or at least one $1$ among its digits?

A $7$-digit number is chosen at random. Which is more likely: the number has no $1$‘s among its digits or the number has at least one $1$ among its digits?

Here’s how I did it: The question is asking whether $8(9)^6$ (the number of those with no $1$‘s among its digits) or $9(10)^6 – 8(9)^6$ (the number of those with at least one $1$ among its digits). Some tedious multiplying shows that $8(9)^6 = 4241528 < 4500000$, which demonstrates that $9(10^6) – 8(9)^6$ i.e. the number having at least one $1$‘s among its digits is more likely.

However, I am wondering if there is a slicker way to get the answer without having to do any tedious multipication.