Consider a CDM with $ n $ States $ k actions and reduction factor $ gamma in (0,1) $. We are uncertain of its reward function $ R in mathbb {R} ^ {n times k} $ and transition function $ T in mathbb {R} ^ {n times k times n} $. More precisely,

begin {align}

R_ {s, a} & sim text {Beta} ( alpha_ {s, a}, beta_ {s, a}) \

T_ {s, a} & sim text {Dirichlet} ( delta_ {s, a})

end {align}

or $ alpha, beta in (0, infty) ^ {n times k} $ and $ delta in (0, infty) ^ {n times k times n} $. Q function $ Q in mathbb {R} ^ {n times k} $ of our CDM is defined to be the solution to the equation

begin {align}

Q_ {s, a} & = R_ {s, a} + gamma sum_ {s} T_ {s, a, s} max_ {a} Q_ {s & # 39; 39 ;, a & # 39;

end {align}

which can be formulated as a linear program. So the Q function is distributed as

begin {align}

Q_ {s, a} & sim text {D} ( alpha, beta, delta) _ {s, a}

end {align}

or $ mathrm {D} $ is an unknown distribution. My question is the following: **What do we know about $ text {D} $?** I am particularly interested in the quantile function $ f: (0,1) rightarrow mathbb {R} $

begin {align}

f (p) = inf {x in mathbb {R}: p leq Pr (Q_ {s, a} leq x) }

end {align}

because this gives a higher Bayesian confidence related to the Q function of our CDM. For the sake of such limits in the context of multi-armed bandit problems, see *Bayesian upper confidence limits for bandit problems* by Kaufmann et al.

If the quantiles of this distribution can not be obtained in closed form, what is the fastest way to approach them, beyond the sampling of several CDMs? $ R $sand $ T $s) of our belief distribution, to solve them and to observe the empirical distribution of $ Q_ {s, a} $?