Chapter 2 Exercises

Some solutions might be off

NOTE: This part requires some basic understading of calculus.
These are just my solutions of the book Reinforcement Learning: An Introduction, all the credit for book goes to the authors and other contributors. Complete notes can be found here. If there are any problems with teh solutions or you have some ideas ping me at bonde.yash97@gmail.com.

2.6 Mysterious Spikes

The formula for Q-value update is \(Q \leftarrow Q + \frac{1}{n}(R - Q)\) and we set the initial Q-value as \(Q_1(a) = +5\). Initially this results in large exploration as every option that it explores is "disappointing". As it explores more options it reaches \(q_*(a)\) and the returns are very high. But the rewards are still negative and value keeps slipping but eventually starts to increase again

2.7 Unbiased Constant Step-Size Trick

Sample averaes are not completely satisfactory solution because sample averages don't perform well in non-stationary problem. One way to avoid this is to have a step size of \(\beta_n \doteq \frac{\alpha}{\bar{o}_n}\) where \[\bar{o}_n \doteq \bar{o}_{n-1} + \alpha(1 - \bar{o}_{n-1}) \tag{1} \] We have to show that \(o_n\) is an exponential recency weighted average without initial bias. Teh solution can be given in the following manner: \[Q_{n+1} = Q_n + \beta(R_n - Q_n)\] \[= (1 - \beta)Q_n + \beta R_n\] \[= (1 - \frac{\alpha_n}{\bar{o_n}})Q_n + \frac{\alpha_n}{\bar{o_n}}R_n\] \[\bar{o}_n Q_{n+1} = \alpha R_n - (\bar{o}_n - \alpha)Q_n \tag{2} \] Now we replace the value of \(\bar{o}_n\) in \((1)\) from \((2)\) \[= \alpha R_n - \left[\bar{o}_{n-1} + \alpha(1 - \bar{o}_{n-1}) - \alpha \right]Q_n\] \[\bar{o}_n Q_{n+1} = \alpha R_n - (1 - \alpha)\bar{o}_{n-1}Q_n\] The expansion of \(Q_{n+1}\) can be given as \[Q_{n+1} = (1-\alpha)^nQ_1 + \sum_{i=1}^{n}\alpha(1-\alpha)^{n-1}R_i \tag{3} \] We can look at similarity between \((2)\) and \((3)\) and fitting the equation we get \[\bar{o}_n Q_{n+1} = (1-\alpha)^n\bar{o}_0Q_1 + \sum_{i=1}^{n}\alpha(1-\alpha)^{n-1}R_i\] \[\bar{o}_n Q_{n+1} = \sum_{i=1}^{n}\alpha(1-\alpha)^{n-1}R_i\] Thus we have shown that \(Q_n\) is without initial biasm this also is a form of exponential recency weighted bias as later values are given less importance.

2.8 Spikes in UCB Algorithm

Upper Confidence Bound or UCB has the formula \[A_t \doteq argmax_a (Q_t(a) + c\sqrt{\frac{ln(t)}{N_t(a)}})\] The spike occurs on 11th step and the problem is a 10-bandit problem. Initially \(N_t(a) = 0\) and it increments one by one as all the actions are taken. On the 11th step all the actions have been taken and we can now take the action that was most optimal till now. (note that \(t = 1\) after 11th)

Now from the 12th step, we still have not obtained the optimal solution \(q_*(a)\) and so the system is forced to diversify the options and rewards subsequently decreases. How ever since we already have explored all the possible option we can get a head-start over \(\epsilon\)-greedy method and as \(t\rightarrow \infty\) the rewards start to increase.

2.9 Softmax and Sigmoid

Let there be two actions \(\{A_1, A_2\}\) and its corresponding preferences \(\{a_1, a_2\}\), the logistic probablity of \(a_1\) can be then given as \[sigmoid(a_1) = \frac{1}{1 + e^{-a_1}}\] The winning case can be when \(Pr(a_1) > Pr(a_2)\) or \(a_1 > a_2\). Similarly the softmax probablity can be given as \[softmax(a_1) = \frac{e^{a_1}}{e^{a_1} + e^{a_2}}\] \[softmax(a_1) = \frac{1}{1 + e^{(a_2 - a_1)}}\] Here also we can see that when the score \(a_2 > a_1\) the probability of selection of \(A_1\) is more. Thus softmax and logistic behave in the same way for a two action game.