Thursday, June 11, 2015

Robot Navigation - Q learning algorithm


The aim of this lab is to understand the reinforcement learning subject of the autonomous robots course and implement a reinforcement learning algorithm to learn a policy that moves a robot to a goal position. The algorithm is the Q-learning algorithm and it will be implemented in Matlab.

1 - Introduction

The reinforcement learning algorithm does not force the robot to plan path by using any path planning algorithm, rather the algorithm learns optimal solution by randomly moving inside map for several times. It is an approximation of natural learning process, where unknown problem is solved just by trial and error method. The following sections will briefly discuss about the implementation and the results obtained by the algorithm.

Environment: The environment used for this lab experiment is shown below.

Figure-1: Environment used for the implementation.

States and Actions: The size of the given environment is 20$\times$14 = 280 states. The robot can only do 4 different actions: ←, ↑, →, ↓. Thus, the size of the Q matrices would be 280$\times$4 = 1120 cells.

Dynamics: Dynamics make the robot move towards a direction according to the actions. The robot will move one cell per iteration to the direction of the action that we select, unless there is an obstacle or the wall in front of it, in which case it will stay in the same position.

Reinforcement function: Reinforcement function assigns reward at each cell, +1 for goal cell and -1 otherwise.

2 - The Algorithm

Q-Learning is an Off-Policy algorithm for Temporal Difference learning. It learns the optimal policy even when actions are selected according to a more exploratory or even random policy. The pseudo-code we used for the implementation is shown below:

Figure-2: The pseudo code of Q-learning algorithm.

  • $\alpha$ - the learning rate, set between 0 and 1. Setting it to 0 means that the Q-values are never updated, hence nothing is learned. Setting a high value such as 0.9 means that learning can occur quickly.
  • $\gamma$ - discount factor, also set between 0 and 1. This models the fact that future rewards are worth less than immediate rewards. Mathematically, the discount factor needs to be set less than 0 for the algorithm to converge.
  • $max_{\alpha}$ - the maximum reward that is attainable in the state following the current one. i.e the reward for taking the optimal action thereafter.

This procedural approach can be translated into plain english steps as follows:

  • Initialize the Q-values table, Q(s, a).
  • Observe the current state, s.
  • Choose an action, a, for that state based on one of the action selection policies explained in the next chapter ($\varepsilon$-soft, $\varepsilon$-greedy or softmax).
  • Take the action, and observe the reward, r, as well as the new state, s'.
  • Update the Q-value for the state using the observed reward and the maximum reward possible for the next state. The updating is done according to the forumla and parameters described above.
  • Set the state to the new state, and repeat the process until a terminal state is reached.

2.1 - Action Selection Policies

As mentioned above, there are three common policies used for action selection. The aim of these policies is to balance the trade-off between exploitation and exploration, by not always exploiting what has been learnt so far.

  • $\varepsilon$-greedy - most of the time the action with the highest estimated reward is chosen, called the greediest action. Every once in a while, say with a small probability , an action is selected at random. The action is selected uniformly, independant of the action-value estimates. This method ensures that if enough trials are done, each action will be tried an infinite number of times, thus ensuring optimal actions are discovered.
  • $\varepsilon$-soft - very similar to  -greedy. The best action is selected with probability 1 - and the rest of the time a random action is chosen uniformly.
  • softmax - one drawback of  -greedy and -soft is that they select random actions uniformly. The worst possible action is just as likely to be selected as the second best. Softmax remedies this by assigning a rank or weight to each of the actions, according to their action-value estimate. A random action is selected with regards to the weight associated with each action, meaning the worst actions are unlikely to be chosen. This is a good approach to take where the worst actions are very unfavourable.

It is not clear which of these policies produces the best results overall. The nature of the task will have some bearing on how well each policy influences learning. If the problem we are trying to solve is of a game playing nature, against a human opponent, human factors may also be influencial.

3 - The Implementation

The algorithm has been implemented in MATLAB. The problem consists in finding the goal in a finite 2D environment that is closed and contains some obstacles as shown in the figure (Fig.1). The map given with the lab manual has been used to implement this algorithm. The goal position is given as (18,3) for the environment.

After implementing the algorithm, we got the following Q matrix for the given map.

\tiny Q_1 = \begin{smallmatrix}
 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
 & 0 & -9.2943 & -9.2237 & -9.1390 & -9.0439 & -8.9383 & 0 & 0 & -8.5404 & -8.3806 & -8.2018 & -8.0029 & -7.7821 & -7.5369 & 0 & 0 & -0.3124 &  0.7610 & -0.3110 & 0 \\
 & 0 & -9.2239 & -9.1390 & -9.0441 & -8.9384 & -8.8208 & 0 & 0 & -8.3815 & -8.2028 & -8.0037 & -7.7825 & -7.5369 & -7.2640 & 0 & 0 & -0.1000 & -0.2679 & -0.1000 & 0 \\
 & 0 & -9.1390 & -9.0441 & -8.9383 & -8.8207 & -8.6900 & -8.5445 & -8.3830 & -8.2035 & -8.0042 & -7.7830 & -7.5370 & -7.2638 & -6.9605 & 0 & 0 & -0.4111 & -0.1000 & -0.3182 & 0 \\
 & 0 & -9.0443 & -8.9386 & -8.8210 & -8.6903 & -8.5449 & -8.3835 & -8.2041 & -8.0048 & -7.7833 & -7.5372 & -7.2639 & -6.9603 & -6.6232 & 0 & 0 & -1.4003 & -0.4277 & -1.4244 & 0 \\
 & 0 & -8.9396 & -8.8258 & -8.7018 & -8.5480 & -8.3945 & -8.2068 & -8.0173 & -7.7855 & -7.5375 & -7.2641 & -6.9603 & -6.6230 & -6.2482 & 0 & 0 & -2.2785 & -1.3779 & -2.2364 & 0 \\
 & 0 & -9.0512 & -8.9513 & -8.8418 & 0 & 0 & 0 & 0 & 0 & -7.2641 & -6.9604 & -6.6231 & -6.2482 & -5.8315 & 0 & 0 & -2.9556 & -2.2357 & -3.0598 & 0 \\
 & 0 & -9.1429 & -9.0484 & -8.9420 & 0 & 0 & 0 & 0 & 0 & -7.0100 & -6.6289 & -6.2864 & -5.8579 & -5.3980 & -4.9220 & -4.3757 & -3.6911 & -3.0204 & -3.6700 & 0 \\
 & 0 & -9.2117 & -9.1366 & -9.0429 & -8.9376 & -8.8202 & -8.6898 & 0 & 0 & -7.3026 & -6.9989 & -6.6591 & -6.2971 & -5.9022 & -5.4483 & -4.8861 & -4.3715 & -3.6570 & -4.2932 & 0 \\
 & 0 & -9.1386 & -9.0436 & -8.9381 & -8.8207 & -8.6899 & -8.5446 & 0 & 0 & -7.5762 & -7.2695 & -6.9658 & -6.6430 & -6.2820 & -5.8705 & -5.3831 & -4.9359 & -4.2940 & -4.9371 & 0 \\
 & 0 & -9.0524 & -8.9505 & -8.8226 & -8.7125 & -8.5453 & -8.4023 & -8.2374 & -8.0203 & -7.8171 & -7.5767 & -7.3031 & -6.9796 & -6.6225 & 0 & -5.8293 & -5.4190 & -4.9325 & -5.4252 & 0 \\
 & 0 & -9.1498 & -9.0542 & -8.9570 & -8.8325 & -8.6999 & -8.5490 & -8.4054 & -8.2301 & -8.0035 & -7.8082 & -7.5710 & -7.2651 & 0 & 0 & -6.2765 & -5.9030 & -5.3661 & -5.8961 & 0 \\
 & 0 & -9.2214 & -9.1390 & -9.0480 & -8.9367 & -8.8243 & -8.6869 & -8.5480 & -8.3945 & -8.2068 & -8.0173 & -7.7855 & 0 & 0 & 0 & -6.6225 & -6.2654 & -5.8288 & -6.2654 & 0 \\
 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\

\tiny Q_2 = \begin{smallmatrix}
 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
 & 0 & -9.3078 & -9.2723 & -9.2089 & -9.1246 & -9.0379 & 0 & 0 & -8.6219 & -8.5167 & -8.3643 & -8.1977 & -7.9979 & -7.7741 & 0 & 0 & -1.2748 & -0.3031 & -1.2584 & 0 \\
 & 0 & -9.3148 & -9.2640 & -9.2199 & -9.1365 & -9.0427 & 0 & 0 & -8.6620 & -8.5348 & -8.3782 & -8.2008 & -8.0022 & -7.7811 & 0 & 0 & -1.2781 & -0.2737 & -1.2751 & 0 \\
 & 0 & -9.2691 & -9.2023 & -9.1372 & -9.0436 & -8.9380 & -8.6892 & -8.5436 & -8.5425 & -8.3813 & -8.2017 & -8.0030 & -7.7816 & -7.5366 & 0 & 0 & -0.3165 &  0.7594 & -0.3164 & 0 \\
 & 0 & -9.2139 & -9.1368 & -9.0433 & -8.9373 & -8.8205 & -8.6892 & -8.5439 & -8.3823 & -8.2027 & -8.0040 & -7.7822 & -7.5358 & -7.2636 & 0 & 0 & -1.2849 & -0.3165 & -1.2845 & 0 \\
 & 0 & -9.1371 & -9.0435 & -8.9381 & -8.8206 & -8.6900 & -8.5444 & -8.3829 & -8.2035 & -8.0043 & -7.7827 & -7.5365 & -7.2631 & -6.9603 & 0 & 0 & -2.1564 & -1.2849 & -2.1555 & 0 \\
 & 0 & -9.0439 & -8.9387 & -8.8213 & 0 & 0 & 0 & 0 & 0 & -7.7818 & -7.5368 & -7.2635 & -6.9599 & -6.6227 & 0 & 0 & -2.9407 & -2.1564 & -2.9394 & 0 \\
 & 0 & -9.1382 & -9.0441 & -8.9387 & 0 & 0 & 0 & 0 & 0 & -7.5357 & -7.2631 & -6.9600 & -6.6228 & -6.2481 & -5.3675 & -4.8535 & -3.6466 & -2.9407 & -3.6447 & 0 \\
 & 0 & -9.2127 & -9.1368 & -9.0431 & -9.0333 & -8.9321 & -8.8181 & 0 & 0 & -7.2632 & -6.9597 & -6.6223 & -6.2472 & -5.8305 & -5.3675 & -4.8531 & -4.2817 & -3.6466 & -4.2790 & 0 \\
 & 0 & -9.2173 & -9.2076 & -9.1328 & -9.0412 & -8.9365 & -8.8201 & 0 & 0 & -7.5363 & -7.2630 & -6.9593 & -6.6217 & -6.2466 & -5.8299 & -5.3668 & -4.8524 & -4.2817 & -4.8496 & 0 \\
 & 0 & -9.2125 & -9.1378 & -9.0435 & -8.9376 & -8.8203 & -8.6896 & -8.3827 & -8.2032 & -7.7824 & -7.5363 & -7.2628 & -6.9590 & -6.6215 & 0 & -5.8256 & -5.3639 & -4.8527 & -5.3608 & 0 \\
 & 0 & -9.1382 & -9.0433 & -8.9376 & -8.8196 & -8.6889 & -8.5436 & -8.3820 & -8.2026 & -8.0033 & -7.7816 & -7.5357 & -7.2625 & 0 & 0 & -6.2373 & -5.8250 & -5.3657 & -5.8206 & 0 \\
 & 0 & -9.2149 & -9.1346 & -9.0409 & -8.9353 & -8.8179 & -8.6868 & -8.5417 & -8.3801 & -8.2008 & -8.0021 & -7.7813 & 0 & 0 & 0 & -6.6048 & -6.2378 & -5.8254 & -6.2298 & 0 \\
 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\

\tiny Q_3 = \begin{smallmatrix}
 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
 & 0 & -9.3002 & -9.3124 & -9.2598 & -9.2151 & -9.1209 & 0 & 0 & -8.6505 & -8.6573 & -8.5303 & -8.3772 & -8.1943 & -7.9950 & 0 & 0 & -1.2693 & -1.2596 & -0.3107 & 0 \\
 & 0 & -9.2481 & -9.2853 & -9.2197 & -9.1374 & -9.0432 & 0 & 0 & -8.5306 & -8.5373 & -8.3796 & -8.2016 & -8.0011 & -7.7814 & 0 & 0 & -0.3150 & -0.2689 &  0.7594 & 0 \\
 & 0 & -9.2160 & -9.2179 & -9.1371 & -9.0433 & -8.9380 & -8.8203 & -8.6893 & -8.5437 & -8.3821 & -8.2020 & -8.0034 & -7.7824 & -7.5360 & 0 & 0 & -1.2842 & -1.2848 & -0.3164 & 0 \\
 & 0 & -9.1343 & -9.1378 & -9.0436 & -8.9379 & -8.8203 & -8.6896 & -8.5438 & -8.3828 & -8.2035 & -8.0039 & -7.7824 & -7.5360 & -7.2632 & 0 & 0 & -2.1557 & -2.1558 & -1.2844 & 0 \\
 & 0 & -9.0424 & -9.0443 & -8.9386 & -8.8209 & -8.6902 & -8.5447 & -8.3835 & -8.2040 & -8.0048 & -7.7829 & -7.5367 & -7.2634 & -6.9596 & 0 & 0 & -2.9399 & -2.9403 & -2.1557 & 0 \\
 & 0 & -9.1230 & -9.1254 & -9.0429 & 0 & 0 & 0 & 0 & 0 & -7.5360 & -7.5367 & -7.2637 & -6.9598 & -6.6225 & 0 & 0 & -3.6461 & -3.6457 & -2.9394 & 0 \\
 & 0 & -9.2024 & -9.1912 & -9.1148 & 0 & 0 & 0 & 0 & 0 & -7.2627 & -7.2635 & -6.9602 & -6.6227 & -6.2481 & -5.8312 & -5.3682 & -4.8536 & -4.2807 & -3.6447 & 0 \\
 & 0 & -9.2354 & -9.2138 & -9.1890 & -9.1226 & -9.0390 & -8.9321 & 0 & 0 & -7.5343 & -7.5350 & -7.2617 & -6.9586 & -6.6214 & -6.2461 & -5.8289 & -5.3654 & -4.8515 & -4.2788 & 0 \\
 & 0 & -9.1934 & -9.2098 & -9.1368 & -9.0430 & -8.9372 & -8.8200 & 0 & 0 & -7.7816 & -7.7810 & -7.5351 & -7.2622 & -6.9580 & -6.6195 & -6.2434 & -5.8279 & -5.3634 & -4.8496 & 0 \\
 & 0 & -9.1304 & -9.1343 & -9.0433 & -8.9378 & -8.8207 & -8.6897 & -8.5443 & -8.3827 & -8.2031 & -8.0030 & -7.7807 & -7.5350 & -7.2598 & 0 & -6.1654 & -6.2230 & -5.8213 & -5.3612 & 0 \\
 & 0 & -9.1971 & -9.1956 & -9.1300 & -9.0410 & -8.9351 & -8.8179 & -8.6866 & -8.5415 & -8.3807 & -8.1995 & -8.0015 & -7.7783 & 0 & 0 & -6.5028 & -6.4498 & -6.2222 & -5.8212 & 0 \\
 & 0 & -9.2355 & -9.2192 & -9.1863 & -9.0734 & -9.0293 & -8.9198 & -8.8052 & -8.6773 & -8.5275 & -8.3586 & -8.1660 & 0 & 0 & 0 & -6.7050 & -6.7333 & -6.5172 & -6.2306 & 0 \\
 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\

\tiny Q_4 = \begin{smallmatrix}
 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
 & 0 & -9.2953 & -9.2238 & -9.1390 & -9.0438 & -9.0383 & 0 & 0 & -8.5405 & -8.3804 & -8.2017 & -8.0028 & -7.7820 & -7.7789 & 0 & 0 & -0.3121 & -1.2707 & -1.2546 & 0 \\
 & 0 & -9.2240 & -9.1391 & -9.0441 & -8.9384 & -8.9371 & 0 & 0 & -8.3815 & -8.2028 & -8.0037 & -7.7824 & -7.5369 & -7.5359 & 0 & 0 &  0.7594 & -0.2673 & -0.3122 & 0 \\
 & 0 & -9.1390 & -9.0441 & -8.9383 & -8.8208 & -8.6900 & -8.5445 & -8.3830 & -8.2035 & -8.0043 & -7.7830 & -7.5371 & -7.2639 & -7.2629 & 0 & 0 & -0.3165 & -1.2843 & -1.2825 & 0 \\
 & 0 & -9.0443 & -8.9385 & -8.8210 & -8.6902 & -8.5449 & -8.3834 & -8.2041 & -8.0048 & -7.7833 & -7.5373 & -7.2639 & -6.9603 & -6.9599 & 0 & 0 & -1.2849 & -2.1552 & -2.1530 & 0 \\
 & 0 & -8.9390 & -8.8214 & -8.6907 & -8.5454 & -8.3840 & -8.2046 & -8.0052 & -7.7836 & -7.5375 & -7.2641 & -6.9603 & -6.6230 & -6.6224 & 0 & 0 & -2.1564 & -2.9391 & -2.9376 & 0 \\
 & 0 & -9.0439 & -8.9388 & -8.9360 & 0 & 0 & 0 & 0 & 0 & -7.2641 & -6.9605 & -6.6230 & -6.2481 & -6.2473 & 0 & 0 & -2.9407 & -3.6442 & -3.6420 & 0 \\
 & 0 & -9.1381 & -9.0441 & -9.0322 & 0 & 0 & 0 & 0 & 0 & -6.9606 & -6.6233 & -6.2483 & -5.8315 & -5.3684 & -4.8538 & -4.2820 & -3.6466 & -4.2794 & -4.2763 & 0 \\
 & 0 & -9.2124 & -9.1367 & -9.0429 & -8.9375 & -8.8203 & -8.8191 & 0 & 0 & -7.2631 & -6.9595 & -6.6221 & -6.2471 & -5.8304 & -5.3675 & -4.8531 & -4.2815 & -4.8481 & -4.8453 & 0 \\
 & 0 & -9.1385 & -9.0436 & -8.9381 & -8.8207 & -8.6899 & -8.6892 & 0 & 0 & -7.5363 & -7.2630 & -6.9593 & -6.6217 & -6.2466 & -5.8297 & -5.3667 & -4.8524 & -5.3621 & -5.3403 & 0 \\
 & 0 & -9.0443 & -8.9387 & -8.8210 & -8.6901 & -8.5447 & -8.3831 & -8.2036 & -8.0041 & -7.7824 & -7.5362 & -7.2628 & -6.9590 & -6.9557 & 0 & -5.8252 & -5.3645 & -5.8162 & -5.8043 & 0 \\
 & 0 & -9.1381 & -9.0431 & -8.9374 & -8.8196 & -8.6888 & -8.5434 & -8.3820 & -8.2025 & -8.0031 & -7.7817 & -7.5356 & -7.5316 & 0 & 0 & -6.2374 & -5.8254 & -6.2076 & -6.1673 & 0 \\
 & 0 & -9.2148 & -9.1344 & -9.0405 & -8.9353 & -8.8175 & -8.6866 & -8.5413 & -8.3800 & -8.2009 & -8.0020 & -7.9625 & 0 & 0 & 0 & -6.6043 & -6.2370 & -6.4064 & -6.4853 & 0 \\
 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\

The resultant optimal policy is shown below:

Policy = \begin{matrix}
& o & o & o & o & o & o & o & o & o & o & o & o & o & o & o & o & o & o & o & o \\
& o & \bigtriangledown & \bigtriangledown & \triangleright & \triangleright & \bigtriangledown & o & o & \bigtriangledown & \triangleright & \triangleright & \triangleright & \triangleright & \bigtriangledown & o & o & \triangleright & \bigtriangledown & \triangleleft & o \\
& o & \bigtriangledown & \bigtriangledown & \triangleright & \bigtriangledown & \bigtriangledown & o & o & \bigtriangledown & \bigtriangledown & \triangleright & \triangleright & \bigtriangledown & \bigtriangledown & o & o & \triangleright & G & \triangleleft & o \\
& o & \bigtriangledown & \triangleright & \triangleright & \bigtriangledown & \triangleright & \bigtriangledown & \triangleright & \triangleright & \bigtriangledown & \bigtriangledown & \bigtriangledown & \bigtriangledown & \bigtriangledown & o & o & \triangleright & \bigtriangleup & \triangleleft & o \\
& o & \bigtriangledown & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \bigtriangledown & \triangleright & \bigtriangledown & \bigtriangledown & \triangleright & \triangleright & \bigtriangledown & o & o & \bigtriangleup & \bigtriangleup & \triangleleft & o \\
& o & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \bigtriangledown & \triangleright & \triangleright & \bigtriangledown & \bigtriangledown & o & o & \triangleright & \bigtriangleup & \bigtriangleup & o \\
& o & \bigtriangleup & \bigtriangleup & \bigtriangleup & o & o & o & o & o & \triangleright & \bigtriangledown & \triangleright & \triangleright & \bigtriangledown & o & o & \bigtriangleup & \bigtriangleup & \bigtriangleup & o \\
& o & \triangleright & \triangleright & \bigtriangleup & o & o & o & o & o & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \bigtriangleup & \bigtriangleup & \triangleleft & o \\
& o & \bigtriangledown & \bigtriangledown & \bigtriangledown & \triangleright & \bigtriangledown & \bigtriangledown & o & o & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \bigtriangleup & \triangleleft & o \\
& o & \triangleright & \triangleright & \triangleright & \triangleright & \bigtriangledown & \bigtriangledown & o & o & \bigtriangleup & \bigtriangleup & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \bigtriangleup & \bigtriangleup & \bigtriangleup & o \\
& o & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \bigtriangleup & o & \triangleright & \bigtriangleup & \bigtriangleup & \bigtriangleup & o \\
& o & \triangleright & \triangleright & \triangleright & \bigtriangleup & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \bigtriangleup & \triangleright & \bigtriangleup & o & o & \bigtriangleup & \bigtriangleup & \bigtriangleup & \bigtriangleup & o \\
& o & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \triangleright & \bigtriangleup & \triangleright & \bigtriangleup & o & o & o & \triangleright & \triangleright & \bigtriangleup & \bigtriangleup & o \\
& o & o & o & o & o & o & o & o & o & o & o & o & o & o & o & o & o & o & o & o \\

Graphical representation of the State Value Function, V, as:

Figure-3: Graphical representation of the State Value Function, V.

\tiny V = \begin{smallmatrix}
 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
 & 0 & -9.2943 & -9.2237 & -9.1390 & -9.0438 & -8.9383 & 0 & 0 & -8.5404 & -8.3804 & -8.2017 & -8.0028 & -7.7820 & -7.5369 & 0 & 0 & -0.3121 &  0.7610 & -0.3107 & 0 \\
 & 0 & -9.2239 & -9.1390 & -9.0441 & -8.9384 & -8.8208 & 0 & 0 & -8.3815 & -8.2028 & -8.0037 & -7.7824 & -7.5369 & -7.2640 & 0 & 0 &  0.7594 & -0.2673 &  0.7594 & 0 \\
 & 0 & -9.1390 & -9.0441 & -8.9383 & -8.8207 & -8.6900 & -8.5445 & -8.3830 & -8.2035 & -8.0042 & -7.7830 & -7.5370 & -7.2638 & -6.9605 & 0 & 0 & -0.3165 &  0.7594 & -0.3164 & 0 \\
 & 0 & -9.0443 & -8.9385 & -8.8210 & -8.6902 & -8.5449 & -8.3834 & -8.2041 & -8.0048 & -7.7833 & -7.5372 & -7.2639 & -6.9603 & -6.6232 & 0 & 0 & -1.2849 & -0.3165 & -1.2844 & 0 \\
 & 0 & -8.9390 & -8.8214 & -8.6907 & -8.5454 & -8.3840 & -8.2046 & -8.0052 & -7.7836 & -7.5375 & -7.2641 & -6.9603 & -6.6230 & -6.2482 & 0 & 0 & -2.1564 & -1.2849 & -2.1555 & 0 \\
 & 0 & -9.0439 & -8.9387 & -8.8213 & 0 & 0 & 0 & 0 & 0 & -7.2641 & -6.9604 & -6.6230 & -6.2481 & -5.8315 & 0 & 0 & -2.9407 & -2.1564 & -2.9394 & 0 \\
 & 0 & -9.1381 & -9.0441 & -8.9387 & 0 & 0 & 0 & 0 & 0 & -6.9606 & -6.6233 & -6.2483 & -5.8315 & -5.3684 & -4.8538 & -4.2820 & -3.6466 & -2.9407 & -3.6447 & 0 \\
 & 0 & -9.2117 & -9.1366 & -9.0429 & -8.9375 & -8.8202 & -8.6898 & 0 & 0 & -7.2631 & -6.9595 & -6.6221 & -6.2471 & -5.8304 & -5.3675 & -4.8531 & -4.2815 & -3.6466 & -4.2788 & 0 \\
 & 0 & -9.1385 & -9.0436 & -8.9381 & -8.8207 & -8.6899 & -8.5446 & 0 & 0 & -7.5363 & -7.2630 & -6.9593 & -6.6217 & -6.2466 & -5.8297 & -5.3667 & -4.8524 & -4.2817 & -4.8496 & 0 \\
 & 0 & -9.0443 & -8.9387 & -8.8210 & -8.6901 & -8.5447 & -8.3831 & -8.2036 & -8.0041 & -7.7824 & -7.5362 & -7.2628 & -6.9590 & -6.6215 & 0 & -5.8252 & -5.3639 & -4.8527 & -5.3608 & 0 \\
 & 0 & -9.1381 & -9.0431 & -8.9374 & -8.8196 & -8.6888 & -8.5434 & -8.3820 & -8.2025 & -8.0031 & -7.7816 & -7.5356 & -7.2625 & 0 & 0 & -6.2373 & -5.8250 & -5.3657 & -5.8206 & 0 \\
 & 0 & -9.2148 & -9.1344 & -9.0405 & -8.9353 & -8.8175 & -8.6866 & -8.5413 & -8.3800 & -8.2008 & -8.0020 & -7.7813 & 0 & 0 & 0 & -6.6043 & -6.2370 & -5.8254 & -6.2298 & 0 \\
 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\

The evolution of effectiveness has been computed at every 250 episodes of the main loop. The rewards are then averaged for 100 episodes. The graph shows randomness locally but its property is changing globally as the number of required iteration is decreasing with the increase of number of episodes. And eventually it gets saturated at some point which indicates minimum number of iterations needed for this particular map. Graphical representation of the evolution of the effectiveness is shown below:

Figure-4: Graphical representation of the evolution of the effectiveness.

4 - The results & conclusion

This laboratory work was very helpful to understand the theoretical background of the Q learning algorithm. It helped me to explore more information about it and how to apply to the robot learning.

No comments:

Post a Comment