My notes on Reinforcement Learning

hixidom · Mar 27, 2016

Reinforcement learning:
...is a type of machine learning that involves learning a behavior policy which maximizes long-term reward. The policy is derived from the Value function, V(S,A), which gives the long-term value of taking action A when in state S. When V is known for every combination of S and A, the optimal policy from a given state is simply to take the action for which V is largest, A'.
How do we find V for a particular combination of S and A? Imagine the following scenario: There is a state S1 which leads to state S2 when action A1 is taken (there may be many other available actions, but we start by focusing on just one); there is some reward R1 which is recieved as a result of doing A1 in S1; V(S2,A2' ) is already known but V(S1,A1) is not known. See diagram below for example of this sort of Markov Decision Process.

Now, we don't know if A1 is the best action when in S1, but we do know that there will be some reward R1 from doing A1, and we know that this will lead to S2 where the optimal action has a value of V(S2,A2' ). Therefore we can say that V(S1,A1)=R1+G·V(S2,A2' ), where G (the "discount factor" ) represents the trade-off between immediate reward R1 and long-term reward V(S2,A2' ). Note that, if G=0, then the Value of A1 in S1 is simply the immediate reward R1, while if G>0 then V(S1,A1) depends on V(S2,A2' ), which presumably depends on R3 and V(S3,A3' ), which depends on... and so on.

But what if we don't know ANY of the Values initially? Well, turns out you can just assign V(S,A) a random value initially for all combinations of S and A. Then you can start in a particular state S1, choose an action A1 and get a reward R1, and then update V(S1,A1) based on the above equation in terms of V(S2,A2' ). You then repeat this process in S2 and so on. In the case of discrete states, "updating" V means doing a weighted average of the old and new Value: Vnew(S1,A1) = (1-L)·Vold(S1,A1) + L(R1+G·V(S2,A2' )), where L is a constant between 0 and 1 known as the "learning rate" for reasons you can deduce by plugging in L=0. The cool thing about this process is that even though Vold(S1,A1) and V(S2,A2' ) are just random guesses at first, they converge on the correct Values after many many updates have been performed and each state has been updated many times. Since V is initialized with random values, we don't know what the true optimal action is for each state. For this reason, it is customary to start this process choosing actions at random. After V is partially updated, the Value of doing A in S will be partially accurate. At this point, we could exploit this fact by choosing the optimal A in each state knowing that we will usually get a reward, but in doing this we will avoid all suboptimal actions even if some of those suboptimal actions result in larger rewards which have not been discovered yet. Knowing that there may be undiscovered rewards, we can instead choose suboptimal actions as a means of exploring the state-action space further. It is typical to follow a middle-ground between the two extremes of all exploitation and all exploration by choosing a suboptimal action with a chance of epsilon, where epsilon is between 0 and 1, and choosing the optimal action otherwise. This is known as an epsilon-greedy strategy.
Though there are many other types of reinforcement learning, this particular process is known as Q-learning (the Value function is usually called Q rather than V)

Upgrades:
1. The Value funtion only converges on the correct Values if each state is visited many times as we choose actions and move through state space. However, if there are an infinite number of states (e.g. if state space is continuous) then it is impossible to visit all possible states, much less doing so multiple times. The standard workaround for this is to approximate V(S,A) using a neural network, where the state or multiple parameters which describe the state are the input to the NN and each output of the NN is the Value of a different action. This only allows for discrete actions, as you may have noticed. It is possible to have inputs corresponding to actions and a single output corresponding to V(S,A), but this has been found to be much less effective when discretizing actions is possible. Now that we have a neural network representing V, we can update V using our method of choice for NN training (RMSProp in my case). Instead of using a weighted average, we simply use the inputs corresponding to S1 and the expected output for the value of V(S1,A1), given by R1+G·V(S2,A2' ), to train the neural network.
2. Leemon Baird discovered that Q-Learning has issues when the time step of the RL simulation is very small because the Q values for different actions can get so close together that they are indistinuishable. To fix this, he defines an alternative function V(S,A)=V(S,A' )+K·(V(S,A)-V(S,A' )), thus scaling the Values for each action away from that of the optimal action. Baird showed that, when the scaling factor K is proportional to 1/(time step), the separation between Values for different actions is invariant with respect to time step, thus solving the problem of Q values becoming infinitesimally close in the continuous time limit. This is called Advantage Learning.
3. Trainig the NN on only the most recently encountered states can be problematic if said states are correlated. As an example of this, consider a well-trained NN which gives the Value of two actions, turning right and left, with the oncoming road shape as state input. The well-trained NN uses every neuron to calculate the two output Values. If the next 10 states in a row happen to all require right turns, then the NN will be trained for higher Value of right turns over and over again. With it's previous neural weights being partially overwritten, it is likely that the NN will be unable to properly value left turns after this point. To fix this problem, we have to train on both right and left turns simultaneously, even if we only encounter one or the other. To do this, we use stochastic gradient descent with minibatch updates, which means we keep a history of the last N (usually on the order of 10000) state-action-reward combinations SARCs and, instead of fully training the NN on the most recent SARC, we randomly choose n SARCs from the history and adjust the NN partially based on each SARC. "Sochastic gradient descent" corresponds to the partial adjustment of NN weights using different SARCs, and "mini-batch update" corresponds to the fact that we take a random subset of SARCs from the SARC history to perform the SGD.
4. An issue can arise if SARCs remain correlated for long enough that they dominate the SARC history. If this happens, then the purpose of mini-batch updating is defeated. Presumably, if there is a point where the RL system has had to make 10000 right turns in a row then the RL system is an expert on right turns. It no longer needs to save right turn experiences in its history. Rather, it can choose to save only experiences which are unique or, even better, it can save only experiences for which the NN needs training (i.e. for which the NN output differs largely from the target output). This is called prioritized sweeping. In my case, every SARC is put in the history, but when mini-batch is being selected, SARC selection probability is proportional to D, the difference between actual V (from NN) and target V (i.e. R1+G·V(S2,A2' )).
5. Recall the equation Vnew(S1,A1) = (1-L)·Vold(S1,A1) + L·(R1+G·V(S2,A2' )). This updates the value of V based on the assumption that you will do the optimal action A2 when in S2. However, with epsilon-greedy or some other exploration-exploitation trade-off strategy, this is not always the case. The result is that Values tend to be overestimated, especially for states where the available actions have a large Value disparity. For example, if one of the possible actions in S2 has a fatal outcome then it is clearly not the optimal action, and yet it will potentially be chosen with an e-greedy action-choosing strategy, but since only the optimal action is considered when updating V, the potential choice of the fatal outcome is not accounted for. If we want to account for the occasional bad choices resulting from e-greedy strategy, then we should replace V(S2,A2' ) above with V(S2,A2), where A2 is now chosen based on the e-greedy strategy instead of always being the optimal action. Because we looked ahead at the action we will choose in S2 while still updating the Value function at S1, this process is called SARSA (State,Action,Reward,State,Action).

My Project:
A classic benchmark in RL (and optimal control theory in general) is the pole-cart problem. The problem is to balace a pole upright on top of a cart which can move back and forth. If the pole falls to 90deg from vertical or the cart hits the wall of the room, this marks the end of an "episode" and the pole cart system is reset with random angle/position. I apply SARSA Advantage Learning with NN approximation, mini-batch updating, and prioritized sweeping. All the bells and whistles certainly were not required to make this work, but they should make it work better (in principle, at least). It's coded in C++ but without classes on the off-chance that I decide to try putting it on a microcontroller one day.

The NN has 3 inputs (pole angle, cart position, cart velocity) and 3 outputs (Values of "accelerate left", "don't move", "accelerate right" ). Letting the actions be simply displacement to left and right allow for much better learning and control, but I switched over to the actions being acceleration because this is more relevant to robotic applications. While I've tried using upward of 10 hidden layers, the problem can be solved relatively well with as few as 3 hidden layers. If the output is linearly dependent on the inputs, then one hidden layer will suffice. Conversely, multiple hidden layers allow for nonlinear dependence. As for hidden layer width, 3 neurons wide worked fine, but more allows for representation of more complex linear features. For the demonstration below, I use a hidden layer stack 6 neurons wide and 4 neurons deep.

Wanting the pole to stay upright and the cart to stay near the middle of the room, the reward function is R = -|pole angle|-|cart position|. I've played around with many different reward schemes and, not surprisingly, the reward has a huge effect in terms of learning rate and pathologies. I found that the speed at which the cart can move (i.e. it's reaction speed) has a large effect on the outcome: Too large and it is unable to stabilize, too small and it cannot compensate for falling pole. The strength of gravity is also important, and in principle there is a function of reaction speed and gravity under which the RL problem is invariant assuming continuous time, but I haven't played with strength of gravity nor time step in this project (yet).

While I've tried many combinations of parameters which have different pros and cons, for the demonstrations below I used: G=0.5, L=0.001 (NN learning rate), K=2, epsilon=0.1, history size=10000, mini-batch size=40, priority threshold=0.05 (D at which SARC has 50% chance of being ignored. Average D is 0.2).

GUI explanation:

https://imgur.com/QHNMRu5

[YOUTUBE]

My notes on Reinforcement Learning

hixidom

Rising Star