This chapter introduces another major category of reinforcement learning algorithms: policy-based RL. First, we will try to smoothly transition from value-based RL by discussing issues with value-based RL and how such issues can be addressed with policy-based RL. Next, we will get familiar with the basic concepts in policy-based RL and reveal the major steps of policy-based RL studies: objective function construction, policy definition, training via policy improvements, and algorithm improvements. With the understanding of these steps, we will first introduce an objective function to derive the policy gradient. Based on it, the deduction of the policy gradient theorem will be presented. Then, we will show the Monte Carlo implementation with the derived policy gradient theorem. Next, we will show the issues with the simple Monte Carlo implementation, which can be improved in two different directions: policy and value. As for policy, we will introduce more forms of objective functions and show a more widely-accepted policy gradient theorem. For value, we will discuss more policy evaluation methods for reducing the variance. Various classic policy gradient algorithms including REINFORCE, REINFORCE with baseline, Actor-Critic (TD (0)(0) and TD(lambda)\mathrm{TD}(\lambda) ), and so on, will be introduced.
16.2. Policy-Based RL vs. Value-Based RL
Value-based algorithms such as Q learning and deep Q learning[163] have some issues:
A minor change in the value function can lead to a change in the selection of actions. Such noncontinuous changes are an important reason why it is hard for value-based methods to get converged results.
The search for the optimal QQ value is difficult in a continuous space. This issue is very obvious when there are high-dimensional or continuous action spaces. For example, in autonomous driving applications, the number of states corresponding to different road and vehicle conditions is huge.
Value-based algorithms cannot learn stochastic policies. Due to this reason, it is easy for a value-based learning agent to get stuck in a specific state. One example is a two-player game like Rock-Scissors-Paper; if one player acts deterministically, the other player can develop countermeasures to win.
Policy-based algorithms can be used to address these issues by adjusting the policy instead of the value functions ( VV and QQ ) as in value-based RL algorithms. In value-based RL, both of these value functions can determine the selection of actions in different states. As shown in Fig. 16.1, the state values determine the selection of actions via a greedy policy: select the action with the highest action-state value (i.e., the highest expected cumulative rewards). To enable learning, stochastic factors were added by modifying the greedy policy with epsilon\epsilon in the training process (in testing, only the greedy policy is used), which represents the probability of exploring better actions that are not suggested by the greedy policy but can achieve higher long-term rewards.
By contrast, policy-based algorithms directly determine the selection of actions in different states via the policy (Fig. 16.2). Such a policy can be viewed as a probability distribution of different actions at different state variables. This distribution tells the probability of selecting any action in the given state. Thus, when a state with different state variables is determined, we will know the probabilities of selecting different actions, though only one action will be selected based on the probabilities. To select an action, the Softmax function is usually used for discrete actions, while the Gaussian distribution is commonly used for continuous actions. The greedy policy (not epsilon\epsilon greedy) can be viewed as a special policy
Figure 16.1: Value-based RL vs. policy-based RL
that corresponds to a Dirac delta distribution at the point where the maximum action-state function value is located: only the action corresponding to the maximum action-state value will be selected at a probability of 100%100 \%. In the training process of a policy-based learning algorithm, this distribution representing the policy will be adjusted to improve the policy.
Figure 16.2: Illustration of policy update in policy-based RL
16.3. Basic Concepts
The probability distribution used to represent a policy can be written as pi_(theta)(a∣s)\pi_{\theta}(a \mid s), which describes the probability of selecting an action aa in the state ss. This function is parameterized by theta\theta, which could be one number or an array of numbers.
Fig. [?] illustrates the updates of pi_(theta)(a∣s)\pi_{\theta}(a \mid s), in which the policy is updated every step with epsodic data. As shown, in Step 1 of Episode 1 represented by a state vec(s)_(1,1)\vec{s}_{1,1}, we selected action a_(2)a_{2}, which has a probability of 0.23 . The generated awards were used to update pi_(theta)(a∣s)\pi_{\theta}(a \mid s). This led to vec(s)_(1,2)\vec{s}_{1,2}, in which theupdated pi_(theta)(a∣s)\pi_{\theta}(a \mid s) can be simply viewed as the new probabilities of different actions. This process continued as more episodes of data was used.
When an agent acts in an environment using a policy pi_(theta)(a∣s)\pi_{\theta}(a \mid s), a series of states and actions, can be generated. This series is called a trajectory, which is illustrated in Fig. 16.3. Such a trajectory can be formulated as follows:
The stochastic nature of policy-based algorithms makes their theories appear different from those of value-based algorithms [164]. One difference is the widespread use of probability functions and expectations in the theories of policy-based algorithms, which is not necessary in value-based algorithms. One policy corresponds to one probability distribution. As a result, different trajectories can be generated even if an agent starts from the same state following one policy. Therefore, in order to evaluate a policy, expectations of rewards and state functions are commonly used in the theories of policybased theories. It is noted that many introductions to RL are written up using frameworks considering the probabilities
Figure 16.3: Trajectory of an episode in RL learning
and expectations, which present a comprehensive description suitable for both value-based and policy-based algorithms, whereas some other introductions are laid down without probability for simplicity, which is more intended for value-based algorithms. This could be confusing to many people who are new to RL.
Reinforcement learning is similar to supervised learning in a few aspects. That is, we are looking for a mapping from states to actions. But here, it is no longer a simple one-to-one mapping. The rewards and their functions, such as the state value function and action-state value function, serve as the "objective/loss function" in training and "evaluation metrics" in testing for RL. The training process of RL is a process of maximizing the objective function. Thus, such an objective function is usually constructed with the reward and policy. In that way, the probability of actions can be adjusted to obtain more rewards (or higher objective function value). Because the policy is parameterized by theta\theta, the objective function is also parameterized by theta\theta. The training process aims to obtain the optimal parameter theta^(**)\theta^{*}, which can maximize the objective function:
In the following, we will see that the development, implementation, and improvement of policy gradient methods can be divided into four steps or components:
Propose an objective function to assess the effectiveness of the policy. We will see that many different objective functions can be used.
Define a policy mathematically. For training, we will need to find out how to improve the policy by maximizing the objective function: the policy gradient theorem.
Construct and implement an algorithm for training. That is, we will develop a procedure to improve the policy in the process of maximizing the objective function.
Further Improve the algorithm. Many issues caused difficulties in getting good training results with policy gradient methods. Different improvements were proposed, leading to different methods/algorithms of policy gradients.
The rest of this chapter is organized to introduce these four components sequentially.
16.4. Objective Function and Policy Gradient Theorem
16.4.1. Objective Function
The objective function can be constructed in different ways. In the following, we will first show a very simple way to construct the objective function and derive the policy gradient theorem with it. Three more objective functions will be presented after that, based on which various policy gradient algorithms will be described.
In a simple way, we can use the expectation of the sum of rewards as the objective function. This expectation can be obtained by letting the learning agent generate a large number of trajectories. When this number is large enough, we can use the average of the total rewards to approximate this expectation.
where ii is the number of tau\tau, which is the trajectory obtained in episode ii, and tt is the number of the step in this episode.
The above equation can be written in a continuous form:
The RL implementation with the above equation is similar to the search for the minimum loss or maximum objective function in supervised learning. We can still use optimization methods such as gradient descent. Notwithstanding, there is one major difference: we look for a maximum objective function determined by the rewards in reinforcement learning instead of a minimum objective function determined by the difference between the predicted and true labels in supervised learning. In gradient ascent, the parameters that enable the maximum of the objective function are achieved via the following equation:
Now, the key is to derive an equation that can be easily used to calculate grad_(theta)J(theta)\nabla_{\theta} J(\theta). This equation can be derived as follows,
where log\log is the natural log\log, i.e., ln\ln.
In the above equation, the operator is placed on pi_(theta)(tau)*r(tau)\pi_{\theta}(\tau) \cdot r(\tau), but it only takes effect on pi_(theta)(tau)\pi_{\theta}(\tau), because r(tau)r(\tau) is not a function of theta\theta. The key work in the above deduction is to take a pi_(theta)(tau)\pi_{\theta}(\tau) out from the gradient operation, which enables the construction of an expectation.
Next, we need to recall the essential characteristics of MDP, i.e., Markov property [165], so that tau\tau in the above equation can be related to individual action-state sets for implementations. In this property, the current state is only related to the state and action at the previous time and independent of the states and actions at other times. Then, the probability of a trajectory tau\tau can be formulated as
Please be aware that, at this moment, we just sum up the rewards from one trajectory to get r(tau)r(\tau), and the discount factor has not been considered yet. The discount factor will be discussed in a later section for advanced policy gradient algorithms.
Substituting the above two equations into the grad_(theta)J(theta)\nabla_{\theta} J(\theta) equation, we obtain
The above equation is called the policy gradient theorem. This equation for the calculation of the gradient of the objective function contains an expectation, which cannot be calculated directly. Therefore, in implementations, we can only approximate it with an average of enough samplings of tau\tau :
{:(16.10)grad_(theta)J(theta)~~ubrace((1)/(N)sum_(i=1)^(N)[ubrace((sum_(t=0)^(T)grad_(theta)log pi_(theta)(a_(i,t)∣s_(i,t)))ubrace)_("Policy: probability of "tau)*ubrace((sum_(t=0)^(T)r(s_(i,t),a_(i,t)))ubrace)_("Value: rewards of "tau)]ubrace)_("Use mean to approximate expectation "):}\begin{equation*}
\nabla_{\theta} J(\theta) \approx \underbrace{\frac{1}{N} \sum_{i=1}^{N}[\underbrace{\left(\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}\left(a_{i, t} \mid s_{i, t}\right)\right)}_{\text {Policy: probability of } \tau} \cdot \underbrace{\left(\sum_{t=0}^{T} r\left(s_{i, t}, a_{i, t}\right)\right)}_{\text {Value: rewards of } \tau}]}_{\text {Use mean to approximate expectation }} \tag{16.10}
\end{equation*}
In addition to the use of the mean to approximate the expectation, there are two key parts: policy and reward (value). The former determines what action will be selected, and the latter tells what the value of the action (or reward generated by the action) is. For the policy, we will need to find a way to mathematically formulate pi_(theta)\pi_{\theta}, and based on it, obtain a simple way to obtain the gradient of (the log\log of) the policy. For the reward (value), in a simple way, we can directly implement the above equation by adding up the rewards for all the steps of a tau\tau. The part for evaluating the value (reward) in the above equation is sum_(t=0)^(T)r(s_(i,t),a_(i,t))\sum_{t=0}^{T} r\left(s_{i, t}, a_{i, t}\right). This part can be replaced by other equations for valuation evaluations as listed in Table 16.4.2. In a later subsection, we will discuss more ways of evaluating the value, which leads to more advanced policy gradient algorithms.
Table 16.1: Different value evaluation functions in common policy gradient algorithms
Equation for Value Evaluation Algorithm
G_(t)=sum_(i=t)^(T)gamma^(i-t)r_(i+1) REINFORCE version 2
G_(t)-V_(w)(s_(t))^(1) REINFORCE with baseline
Q_(w)(s_(t),a_(t)) Q Actor-Critic
A_(w)(s_(t),a_(t))=Q_(w)(s_(t),a_(t))-V_(w)(s_(t)) Advantage Actor-Critic
delta=r_(t)+V_(w)(s_(t+1))-V_(w)(s_(t)) Temporal-Difference Actor Critic
delta(lambda)^(2) Eligibility Actor Critic| Equation for Value Evaluation | Algorithm |
| :--- | :--- |
| $G_{t}=\sum_{i=t}^{T} \gamma^{i-t} r_{i+1}$ | REINFORCE version 2 |
| $G_{t}-V_{w}\left(s_{t}\right)^{1}$ | REINFORCE with baseline |
| $Q_{w}\left(s_{t}, a_{t}\right)$ | Q Actor-Critic |
| $A_{w}\left(s_{t}, a_{t}\right)=Q_{w}\left(s_{t}, a_{t}\right)-V_{w}\left(s_{t}\right)$ | Advantage Actor-Critic |
| $\delta=r_{t}+V_{w}\left(s_{t+1}\right)-V_{w}\left(s_{t}\right)$ | Temporal-Difference Actor Critic |
| $\delta(\lambda)^{2}$ | Eligibility Actor Critic |
^(1)V_(w){ }^{1} V_{w} and Q_(w)Q_{w} are an approximation to V^(pi)V^{\pi} and Q^(pi)Q^{\pi}, respectively, which are parameterized by ww, such as linear functions as Q(s,a)=phi(s,a)*wQ(s, a)=\phi(s, a) \cdot w and a deep neural network, and will be improved in the training process. ^(2)delta(lambda){ }^{2} \delta(\lambda) is the delta\delta obtained using a TD(lambda)\operatorname{TD}(\lambda) scheme, which will be introduced later.