Previous
Value-Based RL
Contents
Table of Contents
Next
AI Solver/Optimizer

Chapter 16

Policy-Based Reinforcement Learning

16.1. Overview

This chapter introduces another major category of reinforcement learning algorithms: policy-based RL. First, we will try to smoothly transition from value-based RL by discussing issues with value-based RL and how such issues can be addressed with policy-based RL. Next, we will get familiar with the basic concepts in policy-based RL and reveal the major steps of policy-based RL studies: objective function construction, policy definition, training via policy improvements, and algorithm improvements. With the understanding of these steps, we will first introduce an objective function to derive the policy gradient. Based on it, the deduction of the policy gradient theorem will be presented. Then, we will show the Monte Carlo implementation with the derived policy gradient theorem. Next, we will show the issues with the simple Monte Carlo implementation, which can be improved in two different directions: policy and value. As for policy, we will introduce more forms of objective functions and show a more widely-accepted policy gradient theorem. For value, we will discuss more policy evaluation methods for reducing the variance. Various classic policy gradient algorithms including REINFORCE, REINFORCE with baseline, Actor-Critic (TD ( 0 ) ( 0 ) (0)(0)(0) and TD ( λ ) TD ( λ ) TD(lambda)\mathrm{TD}(\lambda)TD(λ) ), and so on, will be introduced.

16.2. Policy-Based RL vs. Value-Based RL

Value-based algorithms such as Q learning and deep Q learning[163] have some issues:
  1. A minor change in the value function can lead to a change in the selection of actions. Such noncontinuous changes are an important reason why it is hard for value-based methods to get converged results.
  2. The search for the optimal Q Q QQQ value is difficult in a continuous space. This issue is very obvious when there are high-dimensional or continuous action spaces. For example, in autonomous driving applications, the number of states corresponding to different road and vehicle conditions is huge.
  3. Value-based algorithms cannot learn stochastic policies. Due to this reason, it is easy for a value-based learning agent to get stuck in a specific state. One example is a two-player game like Rock-Scissors-Paper; if one player acts deterministically, the other player can develop countermeasures to win.
Policy-based algorithms can be used to address these issues by adjusting the policy instead of the value functions ( V V VVV and Q Q QQQ ) as in value-based RL algorithms. In value-based RL, both of these value functions can determine the selection of actions in different states. As shown in Fig. 16.1, the state values determine the selection of actions via a greedy policy: select the action with the highest action-state value (i.e., the highest expected cumulative rewards). To enable learning, stochastic factors were added by modifying the greedy policy with ϵ ϵ epsilon\epsilonϵ in the training process (in testing, only the greedy policy is used), which represents the probability of exploring better actions that are not suggested by the greedy policy but can achieve higher long-term rewards.
By contrast, policy-based algorithms directly determine the selection of actions in different states via the policy (Fig. 16.2). Such a policy can be viewed as a probability distribution of different actions at different state variables. This distribution tells the probability of selecting any action in the given state. Thus, when a state with different state variables is determined, we will know the probabilities of selecting different actions, though only one action will be selected based on the probabilities. To select an action, the Softmax function is usually used for discrete actions, while the Gaussian distribution is commonly used for continuous actions. The greedy policy (not ϵ ϵ epsilon\epsilonϵ greedy) can be viewed as a special policy
Figure 16.1: Value-based RL vs. policy-based RL
that corresponds to a Dirac delta distribution at the point where the maximum action-state function value is located: only the action corresponding to the maximum action-state value will be selected at a probability of 100 % 100 % 100%100 \%100%. In the training process of a policy-based learning algorithm, this distribution representing the policy will be adjusted to improve the policy.
Figure 16.2: Illustration of policy update in policy-based RL

16.3. Basic Concepts

The probability distribution used to represent a policy can be written as π θ ( a s ) π θ ( a s ) pi_(theta)(a∣s)\pi_{\theta}(a \mid s)πθ(as), which describes the probability of selecting an action a a aaa in the state s s sss. This function is parameterized by θ θ theta\thetaθ, which could be one number or an array of numbers.
Fig. [?] illustrates the updates of π θ ( a s ) π θ ( a s ) pi_(theta)(a∣s)\pi_{\theta}(a \mid s)πθ(as), in which the policy is updated every step with epsodic data. As shown, in Step 1 of Episode 1 represented by a state s 1 , 1 s 1 , 1 vec(s)_(1,1)\vec{s}_{1,1}s1,1, we selected action a 2 a 2 a_(2)a_{2}a2, which has a probability of 0.23 . The generated awards were used to update π θ ( a s ) π θ ( a s ) pi_(theta)(a∣s)\pi_{\theta}(a \mid s)πθ(as). This led to s 1 , 2 s 1 , 2 vec(s)_(1,2)\vec{s}_{1,2}s1,2, in which theupdated π θ ( a s ) π θ ( a s ) pi_(theta)(a∣s)\pi_{\theta}(a \mid s)πθ(as) can be simply viewed as the new probabilities of different actions. This process continued as more episodes of data was used.
When an agent acts in an environment using a policy π θ ( a s ) π θ ( a s ) pi_(theta)(a∣s)\pi_{\theta}(a \mid s)πθ(as), a series of states and actions, can be generated. This series is called a trajectory, which is illustrated in Fig. 16.3. Such a trajectory can be formulated as follows:
τ ( s 1 , a 1 , s 2 , a 2 , , s t , a t ) τ s 1 , a 1 , s 2 , a 2 , , s t , a t tau(s_(1),a_(1),s_(2),a_(2),cdots,s_(t),a_(t))\tau\left(s_{1}, a_{1}, s_{2}, a_{2}, \cdots, s_{t}, a_{t}\right)τ(s1,a1,s2,a2,,st,at)
The stochastic nature of policy-based algorithms makes their theories appear different from those of value-based algorithms [164]. One difference is the widespread use of probability functions and expectations in the theories of policy-based algorithms, which is not necessary in value-based algorithms. One policy corresponds to one probability distribution. As a result, different trajectories can be generated even if an agent starts from the same state following one policy. Therefore, in order to evaluate a policy, expectations of rewards and state functions are commonly used in the theories of policybased theories. It is noted that many introductions to RL are written up using frameworks considering the probabilities
Figure 16.3: Trajectory of an episode in RL learning
and expectations, which present a comprehensive description suitable for both value-based and policy-based algorithms, whereas some other introductions are laid down without probability for simplicity, which is more intended for value-based algorithms. This could be confusing to many people who are new to RL.
Reinforcement learning is similar to supervised learning in a few aspects. That is, we are looking for a mapping from states to actions. But here, it is no longer a simple one-to-one mapping. The rewards and their functions, such as the state value function and action-state value function, serve as the "objective/loss function" in training and "evaluation metrics" in testing for RL. The training process of RL is a process of maximizing the objective function. Thus, such an objective function is usually constructed with the reward and policy. In that way, the probability of actions can be adjusted to obtain more rewards (or higher objective function value). Because the policy is parameterized by θ θ theta\thetaθ, the objective function is also parameterized by θ θ theta\thetaθ. The training process aims to obtain the optimal parameter θ θ theta^(**)\theta^{*}θ, which can maximize the objective function:
(16.1) θ = arg max θ J ( θ ) (16.1) θ = arg max θ J ( θ ) {:(16.1)theta^(**)=arg max_(theta)J(theta):}\begin{equation*} \theta^{*}=\underset{\theta}{\arg \max } J(\theta) \tag{16.1} \end{equation*}(16.1)θ=argmaxθJ(θ)
In the following, we will see that the development, implementation, and improvement of policy gradient methods can be divided into four steps or components:
  1. Propose an objective function to assess the effectiveness of the policy. We will see that many different objective functions can be used.
  2. Define a policy mathematically. For training, we will need to find out how to improve the policy by maximizing the objective function: the policy gradient theorem.
  3. Construct and implement an algorithm for training. That is, we will develop a procedure to improve the policy in the process of maximizing the objective function.
  4. Further Improve the algorithm. Many issues caused difficulties in getting good training results with policy gradient methods. Different improvements were proposed, leading to different methods/algorithms of policy gradients.
The rest of this chapter is organized to introduce these four components sequentially.

16.4. Objective Function and Policy Gradient Theorem

16.4.1. Objective Function

The objective function can be constructed in different ways. In the following, we will first show a very simple way to construct the objective function and derive the policy gradient theorem with it. Three more objective functions will be presented after that, based on which various policy gradient algorithms will be described.
In a simple way, we can use the expectation of the sum of rewards as the objective function. This expectation can be obtained by letting the learning agent generate a large number of trajectories. When this number is large enough, we can use the average of the total rewards to approximate this expectation.
(16.2) J ( θ ) = E τ π θ ( τ ) [ r 0 + r 1 + + r t ] = E τ π θ ( τ ) [ t ( r ( τ ) ) ] 1 N i N [ t r ( s i , t , a i , t ) ] (16.2) J ( θ ) = E τ π θ ( τ ) r 0 + r 1 + + r t = E τ π θ ( τ ) t ( r ( τ ) ) 1 N i N t r s i , t , a i , t {:(16.2)J(theta)=E_(tau∼pi_(theta(tau)))[r_(0)+r_(1)+cdots+r_(t)]=E_(tau∼pi_(theta(tau)))[sum_(t)(r(tau))]~~(1)/(N)sum_(i)^(N)[sum_(t)r(s_(i,t),a_(i,t))]:}\begin{equation*} J(\theta)=\mathbb{E}_{\tau \sim \pi_{\theta(\tau)}}\left[r_{0}+r_{1}+\cdots+r_{t}\right]=\mathbb{E}_{\tau \sim \pi_{\theta(\tau)}}\left[\sum_{t}(r(\tau))\right] \approx \frac{1}{N} \sum_{i}^{N}\left[\sum_{t} r\left(s_{i, t}, a_{i, t}\right)\right] \tag{16.2} \end{equation*}(16.2)J(θ)=Eτπθ(τ)[r0+r1++rt]=Eτπθ(τ)[t(r(τ))]1NiN[tr(si,t,ai,t)]
where i i iii is the number of τ τ tau\tauτ, which is the trajectory obtained in episode i i iii, and t t ttt is the number of the step in this episode.
The above equation can be written in a continuous form:
(16.3) J ( θ ) = E τ π θ ( τ ) [ r ( τ ) ] = τ r ( τ ) π θ ( τ ) d τ (16.3) J ( θ ) = E τ π θ ( τ ) r ( τ ) = τ r ( τ ) π θ ( τ ) d τ {:(16.3)J(theta)=E_(tau∼pi_(theta(tau)))[sum r(tau)]=int_(tau)r(tau)pi_(theta)(tau)d tau:}\begin{equation*} J(\theta)=\mathbb{E}_{\tau \sim \pi_{\theta(\tau)}}\left[\sum r(\tau)\right]=\int_{\tau} r(\tau) \pi_{\theta}(\tau) d \tau \tag{16.3} \end{equation*}(16.3)J(θ)=Eτπθ(τ)[r(τ)]=τr(τ)πθ(τ)dτ
With the above definition, the training process in policy-based reinforcement learning tasks can be formulated as
(16.4) arg max θ J ( θ ) = arg max θ E τ π θ ( τ ) [ r ( τ ) ] = arg max θ τ r ( τ ) π θ ( τ ) d τ (16.4) arg max θ J ( θ ) = arg max θ E τ π θ ( τ ) r ( τ ) = arg max θ τ r ( τ ) π θ ( τ ) d τ {:(16.4)arg max_(theta)J(theta)=arg max_(theta)E_(tau∼pi_(theta(tau)))[sum r(tau)]=arg max_(theta)int_(tau)r(tau)pi_(theta)(tau)d tau:}\begin{equation*} \underset{\theta}{\arg \max } J(\theta)=\underset{\theta}{\arg \max } \mathbb{E}_{\tau \sim \pi_{\theta(\tau)}}\left[\sum r(\tau)\right]=\underset{\theta}{\arg \max } \int_{\tau} r(\tau) \pi_{\theta}(\tau) d \tau \tag{16.4} \end{equation*}(16.4)argmaxθJ(θ)=argmaxθEτπθ(τ)[r(τ)]=argmaxθτr(τ)πθ(τ)dτ
The RL implementation with the above equation is similar to the search for the minimum loss or maximum objective function in supervised learning. We can still use optimization methods such as gradient descent. Notwithstanding, there is one major difference: we look for a maximum objective function determined by the rewards in reinforcement learning instead of a minimum objective function determined by the difference between the predicted and true labels in supervised learning. In gradient ascent, the parameters that enable the maximum of the objective function are achieved via the following equation:
(16.5) θ = θ + α θ J ( θ ) (16.5) θ = θ + α θ J ( θ ) {:(16.5)theta=theta+alphagrad_(theta)J(theta):}\begin{equation*} \theta=\theta+\alpha \nabla_{\theta} J(\theta) \tag{16.5} \end{equation*}(16.5)θ=θ+αθJ(θ)

16.4.2. Policy Gradient Theorem

Now, the key is to derive an equation that can be easily used to calculate θ J ( θ ) θ J ( θ ) grad_(theta)J(theta)\nabla_{\theta} J(\theta)θJ(θ). This equation can be derived as follows,
θ J ( θ ) = θ [ π θ ( τ ) r ( τ ) ] d τ (16.6) = π θ ( τ ) θ π θ ( τ ) π θ ( τ ) r ( τ ) d τ = π θ ( τ ) θ log π θ ( τ ) r ( τ ) d τ = E τ π θ ( τ ) [ θ log π θ ( τ ) r ( τ ) ] θ J ( θ ) = θ π θ ( τ ) r ( τ ) d τ (16.6) = π θ ( τ ) θ π θ ( τ ) π θ ( τ ) r ( τ ) d τ = π θ ( τ ) θ log π θ ( τ ) r ( τ ) d τ = E τ π θ ( τ ) θ log π θ ( τ ) r ( τ ) {:[grad_(theta)J(theta)=intgrad_(theta)[pi_(theta)(tau)*r(tau)]d tau],[(16.6)=intpi_(theta)(tau)*(grad_(theta)pi_(theta)(tau))/(pi_(theta)(tau))*r(tau)d tau],[=intpi_(theta)(tau)*grad_(theta)log pi_(theta)(tau)*r(tau)d tau],[=E_(tau∼pi_(theta)(tau))[grad_(theta)log pi_(theta)(tau)*r(tau)]]:}\begin{align*} \nabla_{\theta} J(\theta) & =\int \nabla_{\theta}\left[\pi_{\theta}(\tau) \cdot r(\tau)\right] d \tau \\ & =\int \pi_{\theta}(\tau) \cdot \frac{\nabla_{\theta} \pi_{\theta}(\tau)}{\pi_{\theta}(\tau)} \cdot r(\tau) d \tau \tag{16.6}\\ & =\int \pi_{\theta}(\tau) \cdot \nabla_{\theta} \log \pi_{\theta}(\tau) \cdot r(\tau) d \tau \\ & =\mathbb{E}_{\tau \sim \pi_{\theta}(\tau)}\left[\nabla_{\theta} \log \pi_{\theta}(\tau) \cdot r(\tau)\right] \end{align*}θJ(θ)=θ[πθ(τ)r(τ)]dτ(16.6)=πθ(τ)θπθ(τ)πθ(τ)r(τ)dτ=πθ(τ)θlogπθ(τ)r(τ)dτ=Eτπθ(τ)[θlogπθ(τ)r(τ)]
where log log log\loglog is the natural log log log\loglog, i.e., ln ln ln\lnln.
In the above equation, the operator is placed on π θ ( τ ) r ( τ ) π θ ( τ ) r ( τ ) pi_(theta)(tau)*r(tau)\pi_{\theta}(\tau) \cdot r(\tau)πθ(τ)r(τ), but it only takes effect on π θ ( τ ) π θ ( τ ) pi_(theta)(tau)\pi_{\theta}(\tau)πθ(τ), because r ( τ ) r ( τ ) r(tau)r(\tau)r(τ) is not a function of θ θ theta\thetaθ. The key work in the above deduction is to take a π θ ( τ ) π θ ( τ ) pi_(theta)(tau)\pi_{\theta}(\tau)πθ(τ) out from the gradient operation, which enables the construction of an expectation.
Next, we need to recall the essential characteristics of MDP, i.e., Markov property [165], so that τ τ tau\tauτ in the above equation can be related to individual action-state sets for implementations. In this property, the current state is only related to the state and action at the previous time and independent of the states and actions at other times. Then, the probability of a trajectory τ τ tau\tauτ can be formulated as
(16.7) π θ ( τ ) = P θ ( s 0 , a 0 , s 1 , a 1 , , s T , a T , s T + 1 ) = P ( s 0 ) t = 0 T [ π θ ( a t s t ) P ( s t + 1 s t , a t ) ] (16.7) π θ ( τ ) = P θ s 0 , a 0 , s 1 , a 1 , , s T , a T , s T + 1 = P s 0 t = 0 T π θ a t s t P s t + 1 s t , a t {:(16.7)pi_(theta)(tau)=P_(theta)(s_(0),a_(0),s_(1),a_(1),cdots,s_(T),a_(T),s_(T+1))=P(s_(0))prod_(t=0)^(T)[pi_(theta)(a_(t)∣s_(t))*P(s_(t+1)∣s_(t),a_(t))]:}\begin{equation*} \pi_{\theta}(\tau)=P_{\theta}\left(s_{0}, a_{0}, s_{1}, a_{1}, \cdots, s_{T}, a_{T}, s_{T+1}\right)=P\left(s_{0}\right) \prod_{t=0}^{T}\left[\pi_{\theta}\left(a_{t} \mid s_{t}\right) \cdot P\left(s_{t+1} \mid s_{t}, a_{t}\right)\right] \tag{16.7} \end{equation*}(16.7)πθ(τ)=Pθ(s0,a0,s1,a1,,sT,aT,sT+1)=P(s0)t=0T[πθ(atst)P(st+1st,at)]
Besides, the sum of the rewards obtained in the trajectory τ τ tau\tauτ can be formulated as
(16.8) r ( τ ) = t = 0 T r ( s t , a t ) (16.8) r ( τ ) = t = 0 T r s t , a t {:(16.8)r(tau)=sum_(t=0)^(T)r(s_(t),a_(t)):}\begin{equation*} r(\tau)=\sum_{t=0}^{T} r\left(s_{t}, a_{t}\right) \tag{16.8} \end{equation*}(16.8)r(τ)=t=0Tr(st,at)
Please be aware that, at this moment, we just sum up the rewards from one trajectory to get r ( τ ) r ( τ ) r(tau)r(\tau)r(τ), and the discount factor has not been considered yet. The discount factor will be discussed in a later section for advanced policy gradient algorithms.
Substituting the above two equations into the θ J ( θ ) θ J ( θ ) grad_(theta)J(theta)\nabla_{\theta} J(\theta)θJ(θ) equation, we obtain
θ J ( θ ) = E τ π θ ( τ ) [ θ ( log P ( s 0 ) + t = 0 T log π θ ( a t s t ) + log P ( s t + 1 s t , a t ) ) ( t = 0 T r ( s t , a t ) ) ] (16.9) = E τ π θ ( τ ) [ θ ( t = 0 T log π θ ( a t s t ) ) ( t = 0 T r ( s t , a t ) ) ] = E τ π θ ( τ ) [ ( t = 0 T θ log π θ ( a t s t ) ) ( t = 0 T r ( s t , a t ) ) ] θ J ( θ ) = E τ π θ ( τ ) θ log P s 0 + t = 0 T log π θ a t s t + log P s t + 1 s t , a t t = 0 T r s t , a t (16.9) = E τ π θ ( τ ) θ t = 0 T log π θ a t s t t = 0 T r s t , a t = E τ π θ ( τ ) t = 0 T θ log π θ a t s t t = 0 T r s t , a t {:[grad_(theta)J(theta)=E_(tau∼pi_(theta)(tau))[grad_(theta)(log P(s_(0))+sum_(t=0)^(T)log pi_(theta)(a_(t)∣s_(t))+log P(s_(t+1)∣s_(t),a_(t)))*(sum_(t=0)^(T)r(s_(t),a_(t)))]],[(16.9)=E_(tau∼pi_(theta)(tau))[grad_(theta)(sum_(t=0)^(T)log pi_(theta)(a_(t)∣s_(t)))*(sum_(t=0)^(T)r(s_(t),a_(t)))]],[=E_(tau∼pi_(theta)(tau))[(sum_(t=0)^(T)grad_(theta)log pi_(theta)(a_(t)∣s_(t)))*(sum_(t=0)^(T)r(s_(t),a_(t)))]]:}\begin{align*} \nabla_{\theta} J(\theta) & =\mathbb{E}_{\tau \sim \pi_{\theta}(\tau)}\left[\nabla_{\theta}\left(\log P\left(s_{0}\right)+\sum_{t=0}^{T} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)+\log P\left(s_{t+1} \mid s_{t}, a_{t}\right)\right) \cdot\left(\sum_{t=0}^{T} r\left(s_{t}, a_{t}\right)\right)\right] \\ & =\mathbb{E}_{\tau \sim \pi_{\theta}(\tau)}\left[\nabla_{\theta}\left(\sum_{t=0}^{T} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)\right) \cdot\left(\sum_{t=0}^{T} r\left(s_{t}, a_{t}\right)\right)\right] \tag{16.9}\\ & =\mathbb{E}_{\tau \sim \pi_{\theta}(\tau)}\left[\left(\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}\left(a_{t} \mid s_{t}\right)\right) \cdot\left(\sum_{t=0}^{T} r\left(s_{t}, a_{t}\right)\right)\right] \end{align*}θJ(θ)=Eτπθ(τ)[θ(logP(s0)+t=0Tlogπθ(atst)+logP(st+1st,at))(t=0Tr(st,at))](16.9)=Eτπθ(τ)[θ(t=0Tlogπθ(atst))(t=0Tr(st,at))]=Eτπθ(τ)[(t=0Tθlogπθ(atst))(t=0Tr(st,at))]
The above equation is called the policy gradient theorem. This equation for the calculation of the gradient of the objective function contains an expectation, which cannot be calculated directly. Therefore, in implementations, we can only approximate it with an average of enough samplings of τ τ tau\tauτ :
(16.10) θ J ( θ ) 1 N i = 1 N [ ( t = 0 T θ log π θ ( a i , t s i , t ) ) Policy: probability of τ ( t = 0 T r ( s i , t , a i , t ) ) Value: rewards of τ ] Use mean to approximate expectation (16.10) θ J ( θ ) 1 N i = 1 N [ t = 0 T θ log π θ a i , t s i , t Policy: probability of  τ t = 0 T r s i , t , a i , t Value: rewards of  τ ] Use mean to approximate expectation  {:(16.10)grad_(theta)J(theta)~~ubrace((1)/(N)sum_(i=1)^(N)[ubrace((sum_(t=0)^(T)grad_(theta)log pi_(theta)(a_(i,t)∣s_(i,t)))ubrace)_("Policy: probability of "tau)*ubrace((sum_(t=0)^(T)r(s_(i,t),a_(i,t)))ubrace)_("Value: rewards of "tau)]ubrace)_("Use mean to approximate expectation "):}\begin{equation*} \nabla_{\theta} J(\theta) \approx \underbrace{\frac{1}{N} \sum_{i=1}^{N}[\underbrace{\left(\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}\left(a_{i, t} \mid s_{i, t}\right)\right)}_{\text {Policy: probability of } \tau} \cdot \underbrace{\left(\sum_{t=0}^{T} r\left(s_{i, t}, a_{i, t}\right)\right)}_{\text {Value: rewards of } \tau}]}_{\text {Use mean to approximate expectation }} \tag{16.10} \end{equation*}(16.10)θJ(θ)1Ni=1N[(t=0Tθlogπθ(ai,tsi,t))Policy: probability of τ(t=0Tr(si,t,ai,t))Value: rewards of τ]Use mean to approximate expectation 
In addition to the use of the mean to approximate the expectation, there are two key parts: policy and reward (value). The former determines what action will be selected, and the latter tells what the value of the action (or reward generated by the action) is. For the policy, we will need to find a way to mathematically formulate π θ π θ pi_(theta)\pi_{\theta}πθ, and based on it, obtain a simple way to obtain the gradient of (the log log log\loglog of) the policy. For the reward (value), in a simple way, we can directly implement the above equation by adding up the rewards for all the steps of a τ τ tau\tauτ. The part for evaluating the value (reward) in the above equation is t = 0 T r ( s i , t , a i , t ) t = 0 T r s i , t , a i , t sum_(t=0)^(T)r(s_(i,t),a_(i,t))\sum_{t=0}^{T} r\left(s_{i, t}, a_{i, t}\right)t=0Tr(si,t,ai,t). This part can be replaced by other equations for valuation evaluations as listed in Table 16.4.2. In a later subsection, we will discuss more ways of evaluating the value, which leads to more advanced policy gradient algorithms.
Table 16.1: Different value evaluation functions in common policy gradient algorithms
Equation for Value Evaluation Algorithm
G t = i = t T γ i t r i + 1 G t = i = t T γ i t r i + 1 G_(t)=sum_(i=t)^(T)gamma^(i-t)r_(i+1)G_{t}=\sum_{i=t}^{T} \gamma^{i-t} r_{i+1}Gt=i=tTγitri+1 REINFORCE version 2
G t V w ( s t ) 1 G t V w s t 1 G_(t)-V_(w)(s_(t))^(1)G_{t}-V_{w}\left(s_{t}\right)^{1}GtVw(st)1 REINFORCE with baseline
Q w ( s t , a t ) Q w s t , a t Q_(w)(s_(t),a_(t))Q_{w}\left(s_{t}, a_{t}\right)Qw(st,at) Q Actor-Critic
A w ( s t , a t ) = Q w ( s t , a t ) V w ( s t ) A w s t , a t = Q w s t , a t V w s t A_(w)(s_(t),a_(t))=Q_(w)(s_(t),a_(t))-V_(w)(s_(t))A_{w}\left(s_{t}, a_{t}\right)=Q_{w}\left(s_{t}, a_{t}\right)-V_{w}\left(s_{t}\right)Aw(st,at)=Qw(st,at)Vw(st) Advantage Actor-Critic
δ = r t + V w ( s t + 1 ) V w ( s t ) δ = r t + V w s t + 1 V w s t delta=r_(t)+V_(w)(s_(t+1))-V_(w)(s_(t))\delta=r_{t}+V_{w}\left(s_{t+1}\right)-V_{w}\left(s_{t}\right)δ=rt+Vw(st+1)Vw(st) Temporal-Difference Actor Critic
δ ( λ ) 2 δ ( λ ) 2 delta(lambda)^(2)\delta(\lambda)^{2}δ(λ)2 Eligibility Actor Critic
Equation for Value Evaluation Algorithm G_(t)=sum_(i=t)^(T)gamma^(i-t)r_(i+1) REINFORCE version 2 G_(t)-V_(w)(s_(t))^(1) REINFORCE with baseline Q_(w)(s_(t),a_(t)) Q Actor-Critic A_(w)(s_(t),a_(t))=Q_(w)(s_(t),a_(t))-V_(w)(s_(t)) Advantage Actor-Critic delta=r_(t)+V_(w)(s_(t+1))-V_(w)(s_(t)) Temporal-Difference Actor Critic delta(lambda)^(2) Eligibility Actor Critic| Equation for Value Evaluation | Algorithm | | :--- | :--- | | $G_{t}=\sum_{i=t}^{T} \gamma^{i-t} r_{i+1}$ | REINFORCE version 2 | | $G_{t}-V_{w}\left(s_{t}\right)^{1}$ | REINFORCE with baseline | | $Q_{w}\left(s_{t}, a_{t}\right)$ | Q Actor-Critic | | $A_{w}\left(s_{t}, a_{t}\right)=Q_{w}\left(s_{t}, a_{t}\right)-V_{w}\left(s_{t}\right)$ | Advantage Actor-Critic | | $\delta=r_{t}+V_{w}\left(s_{t+1}\right)-V_{w}\left(s_{t}\right)$ | Temporal-Difference Actor Critic | | $\delta(\lambda)^{2}$ | Eligibility Actor Critic |

  1. 1 V w 1 V w ^(1)V_(w){ }^{1} V_{w}1Vw and Q w Q w Q_(w)Q_{w}Qw are an approximation to V π V π V^(pi)V^{\pi}Vπ and Q π Q π Q^(pi)Q^{\pi}Qπ, respectively, which are parameterized by w w www, such as linear functions as Q ( s , a ) = ϕ ( s , a ) w Q ( s , a ) = ϕ ( s , a ) w Q(s,a)=phi(s,a)*wQ(s, a)=\phi(s, a) \cdot wQ(s,a)=ϕ(s,a)w and a deep neural network, and will be improved in the training process.
    2 δ ( λ ) 2 δ ( λ ) ^(2)delta(lambda){ }^{2} \delta(\lambda)2δ(λ) is the δ δ delta\deltaδ obtained using a TD ( λ ) TD ( λ ) TD(lambda)\operatorname{TD}(\lambda)TD(λ) scheme, which will be introduced later.

 

 

 

 

 

 

Enjoy and Build the AI World

Sample Code from AI Engineering

Cite the code in your publications

Linear Models