强化学习记

题记

强化学习是人工智能领域一个不能绕开的话题。今天趁着《人工智能与机器学习基础》课上讲到强化学习，我在博客上打开这篇文章，等着以后再行扩充。

今年《人工智能与机器学习基础》课程的大作业的主题正好也是强化学习，我觉得能趁这一次大作业把强化学习的原理弄懂也是一件很不错的事情。

关于该文章的完成程度与事实性错误

该文章中的 SARSA 算法部分事实上应为 TD(0) 算法，可能还会存在一些其他的事实性错误，请谨慎阅读。

由于强化学习实验已截止，本文档的更新可能告一段落。

Further improvement

I have discussed this blog with ChatGPT and it listed several areas for improvement, this serves as the guideline of improving this article.

About the use of AI in this article

The contents, although not directly copied from, are largely based on my conversations with ChatGPT, with the conversation links listed below:

C1, Br1, Br2, Br3

Some resources:

Reinforcement Learning - An Overview By Kevin Murphy

Genres of RL

What is Reinforcement Learning?

Inspired from Law of Effect, a psychological principle in behavioral conditioning which states that "responses that produce a satisfying effect in a particular situation become more likely to occur again in that situation"¹—the idea of reinforcement
Core components: Agent, Actions, Rewards, Environment, Feedback

the Reinforcement Learning agent-environment loop

Different axes on the types of RL:

Value-Based \(\leftrightarrow\) Policy-Based (\(\leftrightarrow\) Actor-Critic)
Model-Based \(\leftrightarrow\) Model-Free
On-Policy \(\leftrightarrow\) Off-Policy

They are characterized by their different parametrization and optimization techniques.

In mainstream RL, the environment is often described as an MDP (rather than POMDP due to its complexity; therefore, in these methods, the problem can be defined as:

Given an MDP, find a policy that maximizes cumulative reward.

RL Basics

Markov Decision Process

\[ (\mathcal S, \mathcal A,P,r,\gamma) \]

Monte-Carlo Learning employs a direct Robbins-Monro approximation on the value function.

Bellman Operators

Bellman Expectation Operators:

\[ \begin{gather} (\mathcal T^\pi V)(s)=\mathbb E_{a\sim\pi(\cdot\mid s)}[r(s,a)+\gamma\mathbb E_{s^\prime\sim P(\cdot|s,a)}V(s^\prime)]\\ (\mathcal T^\pi Q)(s,a)=r(s,a) + \gamma \mathbb E_{s'\sim P(\cdot|s,a)} \mathbb E_{a'\sim\pi(\cdot|s')} Q(s',a') \end{gather} \]

Bellman Optimality Operators:

\[ \begin{gather} (\mathcal T^* V)(s) = \max_{a\in\mathcal A}\left[ r(s,a) + \gamma \mathbb E_{s'\sim P(\cdot|s,a)} V(s')\right]\\ (\mathcal T^* Q)(s,a) = r(s,a) + \gamma \mathbb E_{s'\sim P(\cdot|s,a)} \max_{a'} Q(s',a') \end{gather} \]

For \(\gamma \in (0, 1)\) these properties hold:

\(\mathcal T^{\pi}\) and \(\mathcal T^{\ast}\) are \(\gamma\)-contractions² in \(\|\cdot\|_\infty\)
Each has a unique fixed point: \(V^{\pi}=\mathcal T^{\pi} V^{\pi}, \quad V^{\ast}=\mathcal T^{\ast} V^{\ast}\), derived with Banach fixed-point theorem

Intro to Value-Based Methods

The optimal value satisfies:

\[ V^{\ast}=T^{\ast}V^{\ast}\quad\text{or}\quad Q^{\ast}=\mathcal T^{\ast}Q^{\ast} \]

Since \(\mathcal T^\ast\) is a contraction, the iteration process \(V_{k+1}=\mathcal T^{\ast}V_k\) converges—value iteration, however, in model-free settings, the expectations under P are unavailable, motivating stochastic approximation.

Q-learning

Robbins-Monro stochastic approximation presented a methodology for solving a root finding problem, where the function is represented as an expected value. Assume that we have a function \(M(\theta)\), and a constant \(\alpha\), such that the equation \(M(\theta) = \alpha\) has a unique root at \(\theta^\ast\). We can not directly observe the function \(M(\theta)\) but instead obtain measurements of random variable \(N(\theta)\) where \(\mathbb E[N(\theta)] = M(\theta)\). The structure of the algorithm goes as follows:⁵

\[ \theta_{n+1} \gets \theta_n-a_n(N(\theta_n)-\alpha) \]

When \(\sum_{n=0}^\infty a_n=\infty\) and \(\sum_{n=0}^\infty a_n^2\le \infty\), and \(N(\theta), M(\theta)\) have good properties, \(\theta_n\) is guaranteed to converge.

When we employ Robbins-Monro stochastic approximation on \(Q^\ast = \mathcal T^\ast Q^\ast\), obtaining:

\[ Q_{t+1}(s_t,a_t)\gets Q_t(s_t,a_t)+\alpha_t\left[r + \gamma \max_{a^\prime}Q_t(s_{t+1},a^\prime)-Q_t(s_t,a_t)\right] \]

This is the Q-learning algorithm.

SARSA

Similarly, Robbins-Monro approximation can be employed on \(V^\pi = \mathcal T^\pi V^\pi\), resulting in SARSA:

\[ V(s_t) \gets V(s_t)+\alpha\left[r_t+\gamma V(s_{t+1})-V(s_t)\right] \]

Intro to Policy-Based Methods

Policy iteration is a bit subtler than value iteration, it consists of two Bellman-based Steps

\[ \begin{aligned} \text{Policy evaluation:}\quad & V^{\pi_k}=\mathcal T^{\pi_k}V^{\pi_k} \\ \text{Policy improvement:}\quad & \begin{aligned}\pi_{k+1}(s)&=\arg\max_a \left[r(s,a)+\gamma\mathbb E_{s^\prime}V^{\pi_k}(s^\prime)\right]\\&=\arg\max_a Q^{\pi_k}(s,a)\end{aligned}\\ \implies\quad&\mathcal T^{\pi_{k+1}}V^{\pi_k} = \mathcal T^{\ast}V^{\pi_k} \end{aligned} \]

Policy-based methods can be divided into:

Methods that mimic a softened or approximate policy iteration, such as Actor-Critic;
Methods that directly optimize \(J(\theta)\), with policy gradient.

They are often used in scenarios where the action space is large / continuous.

When the policy is not derived from a explicitly-stored \(Q(s,a)\) but instead parametrized by \(\theta\), the policy improvement step can be rewritten as:

\[ \pi_{k+1}\approx\arg \max_{\pi \in \Pi_\theta}\mathbb E_{a\sim\pi}\left[Q^{\pi_k}(s,a)\right] \]

for a fixed critic \(Q^{\pi_k}\):

\[ J(\theta)=\mathbb E_{s\sim d^{\pi_k}}\mathbb E_{a\sim \pi_\theta(\cdot|s)}[Q^{\pi_k}(s, a)] \]

The derivation of \(J\) relies in the state distribution thus intractable, however, the Policy Gradient Theorem⁴ avoids differentiating the state distribution by exploiting the Markov structure of the trajectory distribution:

\[ \nabla_\theta J(\theta)=\mathbb E[\nabla_\theta \log \pi_\theta(a|s) Q^{\pi_k}(s,a)] \]

this provides a possible approximation of policy improvement.

Actor-Critic (AC)

Actor-Critic Methods can be interpreted as approximate policy iteration

Critic: Robbins-Monro on Bellman expectation operator
Actor: Gradient ascent on \(J(\theta)\)

Advantage Actor-Critic (A2C)

\[ \begin{gather} Q(s,a) \Rightarrow A(s,a)=Q(s,a)-V(s)\\ \nabla_\theta J(\theta)=\mathbb E[\nabla_\theta \log \pi_\theta(a|s)A(s,a)] \end{gather} \]

A3C

Constrained RL

Maximum-Entropy RL

Entropy Regularization

Recall the objective in standard RL (overall reward with discount factor):

\[ J(\pi)=\mathbb E_\pi\left[\sum_{t=0}^\infty \gamma^t r(s_t,a_t)\right] \]

Introduce Entropy-regularized objective³:

\[ \begin{gather} J_\alpha(\pi)=\left[\sum_{t=0}^\infty\gamma^t\left(r(s_t,a_t)+\alpha\mathcal H(\pi(\cdot|s_t))\right)\right]\\ \text{with}\quad \mathcal H(\pi(\cdot|s)) = -\mathbb E_{a\sim \pi}\log \pi(a|s) \end{gather} \]

With the entropy augmentation term, we can define the entropy-augmented value-functions and the Bellman expectation operator:

\[ \begin{gather} V^{\pi}_\alpha(s)=\mathbb E_{\pi}\left[\sum_{t=0}^\infty\gamma^t(r(s_t,a_t)+\alpha\mathcal H(\pi(\cdot|s_t)))\middle|s_0=s\right] \\ Q^{\pi}_\alpha(s,a)=r(s,a)+\gamma \mathbb E_{s^\prime}V_\alpha^\pi(s^\prime)\\ (\mathcal T_\alpha^\pi V)(s)=\mathbb E_{a\sim\pi}[r(s,a)+\gamma \mathbb E_{s^\prime} V(s^\prime)-\alpha \log \pi(a|s)] \end{gather} \]

Legendre-Fenchel Duality

The key part that clarified policy-based methods is entropy-regularized policy improvement, with the introduction of entropy, the optimum is unique and explicitly solvable with Lagrange Multipliers, in the form of a Softmax function:

\[ \begin{gather} \pi_{k+1}=\arg \max_{\pi} \mathbb E_{a\sim\pi}[Q^{\pi_k}_\alpha(s,a) - \alpha \log \pi(a|s)] \\ \sum_{a}\pi(a|s)=1,\quad\pi(a|s)\ge 0\\ \quad \implies \mathcal L(\pi,\lambda)=\sum_a\pi(a|s) Q(s, a)-\alpha\sum_a\pi(a|s)\log \pi(a|s)+\lambda\left(\sum_a\pi(a|s)-1\right)\\ {\partial \mathcal L\over \partial \pi(a|s)}=Q(s,a)-\alpha(\log\pi(a|s)+1)+\lambda =0\\ \implies \pi(a|s) = \exp{Q(s, a)\over \alpha}\exp{\lambda-\alpha\over \alpha},\quad \text{with normalization}~\sum_a\pi(a|s)=1 \\ \implies \pi_{k+1}(a|s)={\exp\left({{1\over\alpha}Q^{\pi_k}_\alpha(s,a)}\right)\over\sum_{a^\prime}\left({{1\over\alpha}Q^{\pi_k}_\alpha(s,a^\prime)}\right)} \end{gather} \]

(For simplicity, in these steps we follow the assumption of discrete policy)

This defines the soft Bellman optimality operator:

\[ \begin{gather} V(s) = \mathbb E_{a\sim \pi}[Q(s,a)-\alpha \log \pi(a|s)]\\ \text{with the explicit policy:}\quad V^{\ast}(s)=\sum_a\pi^{\ast}(a|s)\Big[Q(s,a)-\alpha\log \pi^\ast(a|s)\Big]\\ \text{with:}\quad \log \pi^{\ast}(a|s) = {Q(s, a)\over \alpha}-\log\sum_{a^\prime}\exp{Q(s, a^\prime)\over \alpha}\\ \begin{aligned} \implies \quad V^{\ast}(s)&=\sum_a\pi^{\ast}(a|s)\left[Q(s, a)-\alpha\left({Q(s, a)\over \alpha} - \log Z(s)\right)\right]\\ &=\alpha \log Z(s)\sum_{a}\pi^\ast(a|s)\\ &=\alpha\log Z(s)\\ &=\alpha \log \sum_a \exp{Q(s ,a)\over \alpha} \end{aligned}\\ \implies\quad \boxed{(\mathcal T^\ast_\alpha V)(s)=\alpha\log \sum_{a}\exp{r(s,a)+\gamma \mathbb E_{s^\prime}V(s^\prime)\over \alpha}} \end{gather} \]

From a variational perspective, the entire derivation can be summarized as:

\[ \max_{\pi}\mathbb E_{a\sim\pi}[Q(s, a) - \alpha \log \pi(a|s)] = \alpha\log \sum_{a}\exp{Q(s,a)\over \alpha} \]

which is a Legendre-Fenchel duality between entropy and the log-sum-exp function.

Soft Value-Based Methods

When we solve the entropy-augmented Bellman optimality equation exactly as in value based methods:

\[ \begin{gather} Q^\ast_\alpha=\mathcal T^{\ast}_\alpha Q^\ast_\alpha \\ \pi^\ast(a|s) = {\exp(\frac1\alpha Q^\ast_\alpha(s,a))\over \int\exp(\frac1\alpha Q^\ast_\alpha(s,a^\prime))\mathrm d a^\prime} \end{gather} \]

we obtain soft Q-learning algorithm, the derived policy is called the Boltzmann (a.k.a. Gibbs) policy. Recall Gibbs distribution in statistical mechanics:

\[ p(x)=\frac1Ze^{-\frac{E(x)}{kT}} \]

the following correspondence exist:

Statistical mechanics	RL
Energy \(E(x)\)	\(-Q(s,a)\)
Temperature \(T\)	\(\alpha\)
Partition function \(Z\)	\(\sum_a \exp\frac Q a\)

RL with KL-constraints

Wikipedia Article: Law of Effect ↩
Berkeley Handout: EE290 Lecture 17 ↩
CMU Slide on Entropy-regularized RL ↩
Sutton et al, NIPS 1999 ↩
Wikipedia Article: Robbins-Monro Algorithm ↩