强化学习记
题记
强化学习是人工智能领域一个不能绕开的话题。今天趁着《人工智能与机器学习基础》课上讲到强化学习,我在博客上打开这篇文章,等着以后再行扩充。
今年《人工智能与机器学习基础》课程的大作业的主题正好也是强化学习,我觉得能趁这一次大作业把强化学习的原理弄懂也是一件很不错的事情。
关于该文章的完成程度与事实性错误
该文章中的 SARSA 算法部分事实上应为 TD(0) 算法,可能还会存在一些其他的事实性错误,请谨慎阅读。
由于强化学习实验已截止,本文档的更新可能告一段落。
Further improvement
I have discussed this blog with ChatGPT and it listed several areas for improvement, this serves as the guideline of improving this article.
About the use of AI in this article
The contents, although not directly copied from, are largely based on my conversations with ChatGPT, with the conversation links listed below:
Some resources:
Genres of RL
What is Reinforcement Learning?
- Inspired from Law of Effect, a psychological principle in behavioral conditioning which states that "responses that produce a satisfying effect in a particular situation become more likely to occur again in that situation"1—the idea of reinforcement
- Core components: Agent, Actions, Rewards, Environment, Feedback

Different axes on the types of RL:
- Value-Based \(\leftrightarrow\) Policy-Based (\(\leftrightarrow\) Actor-Critic)
- Model-Based \(\leftrightarrow\) Model-Free
- On-Policy \(\leftrightarrow\) Off-Policy
They are characterized by their different parametrization and optimization techniques.
In mainstream RL, the environment is often described as an MDP (rather than POMDP due to its complexity; therefore, in these methods, the problem can be defined as:
Given an MDP, find a policy that maximizes cumulative reward.
RL Basics
Markov Decision Process
Monte-Carlo Learning employs a direct Robbins-Monro approximation on the value function.
Bellman Operators
Bellman Expectation Operators:
Bellman Optimality Operators:
For \(\gamma \in (0, 1)\) these properties hold:
- \(\mathcal T^{\pi}\) and \(\mathcal T^{\ast}\) are \(\gamma\)-contractions2 in \(\|\cdot\|_\infty\)
- Each has a unique fixed point: \(V^{\pi}=\mathcal T^{\pi} V^{\pi}, \quad V^{\ast}=\mathcal T^{\ast} V^{\ast}\), derived with Banach fixed-point theorem
Intro to Value-Based Methods
The optimal value satisfies:
Since \(\mathcal T^\ast\) is a contraction, the iteration process \(V_{k+1}=\mathcal T^{\ast}V_k\) converges—value iteration, however, in model-free settings, the expectations under P are unavailable, motivating stochastic approximation.
Q-learning
Robbins-Monro stochastic approximation presented a methodology for solving a root finding problem, where the function is represented as an expected value. Assume that we have a function \(M(\theta)\), and a constant \(\alpha\), such that the equation \(M(\theta) = \alpha\) has a unique root at \(\theta^\ast\). We can not directly observe the function \(M(\theta)\) but instead obtain measurements of random variable \(N(\theta)\) where \(\mathbb E[N(\theta)] = M(\theta)\). The structure of the algorithm goes as follows:5
When \(\sum_{n=0}^\infty a_n=\infty\) and \(\sum_{n=0}^\infty a_n^2\le \infty\), and \(N(\theta), M(\theta)\) have good properties, \(\theta_n\) is guaranteed to converge.
When we employ Robbins-Monro stochastic approximation on \(Q^\ast = \mathcal T^\ast Q^\ast\), obtaining:
This is the Q-learning algorithm.
SARSA
Similarly, Robbins-Monro approximation can be employed on \(V^\pi = \mathcal T^\pi V^\pi\), resulting in SARSA:
Intro to Policy-Based Methods
Policy iteration is a bit subtler than value iteration, it consists of two Bellman-based Steps
Policy-based methods can be divided into:
- Methods that mimic a softened or approximate policy iteration, such as Actor-Critic;
- Methods that directly optimize \(J(\theta)\), with policy gradient.
They are often used in scenarios where the action space is large / continuous.
When the policy is not derived from a explicitly-stored \(Q(s,a)\) but instead parametrized by \(\theta\), the policy improvement step can be rewritten as:
for a fixed critic \(Q^{\pi_k}\):
The derivation of \(J\) relies in the state distribution thus intractable, however, the Policy Gradient Theorem4 avoids differentiating the state distribution by exploiting the Markov structure of the trajectory distribution:
this provides a possible approximation of policy improvement.
Actor-Critic (AC)
Actor-Critic Methods can be interpreted as approximate policy iteration
- Critic: Robbins-Monro on Bellman expectation operator
- Actor: Gradient ascent on \(J(\theta)\)
Advantage Actor-Critic (A2C)
A3C
Constrained RL
Maximum-Entropy RL
Entropy Regularization
Recall the objective in standard RL (overall reward with discount factor):
Introduce Entropy-regularized objective3:
With the entropy augmentation term, we can define the entropy-augmented value-functions and the Bellman expectation operator:
Legendre-Fenchel Duality
The key part that clarified policy-based methods is entropy-regularized policy improvement, with the introduction of entropy, the optimum is unique and explicitly solvable with Lagrange Multipliers, in the form of a Softmax function:
(For simplicity, in these steps we follow the assumption of discrete policy)
This defines the soft Bellman optimality operator:
From a variational perspective, the entire derivation can be summarized as:
which is a Legendre-Fenchel duality between entropy and the log-sum-exp function.
Soft Value-Based Methods
When we solve the entropy-augmented Bellman optimality equation exactly as in value based methods:
we obtain soft Q-learning algorithm, the derived policy is called the Boltzmann (a.k.a. Gibbs) policy. Recall Gibbs distribution in statistical mechanics:
the following correspondence exist:
| Statistical mechanics | RL |
|---|---|
| Energy \(E(x)\) | \(-Q(s,a)\) |
| Temperature \(T\) | \(\alpha\) |
| Partition function \(Z\) | \(\sum_a \exp\frac Q a\) |
RL with KL-constraints
-
Wikipedia Article: Law of Effect ↩
-
Berkeley Handout: EE290 Lecture 17 ↩
-
Wikipedia Article: Robbins-Monro Algorithm ↩