Reinforcement Learning Primer
๐ Learning Objectives: Master the core concepts of reinforcement learning, main algorithm categories, etc., to lay a solid foundation for in-depth study of reinforcement learning.
This document aims to provide beginners with a systematic introduction to reinforcement learning. As an important branch of artificial intelligence, reinforcement learning has wide applications in game AI, robot control, recommendation systems, autonomous driving, and other fields.
๐ Table of Contents
- 1. What is Reinforcement Learning?
- 2. Core Elements of Reinforcement Learning
- 3. Classification of Reinforcement Learning
- 4. Balancing Exploration and Exploitation
- 5. Bellman Equation: Theoretical Foundation
1. What is Reinforcement Learning (Reinforcement Learning)?
Reinforcement learning is a branch of machine learning that focuses on how an Agent takes actions in an Environment to maximize the desired Reward. Unlike supervised learning and unsupervised learning, reinforcement learning learns optimal decision-making strategies through continuous interaction with the environment.
๐ฏ Core Interaction Flow
graph TD
A["๐ค Agent"] --> B["๐ฏ Select Action"]
B --> C["๐ Environment"]
C --> D["๐ New State"]
C --> E["๐ Reward"]
D --> A
E --> A
style A fill:#e1f5fe,stroke:#01579b,stroke-width:2px
style C fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
style D fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
style E fill:#fff3e0,stroke:#e65100,stroke-width:2px
๐ Real-Life Analogy: Training a Pet Dog
To better understand reinforcement learning, letโs use the example of training a pet dog:
- Agent ๐: Your pet dog, the learner who needs to acquire new skills
- Environment ๐ : The training ground, including room layout, furniture placement, and other external conditions
- State ๐: The dogโs current posture and position, e.g., โstanding in the center of the living roomโ
- Action ๐ฏ: The choices the dog can make, such as โsit,โ โshake,โ โspinโ
- Reward ๐ฆด: The feedback you give; a treat for doing it right (+1), no reward (0) or a gentle scolding (-1) for doing it wrong
- Policy ๐ง : The โrules of behaviorโ the dog learns, i.e., what to do in specific situations
๐ฏ Core Definition: Learning from Interaction to Achieve Goals
The core idea of reinforcement learning is โTrial-and-Errorโ. Unlike traditional supervised learning:
| Feature | Supervised Learning | Reinforcement Learning |
|---|---|---|
| Learning Method | Learns from labeled data | Learns from environmental interaction |
| Feedback Type | Correct answers | Reward signals |
| Data Acquisition | Static datasets | Dynamic interaction process |
| Goal | Prediction accuracy | Maximizing long-term cumulative reward |
The agent is not directly told what to do but must discover which sequences of actions lead to the highest long-term returns through interaction with the environment. This learning method is closer to the natural learning process of humans and animals.
๐ Unique Advantages of Reinforcement Learning
- Strong Adaptability: Can adapt to dynamically changing environments
- No Need for Large Labeled Datasets: Learns autonomously through interaction
- Goal-Oriented: Directly optimizes the ultimate goal, not intermediate metrics
- Sequential Decision-Making: Considers the long-term impact of actions, not just single-step optimality
2. Core Elements of Reinforcement Learning
A reinforcement learning system is a complex interactive system composed of the following six key components. Understanding these elements and their interrelationships is fundamental to mastering reinforcement learning.
๐งฉ Six Core Components Explained
1. ๐ค Agent
Definition: The learner and decision-maker, the core of the entire system.
Characteristics:
- Possesses the ability to perceive environmental states
- Can perform actions to influence the environment
- Equipped with mechanisms to learn and improve policies
- Goal is to maximize long-term cumulative reward
Examples:
- ๐ฎ Game AI (e.g., the Go program in AlphaGo)
- ๐ค Autonomous robots (e.g., robotic vacuum cleaners)
- ๐ฐ Trading systems (e.g., stock trading algorithms)
- ๐ Decision-making systems for autonomous vehicles
2. ๐ Environment
Definition: The external world with which the agent interacts; the agent cannot fully control it but can influence it.
Properties:
- State Space: The set of all possible states
- Dynamism: States change according to the agentโs actions and environmental laws
- Stochasticity: State transitions may be uncertain
- Partial Observability: The agent may not be able to observe the complete state of the environment
Classification:
- Deterministic Environments vs Stochastic Environments
- Fully Observable vs Partially Observable
- Single-Agent vs Multi-Agent
- Static vs Dynamic
3. ๐ State
Definition: A collection of information describing the current situation of the environment, denoted as
Important Concepts:
- Markov Property: Future states depend only on the current state, not on history
- State Representation: How to effectively represent and encode state information
- State Space Size: Discrete finite, discrete infinite, continuous
4. ๐ฏ Action
Definition: An operation that the agent can execute, denoted as
Classification:
- Discrete Action Space: A finite number of optional actions (e.g., up, down, left, right in a game)
- Continuous Action Space: Actions with continuous values (e.g., robot joint angles)
- Mixed Action Space: Contains both discrete and continuous actions
5. ๐ Reward
Definition: The immediate feedback from the environment to the agentโs action, denoted as
Design Principles:
- Sparse Rewards vs Dense Rewards
- Reward Shaping: Guiding learning through intermediate rewards
- Reward Function Design: Avoiding reward hacking
Mathematical Representation:
The reward function can be expressed as:
6. ๐ง Policy
Definition: The agentโs rule of behavior, defining what action should be taken in each state.
Mathematical Representation:
- Deterministic Policy:
- Stochastic Policy:
Policy Types:
- Greedy Policy: Always selects the currently optimal action
-Greedy Policy: Explores randomly with probability - Softmax Policy: Based on a probability distribution of action values
๐ Standard Interaction Flow (MDP Framework)
The interaction process of reinforcement learning can be formalized as a Markov Decision Process (MDP):
sequenceDiagram
participant A as ๐ค Agent
participant E as ๐ Environment
Note over A,E: Time t
A->>A: Observe state St
A->>A: Select action based on policy ฯ
A->>E: Execute action At
E->>E: State transition
E->>A: Return new state St+1
E->>A: Return reward Rt+1
A->>A: Update policy/value function
Note over A,E: Loop continues...
Detailed Steps:
- State Perception: The agent observes the current state
- Decision Making: Selects action
according to policy - Action Execution: Executes the selected action in the environment
- Environment Response: The environment transitions to a new state
and gives reward - Learning Update: The agent uses new information to update its policy or value function
- Iterative Loop: Repeats the above process until a termination condition is met
๐ฏ Key Concepts Supplement
Value Functions
Although not direct interaction elements, value functions are important tools for understanding and optimizing policies:
- State-Value Function
: The expected return from state following policy - Action-Value Function
: The expected return from state taking action and then following policy
Model (Optional Component)
Definition: A mathematical description of the environmentโs dynamics, including:
- Transition Probabilities:
- The probability of transitioning to state after taking action in state - Reward Function:
- The corresponding expected reward
Applications:
- Model-Based Methods: Use models for planning and searching
- Model-Free Methods: Learn directly from experience, no explicit model needed
3. Classification of Reinforcement Learning
There are many types of reinforcement learning algorithms. Understanding the classification and characteristics of different algorithms helps us choose appropriate methods for specific problems. Below are the main classification dimensions and representative algorithms.
๐๏ธ Algorithm Classification Panorama
graph TD
A["๐ง Reinforcement Learning Algorithms"] --> B["๐๏ธ Model-Based"]
A --> C["๐ฏ Model-Free"]
B --> B1["๐ Dynamic Programming"]
B --> B2["๐ Search Algorithms"]
B --> B3["๐ฒ Sample-Based Planning"]
C --> C1["๐ Value-Based"]
C --> C2["๐ญ Policy-Based"]
C --> C3["๐ช Actor-Critic"]
B1 --> B11["Policy Iteration
Value Iteration"]
B2 --> B21["MCTS
AlphaGo"]
B3 --> B31["Dyna-Q
PETS"]
C1 --> C11["๐ข Tabular Methods
Q-Learning
SARSA"]
C1 --> C12["๐ง Deep Methods
DQN
Double DQN
Dueling DQN"]
C2 --> C21["๐ฏ Policy Gradient
REINFORCE
TRPO
PPO"]
C2 --> C22["๐ช Evolutionary Strategies
ES
CMA-ES"]
C3 --> C31["๐ญ Traditional AC
A2C
A3C"]
C3 --> C32["๐ Advanced AC
SAC
TD3
DDPG"]
style A fill:#f9f,stroke:#333,stroke-width:3px
style B fill:#bbf,stroke:#333,stroke-width:2px
style C fill:#fbf,stroke:#333,stroke-width:2px
๐๏ธ 3.1 Model-Based vs Model-Free Learning
This is the most fundamental and important classification dimension in reinforcement learning.
Model-Based RL
Core Idea: The agent first learns a model of the environmentโs dynamics, and then uses this model for planning and decision-making.
Advantages:
- โ High Data Efficiency: Can generate virtual experiences through the model
- โ Strong Planning Capability: Can prospectively evaluate the consequences of actions
- โ Good Generalization: The model can be generalized to unseen states
Disadvantages:
- โ Model Bias: Inaccurate models can lead to suboptimal policies
- โ High Computational Complexity: Planning processes are usually computationally intensive
- โ Difficult to Model: Complex environments are difficult to model accurately
Representative Algorithms:
- Dynamic Programming: Policy Iteration, Value Iteration
- Monte Carlo Tree Search (MCTS): AlphaGo, AlphaZero
- Model-Based Deep RL: Dyna-Q, PETS, MuZero
Application Scenarios:
- ๐ฒ Board games (clear rules, easy to model)
- ๐ค Robot control (physical models are relatively certain)
- ๐ Financial trading (rich historical data)
Model-Free RL
Core Idea: The agent does not learn an environment model, but directly learns value functions or policies from experience.
Advantages:
- โ Simple to Implement: No need to model environment dynamics
- โ Wide Applicability: Suitable for complex, difficult-to-model environments
- โ Strong Robustness: Not affected by model errors
Disadvantages:
- โ High Data Requirement: Requires a large amount of environmental interaction
- โ Low Sample Efficiency: Relatively slow learning speed
- โ Lack of Planning: Cannot make prospective decisions
๐ฏ 3.2 Three Major Schools of Model-Free Learning
๐ Value-Based Learning (Value-Based)
Core Idea: Learn the value function of states or state-action pairs, and then select the optimal action based on the value.
Mathematical Foundation:
- State-value function:
- Action-value function:
Decision Rule:
Representative Algorithms:
- Tabular Methods: Q-Learning, SARSA, Expected SARSA
- Function Approximation: DQN, Double DQN, Dueling DQN, Rainbow DQN
Application Scenarios:
- Discrete action spaces
- Scenarios requiring deterministic policies
- Situations where sample efficiency is not extremely high
๐ญ Policy-Based Learning (Policy-Based)
Core Idea: Directly learn a parameterized policy function and optimize policy parameters using policy gradient methods.
Mathematical Foundation:
Policy Gradient Theorem:
Advantages:
- โ Continuous Action Space: Naturally applicable to continuous control
- โ Stochastic Policies: Can learn stochastic policies
- โ Strong Policy Expression Capability: Can represent complex policies
Representative Algorithms:
- Basic Methods: REINFORCE, Actor-Only
- Advanced Methods: TRPO, PPO, A3C
- Evolutionary Methods: Evolution Strategies (ES)
๐ช Actor-Critic (Actor-Critic)
Core Idea: Combines the advantages of value learning and policy learning, using two networks to learn the policy and value function separately.
Architecture Design:
- Actor: Learns policy
, responsible for action selection - Critic: Learns value function
or , responsible for value evaluation
Advantages:
- โ Smaller Variance: Critic provides a baseline, reducing the variance of policy gradients
- โ Smaller Bias: Actor directly optimizes the policy, avoiding bias from the value function
- โ Strong Adaptability: Applicable to both discrete and continuous action spaces
Representative Algorithms:
- Synchronous Methods: A2C, PPO, TRPO
- Asynchronous Methods: A3C, IMPALA
- Deterministic Methods: DDPG, TD3, SAC
๐ 3.3 Classification by Learning Update Method
Monte Carlo Method (MC)
Characteristics:
- Requires complete episodes for updates
- Uses actual returns
for learning - Unbiased estimate, but higher variance
Update Formula:
Application Scenarios:
- Environments with short episodes
- Scenarios requiring unbiased estimates
Temporal Difference Method (TD)
Characteristics:
- Can update at each step, no need to wait for the episode to end
- Uses bootstrapping, i.e., updates estimates with other estimates
- Biased but with smaller variance
TD(0) Update Formula:
Variants:
- TD(ฮป): Method combining multi-step returns
- Q-Learning: Off-policy TD method
- SARSA: On-policy TD method
๐ฏ 3.4 Classification by Policy Update Method
On-Policy vs Off-Policy
On-Policy:
- The learning policy is the same as the data-generating policy
- Representative algorithms: SARSA, A2C, PPO
- Advantages: Good stability, guaranteed convergence
- Disadvantages: Relatively low data utilization efficiency
Off-Policy:
- The learning policy is different from the data-generating policy
- Representative algorithms: Q-Learning, DQN, DDPG
- Advantages: High data utilization efficiency, can reuse historical data
- Disadvantages: May have distribution shift problems
๐ Algorithm Selection Guide
| Environment Feature | Recommended Algorithm Type | Representative Algorithms |
|---|---|---|
| Discrete action space | Value-Based | DQN, Rainbow |
| Continuous action space | Actor-Critic | PPO, SAC, TD3 |
| High-dimensional state space | Deep RL | DQN, A3C, PPO |
| High sample efficiency required | Model-Based | MCTS, MuZero |
| Stochastic policies required | Policy-Based | PPO, SAC |
| Multi-agent environments | Specialized Algorithms | MADDPG, QMIX |
4. Balancing Exploration and Exploitation
This is one of the most central dilemmas in reinforcement learning, also known as the Exploration-Exploitation Trade-off. The essence of this problem is: how should an agent balance acquiring new information (exploration) and utilizing existing knowledge (exploitation)?
โ๏ธ Core Concept Comparison
graph LR
A["๐ค Agent faces choice"] --> B["๐ฏ Exploitation"]
A --> C["๐ Exploration"]
B --> B1["Choose known optimal action"]
B --> B2["Get predictable rewards"]
B --> B3["Risk: May miss better choices"]
C --> C1["Try unknown actions"]
C --> C2["May discover better policies"]
C --> C3["Risk: May receive lower rewards"]
style B fill:#e8f5e8,stroke:#2e7d32
style C fill:#fff3e0,stroke:#f57c00
๐ฝ๏ธ Real-Life Analogy: Restaurant Selection Problem
You are in a new city with many restaurants:
Exploitation Strategy
- Behavior: Always go to the best restaurant youโve already tried
- Advantages: Guaranteed satisfactory dining experience
- Disadvantages: May never discover better restaurants
- Mindset: Seek stability, avoid risk
Exploration Strategy
- Behavior: Try new, untried restaurants
- Advantages: Opportunity to discover unexpected surprises
- Disadvantages: May encounter bad food, wasting time and money
- Mindset: Take risks, pursue better solutions
๐ Mathematical Description of the Exploration-Exploitation Dilemma
In the Multi-Armed Bandit problem, this dilemma can be formalized as:
Given
- Exploitation: Choose
- Exploration: Choose a non-greedy action
Regret is defined as:
Total regret is:
๐ ๏ธ Main Exploration Strategies
1. ฮต-Greedy Policy
Core Idea: Explores randomly with probability
Algorithm Description:
1 | if random() < ฮต: |
Mathematical Representation:
Variants:
- Decaying ฮต-Greedy:
or - Adaptive ฮต-Greedy: Adjusts
based on uncertainty
2. Softmax/Boltzmann Policy
Core Idea: Selects actions based on a probability distribution of action values; actions with higher values have a higher probability of being chosen.
Mathematical Representation:
Where
: Approaches a greedy policy : Approaches a uniform random policy
3. Upper Confidence Bound (UCB)
Core Idea: Selects the action with the highest โoptimistic estimate,โ which considers the upper bound of uncertainty.
UCB1 Formula:
Where:
: Average reward for action : Number of times action has been selected : Confidence parameter
Intuitive Understanding:
- First term: Utilizes the current best estimate
- Second term: Explores actions with high uncertainty
4. Thompson Sampling
Core Idea: Maintains a probability distribution for the value of each action and selects actions based on sampling results.
Algorithm Flow:
- Maintain a value distribution
for each action - Sample a value
from each distribution - Select
- Update the distribution based on the received reward
5. Count-Based Exploration
Core Idea: Encourages visiting less-visited state-action pairs.
Reward Shaping:
Where
๐ฏ Exploration Strategy Comparison
graph TD
A["๐ฏ Exploration Strategy Selection"] --> B["๐ฎ Environment Type"]
A --> C["๐ Performance Requirements"]
A --> D["๐ง Implementation Complexity"]
B --> B1["๐ฒ Stochastic Environments
โ UCB, Thompson"]
B --> B2["๐ฏ Deterministic Environments
โ ฮต-greedy"]
B --> B3["๐ Non-stationary Environments
โ Decaying ฮต-greedy"]
C --> C1["๐ Fast Convergence
โ UCB"]
C --> C2["โ๏ธ Balanced Performance
โ Softmax"]
C --> C3["๐ฏ Simple and Effective
โ ฮต-greedy"]
D --> D1["๐ก Simple Implementation
โ ฮต-greedy"]
D --> D2["๐ง Medium Complexity
โ Softmax"]
D --> D3["๐ฌ Advanced Methods
โ Thompson"]
๐ Exploration Strategy Performance Comparison
| Strategy | Theoretical Guarantee | Implementation Difficulty | Application Scenario | Convergence Speed |
|---|---|---|---|---|
| ฮต-Greedy | General | Simple | General | Medium |
| Decaying ฮต-Greedy | Better | Simple | Non-stationary | Faster |
| Softmax | General | Medium | Continuous Values | Medium |
| UCB | Excellent | Medium | Stochastic | Fast |
| Thompson Sampling | Excellent | Complex | Bayesian Setting | Fast |
๐ Exploration in Deep Reinforcement Learning
In Deep RL, exploration becomes more complex because the state space is huge, and traditional counting methods are no longer applicable.
Intrinsic Motivation
- Curiosity-Driven: ICM (Intrinsic Curiosity Module)
- Random Network Distillation: RND (Random Network Distillation)
- NGU: Never Give Up
Parameter Space Exploration
- Parameter Noise: Adding noise to network parameters
- NoisyNet: Learnable noisy networks
๐ก Practical Advice
- More Exploration Initially: Increase exploration probability at the beginning of learning
- More Exploitation Later: Decrease exploration during convergence
- Environment Adaptation: Choose appropriate strategy based on environment characteristics
- Monitor Metrics: Track exploration rate and performance metrics
- Hyperparameter Tuning: Carefully adjust exploration-related hyperparameters
๐ฏ Summary
Balancing exploration and exploitation is key to successful reinforcement learning. Without exploration, the agent may fall into local optima; without exploitation, the agent cannot effectively use learned knowledge. Choosing the right exploration strategy requires considering environment characteristics, performance requirements, and implementation complexity.
5. The Bellman Equation: Theoretical Foundation of Reinforcement Learning
Bellman equation is the theoretical cornerstone of reinforcement learning. Almost all reinforcement learning algorithms are built upon this elegant mathematical framework. It reveals the recursive structure of value functions, providing fundamental insights for us to understand and solve reinforcement learning problems.
๐งฎ Mathematical Basis: Return and Value Function
Definition of Return
Return
Discount Factor
The discount factor
graph LR
A["ฮณ = 0
๐โโ๏ธ Completely Short-sighted"] --> B["ฮณ = 0.5
โ๏ธ Balanced Consideration"]
B --> C["ฮณ = 0.9
๐ฎ Quite Far-sighted"]
C --> D["ฮณ = 1
๐๏ธ Completely Far-sighted"]
style A fill:#ffcdd2
style B fill:#fff3e0
style C fill:#e8f5e8
style D fill:#e3f2fd
Role of the Discount Factor:
- Mathematical Convergence: Ensures convergence of infinite sequences
- Uncertainty Modeling: The further into the future, the more uncertain
- Practical Significance: Reflects the time value (e.g., interest rate)
๐ Definition of Value Functions
State-Value Function
Definition: The expected return from state
Action-Value Function
Definition: The expected return from state
Relationship Between the Two
๐ Bellman Equation Derivation and Intuition
Core Insight: Recursive Decomposition
Bellmanโs ingenious insight is that value can be recursively decomposed into the immediate reward plus the discounted value of subsequent states.
graph TD
A["Value of current state s"] --> B["Expected immediate reward"]
A --> C["+ ฮณ ร Expected value of next state"]
B --> B1["Select action based on policy ฯ"]
B --> B2["Get immediate reward R(s,a)"]
C --> C1["Transition to next state s'"]
C --> C2["Value of that state V(s')"]
style A fill:#e1f5fe
style B fill:#e8f5e8
style C fill:#fff3e0
Derivation of Bellman Expectation Equation
Starting from the definition of the value function:
Expanding the definition of return:
Using the linearity of expectation:
Noting that
๐ Full Form of the Bellman Equation
Bellman Expectation Equation for State-Value Function
Intuitive Understanding:
- Outer summation: Iterates through all possible actions
: Probability of selecting action - Inner summation: Iterates through all possible next states
: State transition probability : Immediate reward + discounted future value
Bellman Expectation Equation for Action-Value Function
Or:
๐ฏ Bellman Optimality Equations
For the optimal policy
Optimal State-Value Function
Optimal Action-Value Function
๐ง Applications of the Bellman Equation
1. Dynamic Programming Algorithms
Policy Evaluation:
1 | Repeat until convergence: |
Value Iteration:
1 | Repeat until convergence: |
2. Temporal Difference Learning
TD(0) Update Rule:
Where
3. Q-Learning Algorithm
๐ช Geometric Interpretation of the Bellman Equation
The Bellman equation can be viewed as a Contraction Mapping in the value function space:
graph TD
A["Value Function Space"] --> B["Bellman Operator T"]
B --> C["New Value Function"]
C --> D["Fixed Point = True Value Function"]
B --> B1["T[V](s) = max_a ฮฃ_s' P(s'|s,a)[R + ฮณV(s')]"]
style D fill:#e8f5e8
Contraction Property:
This ensures the convergence of the value iteration algorithm.
๐ก Practical Significance and Intuition
- Recursive Structure: Complex problems are decomposed into subproblems
- Dynamic Programming Principle: Optimal substructure property of optimal decisions
- Temporal Consistency: Current optimal decision considers future optimality
- Algorithmic Foundation: Theoretical basis for almost all RL algorithms
๐ฏ Summary
The Bellman equation is not just an elegant mathematical expression but also the key to understanding the essence of reinforcement learning. It tells us:
- Value functions have a recursive structure
- Current decisions should consider long-term impacts
- Optimal policies satisfy the optimal substructure property of dynamic programming
- The true value function can be approximated through iterative solutions
Mastering the Bellman equation is a necessary path to deeply understanding reinforcement learning algorithms.
๐ก Final Words: The charm of reinforcement learning lies in its simulation of the essential biological learning processโconstantly improving through trial and error and feedback. Just as humans learn to ride a bicycle, reinforcement learning allows machines to gain intelligence through interaction with their environment. May this guide provide a solid starting point for your reinforcement learning journey!