Reinforcement Learning Primer

📚 Learning Objectives: Master the core concepts of reinforcement learning, main algorithm categories, etc., to lay a solid foundation for in-depth study of reinforcement learning.

This document aims to provide beginners with a systematic introduction to reinforcement learning. As an important branch of artificial intelligence, reinforcement learning has wide applications in game AI, robot control, recommendation systems, autonomous driving, and other fields.

📋 Table of Contents

1. What is Reinforcement Learning?
2. Core Elements of Reinforcement Learning
3. Classification of Reinforcement Learning
4. Balancing Exploration and Exploitation
5. Bellman Equation: Theoretical Foundation

1. What is Reinforcement Learning (Reinforcement Learning)?

Reinforcement learning is a branch of machine learning that focuses on how an Agent takes actions in an Environment to maximize the desired Reward. Unlike supervised learning and unsupervised learning, reinforcement learning learns optimal decision-making strategies through continuous interaction with the environment.

🎯 Core Interaction Flow

graph TD
    A["🤖 Agent"] --> B["🎯 Select Action"]
    B --> C["🌍 Environment"]
    C --> D["📍 New State"]
    C --> E["🎁 Reward"]
    D --> A
    E --> A
    
    style A fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    style C fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    style D fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
    style E fill:#fff3e0,stroke:#e65100,stroke-width:2px

🐕 Real-Life Analogy: Training a Pet Dog

To better understand reinforcement learning, let’s use the example of training a pet dog:

Agent 🐕: Your pet dog, the learner who needs to acquire new skills
Environment 🏠: The training ground, including room layout, furniture placement, and other external conditions
State 📍: The dog’s current posture and position, e.g., “standing in the center of the living room”
Action 🎯: The choices the dog can make, such as “sit,” “shake,” “spin”
Reward 🦴: The feedback you give; a treat for doing it right (+1), no reward (0) or a gentle scolding (-1) for doing it wrong
Policy 🧠: The “rules of behavior” the dog learns, i.e., what to do in specific situations

🎯 Core Definition: Learning from Interaction to Achieve Goals

The core idea of reinforcement learning is “Trial-and-Error”. Unlike traditional supervised learning:

Feature	Supervised Learning	Reinforcement Learning
Learning Method	Learns from labeled data	Learns from environmental interaction
Feedback Type	Correct answers	Reward signals
Data Acquisition	Static datasets	Dynamic interaction process
Goal	Prediction accuracy	Maximizing long-term cumulative reward

The agent is not directly told what to do but must discover which sequences of actions lead to the highest long-term returns through interaction with the environment. This learning method is closer to the natural learning process of humans and animals.

🌟 Unique Advantages of Reinforcement Learning

Strong Adaptability: Can adapt to dynamically changing environments
No Need for Large Labeled Datasets: Learns autonomously through interaction
Goal-Oriented: Directly optimizes the ultimate goal, not intermediate metrics
Sequential Decision-Making: Considers the long-term impact of actions, not just single-step optimality

2. Core Elements of Reinforcement Learning

A reinforcement learning system is a complex interactive system composed of the following six key components. Understanding these elements and their interrelationships is fundamental to mastering reinforcement learning.

🧩 Six Core Components Explained

1. 🤖 Agent

Definition: The learner and decision-maker, the core of the entire system.

Characteristics:

Possesses the ability to perceive environmental states
Can perform actions to influence the environment
Equipped with mechanisms to learn and improve policies
Goal is to maximize long-term cumulative reward

Examples:

🎮 Game AI (e.g., the Go program in AlphaGo)
🤖 Autonomous robots (e.g., robotic vacuum cleaners)
💰 Trading systems (e.g., stock trading algorithms)
🚗 Decision-making systems for autonomous vehicles

2. 🌍 Environment

Definition: The external world with which the agent interacts; the agent cannot fully control it but can influence it.

Properties:

State Space: The set of all possible states $S$
Dynamism: States change according to the agent’s actions and environmental laws
Stochasticity: State transitions may be uncertain
Partial Observability: The agent may not be able to observe the complete state of the environment

Classification:

Deterministic Environments vs Stochastic Environments
Fully Observable vs Partially Observable
Single-Agent vs Multi-Agent
Static vs Dynamic

3. 📍 State

Definition: A collection of information describing the current situation of the environment, denoted as $s_{t} \in S$ .

Important Concepts:

Markov Property: Future states depend only on the current state, not on history
State Representation: How to effectively represent and encode state information
State Space Size: Discrete finite, discrete infinite, continuous

4. 🎯 Action

Definition: An operation that the agent can execute, denoted as $a_{t} \in A$ .

Classification:

Discrete Action Space: A finite number of optional actions (e.g., up, down, left, right in a game)
Continuous Action Space: Actions with continuous values (e.g., robot joint angles)
Mixed Action Space: Contains both discrete and continuous actions

5. 🎁 Reward

Definition: The immediate feedback from the environment to the agent’s action, denoted as $r_{t} \in R$ .

Design Principles:

Sparse Rewards vs Dense Rewards
Reward Shaping: Guiding learning through intermediate rewards
Reward Function Design: Avoiding reward hacking

Mathematical Representation:
The reward function can be expressed as: $R : S \times A \times S \to R$

6. 🧠 Policy

Definition: The agent’s rule of behavior, defining what action should be taken in each state.

Mathematical Representation:

Deterministic Policy: $π : S \to A$
Stochastic Policy: $π (a | s) = P (A_{t} = a | S_{t} = s)$

Policy Types:

Greedy Policy: Always selects the currently optimal action
$ε$ -Greedy Policy: Explores randomly with probability $ε$
Softmax Policy: Based on a probability distribution of action values

🔄 Standard Interaction Flow (MDP Framework)

The interaction process of reinforcement learning can be formalized as a Markov Decision Process (MDP):

sequenceDiagram
    participant A as 🤖 Agent
    participant E as 🌍 Environment
    
    Note over A,E: Time t
    A->>A: Observe state St
    A->>A: Select action based on policy π
    A->>E: Execute action At
    E->>E: State transition
    E->>A: Return new state St+1
    E->>A: Return reward Rt+1
    A->>A: Update policy/value function
    
    Note over A,E: Loop continues...

Detailed Steps:

State Perception: The agent observes the current state $S_{t}$
Decision Making: Selects action $A_{t}$ according to policy $π (a | s)$
Action Execution: Executes the selected action in the environment
Environment Response: The environment transitions to a new state $S_{t + 1}$ and gives reward $R_{t + 1}$
Learning Update: The agent uses new information to update its policy or value function
Iterative Loop: Repeats the above process until a termination condition is met

🎯 Key Concepts Supplement

Value Functions

Although not direct interaction elements, value functions are important tools for understanding and optimizing policies:

State-Value Function $V^{π} (s)$ : The expected return from state $s$ following policy $π$
Action-Value Function $Q^{π} (s, a)$ : The expected return from state $s$ taking action $a$ and then following policy $π$

Model (Optional Component)

Definition: A mathematical description of the environment’s dynamics, including:

Transition Probabilities: $P (s^{'} | s, a)$ - The probability of transitioning to state $s^{'}$ after taking action $a$ in state $s$
Reward Function: $R (s, a, s^{'})$ - The corresponding expected reward

Applications:

Model-Based Methods: Use models for planning and searching
Model-Free Methods: Learn directly from experience, no explicit model needed

3. Classification of Reinforcement Learning

There are many types of reinforcement learning algorithms. Understanding the classification and characteristics of different algorithms helps us choose appropriate methods for specific problems. Below are the main classification dimensions and representative algorithms.

🗂️ Algorithm Classification Panorama

graph TD
    A["🧠 Reinforcement Learning Algorithms"] --> B["🏗️ Model-Based"]
    A --> C["🎯 Model-Free"]
    
    B --> B1["📊 Dynamic Programming"]
    B --> B2["🔍 Search Algorithms"]
    B --> B3["🎲 Sample-Based Planning"]
    
    C --> C1["💎 Value-Based"]
    C --> C2["🎭 Policy-Based"]
    C --> C3["🎪 Actor-Critic"]
    
    B1 --> B11["Policy Iteration
Value Iteration"]
    B2 --> B21["MCTS
AlphaGo"]
    B3 --> B31["Dyna-Q
PETS"]
    
    C1 --> C11["🔢 Tabular Methods
Q-Learning
SARSA"]
    C1 --> C12["🧠 Deep Methods
DQN
Double DQN
Dueling DQN"]
    
    C2 --> C21["🎯 Policy Gradient
REINFORCE
TRPO
PPO"]
    C2 --> C22["🎪 Evolutionary Strategies
ES
CMA-ES"]
    
    C3 --> C31["🎭 Traditional AC
A2C
A3C"]
    C3 --> C32["🚀 Advanced AC
SAC
TD3
DDPG"]
    
    style A fill:#f9f,stroke:#333,stroke-width:3px
    style B fill:#bbf,stroke:#333,stroke-width:2px
    style C fill:#fbf,stroke:#333,stroke-width:2px

🏗️ 3.1 Model-Based vs Model-Free Learning

This is the most fundamental and important classification dimension in reinforcement learning.

Model-Based RL

Core Idea: The agent first learns a model of the environment’s dynamics, and then uses this model for planning and decision-making.

Advantages:

✅ High Data Efficiency: Can generate virtual experiences through the model
✅ Strong Planning Capability: Can prospectively evaluate the consequences of actions
✅ Good Generalization: The model can be generalized to unseen states

Disadvantages:

❌ Model Bias: Inaccurate models can lead to suboptimal policies
❌ High Computational Complexity: Planning processes are usually computationally intensive
❌ Difficult to Model: Complex environments are difficult to model accurately

Representative Algorithms:

Dynamic Programming: Policy Iteration, Value Iteration
Monte Carlo Tree Search (MCTS): AlphaGo, AlphaZero
Model-Based Deep RL: Dyna-Q, PETS, MuZero

Application Scenarios:

🎲 Board games (clear rules, easy to model)
🤖 Robot control (physical models are relatively certain)
📈 Financial trading (rich historical data)

Model-Free RL

Core Idea: The agent does not learn an environment model, but directly learns value functions or policies from experience.

Advantages:

✅ Simple to Implement: No need to model environment dynamics
✅ Wide Applicability: Suitable for complex, difficult-to-model environments
✅ Strong Robustness: Not affected by model errors

Disadvantages:

❌ High Data Requirement: Requires a large amount of environmental interaction
❌ Low Sample Efficiency: Relatively slow learning speed
❌ Lack of Planning: Cannot make prospective decisions

🎯 3.2 Three Major Schools of Model-Free Learning

💎 Value-Based Learning (Value-Based)

Core Idea: Learn the value function of states or state-action pairs, and then select the optimal action based on the value.

Mathematical Foundation:

State-value function: $V (s) = E [G_{t} | S_{t} = s]$
Action-value function: $Q (s, a) = E [G_{t} | S_{t} = s, A_{t} = a]$

Decision Rule: $π (s) = \arg max_{a} Q (s, a)$

Representative Algorithms:

Tabular Methods: Q-Learning, SARSA, Expected SARSA
Function Approximation: DQN, Double DQN, Dueling DQN, Rainbow DQN

Application Scenarios:

Discrete action spaces
Scenarios requiring deterministic policies
Situations where sample efficiency is not extremely high

🎭 Policy-Based Learning (Policy-Based)

Core Idea: Directly learn a parameterized policy function and optimize policy parameters using policy gradient methods.

Mathematical Foundation:
Policy Gradient Theorem: $\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} \log π_{θ} (a | s) \cdot Q^{π_{θ}} (s, a)]$

Advantages:

✅ Continuous Action Space: Naturally applicable to continuous control
✅ Stochastic Policies: Can learn stochastic policies
✅ Strong Policy Expression Capability: Can represent complex policies

Representative Algorithms:

Basic Methods: REINFORCE, Actor-Only
Advanced Methods: TRPO, PPO, A3C
Evolutionary Methods: Evolution Strategies (ES)

🎪 Actor-Critic (Actor-Critic)

Core Idea: Combines the advantages of value learning and policy learning, using two networks to learn the policy and value function separately.

Architecture Design:

Actor: Learns policy $π_{θ} (a | s)$ , responsible for action selection
Critic: Learns value function $V_{ϕ} (s)$ or $Q_{ϕ} (s, a)$ , responsible for value evaluation

Advantages:

✅ Smaller Variance: Critic provides a baseline, reducing the variance of policy gradients
✅ Smaller Bias: Actor directly optimizes the policy, avoiding bias from the value function
✅ Strong Adaptability: Applicable to both discrete and continuous action spaces

Representative Algorithms:

Synchronous Methods: A2C, PPO, TRPO
Asynchronous Methods: A3C, IMPALA
Deterministic Methods: DDPG, TD3, SAC

🔄 3.3 Classification by Learning Update Method

Monte Carlo Method (MC)

Characteristics:

Requires complete episodes for updates
Uses actual returns $G_{t}$ for learning
Unbiased estimate, but higher variance

Update Formula:

V (S_{t}) \leftarrow V (S_{t}) + α [G_{t} - V (S_{t})]

Application Scenarios:

Environments with short episodes
Scenarios requiring unbiased estimates

Temporal Difference Method (TD)

Characteristics:

Can update at each step, no need to wait for the episode to end
Uses bootstrapping, i.e., updates estimates with other estimates
Biased but with smaller variance

TD(0) Update Formula:

V (S_{t}) \leftarrow V (S_{t}) + α [R_{t + 1} + γ V (S_{t + 1}) - V (S_{t})]

Variants:

TD(λ): Method combining multi-step returns
Q-Learning: Off-policy TD method
SARSA: On-policy TD method

🎯 3.4 Classification by Policy Update Method

On-Policy vs Off-Policy

On-Policy:

The learning policy is the same as the data-generating policy
Representative algorithms: SARSA, A2C, PPO
Advantages: Good stability, guaranteed convergence
Disadvantages: Relatively low data utilization efficiency

Off-Policy:

The learning policy is different from the data-generating policy
Representative algorithms: Q-Learning, DQN, DDPG
Advantages: High data utilization efficiency, can reuse historical data
Disadvantages: May have distribution shift problems

📊 Algorithm Selection Guide

Environment Feature	Recommended Algorithm Type	Representative Algorithms
Discrete action space	Value-Based	DQN, Rainbow
Continuous action space	Actor-Critic	PPO, SAC, TD3
High-dimensional state space	Deep RL	DQN, A3C, PPO
High sample efficiency required	Model-Based	MCTS, MuZero
Stochastic policies required	Policy-Based	PPO, SAC
Multi-agent environments	Specialized Algorithms	MADDPG, QMIX

4. Balancing Exploration and Exploitation

This is one of the most central dilemmas in reinforcement learning, also known as the Exploration-Exploitation Trade-off. The essence of this problem is: how should an agent balance acquiring new information (exploration) and utilizing existing knowledge (exploitation)?

⚖️ Core Concept Comparison

graph LR
    A["🤔 Agent faces choice"] --> B["🎯 Exploitation"]
    A --> C["🔍 Exploration"]
    
    B --> B1["Choose known optimal action"]
    B --> B2["Get predictable rewards"]
    B --> B3["Risk: May miss better choices"]
    
    C --> C1["Try unknown actions"]
    C --> C2["May discover better policies"]
    C --> C3["Risk: May receive lower rewards"]
    
    style B fill:#e8f5e8,stroke:#2e7d32
    style C fill:#fff3e0,stroke:#f57c00

🍽️ Real-Life Analogy: Restaurant Selection Problem

You are in a new city with many restaurants:

Exploitation Strategy

Behavior: Always go to the best restaurant you’ve already tried
Advantages: Guaranteed satisfactory dining experience
Disadvantages: May never discover better restaurants
Mindset: Seek stability, avoid risk

Exploration Strategy

Behavior: Try new, untried restaurants
Advantages: Opportunity to discover unexpected surprises
Disadvantages: May encounter bad food, wasting time and money
Mindset: Take risks, pursue better solutions

📊 Mathematical Description of the Exploration-Exploitation Dilemma

In the Multi-Armed Bandit problem, this dilemma can be formalized as:

Given $K$ actions, each action $a$ has a true value $q_{*} (a)$ and an estimated value $Q_{t} (a)$ .

Exploitation: Choose $A_{t} = \arg max_{a} Q_{t} (a)$
Exploration: Choose a non-greedy action

Regret is defined as:

R_{t} = max_{a} q_{*} (a) - q_{*} (A_{t})

Total regret is: $\sum_{t = 1}^{T} R_{t}$

🛠️ Main Exploration Strategies

1. ε-Greedy Policy

Core Idea: Explores randomly with probability $ε$ , and exploits the current optimal action with probability $1 - ε$ .

Algorithm Description:

if random() < ε:
    选择随机行动 (探索)
else:
    选择当前最优行动 (利用)

Mathematical Representation:

π (a | s) = {\begin{cases} 1 - ε + \frac{ε}{| A |} & if a = \arg max_{a} Q (s, a) \\ \frac{ε}{| A |} & otherwise \end{cases}

Variants:

Decaying ε-Greedy: $ε_{t} = ε_{0} / t$ or $ε_{t} = ε_{0} \cdot γ^{t}$
Adaptive ε-Greedy: Adjusts $ε$ based on uncertainty

2. Softmax/Boltzmann Policy

Core Idea: Selects actions based on a probability distribution of action values; actions with higher values have a higher probability of being chosen.

Mathematical Representation:

π (a | s) = \frac{\exp (Q (s, a) / τ)}{\sum_{a^{'}} \exp (Q (s, a^{'}) / τ)}

Where $τ$ is the temperature parameter:

$τ \to 0$ : Approaches a greedy policy
$τ \to \infty$ : Approaches a uniform random policy

3. Upper Confidence Bound (UCB)

Core Idea: Selects the action with the highest “optimistic estimate,” which considers the upper bound of uncertainty.

UCB1 Formula:

A_{t} = \arg max_{a} [Q_{t} (a) + c \sqrt{\frac{\ln t}{N_{t} (a)}}]

Where:

$Q_{t} (a)$ : Average reward for action $a$
$N_{t} (a)$ : Number of times action $a$ has been selected
$c$ : Confidence parameter

Intuitive Understanding:

First term: Utilizes the current best estimate
Second term: Explores actions with high uncertainty

4. Thompson Sampling

Core Idea: Maintains a probability distribution for the value of each action and selects actions based on sampling results.

Algorithm Flow:

Maintain a value distribution $P (Q (a))$ for each action
Sample a value $\tilde{Q} (a)$ from each distribution
Select $\arg max_{a} \tilde{Q} (a)$
Update the distribution based on the received reward

5. Count-Based Exploration

Core Idea: Encourages visiting less-visited state-action pairs.

Reward Shaping:

r^{'} (s, a) = r (s, a) + β \cdot f (N (s, a))

Where $f (\cdot)$ is a decreasing function, such as $f (n) = 1 / \sqrt{n}$

🎯 Exploration Strategy Comparison

graph TD
    A["🎯 Exploration Strategy Selection"] --> B["🎮 Environment Type"]
    A --> C["📊 Performance Requirements"]
    A --> D["🔧 Implementation Complexity"]
    
    B --> B1["🎲 Stochastic Environments
→ UCB, Thompson"]
    B --> B2["🎯 Deterministic Environments
→ ε-greedy"]
    B --> B3["🌊 Non-stationary Environments
→ Decaying ε-greedy"]
    
    C --> C1["🚀 Fast Convergence
→ UCB"]
    C --> C2["⚖️ Balanced Performance
→ Softmax"]
    C --> C3["🎯 Simple and Effective
→ ε-greedy"]
    
    D --> D1["💡 Simple Implementation
→ ε-greedy"]
    D --> D2["🧠 Medium Complexity
→ Softmax"]
    D --> D3["🔬 Advanced Methods
→ Thompson"]

📈 Exploration Strategy Performance Comparison

Strategy	Theoretical Guarantee	Implementation Difficulty	Application Scenario	Convergence Speed
ε-Greedy	General	Simple	General	Medium
Decaying ε-Greedy	Better	Simple	Non-stationary	Faster
Softmax	General	Medium	Continuous Values	Medium
UCB	Excellent	Medium	Stochastic	Fast
Thompson Sampling	Excellent	Complex	Bayesian Setting	Fast

🔍 Exploration in Deep Reinforcement Learning

In Deep RL, exploration becomes more complex because the state space is huge, and traditional counting methods are no longer applicable.

Intrinsic Motivation

Curiosity-Driven: ICM (Intrinsic Curiosity Module)
Random Network Distillation: RND (Random Network Distillation)
NGU: Never Give Up

Parameter Space Exploration

Parameter Noise: Adding noise to network parameters
NoisyNet: Learnable noisy networks

💡 Practical Advice

More Exploration Initially: Increase exploration probability at the beginning of learning
More Exploitation Later: Decrease exploration during convergence
Environment Adaptation: Choose appropriate strategy based on environment characteristics
Monitor Metrics: Track exploration rate and performance metrics
Hyperparameter Tuning: Carefully adjust exploration-related hyperparameters

🎯 Summary

Balancing exploration and exploitation is key to successful reinforcement learning. Without exploration, the agent may fall into local optima; without exploitation, the agent cannot effectively use learned knowledge. Choosing the right exploration strategy requires considering environment characteristics, performance requirements, and implementation complexity.

5. The Bellman Equation: Theoretical Foundation of Reinforcement Learning

Bellman equation is the theoretical cornerstone of reinforcement learning. Almost all reinforcement learning algorithms are built upon this elegant mathematical framework. It reveals the recursive structure of value functions, providing fundamental insights for us to understand and solve reinforcement learning problems.

🧮 Mathematical Basis: Return and Value Function

Definition of Return

Return $G_{t}$ is the discounted sum of all future rewards from time $t$ :

G_{t} = R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + \dots = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1}

Discount Factor $γ$

The discount factor $γ \in [0, 1]$ controls the importance given to future rewards:

graph LR
    A["γ = 0
🏃‍♂️ Completely Short-sighted"] --> B["γ = 0.5
⚖️ Balanced Consideration"]
    B --> C["γ = 0.9
🔮 Quite Far-sighted"]
    C --> D["γ = 1
👁️ Completely Far-sighted"]
    
    style A fill:#ffcdd2
    style B fill:#fff3e0
    style C fill:#e8f5e8
    style D fill:#e3f2fd

Role of the Discount Factor:

Mathematical Convergence: Ensures convergence of infinite sequences
Uncertainty Modeling: The further into the future, the more uncertain
Practical Significance: Reflects the time value (e.g., interest rate)

📊 Definition of Value Functions

State-Value Function $V^{π} (s)$

Definition: The expected return from state $s$ following policy $π$ .

V^{π} (s) = E_{π} [G_{t} | S_{t} = s] = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} | S_{t} = s]

Action-Value Function $Q^{π} (s, a)$

Definition: The expected return from state $s$ taking action $a$ , and then following policy $π$ .

Q^{π} (s, a) = E_{π} [G_{t} | S_{t} = s, A_{t} = a]

Relationship Between the Two

V^{π} (s) = \sum_{a \in A} π (a | s) Q^{π} (s, a)

🔄 Bellman Equation Derivation and Intuition

Core Insight: Recursive Decomposition

Bellman’s ingenious insight is that value can be recursively decomposed into the immediate reward plus the discounted value of subsequent states.

graph TD
    A["Value of current state s"] --> B["Expected immediate reward"]
    A --> C["+ γ × Expected value of next state"]
    
    B --> B1["Select action based on policy π"]
    B --> B2["Get immediate reward R(s,a)"]
    
    C --> C1["Transition to next state s'"]
    C --> C2["Value of that state V(s')"]
    
    style A fill:#e1f5fe
    style B fill:#e8f5e8
    style C fill:#fff3e0

Derivation of Bellman Expectation Equation

Starting from the definition of the value function:

V^{π} (s) = E_{π} [G_{t} | S_{t} = s]

Expanding the definition of return:

V^{π} (s) = E_{π} [R_{t + 1} + γ G_{t + 1} | S_{t} = s]

Using the linearity of expectation:

V^{π} (s) = E_{π} [R_{t + 1} | S_{t} = s] + γ E_{π} [G_{t + 1} | S_{t} = s]

Noting that $E_{π} [G_{t + 1} | S_{t} = s] = E_{π} [V^{π} (S_{t + 1}) | S_{t} = s]$ :

V^{π} (s) = E_{π} [R_{t + 1} + γ V^{π} (S_{t + 1}) | S_{t} = s]

📐 Full Form of the Bellman Equation

Bellman Expectation Equation for State-Value Function

V^{π} (s) = \sum_{a \in A} π (a | s) \sum_{s^{'} \in S} P (s^{'} | s, a) [R (s, a, s^{'}) + γ V^{π} (s^{'})]

Intuitive Understanding:

Outer summation: Iterates through all possible actions
$π (a | s)$ : Probability of selecting action $a$
Inner summation: Iterates through all possible next states
$P (s^{'} | s, a)$ : State transition probability
$R (s, a, s^{'}) + γ V^{π} (s^{'})$ : Immediate reward + discounted future value

Bellman Expectation Equation for Action-Value Function

Q^{π} (s, a) = \sum_{s^{'} \in S} P (s^{'} | s, a) [R (s, a, s^{'}) + γ V^{π} (s^{'})]

Or:

Q^{π} (s, a) = \sum_{s^{'} \in S} P (s^{'} | s, a) [R (s, a, s^{'}) + γ \sum_{a^{'} \in A} π (a^{'} | s^{'}) Q^{π} (s^{'}, a^{'})]

🎯 Bellman Optimality Equations

For the optimal policy $π^{*}$ , we have the Bellman Optimality Equations:

Optimal State-Value Function

V^{*} (s) = max_{a \in A} \sum_{s^{'} \in S} P (s^{'} | s, a) [R (s, a, s^{'}) + γ V^{*} (s^{'})]

Optimal Action-Value Function

Q^{*} (s, a) = \sum_{s^{'} \in S} P (s^{'} | s, a) [R (s, a, s^{'}) + γ max_{a^{'} \in A} Q^{*} (s^{'}, a^{'})]

🔧 Applications of the Bellman Equation

1. Dynamic Programming Algorithms

Policy Evaluation:

1
2
3

Repeat until convergence:
    For all states s:
        V(s) ← Σ_a π(a|s) Σ_s' P(s'|s,a)[R(s,a,s') + γV(s')]

Value Iteration:

1
2
3

Repeat until convergence:
    For all states s:
        V(s) ← max_a Σ_s' P(s'|s,a)[R(s,a,s') + γV(s')]

2. Temporal Difference Learning

TD(0) Update Rule:

V (S_{t}) \leftarrow V (S_{t}) + α [R_{t + 1} + γ V (S_{t + 1}) - V (S_{t})]

Where $R_{t + 1} + γ V (S_{t + 1}) - V (S_{t})$ is called the TD error.

3. Q-Learning Algorithm

Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α [R_{t + 1} + γ max_{a} Q (S_{t + 1}, a) - Q (S_{t}, A_{t})]

🎪 Geometric Interpretation of the Bellman Equation

The Bellman equation can be viewed as a Contraction Mapping in the value function space:

graph TD
    A["Value Function Space"] --> B["Bellman Operator T"]
    B --> C["New Value Function"]
    C --> D["Fixed Point = True Value Function"]
    
    B --> B1["T[V](s) = max_a Σ_s' P(s'|s,a)[R + γV(s')]"]
    
    style D fill:#e8f5e8

Contraction Property: $| | T [V_{1}] - T [V_{2}] | |_{\infty} \leq γ | | V_{1} - V_{2} | |_{\infty}$

This ensures the convergence of the value iteration algorithm.

💡 Practical Significance and Intuition

Recursive Structure: Complex problems are decomposed into subproblems
Dynamic Programming Principle: Optimal substructure property of optimal decisions
Temporal Consistency: Current optimal decision considers future optimality
Algorithmic Foundation: Theoretical basis for almost all RL algorithms

🎯 Summary

The Bellman equation is not just an elegant mathematical expression but also the key to understanding the essence of reinforcement learning. It tells us:

Value functions have a recursive structure
Current decisions should consider long-term impacts
Optimal policies satisfy the optimal substructure property of dynamic programming
The true value function can be approximated through iterative solutions

Mastering the Bellman equation is a necessary path to deeply understanding reinforcement learning algorithms.

💡 Final Words: The charm of reinforcement learning lies in its simulation of the essential biological learning process—constantly improving through trial and error and feedback. Just as humans learn to ride a bicycle, reinforcement learning allows machines to gain intelligence through interaction with their environment. May this guide provide a solid starting point for your reinforcement learning journey!

Reinforcement

Reinforcement Learning Primer

📋 Table of Contents

1. What is Reinforcement Learning (Reinforcement Learning)?

🎯 Core Interaction Flow

🐕 Real-Life Analogy: Training a Pet Dog

🎯 Core Definition: Learning from Interaction to Achieve Goals

🌟 Unique Advantages of Reinforcement Learning

2. Core Elements of Reinforcement Learning

🧩 Six Core Components Explained

1. 🤖 Agent

2. 🌍 Environment

3. 📍 State

4. 🎯 Action

5. 🎁 Reward

6. 🧠 Policy

🔄 Standard Interaction Flow (MDP Framework)

🎯 Key Concepts Supplement

Value Functions

Model (Optional Component)

3. Classification of Reinforcement Learning

🗂️ Algorithm Classification Panorama

🏗️ 3.1 Model-Based vs Model-Free Learning

Model-Based RL

Model-Free RL

🎯 3.2 Three Major Schools of Model-Free Learning

💎 Value-Based Learning (Value-Based)

🎭 Policy-Based Learning (Policy-Based)

🎪 Actor-Critic (Actor-Critic)

🔄 3.3 Classification by Learning Update Method

Monte Carlo Method (MC)

Temporal Difference Method (TD)

🎯 3.4 Classification by Policy Update Method

On-Policy vs Off-Policy

📊 Algorithm Selection Guide

4. Balancing Exploration and Exploitation

⚖️ Core Concept Comparison

🍽️ Real-Life Analogy: Restaurant Selection Problem

Exploitation Strategy

Exploration Strategy

📊 Mathematical Description of the Exploration-Exploitation Dilemma

🛠️ Main Exploration Strategies

1. ε-Greedy Policy

2. Softmax/Boltzmann Policy

3. Upper Confidence Bound (UCB)

4. Thompson Sampling

5. Count-Based Exploration

🎯 Exploration Strategy Comparison

📈 Exploration Strategy Performance Comparison

🔍 Exploration in Deep Reinforcement Learning

Intrinsic Motivation

Parameter Space Exploration

💡 Practical Advice

🎯 Summary

5. The Bellman Equation: Theoretical Foundation of Reinforcement Learning

🧮 Mathematical Basis: Return and Value Function

Definition of Return

Discount Factor γ

📊 Definition of Value Functions

State-Value Function Vπ(s)

Action-Value Function Qπ(s,a)

Relationship Between the Two

🔄 Bellman Equation Derivation and Intuition

Core Insight: Recursive Decomposition

Derivation of Bellman Expectation Equation

📐 Full Form of the Bellman Equation

Bellman Expectation Equation for State-Value Function

Bellman Expectation Equation for Action-Value Function

🎯 Bellman Optimality Equations

Optimal State-Value Function

Optimal Action-Value Function

🔧 Applications of the Bellman Equation

1. Dynamic Programming Algorithms

2. Temporal Difference Learning

3. Q-Learning Algorithm

🎪 Geometric Interpretation of the Bellman Equation

💡 Practical Significance and Intuition

🎯 Summary

Discount Factor $γ$

State-Value Function $V^{π} (s)$

Action-Value Function $Q^{π} (s, a)$