Reinforcement Learning Primer

๐Ÿ“š Learning Objectives: Master the core concepts of reinforcement learning, main algorithm categories, etc., to lay a solid foundation for in-depth study of reinforcement learning.

This document aims to provide beginners with a systematic introduction to reinforcement learning. As an important branch of artificial intelligence, reinforcement learning has wide applications in game AI, robot control, recommendation systems, autonomous driving, and other fields.

๐Ÿ“‹ Table of Contents

1. What is Reinforcement Learning (Reinforcement Learning)?

Reinforcement learning is a branch of machine learning that focuses on how an Agent takes actions in an Environment to maximize the desired Reward. Unlike supervised learning and unsupervised learning, reinforcement learning learns optimal decision-making strategies through continuous interaction with the environment.

๐ŸŽฏ Core Interaction Flow

graph TD
    A["๐Ÿค– Agent"] --> B["๐ŸŽฏ Select Action"]
    B --> C["๐ŸŒ Environment"]
    C --> D["๐Ÿ“ New State"]
    C --> E["๐ŸŽ Reward"]
    D --> A
    E --> A
    
    style A fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    style C fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    style D fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
    style E fill:#fff3e0,stroke:#e65100,stroke-width:2px

๐Ÿ• Real-Life Analogy: Training a Pet Dog

To better understand reinforcement learning, letโ€™s use the example of training a pet dog:

  • Agent ๐Ÿ•: Your pet dog, the learner who needs to acquire new skills
  • Environment ๐Ÿ : The training ground, including room layout, furniture placement, and other external conditions
  • State ๐Ÿ“: The dogโ€™s current posture and position, e.g., โ€œstanding in the center of the living roomโ€
  • Action ๐ŸŽฏ: The choices the dog can make, such as โ€œsit,โ€ โ€œshake,โ€ โ€œspinโ€
  • Reward ๐Ÿฆด: The feedback you give; a treat for doing it right (+1), no reward (0) or a gentle scolding (-1) for doing it wrong
  • Policy ๐Ÿง : The โ€œrules of behaviorโ€ the dog learns, i.e., what to do in specific situations

๐ŸŽฏ Core Definition: Learning from Interaction to Achieve Goals

The core idea of reinforcement learning is โ€œTrial-and-Errorโ€. Unlike traditional supervised learning:

Feature Supervised Learning Reinforcement Learning
Learning Method Learns from labeled data Learns from environmental interaction
Feedback Type Correct answers Reward signals
Data Acquisition Static datasets Dynamic interaction process
Goal Prediction accuracy Maximizing long-term cumulative reward

The agent is not directly told what to do but must discover which sequences of actions lead to the highest long-term returns through interaction with the environment. This learning method is closer to the natural learning process of humans and animals.

๐ŸŒŸ Unique Advantages of Reinforcement Learning

  1. Strong Adaptability: Can adapt to dynamically changing environments
  2. No Need for Large Labeled Datasets: Learns autonomously through interaction
  3. Goal-Oriented: Directly optimizes the ultimate goal, not intermediate metrics
  4. Sequential Decision-Making: Considers the long-term impact of actions, not just single-step optimality

2. Core Elements of Reinforcement Learning

A reinforcement learning system is a complex interactive system composed of the following six key components. Understanding these elements and their interrelationships is fundamental to mastering reinforcement learning.

๐Ÿงฉ Six Core Components Explained

1. ๐Ÿค– Agent

Definition: The learner and decision-maker, the core of the entire system.

Characteristics:

  • Possesses the ability to perceive environmental states
  • Can perform actions to influence the environment
  • Equipped with mechanisms to learn and improve policies
  • Goal is to maximize long-term cumulative reward

Examples:

  • ๐ŸŽฎ Game AI (e.g., the Go program in AlphaGo)
  • ๐Ÿค– Autonomous robots (e.g., robotic vacuum cleaners)
  • ๐Ÿ’ฐ Trading systems (e.g., stock trading algorithms)
  • ๐Ÿš— Decision-making systems for autonomous vehicles

2. ๐ŸŒ Environment

Definition: The external world with which the agent interacts; the agent cannot fully control it but can influence it.

Properties:

  • State Space: The set of all possible states S
  • Dynamism: States change according to the agentโ€™s actions and environmental laws
  • Stochasticity: State transitions may be uncertain
  • Partial Observability: The agent may not be able to observe the complete state of the environment

Classification:

  • Deterministic Environments vs Stochastic Environments
  • Fully Observable vs Partially Observable
  • Single-Agent vs Multi-Agent
  • Static vs Dynamic

3. ๐Ÿ“ State

Definition: A collection of information describing the current situation of the environment, denoted as stโˆˆS.

Important Concepts:

  • Markov Property: Future states depend only on the current state, not on history
  • State Representation: How to effectively represent and encode state information
  • State Space Size: Discrete finite, discrete infinite, continuous

4. ๐ŸŽฏ Action

Definition: An operation that the agent can execute, denoted as atโˆˆA.

Classification:

  • Discrete Action Space: A finite number of optional actions (e.g., up, down, left, right in a game)
  • Continuous Action Space: Actions with continuous values (e.g., robot joint angles)
  • Mixed Action Space: Contains both discrete and continuous actions

5. ๐ŸŽ Reward

Definition: The immediate feedback from the environment to the agentโ€™s action, denoted as rtโˆˆR.

Design Principles:

  • Sparse Rewards vs Dense Rewards
  • Reward Shaping: Guiding learning through intermediate rewards
  • Reward Function Design: Avoiding reward hacking

Mathematical Representation:
The reward function can be expressed as: R:Sร—Aร—Sโ†’R

6. ๐Ÿง  Policy

Definition: The agentโ€™s rule of behavior, defining what action should be taken in each state.

Mathematical Representation:

  • Deterministic Policy: ฯ€:Sโ†’A
  • Stochastic Policy: ฯ€(a|s)=P(At=a|St=s)

Policy Types:

  • Greedy Policy: Always selects the currently optimal action
  • ฮต-Greedy Policy: Explores randomly with probability ฮต
  • Softmax Policy: Based on a probability distribution of action values

๐Ÿ”„ Standard Interaction Flow (MDP Framework)

The interaction process of reinforcement learning can be formalized as a Markov Decision Process (MDP):

sequenceDiagram
    participant A as ๐Ÿค– Agent
    participant E as ๐ŸŒ Environment
    
    Note over A,E: Time t
    A->>A: Observe state St
    A->>A: Select action based on policy ฯ€
    A->>E: Execute action At
    E->>E: State transition
    E->>A: Return new state St+1
    E->>A: Return reward Rt+1
    A->>A: Update policy/value function
    
    Note over A,E: Loop continues...

Detailed Steps:

  1. State Perception: The agent observes the current state St
  2. Decision Making: Selects action At according to policy ฯ€(a|s)
  3. Action Execution: Executes the selected action in the environment
  4. Environment Response: The environment transitions to a new state St+1 and gives reward Rt+1
  5. Learning Update: The agent uses new information to update its policy or value function
  6. Iterative Loop: Repeats the above process until a termination condition is met

๐ŸŽฏ Key Concepts Supplement

Value Functions

Although not direct interaction elements, value functions are important tools for understanding and optimizing policies:

  • State-Value Function Vฯ€(s): The expected return from state s following policy ฯ€
  • Action-Value Function Qฯ€(s,a): The expected return from state s taking action a and then following policy ฯ€

Model (Optional Component)

Definition: A mathematical description of the environmentโ€™s dynamics, including:

  • Transition Probabilities: P(sโ€ฒ|s,a) - The probability of transitioning to state sโ€ฒ after taking action a in state s
  • Reward Function: R(s,a,sโ€ฒ) - The corresponding expected reward

Applications:

  • Model-Based Methods: Use models for planning and searching
  • Model-Free Methods: Learn directly from experience, no explicit model needed

3. Classification of Reinforcement Learning

There are many types of reinforcement learning algorithms. Understanding the classification and characteristics of different algorithms helps us choose appropriate methods for specific problems. Below are the main classification dimensions and representative algorithms.

๐Ÿ—‚๏ธ Algorithm Classification Panorama

graph TD
    A["๐Ÿง  Reinforcement Learning Algorithms"] --> B["๐Ÿ—๏ธ Model-Based"]
    A --> C["๐ŸŽฏ Model-Free"]
    
    B --> B1["๐Ÿ“Š Dynamic Programming"]
    B --> B2["๐Ÿ” Search Algorithms"]
    B --> B3["๐ŸŽฒ Sample-Based Planning"]
    
    C --> C1["๐Ÿ’Ž Value-Based"]
    C --> C2["๐ŸŽญ Policy-Based"]
    C --> C3["๐ŸŽช Actor-Critic"]
    
    B1 --> B11["Policy Iteration
Value Iteration"] B2 --> B21["MCTS
AlphaGo"] B3 --> B31["Dyna-Q
PETS"] C1 --> C11["๐Ÿ”ข Tabular Methods
Q-Learning
SARSA"] C1 --> C12["๐Ÿง  Deep Methods
DQN
Double DQN
Dueling DQN"] C2 --> C21["๐ŸŽฏ Policy Gradient
REINFORCE
TRPO
PPO"] C2 --> C22["๐ŸŽช Evolutionary Strategies
ES
CMA-ES"] C3 --> C31["๐ŸŽญ Traditional AC
A2C
A3C"] C3 --> C32["๐Ÿš€ Advanced AC
SAC
TD3
DDPG"] style A fill:#f9f,stroke:#333,stroke-width:3px style B fill:#bbf,stroke:#333,stroke-width:2px style C fill:#fbf,stroke:#333,stroke-width:2px

๐Ÿ—๏ธ 3.1 Model-Based vs Model-Free Learning

This is the most fundamental and important classification dimension in reinforcement learning.

Model-Based RL

Core Idea: The agent first learns a model of the environmentโ€™s dynamics, and then uses this model for planning and decision-making.

Advantages:

  • โœ… High Data Efficiency: Can generate virtual experiences through the model
  • โœ… Strong Planning Capability: Can prospectively evaluate the consequences of actions
  • โœ… Good Generalization: The model can be generalized to unseen states

Disadvantages:

  • โŒ Model Bias: Inaccurate models can lead to suboptimal policies
  • โŒ High Computational Complexity: Planning processes are usually computationally intensive
  • โŒ Difficult to Model: Complex environments are difficult to model accurately

Representative Algorithms:

  • Dynamic Programming: Policy Iteration, Value Iteration
  • Monte Carlo Tree Search (MCTS): AlphaGo, AlphaZero
  • Model-Based Deep RL: Dyna-Q, PETS, MuZero

Application Scenarios:

  • ๐ŸŽฒ Board games (clear rules, easy to model)
  • ๐Ÿค– Robot control (physical models are relatively certain)
  • ๐Ÿ“ˆ Financial trading (rich historical data)

Model-Free RL

Core Idea: The agent does not learn an environment model, but directly learns value functions or policies from experience.

Advantages:

  • โœ… Simple to Implement: No need to model environment dynamics
  • โœ… Wide Applicability: Suitable for complex, difficult-to-model environments
  • โœ… Strong Robustness: Not affected by model errors

Disadvantages:

  • โŒ High Data Requirement: Requires a large amount of environmental interaction
  • โŒ Low Sample Efficiency: Relatively slow learning speed
  • โŒ Lack of Planning: Cannot make prospective decisions

๐ŸŽฏ 3.2 Three Major Schools of Model-Free Learning

๐Ÿ’Ž Value-Based Learning (Value-Based)

Core Idea: Learn the value function of states or state-action pairs, and then select the optimal action based on the value.

Mathematical Foundation:

  • State-value function: V(s)=E[Gt|St=s]
  • Action-value function: Q(s,a)=E[Gt|St=s,At=a]

Decision Rule: ฯ€(s)=argโกmaxaQ(s,a)

Representative Algorithms:

  • Tabular Methods: Q-Learning, SARSA, Expected SARSA
  • Function Approximation: DQN, Double DQN, Dueling DQN, Rainbow DQN

Application Scenarios:

  • Discrete action spaces
  • Scenarios requiring deterministic policies
  • Situations where sample efficiency is not extremely high

๐ŸŽญ Policy-Based Learning (Policy-Based)

Core Idea: Directly learn a parameterized policy function and optimize policy parameters using policy gradient methods.

Mathematical Foundation:
Policy Gradient Theorem: โˆ‡ฮธJ(ฮธ)=Eฯ€ฮธ[โˆ‡ฮธlogโกฯ€ฮธ(a|s)โ‹…Qฯ€ฮธ(s,a)]

Advantages:

  • โœ… Continuous Action Space: Naturally applicable to continuous control
  • โœ… Stochastic Policies: Can learn stochastic policies
  • โœ… Strong Policy Expression Capability: Can represent complex policies

Representative Algorithms:

  • Basic Methods: REINFORCE, Actor-Only
  • Advanced Methods: TRPO, PPO, A3C
  • Evolutionary Methods: Evolution Strategies (ES)

๐ŸŽช Actor-Critic (Actor-Critic)

Core Idea: Combines the advantages of value learning and policy learning, using two networks to learn the policy and value function separately.

Architecture Design:

  • Actor: Learns policy ฯ€ฮธ(a|s), responsible for action selection
  • Critic: Learns value function Vฯ•(s) or Qฯ•(s,a), responsible for value evaluation

Advantages:

  • โœ… Smaller Variance: Critic provides a baseline, reducing the variance of policy gradients
  • โœ… Smaller Bias: Actor directly optimizes the policy, avoiding bias from the value function
  • โœ… Strong Adaptability: Applicable to both discrete and continuous action spaces

Representative Algorithms:

  • Synchronous Methods: A2C, PPO, TRPO
  • Asynchronous Methods: A3C, IMPALA
  • Deterministic Methods: DDPG, TD3, SAC

๐Ÿ”„ 3.3 Classification by Learning Update Method

Monte Carlo Method (MC)

Characteristics:

  • Requires complete episodes for updates
  • Uses actual returns Gt for learning
  • Unbiased estimate, but higher variance

Update Formula:

V(St)โ†V(St)+ฮฑ[Gtโˆ’V(St)]

Application Scenarios:

  • Environments with short episodes
  • Scenarios requiring unbiased estimates

Temporal Difference Method (TD)

Characteristics:

  • Can update at each step, no need to wait for the episode to end
  • Uses bootstrapping, i.e., updates estimates with other estimates
  • Biased but with smaller variance

TD(0) Update Formula:

V(St)โ†V(St)+ฮฑ[Rt+1+ฮณV(St+1)โˆ’V(St)]

Variants:

  • TD(ฮป): Method combining multi-step returns
  • Q-Learning: Off-policy TD method
  • SARSA: On-policy TD method

๐ŸŽฏ 3.4 Classification by Policy Update Method

On-Policy vs Off-Policy

On-Policy:

  • The learning policy is the same as the data-generating policy
  • Representative algorithms: SARSA, A2C, PPO
  • Advantages: Good stability, guaranteed convergence
  • Disadvantages: Relatively low data utilization efficiency

Off-Policy:

  • The learning policy is different from the data-generating policy
  • Representative algorithms: Q-Learning, DQN, DDPG
  • Advantages: High data utilization efficiency, can reuse historical data
  • Disadvantages: May have distribution shift problems

๐Ÿ“Š Algorithm Selection Guide

Environment Feature Recommended Algorithm Type Representative Algorithms
Discrete action space Value-Based DQN, Rainbow
Continuous action space Actor-Critic PPO, SAC, TD3
High-dimensional state space Deep RL DQN, A3C, PPO
High sample efficiency required Model-Based MCTS, MuZero
Stochastic policies required Policy-Based PPO, SAC
Multi-agent environments Specialized Algorithms MADDPG, QMIX

4. Balancing Exploration and Exploitation

This is one of the most central dilemmas in reinforcement learning, also known as the Exploration-Exploitation Trade-off. The essence of this problem is: how should an agent balance acquiring new information (exploration) and utilizing existing knowledge (exploitation)?

โš–๏ธ Core Concept Comparison

graph LR
    A["๐Ÿค” Agent faces choice"] --> B["๐ŸŽฏ Exploitation"]
    A --> C["๐Ÿ” Exploration"]
    
    B --> B1["Choose known optimal action"]
    B --> B2["Get predictable rewards"]
    B --> B3["Risk: May miss better choices"]
    
    C --> C1["Try unknown actions"]
    C --> C2["May discover better policies"]
    C --> C3["Risk: May receive lower rewards"]
    
    style B fill:#e8f5e8,stroke:#2e7d32
    style C fill:#fff3e0,stroke:#f57c00

๐Ÿฝ๏ธ Real-Life Analogy: Restaurant Selection Problem

You are in a new city with many restaurants:

Exploitation Strategy

  • Behavior: Always go to the best restaurant youโ€™ve already tried
  • Advantages: Guaranteed satisfactory dining experience
  • Disadvantages: May never discover better restaurants
  • Mindset: Seek stability, avoid risk

Exploration Strategy

  • Behavior: Try new, untried restaurants
  • Advantages: Opportunity to discover unexpected surprises
  • Disadvantages: May encounter bad food, wasting time and money
  • Mindset: Take risks, pursue better solutions

๐Ÿ“Š Mathematical Description of the Exploration-Exploitation Dilemma

In the Multi-Armed Bandit problem, this dilemma can be formalized as:

Given K actions, each action a has a true value qโˆ—(a) and an estimated value Qt(a).

  • Exploitation: Choose At=argโกmaxaQt(a)
  • Exploration: Choose a non-greedy action

Regret is defined as:

Rt=maxaqโˆ—(a)โˆ’qโˆ—(At)

Total regret is: โˆ‘t=1TRt

๐Ÿ› ๏ธ Main Exploration Strategies

1. ฮต-Greedy Policy

Core Idea: Explores randomly with probability ฮต, and exploits the current optimal action with probability 1โˆ’ฮต.

Algorithm Description:

1
2
3
4
if random() < ฮต:
้€‰ๆ‹ฉ้šๆœบ่กŒๅŠจ (ๆŽข็ดข)
else:
้€‰ๆ‹ฉๅฝ“ๅ‰ๆœ€ไผ˜่กŒๅŠจ (ๅˆฉ็”จ)

Mathematical Representation:

ฯ€(a|s)={1โˆ’ฮต+ฮต|A|if a=argโกmaxaQ(s,a)ฮต|A|otherwise

Variants:

  • Decaying ฮต-Greedy: ฮตt=ฮต0/t or ฮตt=ฮต0โ‹…ฮณt
  • Adaptive ฮต-Greedy: Adjusts ฮต based on uncertainty

2. Softmax/Boltzmann Policy

Core Idea: Selects actions based on a probability distribution of action values; actions with higher values have a higher probability of being chosen.

Mathematical Representation:

ฯ€(a|s)=expโก(Q(s,a)/ฯ„)โˆ‘aโ€ฒexpโก(Q(s,aโ€ฒ)/ฯ„)

Where ฯ„ is the temperature parameter:

  • ฯ„โ†’0: Approaches a greedy policy
  • ฯ„โ†’โˆž: Approaches a uniform random policy

3. Upper Confidence Bound (UCB)

Core Idea: Selects the action with the highest โ€œoptimistic estimate,โ€ which considers the upper bound of uncertainty.

UCB1 Formula:

At=argโกmaxa[Qt(a)+clnโกtNt(a)]

Where:

  • Qt(a): Average reward for action a
  • Nt(a): Number of times action a has been selected
  • c: Confidence parameter

Intuitive Understanding:

  • First term: Utilizes the current best estimate
  • Second term: Explores actions with high uncertainty

4. Thompson Sampling

Core Idea: Maintains a probability distribution for the value of each action and selects actions based on sampling results.

Algorithm Flow:

  1. Maintain a value distribution P(Q(a)) for each action
  2. Sample a value Q~(a) from each distribution
  3. Select argโกmaxaQ~(a)
  4. Update the distribution based on the received reward

5. Count-Based Exploration

Core Idea: Encourages visiting less-visited state-action pairs.

Reward Shaping:

rโ€ฒ(s,a)=r(s,a)+ฮฒโ‹…f(N(s,a))

Where f(โ‹…) is a decreasing function, such as f(n)=1/n

๐ŸŽฏ Exploration Strategy Comparison

graph TD
    A["๐ŸŽฏ Exploration Strategy Selection"] --> B["๐ŸŽฎ Environment Type"]
    A --> C["๐Ÿ“Š Performance Requirements"]
    A --> D["๐Ÿ”ง Implementation Complexity"]
    
    B --> B1["๐ŸŽฒ Stochastic Environments
โ†’ UCB, Thompson"] B --> B2["๐ŸŽฏ Deterministic Environments
โ†’ ฮต-greedy"] B --> B3["๐ŸŒŠ Non-stationary Environments
โ†’ Decaying ฮต-greedy"] C --> C1["๐Ÿš€ Fast Convergence
โ†’ UCB"] C --> C2["โš–๏ธ Balanced Performance
โ†’ Softmax"] C --> C3["๐ŸŽฏ Simple and Effective
โ†’ ฮต-greedy"] D --> D1["๐Ÿ’ก Simple Implementation
โ†’ ฮต-greedy"] D --> D2["๐Ÿง  Medium Complexity
โ†’ Softmax"] D --> D3["๐Ÿ”ฌ Advanced Methods
โ†’ Thompson"]

๐Ÿ“ˆ Exploration Strategy Performance Comparison

Strategy Theoretical Guarantee Implementation Difficulty Application Scenario Convergence Speed
ฮต-Greedy General Simple General Medium
Decaying ฮต-Greedy Better Simple Non-stationary Faster
Softmax General Medium Continuous Values Medium
UCB Excellent Medium Stochastic Fast
Thompson Sampling Excellent Complex Bayesian Setting Fast

๐Ÿ” Exploration in Deep Reinforcement Learning

In Deep RL, exploration becomes more complex because the state space is huge, and traditional counting methods are no longer applicable.

Intrinsic Motivation

  • Curiosity-Driven: ICM (Intrinsic Curiosity Module)
  • Random Network Distillation: RND (Random Network Distillation)
  • NGU: Never Give Up

Parameter Space Exploration

  • Parameter Noise: Adding noise to network parameters
  • NoisyNet: Learnable noisy networks

๐Ÿ’ก Practical Advice

  1. More Exploration Initially: Increase exploration probability at the beginning of learning
  2. More Exploitation Later: Decrease exploration during convergence
  3. Environment Adaptation: Choose appropriate strategy based on environment characteristics
  4. Monitor Metrics: Track exploration rate and performance metrics
  5. Hyperparameter Tuning: Carefully adjust exploration-related hyperparameters

๐ŸŽฏ Summary

Balancing exploration and exploitation is key to successful reinforcement learning. Without exploration, the agent may fall into local optima; without exploitation, the agent cannot effectively use learned knowledge. Choosing the right exploration strategy requires considering environment characteristics, performance requirements, and implementation complexity.

5. The Bellman Equation: Theoretical Foundation of Reinforcement Learning

Bellman equation is the theoretical cornerstone of reinforcement learning. Almost all reinforcement learning algorithms are built upon this elegant mathematical framework. It reveals the recursive structure of value functions, providing fundamental insights for us to understand and solve reinforcement learning problems.

๐Ÿงฎ Mathematical Basis: Return and Value Function

Definition of Return

Return Gt is the discounted sum of all future rewards from time t:

Gt=Rt+1+ฮณRt+2+ฮณ2Rt+3+โ‹ฏ=โˆ‘k=0โˆžฮณkRt+k+1

Discount Factor ฮณ

The discount factor ฮณโˆˆ[0,1] controls the importance given to future rewards:

graph LR
    A["ฮณ = 0
๐Ÿƒโ€โ™‚๏ธ Completely Short-sighted"] --> B["ฮณ = 0.5
โš–๏ธ Balanced Consideration"] B --> C["ฮณ = 0.9
๐Ÿ”ฎ Quite Far-sighted"] C --> D["ฮณ = 1
๐Ÿ‘๏ธ Completely Far-sighted"] style A fill:#ffcdd2 style B fill:#fff3e0 style C fill:#e8f5e8 style D fill:#e3f2fd

Role of the Discount Factor:

  • Mathematical Convergence: Ensures convergence of infinite sequences
  • Uncertainty Modeling: The further into the future, the more uncertain
  • Practical Significance: Reflects the time value (e.g., interest rate)

๐Ÿ“Š Definition of Value Functions

State-Value Function Vฯ€(s)

Definition: The expected return from state s following policy ฯ€.

Vฯ€(s)=Eฯ€[Gt|St=s]=Eฯ€[โˆ‘k=0โˆžฮณkRt+k+1|St=s]

Action-Value Function Qฯ€(s,a)

Definition: The expected return from state s taking action a, and then following policy ฯ€.

Qฯ€(s,a)=Eฯ€[Gt|St=s,At=a]

Relationship Between the Two

Vฯ€(s)=โˆ‘aโˆˆAฯ€(a|s)Qฯ€(s,a)

๐Ÿ”„ Bellman Equation Derivation and Intuition

Core Insight: Recursive Decomposition

Bellmanโ€™s ingenious insight is that value can be recursively decomposed into the immediate reward plus the discounted value of subsequent states.

graph TD
    A["Value of current state s"] --> B["Expected immediate reward"]
    A --> C["+ ฮณ ร— Expected value of next state"]
    
    B --> B1["Select action based on policy ฯ€"]
    B --> B2["Get immediate reward R(s,a)"]
    
    C --> C1["Transition to next state s'"]
    C --> C2["Value of that state V(s')"]
    
    style A fill:#e1f5fe
    style B fill:#e8f5e8
    style C fill:#fff3e0

Derivation of Bellman Expectation Equation

Starting from the definition of the value function:

Vฯ€(s)=Eฯ€[Gt|St=s]

Expanding the definition of return:

Vฯ€(s)=Eฯ€[Rt+1+ฮณGt+1|St=s]

Using the linearity of expectation:

Vฯ€(s)=Eฯ€[Rt+1|St=s]+ฮณEฯ€[Gt+1|St=s]

Noting that Eฯ€[Gt+1|St=s]=Eฯ€[Vฯ€(St+1)|St=s]:

Vฯ€(s)=Eฯ€[Rt+1+ฮณVฯ€(St+1)|St=s]

๐Ÿ“ Full Form of the Bellman Equation

Bellman Expectation Equation for State-Value Function

Vฯ€(s)=โˆ‘aโˆˆAฯ€(a|s)โˆ‘sโ€ฒโˆˆSP(sโ€ฒ|s,a)[R(s,a,sโ€ฒ)+ฮณVฯ€(sโ€ฒ)]

Intuitive Understanding:

  • Outer summation: Iterates through all possible actions
  • ฯ€(a|s): Probability of selecting action a
  • Inner summation: Iterates through all possible next states
  • P(sโ€ฒ|s,a): State transition probability
  • R(s,a,sโ€ฒ)+ฮณVฯ€(sโ€ฒ): Immediate reward + discounted future value

Bellman Expectation Equation for Action-Value Function

Qฯ€(s,a)=โˆ‘sโ€ฒโˆˆSP(sโ€ฒ|s,a)[R(s,a,sโ€ฒ)+ฮณVฯ€(sโ€ฒ)]

Or:

Qฯ€(s,a)=โˆ‘sโ€ฒโˆˆSP(sโ€ฒ|s,a)[R(s,a,sโ€ฒ)+ฮณโˆ‘aโ€ฒโˆˆAฯ€(aโ€ฒ|sโ€ฒ)Qฯ€(sโ€ฒ,aโ€ฒ)]

๐ŸŽฏ Bellman Optimality Equations

For the optimal policy ฯ€โˆ—, we have the Bellman Optimality Equations:

Optimal State-Value Function

Vโˆ—(s)=maxaโˆˆAโˆ‘sโ€ฒโˆˆSP(sโ€ฒ|s,a)[R(s,a,sโ€ฒ)+ฮณVโˆ—(sโ€ฒ)]

Optimal Action-Value Function

Qโˆ—(s,a)=โˆ‘sโ€ฒโˆˆSP(sโ€ฒ|s,a)[R(s,a,sโ€ฒ)+ฮณmaxaโ€ฒโˆˆAQโˆ—(sโ€ฒ,aโ€ฒ)]

๐Ÿ”ง Applications of the Bellman Equation

1. Dynamic Programming Algorithms

Policy Evaluation:

1
2
3
Repeat until convergence:
For all states s:
V(s) โ† ฮฃ_a ฯ€(a|s) ฮฃ_s' P(s'|s,a)[R(s,a,s') + ฮณV(s')]

Value Iteration:

1
2
3
Repeat until convergence:
For all states s:
V(s) โ† max_a ฮฃ_s' P(s'|s,a)[R(s,a,s') + ฮณV(s')]

2. Temporal Difference Learning

TD(0) Update Rule:

V(St)โ†V(St)+ฮฑ[Rt+1+ฮณV(St+1)โˆ’V(St)]

Where Rt+1+ฮณV(St+1)โˆ’V(St) is called the TD error.

3. Q-Learning Algorithm

Q(St,At)โ†Q(St,At)+ฮฑ[Rt+1+ฮณmaxaQ(St+1,a)โˆ’Q(St,At)]

๐ŸŽช Geometric Interpretation of the Bellman Equation

The Bellman equation can be viewed as a Contraction Mapping in the value function space:

graph TD
    A["Value Function Space"] --> B["Bellman Operator T"]
    B --> C["New Value Function"]
    C --> D["Fixed Point = True Value Function"]
    
    B --> B1["T[V](s) = max_a ฮฃ_s' P(s'|s,a)[R + ฮณV(s')]"]
    
    style D fill:#e8f5e8

Contraction Property: ||T[V1]โˆ’T[V2]||โˆžโ‰คฮณ||V1โˆ’V2||โˆž

This ensures the convergence of the value iteration algorithm.

๐Ÿ’ก Practical Significance and Intuition

  1. Recursive Structure: Complex problems are decomposed into subproblems
  2. Dynamic Programming Principle: Optimal substructure property of optimal decisions
  3. Temporal Consistency: Current optimal decision considers future optimality
  4. Algorithmic Foundation: Theoretical basis for almost all RL algorithms

๐ŸŽฏ Summary

The Bellman equation is not just an elegant mathematical expression but also the key to understanding the essence of reinforcement learning. It tells us:

  • Value functions have a recursive structure
  • Current decisions should consider long-term impacts
  • Optimal policies satisfy the optimal substructure property of dynamic programming
  • The true value function can be approximated through iterative solutions

Mastering the Bellman equation is a necessary path to deeply understanding reinforcement learning algorithms.

๐Ÿ’ก Final Words: The charm of reinforcement learning lies in its simulation of the essential biological learning processโ€”constantly improving through trial and error and feedback. Just as humans learn to ride a bicycle, reinforcement learning allows machines to gain intelligence through interaction with their environment. May this guide provide a solid starting point for your reinforcement learning journey!