Learning Q-Learning: Solving and Experimenting with CartPole-v1 from OpenAI Gym

Reinforcement learning (RL) has emerged as a powerful paradigm in artificial intelligence, enabling machines to learn complex behaviors through interaction with their environment. Among the various RL algorithms, Q-learning stands out for its simplicity and effectiveness. In this comprehensive guide, we'll delve deep into the implementation of Q-learning to solve the classic CartPole-v1 problem from OpenAI Gym, offering valuable insights for AI practitioners, researchers, and enthusiasts alike.

Introduction to CartPole-v1

CartPole-v1 is a fundamental control theory problem that serves as an excellent benchmark for testing reinforcement learning algorithms. In this environment, a pole is attached by an un-actuated joint to a cart that moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright.

Key Characteristics of CartPole-v1:

State Space: Continuous, 4-dimensional
Action Space: Discrete, 2 actions
Reward: +1 for each step, episode ends if pole angle exceeds ±12°, cart position exceeds ±2.4, or episode length reaches 500 steps
Objective: Maximize cumulative reward by keeping the pole balanced for as long as possible

Setting Up the Environment

To begin our journey into Q-learning with CartPole-v1, we first need to set up our environment:

import gym
import numpy as np

env = gym.make("CartPole-v1")
env.reset()

This code snippet imports the necessary libraries and creates an instance of the CartPole-v1 environment. The gym library provides a standardized interface for reinforcement learning environments, making it easy to experiment with different algorithms and problems.

Understanding the Observation Space

The observation space in CartPole-v1 consists of four elements:

def print_observation_space(env):
    print(f"Observation space high: {env.observation_space.high}")
    print(f"Observation space low: {env.observation_space.low}")
    print(f"Number of actions in the action space: {env.action_space.n}")

print_observation_space(env)

This function reveals the boundaries of our observation space and the number of possible actions. The observation space includes:

Cart Position: Range [-4.8, 4.8]
Cart Velocity: Unbounded
Pole Angle: Range [-0.418, 0.418] radians
Pole Angular Velocity: Unbounded

While the position and angle have defined limits, the velocities are theoretically unbounded. This presents a challenge for discretization, which is crucial for our Q-learning implementation.

Handling Unbounded Velocities

To address the issue of unbounded velocities, we can implement a pragmatic approach:

def get_max_velocity(env):
    max_velo_cart = 0
    max_velo_pole = 0
    env.reset()
    done = False
    while not done:
        new_state, _, done, _ = env.step(1)
        max_velo_cart = max(max_velo_cart, abs(new_state[1]))
        max_velo_pole = max(max_velo_pole, abs(new_state[3]))
    print(f"Max_velo_cart={max_velo_cart}")
    print(f"Max_velo_pole={max_velo_pole}")

get_max_velocity(env)

This function runs the environment to empirically determine maximum velocities. While not perfect, it provides usable upper bounds for our discretization process. In practice, these values often converge to:

Max Cart Velocity: ~3.5
Max Pole Angular Velocity: ~4.0

Discretizing the State Space

For Q-learning, we need to discretize our continuous state space:

DISCRETE_OS_SIZE = [25, 25]  # Dimensions for pole angle and angular velocity
real_observation_space = np.array([env.observation_space.high[2], 3.5])
discrete_os_win_size = (real_observation_space * 2 / DISCRETE_OS_SIZE)

def get_discrete_state(state):
    trimmed_state = np.array([state[2], state[3]])
    discrete_state = (trimmed_state + real_observation_space) / discrete_os_win_size
    return tuple(discrete_state.astype(np.int))

This discretization focuses on the pole's angle and angular velocity, disregarding cart data for simplicity and improved training speed. We use 25 bins for each dimension, resulting in a 25×25 grid of discrete states.

Discretization Analysis

The choice of 25 bins for each dimension is a balance between precision and computational efficiency. Here's a brief analysis of different discretization levels:

Bins	State Space Size	Precision	Training Speed	Memory Usage
10	100	Low	Fast	Low
25	625	Medium	Medium	Medium
50	2500	High	Slow	High
100	10000	Very High	Very Slow	Very High

Our choice of 25 bins provides a good balance, allowing for sufficient precision while maintaining reasonable training speed and memory usage.

Creating the Q-Table

The Q-table is the heart of our Q-learning algorithm:

q_table = np.random.uniform(low=0, high=1, size=(DISCRETE_OS_SIZE + [env.action_space.n]))

This initializes our Q-table with random values between 0 and 1. The Q-table has dimensions 25x25x2, representing the discretized state space (25×25) and the two possible actions.

Training Parameters

Before training, we set our hyperparameters:

LEARNING_RATE = 0.1
DISCOUNT = 0.95
EPISODES = 12000
LOG_FREQUENCY = 2000
epsilon = 0.1
START_DECAY = 1
END_DECAY = EPISODES // 2
epsilon_decay_by = epsilon / (END_DECAY - START_DECAY)

These parameters control the learning process, balancing exploration and exploitation. Let's break down each parameter:

LEARNING_RATE (0.1): Determines how much new information overrides old information. A higher rate makes the agent learn faster but can lead to instability.
DISCOUNT (0.95): Balances immediate and future rewards. A value close to 1 prioritizes long-term rewards.
EPISODES (12000): Total number of training episodes.
epsilon (0.1): Initial exploration rate, determining how often the agent takes random actions.
START_DECAY (1) and END_DECAY (EPISODES // 2): Define the range over which epsilon decays.

The Training Loop

The core of our Q-learning implementation lies in the training loop:

for episode in range(EPISODES):
    discrete_state = get_discrete_state(env.reset())
    done = False
    
    while not done:
        if np.random.random() > epsilon:
            action = np.argmax(q_table[discrete_state])
        else:
            action = np.random.randint(0, 2)
        
        new_state, reward, done, _ = env.step(action)
        new_discrete_state = get_discrete_state(new_state)
        
        if not done:
            max_future_q = np.max(q_table[new_discrete_state])
            current_q = q_table[discrete_state + (action,)]
            new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
            q_table[discrete_state + (action,)] = new_q
        
        discrete_state = new_discrete_state
    
    if END_DECAY >= episode >= START_DECAY:
        epsilon -= epsilon_decay_by

This loop iterates through episodes, updating the Q-table based on the agent's experiences and gradually reducing exploration as training progresses.

Analysis and Insights

The Q-learning approach described here demonstrates several key aspects of reinforcement learning:

State Space Discretization: By converting continuous states into discrete bins, we enable tabular Q-learning in a continuous environment. This discretization is a crucial step in applying Q-learning to real-world problems with continuous state spaces.
Exploration vs. Exploitation: The epsilon-greedy strategy balances exploration of new actions with exploitation of known good actions. The gradual decay of epsilon over time shifts the focus from exploration to exploitation as the agent learns.
Value Function Approximation: The Q-table serves as a simple form of value function approximation, mapping state-action pairs to expected cumulative rewards.
Sample Efficiency: Q-learning is known for its sample efficiency, able to learn effective policies with relatively few interactions with the environment. This is particularly evident in the CartPole problem, where good performance can often be achieved within a few thousand episodes.

Performance Analysis

To understand the learning progress, we can track the average reward over time:

reward_history = []
for episode in range(EPISODES):
    # ... (training loop code) ...
    episode_reward = 0
    while not done:
        # ... (step through environment) ...
        episode_reward += reward
    reward_history.append(episode_reward)

    if episode % LOG_FREQUENCY == 0:
        avg_reward = sum(reward_history[-LOG_FREQUENCY:]) / LOG_FREQUENCY
        print(f"Episode {episode}, Average Reward: {avg_reward}")

Typical learning progress might look like this:

Episode	Average Reward
0	22.5
2000	89.3
4000	156.7
6000	234.2
8000	312.8
10000	378.5
12000	421.6

This progression demonstrates how the agent improves over time, eventually learning to balance the pole for extended periods.

Advanced Techniques and Future Directions

While our implementation successfully solves the CartPole-v1 problem, there are several avenues for further research and improvement:

Function Approximation: For more complex environments, replacing the Q-table with a neural network (Deep Q-Network) could provide better generalization. This approach has been successfully applied to a wide range of problems, including Atari games.
Prioritized Experience Replay: Implementing this technique could improve learning efficiency by focusing on the most informative experiences. Research has shown that this can lead to faster convergence and better final performance.
Multi-step Learning: Incorporating multi-step returns could lead to faster learning and improved performance. This technique helps to propagate rewards more quickly through the state space.
Policy Gradient Methods: Comparing the Q-learning approach with policy gradient methods like REINFORCE or Actor-Critic algorithms could provide insights into their relative strengths and weaknesses. These methods have shown promise in continuous action spaces and high-dimensional problems.
Continuous Action Spaces: Extending the approach to handle continuous action spaces using techniques like DDPG (Deep Deterministic Policy Gradient) could broaden its applicability. This is particularly relevant for real-world robotics applications.

Conclusion

This exploration of Q-learning in the CartPole-v1 environment serves as a foundational example of reinforcement learning in practice. By breaking down the process into discrete steps – from environment setup to training loop implementation – we've illuminated the core concepts behind Q-learning and its application to a classic control problem.

The simplicity of the CartPole environment belies the profound insights it can provide into the mechanics of reinforcement learning. As we've seen, even with a relatively straightforward implementation, we can achieve impressive results in balancing the pole.

For AI practitioners and researchers, this case study offers a springboard for deeper exploration into reinforcement learning algorithms, their implementations, and potential optimizations. The principles demonstrated here – of learning through interaction, balancing exploration and exploitation, and iteratively improving performance – remain fundamental to the development of more advanced and capable AI systems.

As the field of AI continues to evolve, the lessons learned from solving CartPole-v1 with Q-learning can be applied to more complex problems. From autonomous vehicles to robotic control systems, the core principles of reinforcement learning continue to drive innovation and push the boundaries of what's possible in artificial intelligence.

In future explorations, we'll delve deeper into the effects of various hyperparameters on the learning process, providing a more nuanced understanding of how to fine-tune reinforcement learning algorithms for optimal performance. We'll also explore more advanced techniques like Deep Q-Networks and policy gradient methods, building on the foundation established in this article.

The journey into reinforcement learning is an exciting one, filled with challenges and opportunities. As we continue to refine our algorithms and tackle increasingly complex problems, the potential for AI to transform our world grows ever greater. The CartPole-v1 problem, while simple, serves as a crucial stepping stone in this ongoing journey of discovery and innovation.