Deep Q-Network

background image
Home / Learn / Machine Learning /
Deep Q-Network

DQN (Deep Q-Network) is a variant of the Q-learning algorithm that uses deep neural networks to approximate the Q-function. The Q-function is typically represented as a table, where each entry corresponds to a state-action pair, and the value of the entry is the expected cumulative reward for that pair. However, in large state spaces, it can be infeasible to represent the Q-function as a table, as the number of entries in the table would be too large.

DQN addresses this problem by approximating the Q-function with a deep neural network. The neural network takes as input the current state, and produces as output the estimated Q-values for each action. The network is trained to minimize the difference between the estimated Q-values and the true Q-values.

One of the key challenges in training a DQN is that the Q-values are dependent on the actions that are taken, and the actions can change the state of the environment. To overcome this problem, DQN uses a technique called experience replay. Experience replay stores the previous experiences of the agent, including the states, actions, rewards, and next states, in a replay buffer. The agent then samples a random batch of experiences from the replay buffer to use for training the neural network.

Another key feature of DQN is the use of a target network. The target network is a separate network that is used to estimate the Q-values for the next state. The target network is a copy of the main network, and it is updated less frequently than the main network. This helps to stabilize the training process and reduce the correlation between the updates.

DQN has been applied to a wide range of problems, including game playing, robotics, and control systems. It was first introduced in 2013 and since then it has been used to achieve state-of-the-art performance in many Atari games, it was also the first RL algorithm to achieve human-level performance in the game of Go.

Here is an example of how to use TensorFlow to train a simple DQN agent for the game of cartpole. The cartpole game consists of a cart that can move left or right, and a pole that is attached to the cart by an unactuated joint. The goal of the agent is to keep the pole balanced upright as long as possible.

import gym
import tensorflow as tf
import numpy as np

# Create the cartpole environment
env = gym.make('CartPole-v0')

# Define the state and action dimensions
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

# Create the DQN model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(32, input_dim=state_dim, activation='relu'),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(action_dim, activation='linear')
])

# Define the optimizer and loss function
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
loss_fn = tf.keras.losses.CategoricalCrossentropy()

# Create the experience replay buffer
replay_buffer = []

# Define the training loop
for episode in range(1000):
    # Reset the environment
    state = env.reset()

    while True:
        # Select an action
        action_probs = model(tf.convert_to_tensor(state[None, :], dtype=tf.float32))
        action = np.random.choice(np.arange(action_dim), p=action_probs.numpy()[0])

        # Take the action
        next_state, reward, done, _ = env.step(action)

        # Add the experience to the replay buffer
        replay_buffer.append((state, action, reward, next_state, done))

        # Sample a batch of experiences from the replay buffer
        batch = np.random.choice(replay_buffer, size=32)
        states, actions, rewards, next_states, dones = map(np.array, zip(*batch))

        # Compute the Q-values for the next states
        next_q_values = model(tf.convert_to_tensor(next_states, dtype=tf.float32))
        next_q_values = tf.reduce_max(next_q_values, axis=1)

        # Compute the target Q-values
        target_q_values = rewards + (1 - dones) * next_q_values

        # Compute the current Q-values
        q_values = model(tf.convert_to_tensor(states, dtype=tf.float32))

        # One-hot encode the actions
        actions_one_hot = tf.one_hot(actions, action_dim)

        # Compute the Q-values for the selected actions
        q_values = tf.reduce_sum(q_values * actions_one_hot, axis=1)

        # Compute the loss
        loss = loss_fn(target_q_values, q_values)

        # Backpropagation and optimization
        grads, variables = zip(*optimizer.get_gradients(loss, model.trainable_variables))
        grads, _ = tf.clip_by_global_norm(grads, 5.0)
        optimizer.apply_gradients(zip(grads, variables))

        # Update the state
        state = next_state

        # End the episode if done
        if done:
            break

This is a basic example of how to use TensorFlow to implement a DQN algorithm to train an agent to play the CartPole-v0 game. The agent uses a neural network with 3 layers to approximate the Q-function. The state is represented by the position and velocity of the pole and the cart. The agent can take one of two actions at each step, pushing the cart left or right. The agent receives a reward of +1 for every time step that it keeps the pole upright and a reward of -1 for every time the pole falls. The agent uses the Adam optimizer and the categorical cross-entropy loss function to update its parameters. The experience replay buffer is used to store the previous experiences of the agent and it samples a random batch of experiences from the replay buffer to use for training the network. The agent's performance can be improved by adjusting the hyperparameters such as the learning rate, the number of hidden units and the size of the replay buffer, and by using more advanced techniques such as Double DQN, Dueling DQN or Prioritized Replay.

In summary, DQN (Deep Q-Network) is a variant of the Q-learning algorithm that uses deep neural networks to approximate the Q-function. It addresses the problem of representing the Q-function in large state spaces by approximating the Q-function with a deep neural network. DQN uses a technique called experience replay to store previous experiences of the agent, and a target network to estimate the Q-values for the next state, which helps to stabilize the training process. DQN has been applied to a wide range of problems and it has been used to achieve state-of-the-art performance in many Atari games and it was the first RL algorithm to achieve human-level performance in the game of Go.