Deep Q-Network for Stock Trading (Part IV): DQN Architecture Deep Dive

January 17, 2026 14 minute read

This post is part of a series on building a Deep Q-Network (DQN) based trading system for SPY (S&P 500 ETF).

← Previous: Part III: Learning Environment Design

Next Post → Part V: Software Architecture & Results

⚠️ Disclaimer

This blog series is for educational and research purposes only. The content should not be considered financial advice, investment advice, or trading advice. Trading stocks and financial instruments involves substantial risk of loss and is not suitable for every investor. Past performance does not guarantee future results. Always consult with a qualified financial advisor before making investment decisions.

Introduction

In Part III, we designed the trading environment with multi-buy accumulation and FIFO sells. Now it’s time to build the brain of our system: the Deep Q-Network (DQN).

This post dives into:

Double DQN: Preventing Q-value overestimation
Dueling Architecture: Separating value and advantage
Experience Replay: Breaking temporal correlation
Target Networks: Stabilizing training
Network Architecture: Layers, activations, and hyperparameters

The complete code is in src/models/dqn.py on GitHub.

The Q-Learning Foundation

What is Q-Learning?

Q-learning learns a Q-function that estimates the expected cumulative reward:

Q(s, a) = Expected total reward from state s, taking action a

Bellman Equation:

Q(s, a) = r + γ × max_a' Q(s', a')

Where:

r: Immediate reward
γ (gamma): Discount factor (0.99)
s': Next state
max_a' Q(s', a'): Best Q-value in next state

Policy: Choose action with highest Q-value:

π(s) = argmax_a Q(s, a)

From Tabular to Deep Q-Learning

Tabular Q-Learning stores Q(s,a) in a table:

State    Action    Q-value
-----    ------    -------
s1       a1        0.5
s1       a2        0.3
s2       a1        0.8
...

Problem: Trading states are continuous (prices, indicators, etc.). Infinite table size!

Solution: Use a neural network to approximate Q(s,a):

Q(s, a) ≈ NN(s)[a]

Input: State → Output: Q-values for all actions

Vanilla DQN vs. Our Improvements

Vanilla DQN (2013)

DeepMind’s original DQN used:

Single network predicting Q-values
Experience replay
Target network

Problems:

Overestimation bias: max operator causes Q-values to be too optimistic
No value/advantage separation: Conflates state value with action advantages

Our Enhanced DQN

We implement two key improvements:

Double DQN (2015): Fixes overestimation
Dueling Architecture (2016): Separates value and advantage

Let’s explore each!

Double DQN: Fixing Overestimation

The Overestimation Problem

Vanilla DQN computes targets as:

Target = r + γ × max_a Q_target(s', a)

Issue: The same network selects AND evaluates the action:

max_a Q(s', a) = Q(s', argmax_a Q(s', a))

If Q-values have noise (they do!), max will pick overestimated values:

Example:

True Q-values:  [1.0, 1.2, 1.1]
Noisy Q-values: [1.5, 0.9, 1.3]  ← Noise added

max(true) = 1.2
max(noisy) = 1.5  ← Overestimate!

Over many updates, this compounds, causing Q-values to diverge.

Double DQN Solution

Key Idea: Decouple selection from evaluation:

Behavior Network selects action:
```
a* = argmax_a Q_behavior(s', a)
```
Target Network evaluates that action:
```
Target = r + γ × Q_target(s', a*)
```

Why This Works:

If behavior network overestimates action a*, target network is unlikely to also overestimate it (independent noise). This reduces systematic overestimation.

Implementation

@tf.function
def train_step(self, states, actions, rewards, next_states, dones, gamma=0.99):
    """Double DQN training step."""

    # Current Q-values from behavior network
    current_q_values = self.q_network(states, training=True)
    actions_one_hot = tf.one_hot(actions, self.n_actions)
    current_q_values = tf.reduce_sum(current_q_values * actions_one_hot, axis=1)

    # Double DQN: Select action with behavior network
    next_q_values_behavior = self.q_network(next_states, training=False)
    next_actions = tf.argmax(next_q_values_behavior, axis=1)  # ← Selection

    # Evaluate selected action with target network
    next_q_values_target = self.target_network(next_states, training=False)
    next_actions_one_hot = tf.one_hot(next_actions, self.n_actions)
    next_q_values = tf.reduce_sum(next_q_values_target * next_actions_one_hot, axis=1)  # ← Evaluation

    # Compute targets
    targets = rewards + gamma * next_q_values * (1 - dones)

    # MSE loss
    loss = tf.keras.losses.MSE(targets, current_q_values)

    # Update behavior network only
    gradients = tape.gradient(loss, self.q_network.trainable_variables)
    self.optimizer.apply_gradients(zip(gradients, self.q_network.trainable_variables))

    return loss

Key Points:

Behavior network updated every step
Target network copied periodically (e.g., every 1-10 episodes)
Decoupling reduces overestimation

Dueling Architecture: Value vs. Advantage

The Motivation

Some states are intrinsically valuable regardless of action:

Example:

State: Bull market, RSI=50, trending up, cash available

All actions have similar Q-values:
Q(s, HOLD) = 2.5
Q(s, BUY_10) = 2.7
Q(s, BUY_50) = 2.6
Q(s, BUY_100) = 2.8

Observation: The state itself is valuable (2.5 baseline). Actions provide small advantages (+0.0 to +0.3).

Dueling Architecture Idea:

Separate the state value from action advantages:

Q(s, a) = V(s) + A(s, a)

Where:

V(s): Value of being in state s (independent of action)
A(s, a): Advantage of action a in state s (relative to average)

Dueling Formula

To ensure identifiability, we center advantages:

Q(s, a) = V(s) + (A(s, a) - mean_a A(s, a))

Why subtract mean?

Without centering, the network could output arbitrary V and A values that sum to the same Q. Centering forces:

V(s): Represents baseline state value
A(s, a) - mean(A): Relative advantage of action a

Architecture Diagram

Input State (window_size, n_features)
         ↓
     [Flatten]
         ↓
  [Shared Layers]
    256 → 128
         ↓
    ┌─────┴─────┐
    ↓           ↓
[Value Stream] [Advantage Stream]
   128           128
    ↓             ↓
[V(s): 1]    [A(s,a): n_actions]
    └──────┬──────┘
           ↓
     Q(s,a) = V + A - mean(A)

Implementation

class DQNNetwork(Model):
    """Dueling DQN network."""

    def __init__(self, state_shape, n_actions, config):
        super().__init__()

        # Shared layers
        self.flatten = layers.Flatten()
        self.shared_1 = layers.Dense(256, activation='relu', name='shared_1')
        self.shared_2 = layers.Dense(128, activation='relu', name='shared_2')

        # Value stream
        self.value_hidden = layers.Dense(128, activation='relu', name='value_hidden')
        self.value_output = layers.Dense(1, name='value_output')  # Single value

        # Advantage stream
        self.advantage_hidden = layers.Dense(128, activation='relu', name='advantage_hidden')
        self.advantage_output = layers.Dense(n_actions, name='advantage_output')  # One per action

    def call(self, inputs, training=False):
        """Forward pass."""
        # Shared processing
        x = self.flatten(inputs)
        x = self.shared_1(x)
        x = self.shared_2(x)

        # Value stream
        value = self.value_hidden(x)
        value = self.value_output(value)  # Shape: (batch, 1)

        # Advantage stream
        advantage = self.advantage_hidden(x)
        advantage = self.advantage_output(advantage)  # Shape: (batch, n_actions)

        # Combine with centering
        q_values = value + (advantage - tf.reduce_mean(advantage, axis=1, keepdims=True))

        return q_values

Benefits:

Faster Learning: Network learns state values even when actions don’t matter much
Better Generalization: Decoupling helps in states with clear best actions
Robustness: More stable Q-value estimates

Experience Replay: Breaking Correlation

The Problem with Online Learning

If we train on consecutive experiences:

Episode:
Step 1: (s1, a1, r1, s2)  → Train
Step 2: (s2, a2, r2, s3)  → Train
Step 3: (s3, a3, r3, s4)  → Train
...

Issues:

Temporal Correlation: Consecutive states are highly correlated
Inefficient: Each experience used once, then discarded
Catastrophic Forgetting: New experiences overwrite old learnings

Experience Replay Solution

Key Idea: Store experiences in a buffer, sample random mini-batches for training.

class ReplayBuffer:
    """Fixed-size buffer to store experience tuples."""

    def __init__(self, buffer_size: int = 10000):
        """
        Args:
            buffer_size: Maximum buffer size
        """
        self.buffer = []
        self.buffer_size = buffer_size
        self.position = 0

    def add(self, state, action, reward, next_state, done, next_valid_actions):
        """Add experience to buffer."""
        experience = (state, action, reward, next_state, done, next_valid_actions)

        if len(self.buffer) < self.buffer_size:
            self.buffer.append(experience)
        else:
            # Circular buffer: overwrite oldest
            self.buffer[self.position] = experience
            self.position = (self.position + 1) % self.buffer_size

    def sample(self, batch_size: int = 64):
        """Sample random mini-batch."""
        indices = np.random.choice(len(self.buffer), batch_size, replace=False)

        states = []
        actions = []
        rewards = []
        next_states = []
        dones = []
        next_valid_actions = []

        for idx in indices:
            s, a, r, s_next, d, nva = self.buffer[idx]
            states.append(s)
            actions.append(a)
            rewards.append(r)
            next_states.append(s_next)
            dones.append(d)
            next_valid_actions.append(nva)

        return (
            np.array(states),
            np.array(actions),
            np.array(rewards),
            np.array(next_states),
            np.array(dones),
            np.array(next_valid_actions)
        )

Training Loop with Replay

# Initialize
replay_buffer = ReplayBuffer(buffer_size=10000)
agent = DoubleDQN(state_shape, n_actions, config)

for episode in range(num_episodes):
    state = env.reset()

    for step in range(max_steps):
        # Select action
        valid_actions = env.get_action_mask()
        action = agent.get_action(state, valid_actions, epsilon)

        # Execute action
        next_state, reward, done, info = env.step(action)

        # Store experience
        next_valid_actions = env.get_action_mask()
        replay_buffer.add(state, action, reward, next_state, done, next_valid_actions)

        # Train from random mini-batch
        if len(replay_buffer) >= batch_size:
            batch = replay_buffer.sample(batch_size)
            loss = agent.train_step(*batch, gamma=0.99)

        state = next_state
        if done:
            break

Benefits:

Breaks Correlation: Random sampling decorrelates training samples
Data Efficiency: Each experience used multiple times
Stable Learning: Smooth out variance in updates

Target Network: Stabilizing the Target

The Moving Target Problem

Q-learning updates look like:

Q(s, a) ← Q(s, a) + α[r + γ max_a' Q(s', a') - Q(s, a)]
                          ↑
                    This is the target

Problem: The target depends on Q itself! As we update Q, the target moves.

Analogy: Imagine learning to shoot basketballs, but the hoop moves every time you shoot. Impossible to converge!

Target Network Solution

Key Idea: Use a frozen copy of the network for computing targets:

Behavior Network (Q): Updated every training step
Target Network (Q’): Frozen for N steps, then copied from Q

class DoubleDQN:
    def __init__(self, state_shape, n_actions, config):
        # Create two identical networks
        self.q_network = DQNNetwork(state_shape, n_actions, config, 'behavior')
        self.target_network = DQNNetwork(state_shape, n_actions, config, 'target')

        # Initialize target with behavior weights
        self.update_target_network()

    def update_target_network(self):
        """Copy weights from behavior to target."""
        self.target_network.set_weights(self.q_network.get_weights())

    def train(self, states, actions, rewards, next_states, dones):
        # Compute targets using FROZEN target network
        next_q_values = self.target_network(next_states)
        targets = rewards + gamma * tf.reduce_max(next_q_values, axis=1)

        # Update behavior network toward these fixed targets
        with tf.GradientTape() as tape:
            current_q_values = self.q_network(states)
            loss = MSE(targets, current_q_values)

        gradients = tape.gradient(loss, self.q_network.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.q_network.trainable_variables))

# Training loop
for episode in range(num_episodes):
    # ... collect experience and train ...

    # Update target network every N episodes
    if episode % target_update_freq == 0:
        agent.update_target_network()

Analogy: Now the basketball hoop stays still for 10 shots, then moves to a new position. Much easier to learn!

Hyperparameter: target_update_freq

Low (e.g., 1): Frequent updates, less stable
High (e.g., 10): More stable, but slower to adapt
Our default: 1 episode (works well with multiple training steps per episode)

Network Architecture Configuration

Configurable Design

Our system allows flexible network configuration via JSON:

{
  "network": {
    "architecture": "dueling",
    "shared_layers": [256, 128],
    "value_layers": [128],
    "advantage_layers": [128],
    "activation": "relu",
    "dropout_rate": 0.0,
    "batch_norm": false
  }
}

Layer Sizes

Shared Layers: [256, 128]

First layer: 256 units (large capacity for complex patterns)
Second layer: 128 units (dimensionality reduction)

Value/Advantage Streams: [128]

Single hidden layer with 128 units each
Keeps streams symmetric and manageable

Total Parameters:

For state_shape=(5, 27) and n_actions=8:

Input: 5 × 27 = 135 features

Shared:
  Layer 1: 135 × 256 + 256 bias = 34,816
  Layer 2: 256 × 128 + 128 bias = 32,896

Value Stream:
  Hidden: 128 × 128 + 128 bias = 16,512
  Output: 128 × 1 + 1 bias = 129

Advantage Stream:
  Hidden: 128 × 128 + 128 bias = 16,512
  Output: 128 × 8 + 8 bias = 1,032

Total: ~102K parameters

Activation Functions

ReLU (Rectified Linear Unit):

f(x) = max(0, x)

Pros:

Fast to compute
No vanishing gradient
Sparse activation (some neurons off)

Cons:

“Dying ReLU” problem (neurons stuck at 0)

Alternatives:

ELU (Exponential Linear Unit): Smoother, no dying ReLU
tanh: Outputs in [-1, 1], useful for normalized data

Our default: ReLU (standard, reliable)

Regularization

Dropout (optional):

"dropout_rate": 0.2

Randomly drops 20% of neurons during training. Prevents overfitting.

Batch Normalization (optional):

"batch_norm": true

Normalizes layer inputs. Can speed up training but adds complexity.

Our default: No dropout or batch norm (unnecessary for our data size and simplicity)

Hyperparameters Summary

Hyperparameter	Value	Purpose
Learning Rate	0.001	Step size for gradient descent
Gamma (γ)	0.99	Discount factor (long-term focus)
Epsilon Start	1.0	Initial exploration rate
Epsilon End	0.01	Minimum exploration rate
Epsilon Decay	0.995	Decay per episode
Replay Buffer Size	10,000	Experiences stored
Batch Size	64	Mini-batch size for training
Target Update Freq	1 episode	How often to sync target network
Optimizer	Adam	Adaptive learning rate optimizer

Epsilon-Greedy Exploration

Balances exploration vs. exploitation:

if random() < epsilon:
    action = random_valid_action()  # Explore
else:
    action = argmax Q(s, a)          # Exploit

Epsilon Schedule:

epsilon = max(epsilon_end, epsilon * epsilon_decay)

Example:

Episode 1:   ε = 1.0   (100% random)
Episode 10:  ε = 0.95  (95% random)
Episode 50:  ε = 0.78  (78% random)
Episode 100: ε = 0.60  (60% random)
Episode 500: ε = 0.08  (8% random)
Episode 1000: ε = 0.01 (1% random)

Early episodes explore broadly; later episodes exploit learned policy.

Putting It All Together

Complete Training Algorithm

1. Initialize:
   - Behavior network Q
   - Target network Q' (copy of Q)
   - Replay buffer D
   - Epsilon = 1.0

2. For each episode:
   a. Reset environment → state s
   b. For each step:
      i.   Select action a using epsilon-greedy with action masking
      ii.  Execute a → reward r, next state s', done
      iii. Store (s, a, r, s', done) in D
      iv.  If |D| >= batch_size:
           - Sample mini-batch from D
           - Compute Double DQN targets using Q'
           - Train Q on mini-batch
      v.   s ← s'
      vi.  If done, break

   c. Decay epsilon: ε ← ε × 0.995
   d. Update target: Q' ← Q (every N episodes)

3. Save best model based on validation performance

Key Innovations Recap

Double DQN: Decouples action selection from evaluation → reduces overestimation
Dueling Architecture: Separates V(s) and A(s,a) → faster learning
Experience Replay: Random mini-batches → stable training
Target Network: Frozen targets → convergence
Action Masking: Invalid actions filtered → efficiency

What’s Next?

In Part V (Final), we’ll cover:

Software Architecture: Modular design, configuration system
Training Pipeline: Dry run, full training, monitoring
Results Analysis: Out-of-sample validation across 5 market periods
Strategy Comparison: Baseline (20% SL/TP) vs. Aggressive (10%) vs. No Guardrails
Performance Metrics: Total return, Sharpe ratio, max drawdown, win rate
Key Insights: What worked, what didn’t, lessons learned

The DQN brain is complete—now let’s see how it performs in the real market!

← Previous: Part III: Learning Environment Design

Next Post → Part V: Software Architecture & Results

References

Human-level control through deep reinforcement learning - Original DQN paper (2015)
Deep Reinforcement Learning with Double Q-learning - Double DQN paper (2015)
Dueling Network Architectures for Deep Reinforcement Learning - Dueling DQN paper (2016)
Prioritized Experience Replay - Advanced replay technique
Rainbow: Combining Improvements in Deep Reinforcement Learning - Combines 6 DQN improvements
TensorFlow Documentation - TensorFlow 2.x API

Share on

Twitter Facebook LinkedIn

Sharan Naribole

⚠️ Disclaimer

Introduction

The Q-Learning Foundation

What is Q-Learning?

From Tabular to Deep Q-Learning

Vanilla DQN vs. Our Improvements

Vanilla DQN (2013)

Our Enhanced DQN

Double DQN: Fixing Overestimation

The Overestimation Problem

Double DQN Solution

Implementation

Dueling Architecture: Value vs. Advantage

The Motivation

Dueling Formula

Architecture Diagram

Implementation

Experience Replay: Breaking Correlation

The Problem with Online Learning

Experience Replay Solution

Training Loop with Replay

Target Network: Stabilizing the Target

The Moving Target Problem

Target Network Solution

Network Architecture Configuration

Configurable Design

Layer Sizes

Activation Functions

Regularization

Hyperparameters Summary

Epsilon-Greedy Exploration

Putting It All Together

Complete Training Algorithm

Key Innovations Recap

What’s Next?

References

Share on

You May Also Enjoy

Stock Return Classifier (Part IV): Test Evaluation & Portfolio Backtesting

Stock Return Classifier (Part III): Baseline Models, ML Models & Hyperparameter Tuning

Stock Return Classifier (Part II): EDA, Feature Selection & Feature Engineering