Deep Q-Network for Stock Trading (Part I): Problem Statement & RL Motivation

January 01, 2026 8 minute read

This post is part of a series on building a Deep Q-Network (DQN) based trading system for SPY (S&P 500 ETF).

Next Post → Part II: Data Engineering Pipeline

⚠️ Disclaimer

This blog series is for educational and research purposes only. The content should not be considered financial advice, investment advice, or trading advice. Trading stocks and financial instruments involves substantial risk of loss and is not suitable for every investor. Past performance does not guarantee future results. Always consult with a qualified financial advisor before making investment decisions.

Introduction

After completing the AI Trading Strategies Nanodegree Reinforcement Learning course, I wanted to apply reinforcement learning (RL) to a real-world trading problem. While many algorithmic trading approaches rely on supervised learning to predict price direction, I was intrigued by the potential of RL to learn trading policies directly—deciding when to buy, sell, or hold based on maximizing long-term profitability.

This blog series documents my journey building a Deep Q-Network (DQN) trading system for SPY, the most liquid ETF tracking the S&P 500 index. The goal is not just to build a profitable model, but to create a production-ready system with proper data engineering, backtesting, and risk management.

The complete code for this project is available on GitHub.

Problem Statement

The Trading Challenge

Objective: Build an intelligent agent that learns to trade SPY shares to maximize total returns while managing risk.

Unlike traditional approaches that predict whether prices will go up or down, we want our agent to learn:

When to enter a position (and with how many shares)
When to exit partially or completely
When to stay in cash and wait

The agent should adapt to different market conditions:

Trending markets: Ride the momentum, hold winners
Mean-reverting markets: Buy dips, sell rallies
High-volatility periods: Manage risk, avoid large drawdowns

Why SPY?

I chose SPY (SPDR S&P 500 ETF) for several reasons:

High Liquidity: SPY is the most traded ETF globally with tight bid-ask spreads, making it ideal for algorithmic trading
Lower Complexity: Unlike individual stocks, SPY represents a diversified basket of 500 companies, reducing company-specific risk
Historical Data: Decades of reliable historical data available
Real-world Relevance: Many retail and institutional traders focus on SPY for both short-term and long-term strategies

Financial Data Primer

For readers new to financial data, let’s clarify some key concepts:

Price Data:

Open: Price at market open
High/Low: Highest and lowest prices during the trading day
Close: Price at market close (used for most analysis)
Volume: Number of shares traded

Technical Indicators:

Technical analysis uses historical price patterns to predict future movements. Common indicators include:

Moving Averages (SMA, EMA): Smooth out price action to identify trends
Bollinger Bands: Measure volatility using standard deviation bands around moving averages
RSI (Relative Strength Index): Momentum indicator measuring overbought/oversold conditions (0-100 scale)
ADX (Average Directional Index): Measures trend strength

VIX (Volatility Index):

The VIX measures market fear and uncertainty by tracking S&P 500 options prices. High VIX means high expected volatility, often associated with market downturns. Including VIX in our features helps the agent understand broader market sentiment.

Why Reinforcement Learning?

Before diving into RL, let’s understand why traditional supervised learning approaches fall short for trading:

Limitations of Supervised Learning

1. Sequential Decision Making

Supervised learning predicts price direction at a single time step:

“Will SPY go up or down tomorrow?”
But trading involves sequences of decisions over time
Each decision affects future states (cash balance, position size, entry price)

2. Reward is Delayed

In trading, a buy decision only shows profit/loss after the eventual sell
Supervised models don’t naturally handle this temporal credit assignment problem
RL explicitly models delayed rewards through discounting

3. No Action Feedback Loop

Supervised learning: State → Prediction
Trading reality: State → Action → New State → Reward
The agent’s actions change the environment (portfolio state)

4. Static Policy vs. Adaptive Strategy

Supervised models learn a fixed mapping: features → signal
RL learns a policy that adapts based on current portfolio state
Example: The optimal action when holding 100 shares at a loss is different from holding 0 shares

Reinforcement Learning Advantages

RL naturally addresses these challenges:

1. Sequential Decision Process

RL models trading as a Markov Decision Process (MDP):

State (S): Market features + portfolio state (balance, shares held, entry price)
Action (A): Buy X shares, Sell Y shares, or Hold
Reward (R): Profit/loss from trading decisions
Transition: (State, Action) → New State

The agent learns a **policy π(a

s)** that maximizes cumulative long-term reward:

Total Return = Σ γ^t * R_t

where γ (gamma) is the discount factor (0.99 in our case), emphasizing recent rewards more than distant ones.

2. Exploration vs. Exploitation

RL agents balance:

Exploration: Try new strategies (e.g., buying during a crash)
Exploitation: Use known profitable strategies (e.g., selling at resistance levels)

This is critical in trading where market dynamics evolve over time.

3. Risk Management Through Rewards

We can shape agent behavior through reward engineering:

Positive rewards for profitable trades
Negative rewards (penalties) for:
- Excessive trading (transaction costs)
- Holding idle cash too long (opportunity cost)
- Large drawdowns (risk penalty)

4. Direct Policy Learning

Instead of predicting prices (which is notoriously difficult), we directly learn:

Optimal Action = argmax_a Q(state, action)

where Q(state, action) estimates the expected future return of taking action a in state s.

Why Deep Q-Network (DQN)?

Among RL algorithms, I chose DQN for several reasons:

1. Discrete Action Space

DQN works well with discrete actions. For example, with share_increments = [10, 50, 100, 200] and enable_buy_max = false:

Action 0: HOLD
Action 1: BUY 10 shares
Action 2: BUY 50 shares
Action 3: BUY 100 shares
Action 4: BUY 200 shares
Action 5: SELL 10 shares
Action 6: SELL 50 shares
Action 7: SELL 100 shares
Action 8: SELL 200 shares
Action 9: SELL ALL

With enable_buy_max = true, we add BUY_MAX action (buy as many shares as balance allows), giving 11 total actions.

This discrete action space is more intuitive than policy gradient methods that handle continuous actions.

2. Off-Policy Learning

DQN uses experience replay:

Store past experiences: (state, action, reward, next_state)
Sample random mini-batches during training
Breaks correlation between consecutive training samples
More data-efficient than on-policy methods

3. Proven Track Record

DQN was famously used by DeepMind to:

Play Atari games at superhuman level (2015)
Defeat Go champion Lee Sedol via AlphaGo (which built on DQN concepts)

If it can master complex games, it can learn trading patterns.

4. Stable Training with Target Networks

DQN uses two networks:

Online Network: Updated every training step
Target Network: Periodically synchronized copy

This stabilizes learning by providing consistent Q-value targets during training. We’ll dive deeper into this in Part IV.

Project Scope

This project explores:

1. Multi-Buy Position Accumulation

Scale into positions with multiple buys (e.g., buy 10 shares, then 50 more)
Track weighted average entry price across lots
Learn position sizing dynamically

2. Partial Sells with FIFO Tracking

Sell portions of position: 10, 50, or all shares
First-In-First-Out (FIFO) lot tracking for accurate profit calculation
Exit strategies beyond binary buy/sell

3. Risk Management Guardrails

Stop-loss: Automatically exit when position drops X% from weighted average entry
Take-profit: Lock in gains at Y% profit
Compare strategies with different risk tolerances (10%, 20%, or no guardrails)

4. Out-of-Sample Validation

Train on 2022-2025 data
Validate on 5 different market periods:
- 2005 (pre-crisis)
- 2008 (financial crisis)
- 2012 (recovery)
- 2013 (bull market)
- 2021 (post-COVID)

5. Production-Grade System

JSON-based configuration for experiments
Modular architecture: data → features → environment → agent → results
Comprehensive testing and dry-run mode
GitHub-ready with documentation

What’s Next?

In Part II, we’ll dive into the data engineering pipeline:

Collecting SPY and VIX data from Yahoo Finance
Feature engineering with 25 technical indicators
Data splitting strategy to avoid lookahead bias
Continuous rolling normalization for stationarity

Stay tuned!