Deep Q-Network for Stock Trading (Part III): Learning Environment Design

January 14, 2026 18 minute read

This post is part of a series on building a Deep Q-Network (DQN) based trading system for SPY (S&P 500 ETF).

← Previous: Part II: Data Engineering Pipeline

Next Post → Part IV: DQN Architecture Deep Dive

⚠️ Disclaimer

This blog series is for educational and research purposes only. The content should not be considered financial advice, investment advice, or trading advice. Trading stocks and financial instruments involves substantial risk of loss and is not suitable for every investor. Past performance does not guarantee future results. Always consult with a qualified financial advisor before making investment decisions.

Introduction

In Part II, we built the data pipeline with 25 technical indicators. Now comes the exciting part: designing the learning environment where our DQN agent will interact with the market.

A well-designed environment is critical for successful RL. We need to formulate trading as a Markov Decision Process (MDP) with:

State Space: What does the agent observe?
Action Space: What can the agent do?
Reward Function: How do we measure success?
Transition Dynamics: How does the environment evolve?

This part covers our sophisticated environment featuring:

Multi-buy position accumulation
Partial sells with FIFO lot tracking
Action masking for safety
Risk management guardrails

The complete code is in src/trading/environment.py on GitHub.

State Space: What Does the Agent See?

State Composition

The agent observes a window of recent market data plus its current portfolio state:

State = [Market Features (window_size × n_features) + Portfolio State]

Example with window_size=5, 25 features:

Shape: (5, 27)  # 25 market features + 2 portfolio features

Market Features (past 5 days):
- SPY_Close, SPY_Return, SPY_High_Low_Pct, ...
- BB_Position, BB_Width, ...
- EMA_8, EMA_21, SMA_50, SMA_200, ...
- RSI, ADX, DI_Plus, DI_Minus
- Volume_Ratio, Volume_Trend
- VIX_Close

Portfolio State (current):
- Position Ratio: shares_held / max_affordable_shares  (dynamically normalized)
- Cash Ratio: balance / starting_balance   (0.0 to ~2.0+)

Implementation

def _get_state(self) -> np.ndarray:
    """
    Get current state observation.

    Returns:
        np.ndarray: State array of shape (window_size, n_features + 2)
    """
    # Extract window of market features
    start_idx = self.current_step - self.window_size
    end_idx = self.current_step

    market_state = self.data[self.feature_columns].iloc[start_idx:end_idx].values

    # Add portfolio state to each time step
    # Normalize position by max affordable shares (price-adaptive)
    current_price = self._get_current_price()
    max_affordable_shares = self.starting_balance / current_price if current_price > 0 else 1

    position_ratio = self.shares_held / max_affordable_shares
    balance_ratio = self.balance / self.starting_balance

    portfolio_state = np.array([position_ratio, balance_ratio])

    # Broadcast portfolio state across time window
    portfolio_broadcast = np.tile(portfolio_state, (self.window_size, 1))

    # Concatenate: (window_size, n_market_features + 2)
    state = np.concatenate([market_state, portfolio_broadcast], axis=1)

    return state.astype(np.float32)

Why This Design?

1. Temporal Context (window_size)

Using a lookback window (e.g., 5 days) allows the agent to:

Detect short-term trends (price rising/falling over past few days)
Identify patterns (e.g., Bollinger Band squeeze followed by breakout)
Learn temporal dependencies

2. Portfolio Awareness

The agent needs to know its current state:

Position Ratio: “Am I fully invested, partially invested, or in cash?”
- 0.0: No position (can only buy)
- 0.5: Half-invested (can buy more or start selling)
- ~1.0: Fully invested (used all starting capital at current price)
Cash Ratio: “How much buying power do I have?”
- < 1.0: Lost money (unrealized or realized losses)
- = 1.0: Break-even
- > 1.0: Profitable (gains available for reinvestment)

3. Normalization

All features are normalized (see Part II), ensuring:

Stable neural network training
Features contribute equally regardless of scale
Generalization across different price regimes

Action Space: Multi-Buy and Partial Sells

The Traditional Approach (Naive)

Simple implementations often use:

Action 0: HOLD
Action 1: BUY (fixed quantity)
Action 2: SELL (sell entire position)

Limitations:

Can’t scale into positions gradually
Can’t take partial profits
All-or-nothing exits (no risk management flexibility)

Our Sophisticated Approach

We use share_increments and enable_buy_max to define flexible position sizing:

# Configuration
share_increments = [10, 50, 100, 200]
enable_buy_max = True  # Enable BUY_MAX action

# Resulting Action Space (11 actions with BUY_MAX enabled):
Action 0: HOLD
Action 1: BUY 10 shares
Action 2: BUY 50 shares
Action 3: BUY 100 shares
Action 4: BUY 200 shares
Action 5: BUY_MAX (buy as many shares as balance allows)
Action 6: SELL 10 shares
Action 7: SELL 50 shares
Action 8: SELL 100 shares
Action 9: SELL 200 shares
Action 10: SELL ALL (regardless of shares held)

Formula (with BUY_MAX enabled):

n_actions = 1 + N + 1 + N + 1
where N = len(share_increments)

= 1 (HOLD) + N (BUY actions) + 1 (BUY_MAX) + N (SELL actions) + 1 (SELL_ALL)

Example: [10, 50, 100, 200] → 1 + 4 + 1 + 4 + 1 = 11 actions

Formula (with BUY_MAX disabled):

n_actions = 1 + N + N + 1
= 1 (HOLD) + N (BUY actions) + N (SELL actions) + 1 (SELL_ALL)

Example: [10, 50, 100, 200] → 1 + 4 + 4 + 1 = 10 actions

Why BUY_MAX?

The BUY_MAX action enables price-adaptive position sizing:

Dynamic allocation: Buys int(balance / current_price) shares
No arbitrary limits: Agent can go “all-in” when confident
Price-adaptive: At $400 vs $500, BUY_MAX buys different quantities
Symmetric to SELL_ALL: Intuitive action pairing

ActionMasker Implementation

The ActionMasker class handles action space creation and validation:

class ActionMasker:
    """Manages action space with share increments and BUY_MAX."""

    def __init__(self, share_increments: List[int], enable_buy_max: bool = True):
        """
        Args:
            share_increments: List of share quantities (e.g., [10, 50, 100, 200])
            enable_buy_max: Whether to enable BUY_MAX action (default: True)
        """
        self.share_increments = sorted(share_increments)
        self.enable_buy_max = enable_buy_max

        n_increments = len(share_increments)

        # Calculate action space size
        if enable_buy_max:
            self.n_actions = 1 + n_increments + 1 + n_increments + 1
        else:
            self.n_actions = 1 + n_increments + n_increments + 1

        # Define action indices
        self.HOLD = 0
        self.BUY_ACTIONS = list(range(1, 1 + n_increments))

        if enable_buy_max:
            self.BUY_MAX = 1 + n_increments
            self.SELL_ACTIONS = list(range(self.BUY_MAX + 1, self.BUY_MAX + 1 + n_increments))
            self.SELL_ALL = self.BUY_MAX + 1 + n_increments
        else:
            self.BUY_MAX = None
            self.SELL_ACTIONS = list(range(1 + n_increments, 1 + 2*n_increments))
            self.SELL_ALL = 1 + 2*n_increments
        self.SELL_ALL = 1 + 2*n_increments

    def action_to_shares(self, action: int) -> int:
        """Convert action to number of shares."""
        if action == self.HOLD:
            return 0
        elif action in self.BUY_ACTIONS:
            idx = action - 1
            return self.share_increments[idx]
        elif action in self.SELL_ACTIONS:
            idx = action - (1 + len(self.share_increments))
            return -self.share_increments[idx]  # Negative for sell
        else:  # SELL_ALL
            return 0  # Handled separately

    def get_action_mask(
        self,
        current_position: int,
        current_balance: float,
        current_price: float
    ) -> np.ndarray:
        """
        Get mask of valid actions (1 = valid, 0 = invalid).

        Args:
            current_position: Shares currently held
            current_balance: Cash balance
            current_price: Current stock price

        Returns:
            Boolean array of shape (n_actions,)
        """
        mask = np.zeros(self.n_actions, dtype=bool)

        # HOLD always valid
        mask[self.HOLD] = True

        # BUY actions valid if have sufficient balance
        for action in self.BUY_ACTIONS:
            shares = self.action_to_shares(action)
            cost = shares * current_price
            if current_balance >= cost:
                mask[action] = True

        # BUY_MAX valid if can afford at least 1 share
        if self.enable_buy_max and current_balance >= current_price:
            mask[self.BUY_MAX] = True

        # SELL actions valid if:
        # Have enough shares to sell
        for action in self.SELL_ACTIONS:
            shares = abs(self.action_to_shares(action))
            if current_position >= shares:
                mask[action] = True

        # SELL_ALL valid if holding any shares
        if current_position > 0:
            mask[self.SELL_ALL] = True

        return mask

Action Masking: Preventing Invalid Moves

Why Action Masking?

Without masking, the agent might:

Attempt to sell 50 shares when holding only 10
Try to buy when balance is insufficient
Execute BUY_MAX with zero balance

Why Necessary During Training?

During training, the agent uses epsilon-greedy exploration, taking random actions with probability ε (e.g., 20%). Without action masking:

Wasted Exploration: Random exploration would frequently attempt invalid actions, providing no learning signal
Environment Errors: Invalid actions could crash the environment or return meaningless rewards
Slower Convergence: The agent would waste valuable episodes learning basic constraints instead of learning the actual trading task

Action masking constrains the search space to valid actions, allowing the agent to focus exploration on the interesting part of the problem.

Implementation During Training:

# Get valid action mask
action_mask = env.get_action_mask()

# Agent outputs Q-values for all actions
q_values = agent.predict(state)

# Mask invalid actions with large negative value
masked_q_values = np.where(action_mask, q_values, -np.inf)

# Select best valid action
action = np.argmax(masked_q_values)

Multi-Buy Position Accumulation

The Trading Scenario

Imagine this sequence:

Day 1: Buy 10 shares @ $450 (entry price = $450)
Day 5: Buy 50 shares @ $460 (market went up)
Day 10: Buy 10 shares @ $455 (small dip)

Question: What’s the agent’s average entry price?

Answer: Weighted average!

Total Cost = (10 × $450) + (50 × $460) + (10 × $455)
           = $4,500 + $23,000 + $4,550
           = $32,050

Total Shares = 10 + 50 + 10 = 70

Weighted Avg Entry = $32,050 / 70 = $457.86

Lot Tracking Implementation

def _add_lot(self, shares: int, price: float):
    """
    Add a new lot to position.

    Args:
        shares: Number of shares bought
        price: Purchase price
    """
    self.lots.append({
        'shares': shares,
        'entry_price': price
    })

    self.shares_held += shares
    self.entry_price = self._calculate_weighted_avg_entry()

def _calculate_weighted_avg_entry(self) -> Optional[float]:
    """Calculate weighted average entry price from all lots."""
    if not self.lots or self.shares_held == 0:
        return None

    total_cost = sum(lot['shares'] * lot['entry_price'] for lot in self.lots)
    return total_cost / self.shares_held

After 3 buys, lots look like:

self.lots = [
    {'shares': 10, 'entry_price': 450.0},
    {'shares': 50, 'entry_price': 460.0},
    {'shares': 10, 'entry_price': 455.0}
]
self.shares_held = 70
self.entry_price = 457.86  # Weighted average

Partial Sells with FIFO Tracking

Why FIFO (First-In-First-Out)?

When selling portions of a position, we need to track which lots are sold for accurate profit calculation. FIFO matches tax accounting rules and provides consistent profit attribution.

FIFO Example

Position:

Lot 1: 10 shares @ $450
Lot 2: 50 shares @ $460
Lot 3: 10 shares @ $455

Agent decides: SELL 55 shares @ $470

FIFO Logic:

Sell entire Lot 1: 10 shares @ $450 → Profit = 10 × ($470 - $450) = $200
Sell 45 shares from Lot 2: 45 shares @ $460 → Profit = 45 × ($470 - $460) = $450
Remaining Lot 2: 5 shares @ $460 (stays in position)
Lot 3 unchanged: 10 shares @ $455

Total Profit: $200 + $450 = $650

Remaining Position:

Lot 2 (partial): 5 shares @ $460
Lot 3: 10 shares @ $455
New weighted avg entry = (5×460 + 10×455) / 15 = $456.67

Implementation

def _remove_shares_fifo(
    self,
    shares_to_sell: int,
    sell_price: float
) -> Tuple[float, float]:
    """
    Remove shares using FIFO and calculate realized profit.

    Args:
        shares_to_sell: Number of shares to sell
        sell_price: Current market price

    Returns:
        Tuple of (realized_profit, avg_entry_sold)
    """
    remaining_to_sell = shares_to_sell
    total_cost_basis = 0
    shares_sold = 0

    while remaining_to_sell > 0 and self.lots:
        lot = self.lots[0]  # Always take from first lot (FIFO)

        if lot['shares'] <= remaining_to_sell:
            # Sell entire lot
            shares_from_lot = lot['shares']
            total_cost_basis += shares_from_lot * lot['entry_price']
            shares_sold += shares_from_lot
            remaining_to_sell -= shares_from_lot
            self.lots.pop(0)  # Remove depleted lot
        else:
            # Partial lot sale
            shares_from_lot = remaining_to_sell
            total_cost_basis += shares_from_lot * lot['entry_price']
            shares_sold += shares_from_lot
            lot['shares'] -= shares_from_lot  # Reduce lot size
            remaining_to_sell = 0

    # Calculate metrics
    avg_entry_sold = total_cost_basis / shares_sold if shares_sold > 0 else 0
    realized_profit = (sell_price - avg_entry_sold) * shares_sold

    # Update state
    self.shares_held -= shares_sold
    self.entry_price = self._calculate_weighted_avg_entry()

    return realized_profit, avg_entry_sold

SELL_ALL Action

The SELL_ALL action is special—it always sells 100% of the position:

if action == self.action_masker.SELL_ALL:
    shares_to_sell = self.shares_held  # Sell everything

This is useful when:

Stop-loss triggers (get out completely)
Market conditions change drastically
Agent wants to reset to cash

Reward Function: Incentivizing Profitability

Design Principles

A good reward function should:

Incentivize profitable trades (positive reward for gains)
Penalize losses (negative reward for losing trades)
Discourage excessive trading (transaction costs)
Encourage action (small penalty for inactivity)

Our Reward Structure

1. HOLD Action

if action == HOLD:
    reward = idle_reward  # Default: -0.001

Small negative reward encourages the agent to trade when opportunities arise, rather than staying idle.

2. BUY Action

if is_buy_action(action):
    shares_bought = action_to_shares(action)

    # Buy reward component (default: 0.0)
    buy_reward = buy_reward_per_share * shares_bought

    # Transaction cost penalty
    transaction_penalty = -buy_transaction_cost_per_share * shares_bought

    reward = buy_reward + transaction_penalty

Buy Transaction Cost: Represents:

Broker commissions
Bid-ask spread slippage
Market impact

Typical value: $0.01 per share

3. SELL Action (The Main Reward)

if is_sell_action(action):
    # Execute FIFO sell
    realized_profit, avg_entry_sold = _remove_shares_fifo(shares_to_sell, sell_price)

    # Net profit after ALL transaction costs
    buy_cost = buy_transaction_cost_per_share * shares_to_sell
    sell_cost = sell_transaction_cost_per_share * shares_to_sell
    total_transaction_cost = buy_cost + sell_cost

    net_profit = realized_profit - total_transaction_cost

    # Reward using log return (time-additive, symmetric, outlier-resistant)
    position_value_sold = avg_entry_sold * shares_to_sell
    simple_return = net_profit / position_value_sold
    reward = np.log(1 + simple_return) * 100

Why Log Returns Instead of Simple Percentage?

We use logarithmic returns instead of simple percentage returns for several critical reasons:

Time-Additivity (Compound Growth):
- Simple: 10% + 10% = 20% ❌ (incorrect for compounding)
- Log: log(1.1) + log(1.1) = log(1.21) ≈ 19.06% ✓ (correct!)
Symmetry of Gains/Losses:
- Simple: -50% loss requires +100% gain to break even (asymmetric)
- Log: -69.3 vs +69.3 (more balanced)
Outlier Compression:
- Reduces extreme reward spikes that destabilize RL training
- 200% gain isn’t treated as 4x more valuable than 50% gain
Better Statistical Properties:
- Closer to normal distribution → better neural network gradients
- Standard practice in quantitative finance

Comparison Table:

Trade Return	Simple %	Log Reward	Difference
+10%	+10.00	+9.53	-4.7%
+50%	+50.00	+40.55	-18.9%
+100%	+100.00	+69.31	-30.7%
-50%	-50.00	-69.31	+38.6%

Reward Example

Scenario:

Buy 50 shares @ $450 (weighted avg entry)
Sell 50 shares @ $460
Transaction costs: $0.01 per share (buy) + $0.01 (sell)

Calculation:

realized_profit = 50 × ($460 - $450) = $500

total_transaction_cost = 50 × ($0.01 + $0.01) = $1.00

net_profit = $500 - $1.00 = $499

position_value_sold = 50 × $450 = $22,500

simple_return = $499 / $22,500 = 0.0222 (2.22%)

reward = log(1 + 0.0222) × 100 = log(1.0222) × 100 ≈ 2.19

Interpretation: 2.19 log return units (≈2.22% simple return) on capital deployed

Risk Management Guardrails

Stop-Loss and Take-Profit

While we want the agent to learn optimal exit strategies, we also implement guardrails to prevent catastrophic losses and stabilize training.

Why Necessary During Training?

Early in training, the agent’s policy is essentially random. Without stop-loss/take-profit:

Catastrophic Losses: Random exploration could hold losing positions indefinitely, accumulating -50%, -80% unrealized losses
Noisy Learning Signals: Extreme losses create unstable, unhelpful gradients that hinder learning
Reward Space Instability: A few disastrous episodes can dominate the replay buffer, biasing the agent away from taking any positions

With guardrails, even during random exploration:

Maximum loss per position is bounded (e.g., -20% stop-loss)
Maximum gain is capped (e.g., +20% take-profit, preventing greed during lucky streaks)
The agent learns in a stable, bounded reward space
Learning signals are cleaner and more actionable

Think of it like training wheels on a bicycle—they prevent catastrophic failures during the learning phase, allowing the agent to focus on learning optimal entry/exit timing rather than basic risk management.

class TradingGuardrails:
    """Risk management rules."""

    def __init__(
        self,
        stop_loss_pct: float = 20,
        take_profit_pct: float = 20
    ):
        """
        Args:
            stop_loss_pct: Sell all if position down X% from weighted avg entry
            take_profit_pct: Sell all if position up X% from weighted avg entry
        """
        self.stop_loss_pct = stop_loss_pct
        self.take_profit_pct = take_profit_pct

    def check_guardrails(
        self,
        current_price: float,
        entry_price: Optional[float],
        shares_held: int
    ) -> Optional[str]:
        """
        Check if guardrails triggered.

        Returns:
            'stop_loss', 'take_profit', or None
        """
        if shares_held == 0 or entry_price is None:
            return None

        # Calculate position P&L percentage
        pnl_pct = ((current_price - entry_price) / entry_price) * 100

        if pnl_pct <= -self.stop_loss_pct:
            return 'stop_loss'
        elif pnl_pct >= self.take_profit_pct:
            return 'take_profit'

        return None

Guardrail Trigger Example

Position:

Holding 70 shares
Weighted avg entry: $457.86
Stop-loss: 20%
Take-profit: 20%

Trigger Thresholds:

stop_loss_price = $457.86 × (1 - 0.20) = $366.29
take_profit_price = $457.86 × (1.20) = $549.43

If current price drops to $365:

PnL% = ($365 - $457.86) / $457.86 × 100 = -20.3%
→ Stop-loss TRIGGERED → Force SELL_ALL

Disabling Guardrails

For experiments where we want pure RL control:

{
  "trading": {
    "stop_loss_pct": 100,     // Effectively disabled
    "take_profit_pct": 10000  // Effectively disabled
  }
}

This allows comparing:

Baseline: 20% stop-loss/take-profit
Aggressive: 10% stop-loss/take-profit
Pure DQN: No guardrails (100%/10000%)

The Complete MDP

Putting It All Together

Our trading environment implements this MDP:

State Space:

Market features: (window_size, 25 features)
Portfolio state: (position_ratio, cash_ratio)

Action Space (10 actions with share_increments=[10, 50, 100, 200], enable_buy_max=false):

{HOLD, BUY_10, BUY_50, BUY_100, BUY_200, SELL_10, SELL_50, SELL_100, SELL_200, SELL_ALL}

Reward Function:

HOLD: -0.001 (idle penalty)
BUY: -transaction_cost
SELL: log(1 + net_profit / position_value) × 100 (log return)

Transition Dynamics:

Deterministic price progression (historical data)
Stochastic agent behavior (epsilon-greedy exploration)

Termination:

Episode ends when reaching end of data

Environment Step

def step(self, action: int) -> Tuple[np.ndarray, float, bool, Dict]:
    """
    Execute one timestep.

    Args:
        action: Action index to execute

    Returns:
        Tuple of (next_state, reward, done, info)
    """
    current_price = self._get_current_price()

    # Check guardrails (stop-loss / take-profit)
    guardrail_trigger = self.guardrails.check_guardrails(
        current_price, self.entry_price, self.shares_held
    )

    if guardrail_trigger:
        # Force exit entire position
        action = self.action_masker.SELL_ALL

    # Execute action and get reward
    reward = self._execute_action(action, current_price)

    # Advance time
    self.current_step += 1

    # Get next state
    next_state = self._get_state()
    done = self.current_step >= len(self.data)

    info = self._get_info()

    return next_state, reward, done, info

Key Takeaways

State Design: Combines market features (technical indicators) with portfolio state (position, cash)
Action Space: Flexible position sizing with share_increments and BUY_MAX enables gradual accumulation, full allocation, and partial exits
Action Masking: Prevents wasting training on impossible actions (insufficient balance, invalid sells)
Multi-Buy: Weighted average entry price calculated from multiple buy lots
FIFO Sells: First-In-First-Out lot tracking for accurate profit calculation
Reward Engineering: Percentage-based returns incentivize risk-adjusted profitability
Guardrails: Stop-loss and take-profit provide safety net while agent learns

What’s Next?

In Part IV, we’ll dive into the DQN Architecture:

Double DQN to prevent overestimation
Dueling architecture (value & advantage streams)
Experience replay buffer
Target network stabilization
Network layers and hyperparameters

The environment is ready—now let’s build the brain!

← Previous: Part II: Data Engineering Pipeline

Next Post → Part IV: DQN Architecture Deep Dive

References

Human-level control through deep reinforcement learning - DeepMind DQN paper
Markov Decision Processes - Sutton & Barto Chapter 3
Reward Shaping for Reinforcement Learning
FIFO Accounting Method - Investopedia
Action Masking in RL - Invalid Action Masking paper

Share on

Twitter Facebook LinkedIn

Sharan Naribole

Deep Q-Network for Stock Trading (Part III): Learning Environment Design

⚠️ Disclaimer

Introduction

State Space: What Does the Agent See?

State Composition

Implementation

Why This Design?

Action Space: Multi-Buy and Partial Sells

The Traditional Approach (Naive)

Our Sophisticated Approach

ActionMasker Implementation

Action Masking: Preventing Invalid Moves

Multi-Buy Position Accumulation

The Trading Scenario

Lot Tracking Implementation

Partial Sells with FIFO Tracking

Why FIFO (First-In-First-Out)?

FIFO Example

Implementation

SELL_ALL Action

Reward Function: Incentivizing Profitability

Design Principles

Our Reward Structure

Reward Example

Risk Management Guardrails

Stop-Loss and Take-Profit

Guardrail Trigger Example

Disabling Guardrails

The Complete MDP

Putting It All Together

Environment Step

Key Takeaways

What’s Next?

References

Share on

You May Also Enjoy

Deep Q-Network for Stock Trading (Part II): Data Engineering Pipeline

Deep Q-Network for Stock Trading (Part I): Problem Statement & RL Motivation

Stock Return Classifier (Part IV): Test Evaluation & Portfolio Backtesting