TLDR: This research paper introduces a Reinforcement Learning (RL) agent for market making that operates within a sophisticated simulator designed to replicate the complex, non-stationary dynamics of real-world limit order books. By explicitly modeling stylized market facts like clustered order arrivals, fluctuating spreads, and stochastic volatility, the PPO-based RL agent learns adaptive strategies. Comparative analysis against traditional benchmarks shows that this RL agent effectively manages financial returns and risks, even under adverse market conditions, highlighting the value of such realistic simulation environments for training AI in finance.
Market making, the continuous quoting of bid and ask prices to profit from the spread, is a cornerstone of financial market stability. It ensures liquidity, narrows bid-ask spreads, and reduces volatility, especially during uncertain times. However, with the rise of electronic trading, this task has become increasingly complex, requiring automated systems to navigate challenges like slippage, market impact, and constantly changing market conditions.
Reinforcement Learning (RL) has emerged as a powerful paradigm for developing adaptive and data-driven strategies in this domain. RL agents learn to optimize their decision-making policies by interacting with the market environment, much like a human learns through trial and error, but at an accelerated pace and scale.
A New Approach to Market Making with RL
A recent research paper, titled REINFORCEMENTLEARNING-BASEDMARKETMAKING AS A STOCHASTICCONTROL ONNON-STATIONARYLIMITORDER BOOKDYNAMICS, explores the integration of a reinforcement learning agent into a market-making context. Authored by Rafael Zimmer and Oswaldo L. V. Costa from the University of São Paulo, this paper introduces a novel approach that explicitly models the underlying market dynamics to capture the observed ‘stylized facts’ of real markets.
These stylized facts include clustered order arrival times (orders often come in bursts), non-stationary spreads (the difference between bid and ask prices isn’t constant), fluctuating return drifts, stochastic order quantities, and dynamic price volatility. By incorporating these realistic mechanisms, the researchers aim to enhance the stability and adaptability of the RL agent, embedding domain-specific knowledge directly into its learning process.
The Simulator: A Realistic Training Ground
One of the key contributions of this work is the development of a simulator-based environment. Traditional methods often rely on historical data, which can be computationally expensive, require vast amounts of information, and fail to account for market impact or inventory risk. Agent-based simulations, while offering realism, can limit control over market dynamics and adaptability to unseen market regimes.
The proposed simulator, however, leverages parameterizable stochastic processes to model the Limit Order Book (LOB) environment. This includes a Hawkes process for clustered order arrivals, Geometric Brownian Motion for bid and ask prices, an Ornstein-Uhlenbeck process for price drift, a Cox-Ingersoll-Ross process for spread dynamics, and a GARCH(1,1) process for price volatility. Order quantities are modeled as Poisson random variables. This comprehensive model creates a computationally efficient and realistic training ground for RL agents.
How the RL Agent Learns
The market-making problem is framed as a Markov Decision Process (MDP), where the agent observes the market state, takes an action (setting bid and ask spreads and quantities), and receives a reward based on its profit and inventory risk. The agent’s goal is to maximize cumulative rewards over time.
The state space observed by the agent is rich, including indicators like the Relative Strength Index (RSI), Order Imbalance (OI), Micro Price, the agent’s current inventory, moving averages of price returns, and detailed information about multiple levels of the LOB. The action space allows the agent to dynamically choose bid and ask spreads and the corresponding order quantities.
For the learning algorithm, the researchers implemented a market-making agent based on the Proximal-Policy Optimization (PPO) algorithm. PPO is a state-of-the-art RL algorithm known for its stability and performance. The agent uses an Actor-Critic architecture, where the ‘Actor’ learns the optimal policy (what actions to take) and the ‘Critic’ evaluates the value of those actions. The neural network architecture for the Actor incorporates self-attention layers to effectively capture the spatial dependencies within the LOB data.
Performance Under Pressure
The RL agent was trained for 10,000 episodes in the simulator, designed to mimic adverse market conditions. The results were then compared against a closed-form optimal solution (the Avellaneda-Stoikov market making strategy, which operates under a simplified market model) and a simple long-only strategy.
The RL agent demonstrated a mean financial return of 5.203 × 10-5 (annualized return of approximately +1.31%), outperforming the benchmark agent (3.038 × 10-5 or +0.76% annualized) and significantly surpassing the long-only strategy (-2.207 × 10-5 or -0.56% annualized). Crucially, the RL agent also achieved a higher Sortino ratio (0.7497), indicating better risk-adjusted returns compared to the benchmarks (0.4271 and -0.0079, respectively).
These findings suggest that the reinforcement learning agent can effectively operate under non-stationary market conditions and adapt to changing market dynamics. The simulator proved to be a valuable tool for training and pre-training RL agents in complex market-making scenarios, offering a more realistic environment than those based solely on historical data or simplified generative models.
Also Read:
- FinXplore: A Dual-Agent AI System for Dynamic Investment Portfolio Management
- AI Model TRADING-R1 Enhances Financial Trading with Structured Reasoning
Looking Ahead
This research confirms that stochastic dynamic environments can effectively simulate market conditions with varying regimes, and that RL agents can learn to adapt to these complexities. Future work may involve developing hybrid world models that combine both model-based and model-free approaches, further enhancing the adaptability of RL agents to real-world market observations and dynamic conditions.


