Comprehensive guide to open-source LLM trading benchmarks, platforms & frameworks • December 2025
This dashboard catalogs the current landscape of open-source projects that benchmark trading agents and LLMs, especially those connected to real market data and trading APIs. The ecosystem spans from pure simulation benchmarks to live trading platforms with real exchange connectivity.
Platforms that evaluate LLM agents with real-time or near-real-time market data
Open-source arena for LLM agents trading NASDAQ-100, SSE-50 & crypto with live leaderboards
Full-stack platform for live evaluation on US equities and Polymarket prediction markets
Continuous real-time multi-asset benchmark with standardized execution protocols
Multi-agent virtual market focusing on numerical understanding and price formation
Backtesting frameworks designed to evaluate LLM trading strategies
Contamination-free LLM benchmark on DJIA with post-2024 data and Apache-2.0 license
LLM trading agent with layered memory architecture and character design system
Production-ready platforms with live trading and logging capabilities
Community-driven multi-agent platform with live routing to major crypto exchanges
Security & robustness evaluation toolkit for LLM trading agents
Multi-agent LLM framework mimicking a trading firm structure
Reinforcement learning ecosystems useful as baselines and infrastructure
The premier open-source framework for financial RL with 13k+ stars
Modern PPO + self-attention RL framework with 30+ evaluation metrics
Backtrader, AutoTrader, livealgos - mature execution and backtesting engines
| Tier | System Type | Best Sharpe Ratio | Best Returns | Examples |
|---|---|---|---|---|
| Tier 1 | Specialized Multi-Agent Architectures | 5.60 – 8.21 | 23% – 62% | TradingAgents, FinMem, HedgeFundAgent |
| Tier 2 | Simple Agent + Strong Model | 2.81 – 6.47 | 40% – 53% | InvestorAgent+GPT-4.1, FinRL Ensemble |
| Tier 3 | Raw LLM Trading (StockBench) | 0.03 – 0.04 | 1.9% – 2.5% | Kimi-K2, Qwen3-235B, GPT-5 |
⚠️ The performance gap between Tier 1 and Tier 3 spans nearly two orders of magnitude in Sharpe ratio (0.04 to 8.21).
| Project | Main Focus | Live vs Backtest | Multi-Model / Leaderboard | License | Best For |
|---|---|---|---|---|---|
| AI-Trader | LLM trading arena for NASDAQ, SSE, crypto | Historical replay + near-live | Yes - public leaderboard | MIT | Plug-in strategies; MCP-based agents |
| LiveTradeBench | Real-time LLM evaluation on stocks + Polymarket | Live + backtest | Yes - 21 models in paper | PolyForm NC | Reference live evaluation stack |
| Agent Market Arena | Lifelong real-time multi-asset benchmark | Live | Yes - multiple agents × LLMs | Paper only | Conceptual template & results |
| StockBench | Contamination-free LLM backtest on DJIA | Backtest | Yes - multi-profile runs | Apache-2.0 | Ready-made offline benchmark |
| FinMem | Memory-enhanced LLM agent framework | Backtest / sim | Single agent (extensible) | MIT | Strong agent baseline |
| ValueCell | Multi-agent LLM trading with live exchanges | Real live trading | No official leaderboard | Apache-2.0 | Real-money LLM workflows |
| TradeTrap | Robustness & security eval for LLM traders | Runs against live agents | Baseline vs attacked runs | Apache-2.0 | Hardening against attacks |
| FinRL / FinRL-Meta | DRL trading ecosystem + benchmark envs | Sim + paper trading | Many RL baselines | MIT | RL baselines & environment library |
| QuantRL | PPO + attention RL with rich metrics | Backtest | RL only | MIT | Metric-rich research pipeline |
| livealgos | Live ML trading "use our money to test" | Backtest + planned live | Not LLM-focused | LGPL-3.0 | Collaborative algo eval infra |
HKU Data Science • Open-source LLM Trading Arena
An open-source "arena" where multiple LLM agents trade NASDAQ-100, SSE-50, and major crypto under identical rules. Features a public dashboard at ai4trade.ai showing live leaderboards and equity curves.
UIUC U-Lab • Real-World Alpha with LLMs
A full-stack platform (Python + FastAPI + frontend) for live evaluation of LLM trading agents. Has an associated technical report: "LiveTradeBench: Seeking Real-World Alpha with Large Language Models".
⚠️ PolyForm Noncommercial 1.0 – Free for research/experiments, but you'd need a commercial license for production use.
| Risk Profile | Characteristics | Models |
|---|---|---|
| Conservative | Lower volatility, smaller drawdowns | Claude-Opus-4.1, Grok-4 |
| Risk-Seeking | Higher volatility, larger drawdowns | Kimi-K2-Instruct, GPT-5 |
Cross-market performance showed near-zero Sharpe ratio correlation between stock and Polymarket success, indicating specialized strategies rather than general trading intelligence.
"When Agents Trade" • Real-Time Multi-Asset Benchmark
Benchmark introduced in "When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents" (Oct 2025). A continuous, real-time, multi-asset arena for evaluating LLM trading agents in TSLA, BMRN, BTC, ETH.
⚠️ Paper references online evaluation pipeline, but no clearly published GitHub yet – think of this as a spec + public leaderboard, not a turnkey repo.
| Agent Framework | LLM | Cumulative Return | Sharpe Ratio | Max Drawdown |
|---|---|---|---|---|
| InvestorAgent | GPT-4.1 | +40.83% | 6.47 | 4.38% |
| InvestorAgent | Claude-sonnet-4 | +28.91% | 3.76 | 8.41% |
| TradeAgent | Gemini-2.0-flash | +21.91% | 3.82 | 6.41% |
| DeepFundAgent | Vote ensemble | +8.61% | 1.39 | 10.14% |
| HedgeFundAgent | All models | -29.15% | -4.39 | 29.15% |
| TradeAgent | GPT-4.1 | -38.72% | -5.38 | 38.72% |
HedgeFundAgent produced identical results across all LLM backends because its hierarchical 16-sub-agent design dampened individual model variance.
Numerical Understanding in LLM-Based Agents
From "Agent Trading Arena: A Study on Numerical Understanding in LLM-Based Agents". A closed-loop virtual stock market where LLM agents trade against each other and affect prices.
LLM-Powered Stock Trading Benchmark Platform
Associated paper: "Can LLM Agents Trade Stocks Profitably in Real-World Markets?" A plug-and-play offline LLM trading benchmark, easy to fork and retarget.
✓ Apache-2.0 license – very friendly for commercial reuse. Easy to fork and retarget to NSE/BSE by swapping adapters + data source.
| Rank | Model | Final Return | Max Drawdown | Sortino Ratio |
|---|---|---|---|---|
| 🥇 1 | Kimi-K2 | +1.9% | -11.8% | 0.0420 |
| 🥈 2 | Qwen3-235B-Instruct | +2.4% | -11.2% | 0.0299 |
| 🥉 3 | GLM-4.5 | +2.3% | -13.7% | 0.0295 |
| 4 | Qwen3-235B-Think | +2.5% | -14.9% | 0.0309 |
| 5 | OpenAI-O3 | +1.9% | -13.2% | 0.0267 |
| 7 | Claude-4-Sonnet | +2.2% | -14.2% | 0.0245 |
| 9 | GPT-5 | +0.3% | -13.1% | 0.0132 |
| 12 | Buy-and-Hold Baseline | +0.4% | -15.2% | 0.0155 |
LLM Agent with Layered Memory and Character Design
MIT-licensed reference implementation for "FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character Design".
Less a "multi-model benchmark", more a strong reference agent whose architecture + evaluation harness you can adapt or pit against your own agents.
| Stock | FinMem Return | FinMem Sharpe | Best Alternative | Alternative Return |
|---|---|---|---|---|
| TSLA | +61.78% | 2.68 | DQN | +33.34% |
| NFLX | +36.45% | 2.02 | Buy & Hold | +35.51% |
| MSFT | +23.26% | 1.44 | DQN | +14.74% |
| COIN | +34.98% | 0.72 | Generative Agents | +3.46% |
| AMZN | +4.89% | 0.23 | A2C | -6.36% |
Community-Driven Multi-Agent Trading Platform
Very active Apache-2.0 project (~7k★), positioned as a community-driven multi-agent platform for financial applications.
"Configure AI model + configure exchange APIs → define strategy prompts → start trader and monitor PnL in the web UI."
Robustness & Security Evaluation Toolkit
A security/eval toolkit for LLM trading agents, built explicitly around AI-Trader and ValueCell. Evaluates reliability/faithfulness, not just profitability.
✓ Great template for building agents that are not just profitable but robust to adversarial news / tool outputs.
Multi-Agent LLM Trading Firm Framework
A multi-agent LLM framework mimicking a trading firm structure: separate agents for fundamentals, sentiment, technicals, bull/bear researchers, risk team, and trader.
| Stock | TradingAgents Return | Buy & Hold Return | Sharpe Ratio | Max Drawdown |
|---|---|---|---|---|
| AAPL | +26.62% | -5.23% | 8.21 | 0.91% |
| GOOGL | +24.36% | +7.78% | 6.39 | 1.69% |
| AMZN | +23.21% | +17.10% | 5.60 | 2.11% |
Outperformed MACD, SMA, and RSI-based technical strategies by 16–32 percentage points on cumulative returns. Max drawdowns under 2.5% contrast sharply with 10–15% drawdowns for buy-and-hold strategies.
The Premier Open-Source Financial RL Framework
The first major open-source framework for financial RL with 13k+ stars. Comprehensive layered architecture: environments, agents, applications (stock trading, crypto, portfolio allocation, HFT, etc.).
| Strategy | Annual Return | Sharpe Ratio | Max Drawdown |
|---|---|---|---|
| Ensemble (PPO+A2C+DDPG) | +52.61% | 2.81 | -7.09% |
| A2C | +46.65% | 2.24 | -7.59% |
| PPO | +42.57% | 2.36 | -9.04% |
| DJIA Index (Baseline) | +32.84% | 2.02 | -8.93% |
Note: Results from a strongly bullish market period. Cryptocurrency testing showed PPO achieving 103% cumulative return on top-10 market cap tokens over 10 days.
PPO + Self-Attention RL Framework
Modern PPO + self-attention RL framework focused on research-grade evaluation with extensive metrics.
Nice template for clean metric logging and plots, even if you replace the RL policy with an LLM policy wrapper.
Mature Backtesting & Execution Engines
Classic Python framework for backtesting and live trading with Interactive Brokers, Oanda, etc. Extensive documentation and community.
Python platform from backtesting to live trading for multiple brokers/markets. Clean API and well-documented.
"World's first live open-source trading algorithm… use our money to test your strategies." Strong ML + feature-engineering pipeline; live trading "coming soon" but code is geared for it. LGPL-3.0 licensed.