🤖 LLM Trading Resources Dashboard

Comprehensive guide to open-source LLM trading benchmarks, platforms & frameworks • December 2025

📈 LLM Trading Landscape Overview

This dashboard catalogs the current landscape of open-source projects that benchmark trading agents and LLMs, especially those connected to real market data and trading APIs. The ecosystem spans from pure simulation benchmarks to live trading platforms with real exchange connectivity.

12+
Projects Tracked
4
Live Trading Platforms
21+
LLM Backbones Tested
7
MIT/Apache Licensed

🔑 Key Research Finding: Architecture > Model

Across all benchmarks, agent architecture determines performance far more than which LLM powers it. The same GPT-4.1 achieved +40.83% with InvestorAgent but -38.72% with TradeAgent—a 79 percentage point swing from architecture alone.

🎯 Category 1: Live / Near-Live LLM Trading Benchmarks

Platforms that evaluate LLM agents with real-time or near-real-time market data

AI-Trader

Open-source arena for LLM agents trading NASDAQ-100, SSE-50 & crypto with live leaderboards

LiveTradeBench

Full-stack platform for live evaluation on US equities and Polymarket prediction markets

Agent Market Arena

Continuous real-time multi-asset benchmark with standardized execution protocols

Agent Trading Arena

Multi-agent virtual market focusing on numerical understanding and price formation

📚 Category 2: Historical / Simulation Benchmarks

Backtesting frameworks designed to evaluate LLM trading strategies

StockBench

Contamination-free LLM benchmark on DJIA with post-2024 data and Apache-2.0 license

FinMem

LLM trading agent with layered memory architecture and character design system

🔗 Category 3: Multi-Agent Platforms with Exchange Connectivity

Production-ready platforms with live trading and logging capabilities

ValueCell

Community-driven multi-agent platform with live routing to major crypto exchanges

TradeTrap

Security & robustness evaluation toolkit for LLM trading agents

TradingAgents

Multi-agent LLM framework mimicking a trading firm structure

🤖 Category 4: RL-Based Trading Frameworks

Reinforcement learning ecosystems useful as baselines and infrastructure

FinRL / FinRL-Meta

The premier open-source framework for financial RL with 13k+ stars

QuantRL

Modern PPO + self-attention RL framework with 30+ evaluation metrics

Infrastructure Tools

Backtrader, AutoTrader, livealgos - mature execution and backtesting engines

🏆 Performance Hierarchy Across All Benchmarks

TierSystem TypeBest Sharpe RatioBest ReturnsExamples
Tier 1Specialized Multi-Agent Architectures5.60 – 8.2123% – 62%TradingAgents, FinMem, HedgeFundAgent
Tier 2Simple Agent + Strong Model2.81 – 6.4740% – 53%InvestorAgent+GPT-4.1, FinRL Ensemble
Tier 3Raw LLM Trading (StockBench)0.03 – 0.041.9% – 2.5%Kimi-K2, Qwen3-235B, GPT-5

⚠️ The performance gap between Tier 1 and Tier 3 spans nearly two orders of magnitude in Sharpe ratio (0.04 to 8.21).

📌 Actionable Principles from Research

  • Invest in architecture over models — The 79pp swing between agent frameworks using identical GPT-4.1 dwarfs any model upgrade
  • General benchmark performance doesn't predict trading ability — Kimi-K2 and Qwen3 outperformed GPT-5 despite lower LMArena scores
  • Market-specific strategies are essential — Cross-market correlations approach zero
  • Most promising direction — Memory-enhanced multi-agent systems achieved exceptional results by structuring information flow

📋 Quick Comparison Table

Project Main Focus Live vs Backtest Multi-Model / Leaderboard License Best For
AI-Trader LLM trading arena for NASDAQ, SSE, crypto Historical replay + near-live Yes - public leaderboard MIT Plug-in strategies; MCP-based agents
LiveTradeBench Real-time LLM evaluation on stocks + Polymarket Live + backtest Yes - 21 models in paper PolyForm NC Reference live evaluation stack
Agent Market Arena Lifelong real-time multi-asset benchmark Live Yes - multiple agents × LLMs Paper only Conceptual template & results
StockBench Contamination-free LLM backtest on DJIA Backtest Yes - multi-profile runs Apache-2.0 Ready-made offline benchmark
FinMem Memory-enhanced LLM agent framework Backtest / sim Single agent (extensible) MIT Strong agent baseline
ValueCell Multi-agent LLM trading with live exchanges Real live trading No official leaderboard Apache-2.0 Real-money LLM workflows
TradeTrap Robustness & security eval for LLM traders Runs against live agents Baseline vs attacked runs Apache-2.0 Hardening against attacks
FinRL / FinRL-Meta DRL trading ecosystem + benchmark envs Sim + paper trading Many RL baselines MIT RL baselines & environment library
QuantRL PPO + attention RL with rich metrics Backtest RL only MIT Metric-rich research pipeline
livealgos Live ML trading "use our money to test" Backtest + planned live Not LLM-focused LGPL-3.0 Collaborative algo eval infra

AI-Trader

HKU Data Science • Open-source LLM Trading Arena

Near-Live + Replay Public Leaderboard MIT License

What It Is

An open-source "arena" where multiple LLM agents trade NASDAQ-100, SSE-50, and major crypto under identical rules. Features a public dashboard at ai4trade.ai showing live leaderboards and equity curves.

How It Works

  • Agents receive historical or near-real-time market data via Alpha Vantage, plus news via Jina search
  • Trade execution through MCP toolchain (trade tool, price tool, search tool)
  • Supports US stocks, Chinese A-shares, and crypto with configurable date ranges
  • Anti-lookahead "historical replay" mode for fair backtesting
  • Multi-model competition: GPT, Claude, Qwen, etc. all use same tools, capital, and schedule

Evaluation Approach

  • Unified rules: same starting capital, synchronized trading windows, identical data feeds
  • Logs trades, reasoning traces, portfolio paths
  • Leaderboard updated continuously on the web app

Good For You If

  • You want a ready-made arena to plug your own agent into
  • Just add agent/{your_strategy}.py + config; they'll run it for a week+ if you PR it
  • You care about both live dashboards and reproducible historical replay

LiveTradeBench

UIUC U-Lab • Real-World Alpha with LLMs

Live Trading 21 Models Tested PolyForm NC

What It Is

A full-stack platform (Python + FastAPI + frontend) for live evaluation of LLM trading agents. Has an associated technical report: "LiveTradeBench: Seeking Real-World Alpha with Large Language Models".

Key Design

  • Two Environments: US equities (15 stocks) with live Yahoo Finance prices, and Polymarket prediction markets via CLOB API
  • Fetchers for prices, news, and Reddit sentiment
  • Portfolio & account abstractions for clean agent integration

Quick Start Code

from live_trade_bench.systems import StockPortfolioSystem system = StockPortfolioSystem.get_instance() system.add_agent(name="GPT-4o-mini Trader", initial_cash=10000.0, model_name="gpt-4o-mini") system.initialize_for_live() system.run_cycle()

Evaluation Setup (from paper)

  • 50 trading days (Aug 18 – Oct 24, 2025) across 21 different LLM backbones
  • Metrics: cumulative return, volatility, max drawdown, Sharpe, etc.
  • Results reported per model and per environment

Licensing Note

⚠️ PolyForm Noncommercial 1.0 – Free for research/experiments, but you'd need a commercial license for production use.

📊 Published Results (Aug-Oct 2025, 50 Trading Days)

Key Finding: Spearman correlation between LMArena benchmark scores and cumulative trading returns was 0.054 for stocks—essentially zero. For Polymarket, correlation was -0.38 (higher language ability = worse trading).
Risk Profile Characteristics Models
Conservative Lower volatility, smaller drawdowns Claude-Opus-4.1, Grok-4
Risk-Seeking Higher volatility, larger drawdowns Kimi-K2-Instruct, GPT-5
21
LLMs Tested
50
Trading Days
0.054
LMArena↔Return Correlation
-0.38
Polymarket Correlation
GPT-4.1 Note: Achieved highest cumulative return on stocks but suffered maximum drawdown exceeding -30% due to overreactive allocation changes during volatility.

Cross-market performance showed near-zero Sharpe ratio correlation between stock and Polymarket success, indicating specialized strategies rather than general trading intelligence.

Agent Market Arena (AMA)

"When Agents Trade" • Real-Time Multi-Asset Benchmark

Live Trading Multi-Agent Leaderboard Paper Only

What It Is

Benchmark introduced in "When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents" (Oct 2025). A continuous, real-time, multi-asset arena for evaluating LLM trading agents in TSLA, BMRN, BTC, ETH.

Three Core Components

  • Market Intelligence Stream: Verified real-time data + news with heavy emphasis on data verification and de-duplication
  • Agent Execution Protocol: Standardized interaction & rules for all agents
  • Performance Analytics Interface: Dashboard / leaderboard for continuous tracking

Agents & Models Tested

  • Four agent frameworks: InvestorAgent, TradeAgent, HedgeFundAgent, DeepFundAgent
  • Each run with GPT-4o, GPT-4.1, Claude 3.5 Haiku, Claude Sonnet-4, Gemini-2.0-flash, etc.
  • Key finding: Agent architecture matters more than model backbone for outcome variation

Evaluation Metrics

  • Cumulative return, annualized volatility, max drawdown, Sharpe ratio
  • Live for at least two months, generating continuously growing dataset
  • All agents start with identical capital, trade once per day under identical execution rules

Availability Note

⚠️ Paper references online evaluation pipeline, but no clearly published GitHub yet – think of this as a spec + public leaderboard, not a turnkey repo.

📊 Published Results (Aug-Sep 2025, Live Trading)

Architecture > Model: The same GPT-4.1 achieved +40.83% returns with InvestorAgent but -38.72% losses with TradeAgent—a 79 percentage point swing from architecture alone.

TSLA Trading Results

Agent Framework LLM Cumulative Return Sharpe Ratio Max Drawdown
InvestorAgent GPT-4.1 +40.83% 6.47 4.38%
InvestorAgent Claude-sonnet-4 +28.91% 3.76 8.41%
TradeAgent Gemini-2.0-flash +21.91% 3.82 6.41%
DeepFundAgent Vote ensemble +8.61% 1.39 10.14%
HedgeFundAgent All models -29.15% -4.39 29.15%
TradeAgent GPT-4.1 -38.72% -5.38 38.72%

Cross-Asset Winners (Best Sharpe Ratio)

TSLA
InvestorAgent + GPT-4.1 (SR: 6.47)
BMRN
HedgeFundAgent (SR: 4.83, +23.70%)
BTC
DeepFundAgent + GPT-4.1 (SR: 2.45)
ETH
HedgeFundAgent (SR: 3.21, +39.66%)

HedgeFundAgent produced identical results across all LLM backends because its hierarchical 16-sub-agent design dampened individual model variance.

Agent Trading Arena (ATA)

Numerical Understanding in LLM-Based Agents

Simulation Multi-Agent Self-Play

What It Is

From "Agent Trading Arena: A Study on Numerical Understanding in LLM-Based Agents". A closed-loop virtual stock market where LLM agents trade against each other and affect prices.

Key Focus

  • More about multi-agent self-play and numerical reasoning than about real live markets
  • Simulates realistic bid-ask interactions and price formation
  • Agent actions actually move the market, not just react to it

Good For

  • Sandbox testing where agent actions affect market dynamics
  • Studying numerical understanding capabilities of LLMs
  • Research on emergent market behaviors from multi-agent interactions

StockBench

LLM-Powered Stock Trading Benchmark Platform

Backtest Only Multi-Profile Runs Apache-2.0

What It Is

Associated paper: "Can LLM Agents Trade Stocks Profitably in Real-World Markets?" A plug-and-play offline LLM trading benchmark, easy to fork and retarget.

Design

  • Uses post-2024 DJIA data (top 20 stocks) to avoid training data contamination
  • Fundamentals + news from Polygon, Finnhub
  • Multi-step loop per day: portfolio state → analysis → trade decision → execution

Evaluation Features

  • Backtest runner writes detailed reports
  • Metrics: Total/cumulative return, Sortino ratio, Maximum drawdown, etc.
  • CLI/shell wrapper for swapping LLM profiles (OpenAI, DeepSeek, etc.) via config

Why It's Interesting

✓ Apache-2.0 license – very friendly for commercial reuse. Easy to fork and retarget to NSE/BSE by swapping adapters + data source.

📊 Published Results (Mar-Jun 2025, 82 Trading Days, DJIA Top 20)

Surprising Finding: Open-weight models dominated proprietary ones. GPT-5 ranked 9th—below the passive buy-and-hold baseline. Only 7 of 13 LLM agents beat the baseline.
RankModelFinal ReturnMax DrawdownSortino Ratio
🥇 1Kimi-K2+1.9%-11.8%0.0420
🥈 2Qwen3-235B-Instruct+2.4%-11.2%0.0299
🥉 3GLM-4.5+2.3%-13.7%0.0295
4Qwen3-235B-Think+2.5%-14.9%0.0309
5OpenAI-O3+1.9%-13.2%0.0267
7Claude-4-Sonnet+2.2%-14.2%0.0245
9GPT-5+0.3%-13.1%0.0132
12Buy-and-Hold Baseline+0.4%-15.2%0.0155
13
LLMs Tested
82
Trading Days
$100K
Starting Capital
7/13
Beat Baseline

FinMem

LLM Agent with Layered Memory and Character Design

Backtest / Sim MIT License

What It Is

MIT-licensed reference implementation for "FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character Design".

Three Core Modules

  • Profiling: Agent persona and risk profile configuration
  • Memory: Layered/hierarchical memory over financial history
  • Decision-making: Converts memory + current data into trades

Evaluation

  • Works over real-world stock & fund datasets (e.g., TSLA 2022)
  • CLI pipeline (run.py sim) with train/test modes, checkpointing
  • Paper compares FinMem vs other algorithmic agents – reports higher cumulative returns

Use Case

Less a "multi-model benchmark", more a strong reference agent whose architecture + evaluation harness you can adapt or pit against your own agents.

📊 Published Results (Oct 2022 - Apr 2023, ICLR 2024)

Memory Architecture Delivers Dramatic Gains: FinMem outperformed every tested approach including deep RL agents (A2C, PPO, DQN) that had 10 years of training data.
Stock FinMem Return FinMem Sharpe Best Alternative Alternative Return
TSLA +61.78% 2.68 DQN +33.34%
NFLX +36.45% 2.02 Buy & Hold +35.51%
MSFT +23.26% 1.44 DQN +14.74%
COIN +34.98% 0.72 Generative Agents +3.46%
AMZN +4.89% 0.23 A2C -6.36%
61.78%
Best Return (TSLA)
10.80%
TSLA Max Drawdown
52.00%
DQN Max Drawdown
5/5
Stocks Outperformed
Self-Adaptive Risk Profile: Dynamically switching between risk-seeking and risk-averse modes based on market conditions proved optimal for maximizing returns while controlling drawdowns.

ValueCell

Community-Driven Multi-Agent Trading Platform

Real Live Trading Apache-2.0 ~7k★ GitHub

What It Is

Very active Apache-2.0 project (~7k★), positioned as a community-driven multi-agent platform for financial applications.

Key Capabilities

  • Multi-Agent: DeepResearch Agent, Strategy Agent, News Retrieval Agent, etc.
  • LLM-Agnostic: Supports OpenAI, Google, DeepSeek, OpenRouter, etc.
  • Market Coverage: US, crypto, HK, China markets
  • Exchange Connectivity: Live routing to Binance, OKX, Hyperliquid, Coinbase, Gate, MEXC, etc.

Live Trading Workflow

"Configure AI model + configure exchange APIs → define strategy prompts → start trader and monitor PnL in the web UI."

For Benchmarking

  • Out-of-the-box it's a trading product, not a leaderboard
  • Already handles: LLM orchestration, real trades, data storage (LanceDB + SQLite)
  • You can bolt on your own evaluation jobs that compute Sharpe, drawdown, etc., across agents/LLMs

TradeTrap

Robustness & Security Evaluation Toolkit

Live Agent Testing Apache-2.0

What It Is

A security/eval toolkit for LLM trading agents, built explicitly around AI-Trader and ValueCell. Evaluates reliability/faithfulness, not just profitability.

Attack Vectors Tested

  • Prompt Injection: "Reverse expectation", "fake news shockwave"
  • MCP Tool Hijacking: Fake data feeds
  • State Tampering: Memory poisoning, etc.

How It Works

  • Plugs into AI-Trader and ValueCell pipelines
  • Runs attack modules against live agents
  • Records portfolio divergence vs clean baselines

Use Case

✓ Great template for building agents that are not just profitable but robust to adversarial news / tool outputs.

TradingAgents

Multi-Agent LLM Trading Firm Framework

Research Framework

What It Is

A multi-agent LLM framework mimicking a trading firm structure: separate agents for fundamentals, sentiment, technicals, bull/bear researchers, risk team, and trader.

Agent Roles

  • Fundamentals Agent: Analyzes company financials
  • Sentiment Agent: Processes news and social sentiment
  • Technical Agent: Chart patterns and indicators
  • Bull/Bear Researchers: Debate and advocate positions
  • Risk Team: Portfolio risk management
  • Trader: Final execution decisions

Eval Angle

  • Primarily a research framework for collaboration & debates between agents
  • Can be instrumented as a benchmark by logging PnL vs baselines
  • Swap backbones (GPT/Claude/Qwen) for comparison
  • Doesn't ship a public leaderboard like AI-Trader / LiveTradeBench

📊 Published Results (Jun-Nov 2024, o1-preview + GPT-4o)

Exceptional Risk-Adjusted Performance: Sharpe ratios of 5.60–8.21 dramatically exceed typical benchmarks and rival professional quantitative funds.
Stock TradingAgents Return Buy & Hold Return Sharpe Ratio Max Drawdown
AAPL +26.62% -5.23% 8.21 0.91%
GOOGL +24.36% +7.78% 6.39 1.69%
AMZN +23.21% +17.10% 5.60 2.11%
8.21
Best Sharpe (AAPL)
<2.5%
Max Drawdowns
+16-32pp
vs Technical Strategies
6
Specialized Agent Roles

Outperformed MACD, SMA, and RSI-based technical strategies by 16–32 percentage points on cumulative returns. Max drawdowns under 2.5% contrast sharply with 10–15% drawdowns for buy-and-hold strategies.

FinRL / FinRL-Meta

The Premier Open-Source Financial RL Framework

Sim + Paper Trading MIT License 13k+ ★ GitHub

What It Is

The first major open-source framework for financial RL with 13k+ stars. Comprehensive layered architecture: environments, agents, applications (stock trading, crypto, portfolio allocation, HFT, etc.).

Key Features

  • Supports multiple RL backends: ElegantRL, RLlib, Stable-Baselines3
  • Many data sources: Alpaca, Binance, CCXT, IEX, etc.
  • Applications: stock trading, crypto, portfolio allocation, high-frequency trading

FinRL-Meta Extension

  • Hundreds of market environments
  • Reproduced papers as benchmarks
  • Training-testing-trading pipeline connecting to real-time APIs
  • Paper trading and real trading capabilities

Why You Care

  • Build RL baselines against which you compare LLM agents
  • Reuse training/testing/trading orchestration (especially for sim↔live handoff)
  • Extensive documentation and community support

📊 Published Results (Jul 2020 - Jun 2021, DJIA 30 Stocks)

The Deep RL Baseline: Ensemble approach achieved 52.61% annual return with Sharpe ratio 2.81—outperforming the DJIA index by nearly 20 percentage points.
Strategy Annual Return Sharpe Ratio Max Drawdown
Ensemble (PPO+A2C+DDPG) +52.61% 2.81 -7.09%
A2C +46.65% 2.24 -7.59%
PPO +42.57% 2.36 -9.04%
DJIA Index (Baseline) +32.84% 2.02 -8.93%
52.61%
Best Annual Return
2.81
Best Sharpe Ratio
103%
PPO Crypto Return (10d)
30
DJIA Stocks Traded

Note: Results from a strongly bullish market period. Cryptocurrency testing showed PPO achieving 103% cumulative return on top-10 market cap tokens over 10 days.

QuantRL

PPO + Self-Attention RL Framework

Backtest MIT License

What It Is

Modern PPO + self-attention RL framework focused on research-grade evaluation with extensive metrics.

Key Features

  • Rich feature engineering pipeline
  • Custom Gym environment
  • Advanced backtesting with 30+ metrics

Metrics Included

  • Sharpe ratio, Sortino ratio, Calmar ratio
  • Win rate, trade counts
  • Maximum drawdown and recovery
  • And 25+ more...

Use Case

Nice template for clean metric logging and plots, even if you replace the RL policy with an LLM policy wrapper.

Infrastructure Tools

Mature Backtesting & Execution Engines

Backtest + Live

Backtrader

Classic Python framework for backtesting and live trading with Interactive Brokers, Oanda, etc. Extensive documentation and community.

AutoTrader

Python platform from backtesting to live trading for multiple brokers/markets. Clean API and well-documented.

livealgos

"World's first live open-source trading algorithm… use our money to test your strategies." Strong ML + feature-engineering pipeline; live trading "coming soon" but code is geared for it. LGPL-3.0 licensed.

Use Cases

  • Plug an LLM-based signal generator into existing execution/backtesting engine
  • Get mature order-routing and risk infrastructure "for free"
  • These are strategy containers – not LLM-specific, but very useful for integration