LLM Trading Resources Dashboard

📈 LLM Trading Landscape Overview

This dashboard catalogs the current landscape of open-source projects that benchmark trading agents and LLMs, especially those connected to real market data and trading APIs. The ecosystem spans from pure simulation benchmarks to live trading platforms with real exchange connectivity.

12+

Projects Tracked

Live Trading Platforms

21+

LLM Backbones Tested

MIT/Apache Licensed

🎯 Category 1: Live / Near-Live LLM Trading Benchmarks

Platforms that evaluate LLM agents with real-time or near-real-time market data

AI-Trader

Open-source arena for LLM agents trading NASDAQ-100, SSE-50 & crypto with live leaderboards

LiveTradeBench

Full-stack platform for live evaluation on US equities and Polymarket prediction markets

Agent Market Arena

Continuous real-time multi-asset benchmark with standardized execution protocols

Agent Trading Arena

Multi-agent virtual market focusing on numerical understanding and price formation

📚 Category 2: Historical / Simulation Benchmarks

Backtesting frameworks designed to evaluate LLM trading strategies

StockBench

Contamination-free LLM benchmark on DJIA with post-2024 data and Apache-2.0 license

FinMem

LLM trading agent with layered memory architecture and character design system

🔗 Category 3: Multi-Agent Platforms with Exchange Connectivity

Production-ready platforms with live trading and logging capabilities

ValueCell

Community-driven multi-agent platform with live routing to major crypto exchanges

TradeTrap

Security & robustness evaluation toolkit for LLM trading agents

TradingAgents

Multi-agent LLM framework mimicking a trading firm structure

🤖 Category 4: RL-Based Trading Frameworks

Reinforcement learning ecosystems useful as baselines and infrastructure

FinRL / FinRL-Meta

The premier open-source framework for financial RL with 13k+ stars

QuantRL

Modern PPO + self-attention RL framework with 30+ evaluation metrics

Infrastructure Tools

Backtrader, AutoTrader, livealgos - mature execution and backtesting engines

🏆 Performance Hierarchy Across All Benchmarks

Tier	System Type	Best Sharpe Ratio	Best Returns	Examples
Tier 1	Specialized Multi-Agent Architectures	5.60 – 8.21	23% – 62%	TradingAgents, FinMem, HedgeFundAgent
Tier 2	Simple Agent + Strong Model	2.81 – 6.47	40% – 53%	InvestorAgent+GPT-4.1, FinRL Ensemble
Tier 3	Raw LLM Trading (StockBench)	0.03 – 0.04	1.9% – 2.5%	Kimi-K2, Qwen3-235B, GPT-5

⚠️ The performance gap between Tier 1 and Tier 3 spans nearly two orders of magnitude in Sharpe ratio (0.04 to 8.21).

📌 Actionable Principles from Research

Invest in architecture over models — The 79pp swing between agent frameworks using identical GPT-4.1 dwarfs any model upgrade
General benchmark performance doesn't predict trading ability — Kimi-K2 and Qwen3 outperformed GPT-5 despite lower LMArena scores
Market-specific strategies are essential — Cross-market correlations approach zero
Most promising direction — Memory-enhanced multi-agent systems achieved exceptional results by structuring information flow

📋 Quick Comparison Table

Project	Main Focus	Live vs Backtest	Multi-Model / Leaderboard	License	Best For
AI-Trader	LLM trading arena for NASDAQ, SSE, crypto	Historical replay + near-live	Yes - public leaderboard	MIT	Plug-in strategies; MCP-based agents
LiveTradeBench	Real-time LLM evaluation on stocks + Polymarket	Live + backtest	Yes - 21 models in paper	PolyForm NC	Reference live evaluation stack
Agent Market Arena	Lifelong real-time multi-asset benchmark	Live	Yes - multiple agents × LLMs	Paper only	Conceptual template & results
StockBench	Contamination-free LLM backtest on DJIA	Backtest	Yes - multi-profile runs	Apache-2.0	Ready-made offline benchmark
FinMem	Memory-enhanced LLM agent framework	Backtest / sim	Single agent (extensible)	MIT	Strong agent baseline
ValueCell	Multi-agent LLM trading with live exchanges	Real live trading	No official leaderboard	Apache-2.0	Real-money LLM workflows
TradeTrap	Robustness & security eval for LLM traders	Runs against live agents	Baseline vs attacked runs	Apache-2.0	Hardening against attacks
FinRL / FinRL-Meta	DRL trading ecosystem + benchmark envs	Sim + paper trading	Many RL baselines	MIT	RL baselines & environment library
QuantRL	PPO + attention RL with rich metrics	Backtest	RL only	MIT	Metric-rich research pipeline
livealgos	Live ML trading "use our money to test"	Backtest + planned live	Not LLM-focused	LGPL-3.0	Collaborative algo eval infra

AI-Trader

HKU Data Science • Open-source LLM Trading Arena

Near-Live + Replay Public Leaderboard MIT License

GitHub Repository Live Dashboard

What It Is

An open-source "arena" where multiple LLM agents trade NASDAQ-100, SSE-50, and major crypto under identical rules. Features a public dashboard at ai4trade.ai showing live leaderboards and equity curves.

How It Works

Agents receive historical or near-real-time market data via Alpha Vantage, plus news via Jina search
Trade execution through MCP toolchain (trade tool, price tool, search tool)
Supports US stocks, Chinese A-shares, and crypto with configurable date ranges
Anti-lookahead "historical replay" mode for fair backtesting
Multi-model competition: GPT, Claude, Qwen, etc. all use same tools, capital, and schedule

Evaluation Approach

Unified rules: same starting capital, synchronized trading windows, identical data feeds
Logs trades, reasoning traces, portfolio paths
Leaderboard updated continuously on the web app

Good For You If

You want a ready-made arena to plug your own agent into
Just add agent/{your_strategy}.py + config; they'll run it for a week+ if you PR it
You care about both live dashboards and reproducible historical replay

LiveTradeBench

UIUC U-Lab • Real-World Alpha with LLMs

Live Trading 21 Models Tested PolyForm NC

GitHub Repository

What It Is

A full-stack platform (Python + FastAPI + frontend) for live evaluation of LLM trading agents. Has an associated technical report: "LiveTradeBench: Seeking Real-World Alpha with Large Language Models".

Key Design

Two Environments: US equities (15 stocks) with live Yahoo Finance prices, and Polymarket prediction markets via CLOB API
Fetchers for prices, news, and Reddit sentiment
Portfolio & account abstractions for clean agent integration

Quick Start Code

from live_trade_bench.systems import StockPortfolioSystem
system = StockPortfolioSystem.get_instance()
system.add_agent(name="GPT-4o-mini Trader", initial_cash=10000.0, model_name="gpt-4o-mini")
system.initialize_for_live()
system.run_cycle()
                    

Evaluation Setup (from paper)

50 trading days (Aug 18 – Oct 24, 2025) across 21 different LLM backbones
Metrics: cumulative return, volatility, max drawdown, Sharpe, etc.
Results reported per model and per environment

Licensing Note

⚠️ PolyForm Noncommercial 1.0 – Free for research/experiments, but you'd need a commercial license for production use.

📊 Published Results (Aug-Oct 2025, 50 Trading Days)

Key Finding: Spearman correlation between LMArena benchmark scores and cumulative trading returns was 0.054 for stocks—essentially zero. For Polymarket, correlation was -0.38 (higher language ability = worse trading).

Risk Profile	Characteristics	Models
Conservative	Lower volatility, smaller drawdowns	Claude-Opus-4.1, Grok-4
Risk-Seeking	Higher volatility, larger drawdowns	Kimi-K2-Instruct, GPT-5

LLMs Tested

Trading Days

0.054

LMArena↔Return Correlation

-0.38

Polymarket Correlation

GPT-4.1 Note: Achieved highest cumulative return on stocks but suffered maximum drawdown exceeding -30% due to overreactive allocation changes during volatility.

Cross-market performance showed near-zero Sharpe ratio correlation between stock and Polymarket success, indicating specialized strategies rather than general trading intelligence.

Agent Market Arena (AMA)

"When Agents Trade" • Real-Time Multi-Asset Benchmark

Live Trading Multi-Agent Leaderboard Paper Only

Public Leaderboard

What It Is

Benchmark introduced in "When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents" (Oct 2025). A continuous, real-time, multi-asset arena for evaluating LLM trading agents in TSLA, BMRN, BTC, ETH.

Three Core Components

Market Intelligence Stream: Verified real-time data + news with heavy emphasis on data verification and de-duplication
Agent Execution Protocol: Standardized interaction & rules for all agents
Performance Analytics Interface: Dashboard / leaderboard for continuous tracking

Agents & Models Tested

Four agent frameworks: InvestorAgent, TradeAgent, HedgeFundAgent, DeepFundAgent
Each run with GPT-4o, GPT-4.1, Claude 3.5 Haiku, Claude Sonnet-4, Gemini-2.0-flash, etc.
Key finding: Agent architecture matters more than model backbone for outcome variation

Evaluation Metrics

Cumulative return, annualized volatility, max drawdown, Sharpe ratio
Live for at least two months, generating continuously growing dataset
All agents start with identical capital, trade once per day under identical execution rules

Availability Note

⚠️ Paper references online evaluation pipeline, but no clearly published GitHub yet – think of this as a spec + public leaderboard, not a turnkey repo.

📊 Published Results (Aug-Sep 2025, Live Trading)

Architecture > Model: The same GPT-4.1 achieved +40.83% returns with InvestorAgent but -38.72% losses with TradeAgent—a 79 percentage point swing from architecture alone.

TSLA Trading Results

Agent Framework	LLM	Cumulative Return	Sharpe Ratio	Max Drawdown
InvestorAgent	GPT-4.1	+40.83%	6.47	4.38%
InvestorAgent	Claude-sonnet-4	+28.91%	3.76	8.41%
TradeAgent	Gemini-2.0-flash	+21.91%	3.82	6.41%
DeepFundAgent	Vote ensemble	+8.61%	1.39	10.14%
HedgeFundAgent	All models	-29.15%	-4.39	29.15%
TradeAgent	GPT-4.1	-38.72%	-5.38	38.72%

Cross-Asset Winners (Best Sharpe Ratio)

TSLA

InvestorAgent + GPT-4.1 (SR: 6.47)

BMRN

HedgeFundAgent (SR: 4.83, +23.70%)

BTC

DeepFundAgent + GPT-4.1 (SR: 2.45)

ETH

HedgeFundAgent (SR: 3.21, +39.66%)

HedgeFundAgent produced identical results across all LLM backends because its hierarchical 16-sub-agent design dampened individual model variance.

Agent Trading Arena (ATA)

Numerical Understanding in LLM-Based Agents

Simulation Multi-Agent Self-Play

GitHub Repository

What It Is

From "Agent Trading Arena: A Study on Numerical Understanding in LLM-Based Agents". A closed-loop virtual stock market where LLM agents trade against each other and affect prices.

Key Focus

More about multi-agent self-play and numerical reasoning than about real live markets
Simulates realistic bid-ask interactions and price formation
Agent actions actually move the market, not just react to it

Good For

Sandbox testing where agent actions affect market dynamics
Studying numerical understanding capabilities of LLMs
Research on emergent market behaviors from multi-agent interactions

StockBench

LLM-Powered Stock Trading Benchmark Platform

Backtest Only Multi-Profile Runs Apache-2.0

GitHub Repository

What It Is

Associated paper: "Can LLM Agents Trade Stocks Profitably in Real-World Markets?" A plug-and-play offline LLM trading benchmark, easy to fork and retarget.

Design

Uses post-2024 DJIA data (top 20 stocks) to avoid training data contamination
Fundamentals + news from Polygon, Finnhub
Multi-step loop per day: portfolio state → analysis → trade decision → execution

Evaluation Features

Backtest runner writes detailed reports
Metrics: Total/cumulative return, Sortino ratio, Maximum drawdown, etc.
CLI/shell wrapper for swapping LLM profiles (OpenAI, DeepSeek, etc.) via config

Why It's Interesting

✓ Apache-2.0 license – very friendly for commercial reuse. Easy to fork and retarget to NSE/BSE by swapping adapters + data source.

📊 Published Results (Mar-Jun 2025, 82 Trading Days, DJIA Top 20)

Surprising Finding: Open-weight models dominated proprietary ones. GPT-5 ranked 9th—below the passive buy-and-hold baseline. Only 7 of 13 LLM agents beat the baseline.

Rank	Model	Final Return	Max Drawdown	Sortino Ratio
🥇 1	Kimi-K2	+1.9%	-11.8%	0.0420
🥈 2	Qwen3-235B-Instruct	+2.4%	-11.2%	0.0299
🥉 3	GLM-4.5	+2.3%	-13.7%	0.0295
4	Qwen3-235B-Think	+2.5%	-14.9%	0.0309
5	OpenAI-O3	+1.9%	-13.2%	0.0267
7	Claude-4-Sonnet	+2.2%	-14.2%	0.0245
9	GPT-5	+0.3%	-13.1%	0.0132
12	Buy-and-Hold Baseline	+0.4%	-15.2%	0.0155

LLMs Tested

Trading Days

$100K

Starting Capital

7/13

Beat Baseline

FinMem

LLM Agent with Layered Memory and Character Design

Backtest / Sim MIT License

GitHub Repository

What It Is

MIT-licensed reference implementation for "FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character Design".

Three Core Modules

Profiling: Agent persona and risk profile configuration
Memory: Layered/hierarchical memory over financial history
Decision-making: Converts memory + current data into trades

Evaluation

Works over real-world stock & fund datasets (e.g., TSLA 2022)
CLI pipeline (run.py sim) with train/test modes, checkpointing
Paper compares FinMem vs other algorithmic agents – reports higher cumulative returns

Use Case

Less a "multi-model benchmark", more a strong reference agent whose architecture + evaluation harness you can adapt or pit against your own agents.

📊 Published Results (Oct 2022 - Apr 2023, ICLR 2024)

Memory Architecture Delivers Dramatic Gains: FinMem outperformed every tested approach including deep RL agents (A2C, PPO, DQN) that had 10 years of training data.

Stock	FinMem Return	FinMem Sharpe	Best Alternative	Alternative Return
TSLA	+61.78%	2.68	DQN	+33.34%
NFLX	+36.45%	2.02	Buy & Hold	+35.51%
MSFT	+23.26%	1.44	DQN	+14.74%
COIN	+34.98%	0.72	Generative Agents	+3.46%
AMZN	+4.89%	0.23	A2C	-6.36%

61.78%

Best Return (TSLA)

10.80%

TSLA Max Drawdown

52.00%

DQN Max Drawdown

5/5

Stocks Outperformed

Self-Adaptive Risk Profile: Dynamically switching between risk-seeking and risk-averse modes based on market conditions proved optimal for maximizing returns while controlling drawdowns.

ValueCell

Community-Driven Multi-Agent Trading Platform

Real Live Trading Apache-2.0 ~7k★ GitHub

GitHub Repository

What It Is

Very active Apache-2.0 project (~7k★), positioned as a community-driven multi-agent platform for financial applications.

Key Capabilities

Multi-Agent: DeepResearch Agent, Strategy Agent, News Retrieval Agent, etc.
LLM-Agnostic: Supports OpenAI, Google, DeepSeek, OpenRouter, etc.
Market Coverage: US, crypto, HK, China markets
Exchange Connectivity: Live routing to Binance, OKX, Hyperliquid, Coinbase, Gate, MEXC, etc.

Live Trading Workflow

"Configure AI model + configure exchange APIs → define strategy prompts → start trader and monitor PnL in the web UI."

For Benchmarking

Out-of-the-box it's a trading product, not a leaderboard
Already handles: LLM orchestration, real trades, data storage (LanceDB + SQLite)
You can bolt on your own evaluation jobs that compute Sharpe, drawdown, etc., across agents/LLMs

TradeTrap

Robustness & Security Evaluation Toolkit

Live Agent Testing Apache-2.0

GitHub Repository

What It Is

A security/eval toolkit for LLM trading agents, built explicitly around AI-Trader and ValueCell. Evaluates reliability/faithfulness, not just profitability.

Attack Vectors Tested

Prompt Injection: "Reverse expectation", "fake news shockwave"
MCP Tool Hijacking: Fake data feeds
State Tampering: Memory poisoning, etc.

How It Works

Plugs into AI-Trader and ValueCell pipelines
Runs attack modules against live agents
Records portfolio divergence vs clean baselines

Use Case

✓ Great template for building agents that are not just profitable but robust to adversarial news / tool outputs.

TradingAgents

Multi-Agent LLM Trading Firm Framework

Research Framework

GitHub Repository

What It Is

A multi-agent LLM framework mimicking a trading firm structure: separate agents for fundamentals, sentiment, technicals, bull/bear researchers, risk team, and trader.

Agent Roles

Fundamentals Agent: Analyzes company financials
Sentiment Agent: Processes news and social sentiment
Technical Agent: Chart patterns and indicators
Bull/Bear Researchers: Debate and advocate positions
Risk Team: Portfolio risk management
Trader: Final execution decisions

Eval Angle

Primarily a research framework for collaboration & debates between agents
Can be instrumented as a benchmark by logging PnL vs baselines
Swap backbones (GPT/Claude/Qwen) for comparison
Doesn't ship a public leaderboard like AI-Trader / LiveTradeBench

📊 Published Results (Jun-Nov 2024, o1-preview + GPT-4o)

Exceptional Risk-Adjusted Performance: Sharpe ratios of 5.60–8.21 dramatically exceed typical benchmarks and rival professional quantitative funds.

Stock	TradingAgents Return	Buy & Hold Return	Sharpe Ratio	Max Drawdown
AAPL	+26.62%	-5.23%	8.21	0.91%
GOOGL	+24.36%	+7.78%	6.39	1.69%
AMZN	+23.21%	+17.10%	5.60	2.11%

8.21

Best Sharpe (AAPL)

<2.5%

Max Drawdowns

+16-32pp

vs Technical Strategies

Specialized Agent Roles

Outperformed MACD, SMA, and RSI-based technical strategies by 16–32 percentage points on cumulative returns. Max drawdowns under 2.5% contrast sharply with 10–15% drawdowns for buy-and-hold strategies.

FinRL / FinRL-Meta

The Premier Open-Source Financial RL Framework

Sim + Paper Trading MIT License 13k+ ★ GitHub

FinRL GitHub FinRL-Meta GitHub

What It Is

The first major open-source framework for financial RL with 13k+ stars. Comprehensive layered architecture: environments, agents, applications (stock trading, crypto, portfolio allocation, HFT, etc.).

Key Features

Supports multiple RL backends: ElegantRL, RLlib, Stable-Baselines3
Many data sources: Alpaca, Binance, CCXT, IEX, etc.
Applications: stock trading, crypto, portfolio allocation, high-frequency trading

FinRL-Meta Extension

Hundreds of market environments
Reproduced papers as benchmarks
Training-testing-trading pipeline connecting to real-time APIs
Paper trading and real trading capabilities

Why You Care

Build RL baselines against which you compare LLM agents
Reuse training/testing/trading orchestration (especially for sim↔live handoff)
Extensive documentation and community support

📊 Published Results (Jul 2020 - Jun 2021, DJIA 30 Stocks)

The Deep RL Baseline: Ensemble approach achieved 52.61% annual return with Sharpe ratio 2.81—outperforming the DJIA index by nearly 20 percentage points.

Strategy	Annual Return	Sharpe Ratio	Max Drawdown
Ensemble (PPO+A2C+DDPG)	+52.61%	2.81	-7.09%
A2C	+46.65%	2.24	-7.59%
PPO	+42.57%	2.36	-9.04%
DJIA Index (Baseline)	+32.84%	2.02	-8.93%

52.61%

Best Annual Return

2.81

Best Sharpe Ratio

103%

PPO Crypto Return (10d)

DJIA Stocks Traded

Note: Results from a strongly bullish market period. Cryptocurrency testing showed PPO achieving 103% cumulative return on top-10 market cap tokens over 10 days.

QuantRL

PPO + Self-Attention RL Framework

Backtest MIT License

GitHub Repository

What It Is

Modern PPO + self-attention RL framework focused on research-grade evaluation with extensive metrics.

Key Features

Rich feature engineering pipeline
Custom Gym environment
Advanced backtesting with 30+ metrics

Metrics Included

Sharpe ratio, Sortino ratio, Calmar ratio
Win rate, trade counts
Maximum drawdown and recovery
And 25+ more...

Use Case

Nice template for clean metric logging and plots, even if you replace the RL policy with an LLM policy wrapper.

Infrastructure Tools

Mature Backtesting & Execution Engines

Backtest + Live

Backtrader

GitHub Repository

Classic Python framework for backtesting and live trading with Interactive Brokers, Oanda, etc. Extensive documentation and community.

AutoTrader

GitHub Repository

Python platform from backtesting to live trading for multiple brokers/markets. Clean API and well-documented.

livealgos

GitHub Repository

"World's first live open-source trading algorithm… use our money to test your strategies." Strong ML + feature-engineering pipeline; live trading "coming soon" but code is geared for it. LGPL-3.0 licensed.

Use Cases

Plug an LLM-based signal generator into existing execution/backtesting engine
Get mature order-routing and risk infrastructure "for free"
These are strategy containers – not LLM-specific, but very useful for integration

📈 LLM Trading Landscape Overview

🔑 Key Research Finding: Architecture > Model

🎯 Category 1: Live / Near-Live LLM Trading Benchmarks

AI-Trader

LiveTradeBench

Agent Market Arena

Agent Trading Arena

📚 Category 2: Historical / Simulation Benchmarks

StockBench

FinMem

🔗 Category 3: Multi-Agent Platforms with Exchange Connectivity

ValueCell

TradeTrap

TradingAgents

🤖 Category 4: RL-Based Trading Frameworks

FinRL / FinRL-Meta

QuantRL

Infrastructure Tools

🏆 Performance Hierarchy Across All Benchmarks

📌 Actionable Principles from Research

📋 Quick Comparison Table

AI-Trader

What It Is

How It Works

Evaluation Approach

Good For You If

LiveTradeBench

What It Is

Key Design

Quick Start Code

Evaluation Setup (from paper)

Licensing Note

📊 Published Results (Aug-Oct 2025, 50 Trading Days)

Agent Market Arena (AMA)

What It Is

Three Core Components

Agents & Models Tested

Evaluation Metrics

Availability Note

📊 Published Results (Aug-Sep 2025, Live Trading)

TSLA Trading Results

Cross-Asset Winners (Best Sharpe Ratio)

Agent Trading Arena (ATA)

What It Is

Key Focus

Good For

StockBench

What It Is

Design

Evaluation Features

Why It's Interesting

📊 Published Results (Mar-Jun 2025, 82 Trading Days, DJIA Top 20)

FinMem

What It Is

Three Core Modules

Evaluation

Use Case

📊 Published Results (Oct 2022 - Apr 2023, ICLR 2024)

ValueCell

What It Is

Key Capabilities

Live Trading Workflow

For Benchmarking

TradeTrap

What It Is

Attack Vectors Tested

How It Works

Use Case

TradingAgents

What It Is

Agent Roles

Eval Angle

📊 Published Results (Jun-Nov 2024, o1-preview + GPT-4o)

FinRL / FinRL-Meta

What It Is

Key Features

FinRL-Meta Extension

Why You Care

📊 Published Results (Jul 2020 - Jun 2021, DJIA 30 Stocks)

QuantRL