Prediction Benchmark

Evaluated on 100 resolved prediction markets from Polymarket and Metaculus. All results on held-out test data.

0.197
Brier Score
Lower is better (0 = perfect)
9.4%
Calibration Error
ECE (perfect = 0%)
71%
Accuracy
Binary classification at 50%
100
Test Markets
Held-out resolved markets

Calibration Curve

Predicted probability vs actual outcome frequency. Points on the diagonal = perfect calibration.

Actual Frequency
0.00.20.40.60.81.0
Predicted Probability

Dot size = sample count. ECE = 9.4%

Brier Score Comparison

Lower is better. Compared on 58 markets with consensus data.

Market Consensus
0.088
MarketFlux (calibrated)
0.174
Zero-Shot 70B
0.287
Improvement over zero-shot baseline
-31.2%
Brier Score: 0.287 → 0.197

Model Evolution

How each training iteration improved performance on the same 100-market test set.

Version Brier Score ECE Log Loss Accuracy Notes
Zero-Shot (70B) 0.287 22.1% 49% Llama 3.3 70B via ollama, no training
SFT v1 0.228 23.1% 3.303 74% Hard binary labels → overconfident
SFT v2 0.240 21.6% 0.826 71% Label smoothing fix
SFT v2 + Temp Cal 0.197 9.4% 0.582 71% Temperature scaling (T=2.98)

Performance by Category

Brier Score across market categories (lower is better)

Crypto
0.046
n=3
Token Launch
0.160
n=31
Politics
0.160
n=17
Price Movement
0.237
n=23
Other
0.239
n=14
Metaculus Meta
0.258
n=12
Baseline (random): 0.250

Methodology

Base Model
Llama 3.3 70B Instruct
Training
QLoRA SFT (4-bit, rank 32, 3 epochs)
Calibration
Temperature scaling (T=2.98, 5-fold CV)
Training Data
800 resolved markets from Polymarket & Metaculus
Test Data
100 held-out resolved markets (no data leakage)
Data Sources
Polymarket (price resolution) & Metaculus (community predictions)
Evaluation
Brier score, log loss, ECE, resolution, accuracy

Sample Predictions (Resolved Markets)

Selected predictions from the 100-market test set. All markets have resolved.

Market Category Model Consensus Outcome Result
USB close price higher on Dec 5 vs Nov 24? Price 52% 55% YES Correct
Will US government shut down before Oct 2, 2025? Politics 79% 74% YES Correct
OpenAI file S-1 with SEC before Dec 15, 2025? Other 27% 5% NO Correct
Will Australia retain the Ashes 2025-26? Other 73% 99% YES Correct
Ukraine extend martial law beyond Nov 5, 2025? Politics 79% 90% YES Correct
Metaculus: Will Nvidia stock close 2025 higher? Meta 73% 60% YES Correct
Bill Ackman beat politician returns in 2025? Other 73% 4% NO Wrong
BLDR close price higher Dec 20 vs Dec 8? Price 73% 52% NO Wrong
UN General Assembly condemn US re Venezuela? Politics 79% 55% NO Wrong
Arsenal vs Man City match end in draw? Other 27% 28% YES Wrong

Showing 10 of 100 test markets. Predictions are post-calibration probabilities.

Try MarketFlux on Your Markets

API access for calibrated probability predictions on prediction markets, trading, and quantitative finance.