Back to list

GT-Bench - verifiable game-theory reasoning benchmark

Check it out

GitHub

GT-Bench is a compact benchmark for testing whether language models can solve exact game-theory problems, not just describe them. It generates 2x2 normal-form games and scores pure-strategy Nash equilibria deterministically, which makes the signal easy to verify and hard to hand-wave. The main run fine-tunes Qwen3.6-27B with Tinker/LoRA and lifts held-out accuracy from 87.6% to 99.6%, with stress, robustness, adversarial, repeated-seed, and retention checks in the repo.

More projects

Prime Intellect environment page for megaminx-solver v0.2.57, showing the public package, README, training, evaluation, and install controls.

Megaminx World Model Bench - symbolic puzzle-world RL environment

200loc: Interactive + complete step-by-step guide on how LLMs work

Fractal: The Infinite Curiosity Engine

Forecaster Arena - A new and uncontaminated LLM benchmark based on prediction markets

All projects