GT-Bench - verifiable game-theory reasoning benchmark

GT-Bench - verifiable game-theory reasoning benchmark

GT-Bench is a compact benchmark for testing whether language models can solve exact game-theory problems, not just describe them. It generates 2x2 normal-form games and scores pure-strategy Nash equilibria deterministically, which makes the signal easy to verify and hard to hand-wave. The main run fine-tunes Qwen3.6-27B with Tinker/LoRA and lifts held-out accuracy from 87.6% to 99.6%, with stress, robustness, adversarial, repeated-seed, and retention checks in the repo.

More projects