PGN2FEN: A Benchmark for Evaluating LLM Chess Reasoning

Today, I’m releasing PGN2FEN — a new benchmark that tests the ability of language models to understand and simulate chess games by converting PGN move sequences into FEN board positions.

Why does this matter? Because chess is a well-defined, logic-heavy domain with strict rules and an enormous solution space. This makes it a useful proxy task for evaluating sequential reasoning and state tracking in LLMs.

The Task: PGN → FEN

PGN (Portable Game Notation) describes chess games as a sequence of human-readable moves. FEN (Forsyth–Edwards Notation) is a snapshot of the board at any point. Translating PGN to FEN is easy for a chess engine, but is a non-trivial test of internal reasoning, memory, and state tracking for LLMs.

Some desirable characteristics of the PGN2FEN benchmark task include:

  • PGN inputs can be sourced from vast, real-world chess datasets or generated synthetically for specific experiments. LLMs are exposed to this bank of PGN data during pre-training, but it is rare for these PGNs to be accompanied by FEN representations, making the conversion task an unfamiliar test.
  • FEN outputs are deterministic, making benchmark data curation and correctness easy to automate and verify programmatically.
  • The difficulty scales cleanly with the number of moves — longer sequences require more memory, consistency, and reasoning depth.
  • If necessary, variants on the task can reduce the difficulty further, by only evaluating certain subcomponents of the final FEN string, such as the piece placement.

Models are instructed not to leverage code when executing the task. Whether they strictly obey this is not entirely clear (see the Discussion section below).

What's in the Benchmark?

The benchmark data is comprised of 1,000 PGN sequences, ranging from 1-100 halfmoves (10 per move count). These have been derived from Chess World Cup games, thus yielding realistic outputs that should align well with models' internalised knowledge of chess concepts.

GitHub - AidanCooper/pgn2fen-benchmark: A chess-based benchmark for language models
A chess-based benchmark for language models. Contribute to AidanCooper/pgn2fen-benchmark development by creating an account on GitHub.

Check out the PGN2FEN benchmark on GitHub

The benchmark codebase also includes tools for:

  • Popular language model provider API client integrations
  • Parsing, comparing, and analysing FENs (including partial matches)
  • Handling messy model outputs (e.g., noisy prose around the FEN)
  • Analysis and visualisation of model results

What Do the Results Show?

PGN2FEN currently reports full correctness accuracy — that is, whether the model produces the full FEN string expected at the end of the move sequence.

Results are presented separately for:

  • Reasoning-capable models (e.g. OpenAI o3, Gemini 2.5 Pro, DeepSeek Reasoner)
  • Non-reasoning chat models (e.g. DeepSeek Chat, GPT-4.1 variants)

Reasoning Model Results

PGN2FEN benchmark results for reasoning language models

Full correctness accuracy (%):

provider model 0-10 moves 11-20 moves 21-40 moves 41-60 moves 61-80 moves 81-100 moves
openai o3-2025-04-16 99 91.7 95 90 95 96.4
openai o3-mini-2025-01-31 82 74 49 39.5 27 16
deepseek deepseek-reasoner 82 22 7.5 9.5 3 6
google gemini-2.5-pro-preview-03-25 48 8 2 1 0 0
openai o4-mini-2025-04-16 28 19 17.5 25.8 35.8 42.5
google gemini-2.5-flash-preview-04-17 19 1 0 0 0 0

Non-Reasoning Model Results

PGN2FEN benchmark results for non-reasoning language models

Full correctness accuracy (%):

provider model 0-10 moves 11-20 moves 21-40 moves 41-60 moves 61-80 moves 81-100 moves
google gemini-2.0-flash-001 44 10 3.5 0.5 0.5 0
google gemini-2.0-flash-lite-001 36 7 1.5 0 0 0
deepseek deepseek-chat 25 0 0 0 0 0
openai gpt-4.1-2025-04-14 20 1 0 0 0 0
openai gpt-4.1-mini-2025-04-14 17 0 0 0 0 0
openai gpt-4.1-nano-2025-04-14 2 0 0 0 0 0

Discussion

Perhaps the most obvious and unsurprising finding is the relative dominance of reasoning models on this task. OpenAI's o3 is the clear standout, maintaining >90% accuracy across all move lengths. No other model comes close. By contrast, the non-reasoning models totally collapse beyond 20 halfmoves, suggesting they lack the capacity to track long, structured state transitions.

What Makes OpenAI's Reasoning Models so Good?

A more surprising finding is the disparity between OpenAI's o3, o3-mini, and o4-mini models versus Google's Gemini 2.5 Pro and Flash.

OpenAI's deliberate obfuscation of their models' reasoning traces makes it difficult to intuit exactly what's going on under the hood. Inspecting the summarised thinking of o3 on a 90 halfmove PGN input, we don't see anything particularly impressive about its thought process. If anything, it seems rambling and disorganised, adding extraneous commentary like "White and Black are moving steadily, adjusting rooks, bishops, and pawns" to what ought to be a highly methodical exercise. Despite these inefficiencies, it manages to land on the correct answer.

o3-mini exhibits gradual performance degradation as the game length increases, which is consistent with the hypothesis that longer games are more challenging to convert from PGN to FEN. However, o4-mini displays almost the exact opposite trend, achieving its highest score on the 81-100 move range. This is suspicious.

One charitable explanation for o4-mini's unusual trend could be that perhaps its the ~20-60 halfmove range that presents the most challenging positions. Relative to longer sequences, this mid game region may have more pieces to track, and more complex castling and en passant situations. However, no other model shows the same pattern, perhaps with the exception of o3.

Might o4-mini be cheating? The prompt explicitly instructs the models to not solve the task with code, but none of the providers expose a way to strictly prevent this. Reasoning models are also extensively trained to excel on coding and math problems, so they have a natural proclivity to leverage code. It could be that when presented with a sufficiently lengthy PGN, o4-mini (quite sensibly, albeit dishonestly) decides that writing some code is the best strategy.

I asked you not to do this, o4-mini

It's surprisingly difficult to verify if the models are or are not cheating, but there are some signs. In this example that o4-mini answers correctly, we can certainly see that it attempts to programmatically convert the PGN input to FEN using python-chess (the same library used in the PGN2FEN framework). It's scuppered by the library not being available in its environment, but perhaps it subsequently ran pip install python-chess (is it able to??) and secretly re-executed the script. OpenAI's reasoning models are somewhat of a black box in this sense, and their propensity for lying and cheating has been noted elsewhere.

What's Next?

The current benchmark is already saturating for top-tier reasoning models. Extending the move sequences to games beyond 100 moves is unlikely to help due to the simplification of chess positions as games progress through the later stages.

Some ideas for pushing the difficulty of the PGN2FEN task further include:

  1. Random or adversarial PGN generation. It's well established that human chess grandmasters are less able to recall random chess positions than realistic ones, and LLMs are likely to exhibit the same pattern.
  2. Testing on Chess960 (aka Fischer Random Chess) games. The randomised starting position of the backrank pieces pushes the models out of their training distribution, amplifies the challenge of tracking piece positions, and complicates the execution of chess mechanics such as castling.

The veracity of the benchmark could be improved if it were possible to ensure that reasoning models do not utilise code to solve the problem. More aggressive prompting may help in the absence of formal mechanisms for protecting against this.