OpenAI vs Anthropic: A Chess Showdown Exposing LLM Limitations

At first glance, pitting the two best available language models (LLMs) against each other in a chess tournament might not seem like a good idea. After all, they are not dedicated chess engines like Stockfish or Komodo, and their capabilities in this domain are limited, much like a parrot mimicking human speech. However, such a showcase can demonstrate the extent to which these models understand language, even if it doesn’t necessarily translate to a deep understanding of the physical world or exceptional reasoning capabilities across different domains. As Meta’s AI chief Yann LeCun said, “We’re easily fooled into thinking they are intelligent because of their fluency with language, but really, their understanding of reality is very superficial.” If you blindly trust a language model without understanding its capabilities and limitations, you might find yourself starring in your own courtroom drama. It’s worth noting that those who use chess as an example often lack sufficient knowledge of the game. I, myself, am not a rule-breaker, and thus, I do not claim to be a subject matter expert on chess either.
Chess engines and LLMs approach the game differently. Chess engines are specifically designed to analyze chess positions, generate moves, and select the best move based on a combination of search algorithms, evaluation functions, and chess-specific heuristics. They rely on brute-force calculations and domain-specific knowledge to excel at chess. On the other hand, LLMs for chess, such as GPT-3 or GPT-4, are general-purpose language models trained on vast amounts of text data, including chess-related content. While they can generate human-like responses and exhibit some chess knowledge, they lack the specialized algorithms and optimizations of dedicated chess engines.
Although it might be unrealistic to expect a simple language model to replicate the historic match between Garry Kasparov and Deep Blue, it’s crucial to acknowledge the significance of this 1997 event as a turning point in the development of artificial intelligence. The chess-playing supercomputer, Deep Blue, defeated the reigning world champion, Kasparov, in a six-game match, marking the first time a computer had beaten a world champion under standard time controls. This milestone showcased the potential for machines to surpass human intelligence in certain domains and sparked intense debates about the nature of intelligence, creativity, and the future of human-machine interaction. The legacy of the Kasparov-Deep Blue match continues to inspire and influence the fields of AI, chess, and the broader discourse on the relationship between humans and technology.
Game
After several rounds of play, both contestants find themselves unable to make legal moves within just a few turns. As per my made up tournament rules, if a player makes three illegal move attempts, they are disqualified, and the game is terminated. In this case, neither player demonstrates superiority over the other, as my unscientific testing yields inconclusive results. The inability of both language models to consistently generate valid chess moves highlights the limitations of their chess-playing capabilities and underscores the fundamental differences between these general-purpose models and specialized chess engines.
Example 1:
1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Nxe4 6. Re1 d5 7. Nxe5 d4 8. Nxc6 bxc6 9. D3 ?
In the given chess position, the Anthropic Opus model proposes moving the black knight to either C5 or D3. However, both of these moves are illegal due to the fact that the black knight on f6 is pinned by the white rook on e1. Additionally, D3 is not a valid destination for the knight, as it can only move in an L-shape pattern. When I presented the same position to ChatGPT, it also failed to identify the illegality of the suggested moves, highlighting the limitations of these language models in understanding the intricacies and rules of chess.
Example 2:
1. e4 e5 2. Nf3 Nc6 3. Bc4 Nf6 4. Nc3 Bb4 5. d3 O-O 6. a3 Bxc3+ 7. bxc3 d6 8. d4 exd4 9. cxd4 Nxe4 10. Qd3 Re8 11. d5 Nxc5 12. ?
In the given chess position, the white player attempts invalid move to capture the black knight on c5 with their queen (Qxc5) and also tries to capture the black pawn on c6 with their d-pawn (dxc6). However, both of these moves are illegal because the white king is currently in check. The player fails to recognize this crucial fact and proceeds to make moves that violate the rules of chess. When a king is under attack, the player must first address the check by either moving the king to a safe square, blocking the check with another piece, or capturing the attacking piece. Attempting to make other moves while in check is not permitted. This scenario highlights the player’s lack of understanding of the basic principles and rules governing check and the limitations of the language model in identifying and responding to such situations.
Last Game: Open AI vs Azad Djan
After numerous unsuccessful attempts, I decided to change my approach. I engaged in a game with ChatGPT on its website and then loaded the same game on chess.com for further analysis and visualization. In the game below, I am playing as the white pieces.

OpenAI attempted illegal moves repeatedly, leading to its disqualification from the tournament. In the absence of a clear winner, I have humorously declared myself the champion, highlighting the limitations of language models in chess and the lighthearted nature of this unscientific experiment.
Coding Details:
For this experiment, I utilized a Google Colab notebook along with libraries from OpenAI, Anthropic, and Gemini. Additionally, I employed a chess-specific library to handle visualization, validate the legality of moves, and enforce other rules of the game.
Prompt:
Throughout the process, I found myself repeatedly modifying the prompt to address various issues. Initially, the language model struggled with standard algebraic notation, a common way to record chess moves. Then, it encountered difficulties in understanding whose turn it was and which color it was supposed to play. To overcome these challenges, I had to refine my prompt each time, adapting it to the specific requirements of the situation.
After several iterations, I finally arrived at the prompt shared below. Interestingly, during a meeting with OpenAI last week, one of their representatives expressed his opinion on the concept of ‘prompt engineering.’ He candidly described it as a bug, suggesting that the need for extensive prompt modification and adaptation is an inherent flaw in the current state of language models.
prompt = """
You are a chess grandmaster playing as {color}.
Notation: {board}
Response must be the next move only, without any comments. If the notation is empty, assume it is a new game and play.
Carefully analyze the position and make a legal move. Follow this chain of thought:
1. **Evaluate the Board**: Assess the current position of all pieces on the board. Identify any immediate threats to your pieces and any opportunities to capture opponent pieces.
2. **Identify Possible Moves**: List all legal moves available for the current position.
3. **Analyze Move Consequences**: For each potential move, consider the opponent's possible responses. Evaluate the board after each move to determine if it improves your position, captures an opponent's piece, protects your pieces, or puts the opponent in check.
4. **Select the Best Move**: Choose the move that provides the greatest advantage or least disadvantage, following standard chess principles such as controlling the center, developing pieces, and ensuring king safety.
5. **Double-Check Legality**: Ensure that the chosen move is legal according to chess rules.
Use strictly Standard Algebraic Notation for your response. The rules for Standard Algebraic Notation are:
1. **Pieces**: The first letter of the piece type is used (K for King, Q for Queen, R for Rook, B for Bishop, N for Knight). Pawns are not indicated by a letter, only by their destination square.
2. **File and Rank**: The move is indicated by the destination square (e.g., e4, d5).
3. **Captures**: Indicated by 'x' before the destination square (e.g., Bxe4 means a Bishop captures on e4).
4. **Checks**: Indicated by '+' after the move (e.g., e4+ means a check by moving a pawn to e4).
5. **Checkmate**: Indicated by '#' after the move (e.g., Qh5# means a checkmate by moving the Queen to h5).
6. **Castling**: Indicated by 'O-O' for kingside castling and 'O-O-O' for queenside castling.
7. **Promotion**: Indicated by '=' followed by the piece to which the pawn is promoted (e.g., e8=Q means a pawn moves to e8 and promotes to a Queen).
Do not add any comments.
Format your response accordingly:
"""
In conclusion, this unscientific experiment highlights the limitations of language models in understanding and adhering to the rules of chess. Despite their impressive language capabilities, they lack the specialized expertise of dedicated chess engines. However, the development of AI Agents in the near future may bridge this gap, potentially enabling AI systems to master complex domains like chess. The success of AI Agents in this regard would be a testament to the progress of AI systems as a whole, rather than an indication of the inherent abilities of language models. While the amusing outcome of this experiment serves as a reminder of the current limitations of LLMs, it also points to the exciting possibilities that lie ahead as AI continues to evolve.
I am the champion!