Bridging the Gap between Expert and Language Models:
Concept-guided Chess Commentary Generation and Evaluation
Abstract
Deep learning-based expert models have reached superhuman performance in decision-making domains such as chess and Go. However, it is under-explored to explain or comment on given decisions although it is important for human education and model explainability. The outputs of expert models are accurate, but yet difficult to interpret for humans. On the other hand, large language models (LLMs) produce fluent commentary but are prone to hallucinations due to their limited decision-making capabilities. To bridge this gap between expert models and LLMs, we focus on chess commentary as a representative case of explaining complex decision-making processes through language and address both the generation and evaluation of commentary. We introduce Concept-guided Chess Commentary generation (CCC) for producing commentary and GPT-based Chess Commentary Evaluation (GCC-Eval) for assessing it. CCC integrates the decision-making strengths of expert models with the linguistic fluency of LLMs through prioritized, concept-based explanations. GCC-Eval leverages expert knowledge to evaluate chess commentary based on informativeness and linguistic quality. Experimental results, validated by both human judges and GCC-Eval, demonstrate that CCC generates commentary that is accurate, informative, and fluent.
Bridging the Gap between Expert and Language Models:
Concept-guided Chess Commentary Generation and Evaluation
Jaechang Kim1, Jinmin Goh1, Inseok Hwang1, Jaewoong Cho2, Jungseul Ok1 1POSTECH, 2KRAFTON
1 Introduction
Artificial intelligence (AI) has achieved superhuman performance in various decision-making tasks, particularly in abstract strategy games like chess and Go. Milestones such as Deep Blue’s victory over the world chess champion Campbell et al. (2002) and AlphaGo’s defeat of top human Go players highlight AI’s capabilities in solving complex problems Silver et al. (2017). While these expert models deliver highly accurate decisions, they often lack interpretability, which is critical for human education and trust in AI systems. The strategic insights and rationales behind decisions are often explained through natural language commentary Chernev (2003); Polgar (2014). Large language models (LLMs) exhibit their outstanding performance in generating fluent natural language. However, LLMs often struggle with hallucinations due to their limited capability in complex decision-making and lack of domain-specific knowledge.
We aim to bridge the gap between expert and language models. Specifically, we focus on the task of chess commentary generation to explain given decisions. Although chess is a resourceful testbed with extensive dataset and study Zang et al. (2019); Lee et al. (2022); Feng et al. (2023), the chess commentary generation has two main challenges: (i) producing accurate and insightful commentary, which requires deep chess knowledge and linguistic ability, and (ii) developing evaluation metrics to assess commentary quality, which is overlooked in previous research.
Although language models can generate fluent natural language, they lack the chess-specific knowledge required for chess commentary generation. Even a model Feng et al. (2023) trained on chess-related data struggles in reasoning and understanding complex positions. One promising approach is to integrate expert models with language models. However, prior attempts Zang et al. (2019); Lee et al. (2022) directly feeding the decision-making process of expert models to language models are inadequate because the decision-making process is hard to interpret for language models.
To address them, we introduce an effective approach using concept-based explanations of expert models. By extracting and prioritizing concepts that the expert model focuses on, we guide the language model to concentrate on the most important aspects of the game. This results in commentary that is both linguistically fluent and strategically insightful. Figure 1 illustrates previous approaches and our approach. Our experiments demonstrate that our approach achieves human-level correctness in commentary generation, while outperforming baselines and human-generated comments in informativeness (relevance, completeness) and linguistic quality (clarity, fluency).
Evaluating chess commentary generation is another challenge task. Previous works Jhamtani et al. (2018); Zang et al. (2019); Lee et al. (2022) rely on similarity-based metrics such as BLEU, which are insufficient due to the inherently diverse nature of commentary. Different commentators may focus on distinct aspects of a position, such as attack strategies or defensive plans. In tasks like summarization or translation, which share the same challenges, LLM-based evaluation metrics Zhong et al. (2022); Liu et al. (2023) are proposed to assess multiple dimensions. We adopt G-Eval Liu et al. (2023) by incorporating expert model guidance for chess knowledge. We measure the commentary’s informativeness (relevance, completeness) and linguistic quality (clarity, fluency). Through our experiments, we show that our proposed method correlates well with human judgments, offering a more reliable metric for commentary evaluation.
Our contributions are as follows:
-
•
We propose an approach that integrates expert models with LLMs through concept-based explanations, facilitating transparent decision-making in chess commentary generation.
-
•
We develop a prioritization mechanism that highlights important concepts and an LLM inference technique that enables the model to understand moves with concept guidance.
-
•
We introduce and validate an LLM-based evaluation metric to assess the quality of chess commentary across multiple dimensions.
2 Related work
Chess commentary generation
Chess commentary generation is generating a comment for a chess move. Jhamtani et al. (2018) first address the task by utilizing web-crawled data to form a chess commentary dataset, framing commentary generation as a sequence prediction problem. Building on this, Zang et al. (2019) incorporate domain-specific chess knowledge using internal chess models, improving quality and contextual relevance of generated comments. Lee et al. (2022), integrate BART Lewis et al. (2020) and an external chess engine for more reliable move evaluation. However, their system classifies moves into predefined categories (e.g., excellent, good, inaccuracy, mistake, blunder), without deeper understanding of the model decision-making process. In contrast, we leverage concept-based explanation to extract chess concepts from an expert model to understand the rationale behind the decision.
Concept-based explanation in chess
Concepts are high-level abstractions commonly shared within a community, enabling efficient communication. In chess, concepts such as "king safety" (i.e., all potential threats against the king) condense complex strategies into understandable terms, allowing players to communicate effectively without lengthy explanations. These concepts are understandable to both humans and language models, serving as a bridge between human intuition and neural networks. Concept-based explanations aim to make a model interpretable by aligning its internal decision-making process with these shared concepts, assuming that such concepts are linearly embedded in the representation space Kim et al. (2018); Alain and Bengio (2016); McGrath et al. (2022). This assumption is validated in chess domains Pálsson and Björnsson (2023); McGrath et al. (2022) for chess expert models like Stockfish Romstad et al. , AlphaZero Silver et al. (2018), and their open-source versions, such as LeelaChessZero Authors (2024).
Prioritization of concepts
Yuksekgonul et al. (2023) train a post-hoc concept bottleneck model, and the classifier following the concept bottleneck model is directly interpreted as the global importance of concepts for a class. However, they focus on finding global concept importance per class, without addressing the varying significance of concepts for individual inputs. We address prioritization of concepts for individual inputs, or local importance, to determine the influence of each concept in specific situations.
Evaluation of natural language generation
Classical evaluation metrics for natural language generation (NLG) are based on similarity. Common metrics are BLEU Papineni et al. (2002) and ROUGE Lin (2004). However, these metrics fail to assess content quality Reiter and Belz (2009) and syntactic correctness Stent et al. (2005), and are insufficient to measure the reliability of NLG systems. Zhang* et al. (2020); Zhao et al. (2019) compare the similarity in the text embedding space, to adequately measure semantic similarity.
Recently, beyond the similarity, Yuan et al. (2021); Mehri and Eskenazi (2020) assess generated natural language in multiple dimensions, and Zhong et al. (2022); Liu et al. (2023) evaluate in multiple dimensions using language models. The idea of using LLMs for evaluation is common, and the evaluation methods are known to be aligned with human evaluation, sometimes more than agreements among human evaluators Rafailov et al. (2024); Chen et al. (2023). The LLM-based evaluators are focused on summarization and translation tasks. Regarding evaluation in chess commentary, they still lack the domain-specific knowledge required for evaluating chess commentary.
Evaluating chess commentary is challenging due to its diverse nature, where commentaries on the same move may vary significantly depending on the focus, such as attack strategies, defensive plans, or comparison with other moves. Chess knowledge is essential for evaluating the correctness and relevance of these commentaries. Previous chess commentary researches Jhamtani et al. (2018); Zang et al. (2019); Lee et al. (2022) use classical metrics such as BLEU, ROUGE, or perplexity, but these metrics fall short for chess commentary, as they do not evaluate with domain-specific knowledge. While manual evaluation by human experts remains ideal, we propose an automatic evaluation method leveraging an LLM with chess knowledge.
3 Method: generation and evaluation
We propose two methods to address chess commentary generation (Section 3.1) and chess commentary evaluation (Section 3.2).
3.1 Concept-guided commentary generation
We propose Concept-guided Chess Commentary generation (CCC), which is a method for generating chess commentary by leveraging a chess expert model and its concept-based explanations. The method involves two key steps: 1) extracting concept vectors from a chess expert model (Section 3.1.1); and 2) generating commentary via an LLM using prioritized concepts that explain the given position and movement (Section 3.1.2). Figure 2 provides an overview of the proposed method.
3.1.1 Concept vector extraction
To make a chess expert model interpretable, we extract concept vectors that correspond to key concepts in chess. We follow a common approach Kim et al. (2018); Yuksekgonul et al. (2023) involving two steps: preparing a dataset for concept learning and extracting concept vectors by training a linear classifier. The concepts we focus on are adopted from Stockfish 8, a classical chess engine that can evaluate positions for their relevance to specific concepts (see Table 6). We collect 200,000 chess positions from the Lichess open database 111https://database.lichess.org/#evals and use Stockfish 8 to assign a score reflecting how strongly each position relates to these concepts. We then label the top of positions with the highest scores as positive samples and the bottom with the lowest scores as negative samples. This process results in a dataset of 20,000 positions for each concept, split equally between positive and negative samples. We employ LeelaChessZero T78, an open-source neural network-based chess model similar to AlphaZero for extracting concept vectors. For the representation space, we use the final layer before policy and value heads (layer 40). We then train a linear Support Vector Machine (SVM) (Cortes and Vapnik, 1995) to classify these samples. The resulting normal vector of the SVM classification boundary serves as the concept vector, and the distance from this boundary determines the concept score for any input position. This score quantifies how strongly a given board state aligns with the extracted concept.
3.1.2 Chess comment generation with an expert model and extracted concepts
Prioritization of concepts
Given a chess position and a specific move, our goal is to identify the concepts most relevant to explaining that movement. For the chess position, we compute the score for each concept by taking the dot product between the expert model representation of the position and the extracted concept vectors. These concept scores reflect how strongly each concept is reflected in the current position. To prioritize concepts, we compare the concept scores before and after the move. By analyzing the differences between pre-move and post-move scores, we identify which concepts are most influenced by the move. This allows us to assign priority to the concepts that explain the impact of the move.
Commentary generation via LLM
We generate chess commentary using an LLM and a chess expert model. Although a language model understands chess-specific notations and terms, it lacks the ability to perform chess-specific reasoning and complex analysis, which can result in hallucination. By integrating chess expert model output, the LLM determines whether to focus on advantageous aspects or disadvantageous aspects. However, since the chess expert model output is based on scalar values, it still generates incorrect comments. Concept-based explanation guides the LLM to focus on critical aspects. Figure 3 is a typical example of a concept-guided comments.
To enhance the reasoning ability of LLM, we employ few-shot prompting, Chain-of-Thought (CoT) prompting Wei et al. (2022), and chess-specific information. This approach provides the LLM with a deeper understanding of chess positions, and prevents potential use of wrongly prioritized concepts. Additionally, we enumerate all existing attacks towards opponent pieces to prevent mentioning of non-existing pieces or illegal moves.
3.2 Automatic evaluation of commentary
Our evaluation approach, termed GCC-Eval, modifies and extends G-Eval to better address the specific challenges of evaluating chess commentary. The core components of GCC-Eval are: (i) Multi-dimensional evaluation by an LLM. (ii) Expert model evaluation for chess knowledge. (iii) Auto-CoT for score-only output. (iv) weighted summation for non-integer scores. Note that our contributions are on the first and second aspects to ensure accurate chess commentary evaluation, focusing on informativeness and linguistic quality.
Evaluation dimensions
The evaluation covers four dimensions: relevance, completeness, clarity, and fluency. While clarity and fluency are general linguistic measures, relevance and completeness require a deep understanding of chess. To address this, we employ an expert model to augment the LLM’s capabilities when scoring relevance and completeness. This integration ensures that the commentary is not only linguistically sound but also informative from a domain-expert perspective.
The scoring prompts, including the expert evaluation and Auto-CoT reasoning, are described in Appendix A. For score computation, we adopt a weighted summation of score probabilities as follows:
(1) |
This method allows for non-integer scores, capturing subtle nuances in the evaluation that would be missed by integer-only scoring schemes.
Comment generation methods | Correctness | Relevance | Completeness | Clarity | Fluency | Words per comment |
Reference | 0.62 | 0.52 | 0.30 | 0.60 | 0.62 | 15.6 |
GAC Jhamtani et al. (2018) | 0.63 | 0.46 | 0.15 | 0.66 | 0.64 | 8.9 |
GPT-4o | 0.36 | 0.49 | 0.40 | 0.72 | 0.84 | 27.1 |
GPT-4o + expert | 0.43 | 0.56 | 0.49 | 0.72 | 0.85 | 26.2 |
GPT-4o + expert + concept (CCC, ours) | 0.60 | 0.67 | 0.59 | 0.80 | 0.91 | 28.5 |
Types of errors | GPT-4o | GPT-4o + expert | GPT-4o + expert + concept |
Referring illegal move or non-existing pieces | 0.46 | 0.28 | 0.20 |
Wrong understanding of tactical/immediate advantage | 0.46 | 0.40 | 0.26 |
Wrong understanding of positional/long-term advantage | 0.28 | 0.26 | 0.28 |
Wrong evaluation of the move/position | 0.32 | 0.30 | 0.34 |
Metrics | |
Correctness | 0.5393 |
Relevance | 0.2448 |
Completeness | 0.2449 |
Clarity | 0.1782 |
Fluency | 0.2328 |
Metrics | Correctness | Relevance | Completeness | Clarity | Fluency | |||||
BLEU-1 | 0.17 | 0.10 | -0.06 | 0.02 | -0.25 | -0.07 | -0.08 | 0.01 | -0.36 | -0.16 |
ROUGE-1 | 0.04 | -0.01 | -0.19 | -0.08 | -0.29 | -0.18 | -0.18 | -0.14 | -0.29 | -0.17 |
ROUGE-2 | 0.15 | 0.03 | -0.10 | -0.03 | -0.16 | -0.05 | -0.02 | 0.03 | -0.15 | 0.00 |
ROUGE-L | 0.08 | 0.01 | -0.18 | -0.08 | -0.29 | -0.18 | -0.16 | -0.14 | -0.29 | -0.17 |
GCC-Eval | - | - | 0.40 | 0.24 | 0.56 | 0.39 | 0.44 | 0.23 | 0.55 | 0.38 |
Comment generation methods | Relevance | Completeness | Clarity | Fluency |
Reference | 0.51 | 0.25 | 0.47 | 0.72 |
GAC Jhamtani et al. (2018) | 0.47 | 0.14 | 0.39 | 0.81 |
GPT-4o | 0.79 | 0.48 | 0.85 | 0.95 |
GPT-4o + expert | 0.81 | 0.49 | 0.75 | 0.90 |
GPT-4o + expert + concept (CCC, ours) | 0.89 | 0.54 | 0.88 | 1.00 |
Concepts | Accuracy | Precision | Recall |
Material | 0.93 | 0.93 | 0.94 |
Imbalance | 0.80 | 0.73 | 0.93 |
Pawns | 0.84 | 0.81 | 0.90 |
White Knights | 0.91 | 0.87 | 0.96 |
Black Knights | 0.91 | 0.87 | 0.97 |
White Bishop | 0.77 | 0.73 | 0.87 |
Black Bishop | 0.75 | 0.71 | 0.83 |
White Rooks | 1.00 | 1.00 | 1.00 |
Black Rooks | 0.99 | 0.99 | 1.00 |
White Queens | 0.74 | 0.71 | 0.79 |
Black Queens | 0.81 | 0.84 | 0.77 |
White Mobility | 0.99 | 0.99 | 1.00 |
Black Mobility | 0.98 | 0.96 | 0.99 |
White Kingsafety | 0.96 | 0.97 | 0.94 |
Black Kingsafety | 0.94 | 0.96 | 0.91 |
White Threats | 0.93 | 0.90 | 0.96 |
Black Threats | 0.93 | 0.90 | 0.97 |
White Space | 1.00 | 1.00 | 1.00 |
Black Space | 1.00 | 1.00 | 1.00 |
White Passedpawns | 0.98 | 0.98 | 0.98 |
Black Passedpawns | 0.92 | 0.91 | 0.94 |
4 Experiments
4.1 Experimental settings
Dataset
We evaluate our model using Chess Commentary dataset introduced by Jhamtani et al. (2018). This dataset contains full chess games accompanied by user-generated commentary on specific moves, collected from an online chess forum222https://gameknot.com/. Following the train/valid/test split introduced by Jhamtani et al. (2018), we use only the test set for our experiments. Since the absence of pre-processing code, we manually align the raw data with pre-processed data to ensure fair comparison with GAC. Additionally, we exclude comments that covering multiple moves for simplicity in analysis.
Baselines
We compare the experimental results within several methods:
-
•
reference: These are reference texts from the GameKnot dataset.
-
•
GAC Jhamtani et al. (2018): An LSTM model trained on the GameKnot dataset for generating chess commentary.
- •
-
•
GPT-4o + expert: This is the same GPT-4o model but augmented with evaluations from a chess expert model. Note that Lee et al. (2022) use BART with a chess expert model and GPT-4o + expert is superior because it uses more powerful language model and a sufficient expert model.
4.2 Human evaluation
Human evaluation settings
We conduct a manual human evaluation for evaluating the quality of comments. For the reliability of evaluation, we ensure every participant possesses sufficient chess knowledge to evaluate chess comments. Specifically, we recruit five participants from university community and SNS. Each participant has a chess.com rapid rating above 1500, which is 99.51st percentile among chess players333retrieved in Oct 2024, from https://www.chess.com/leaderboard/live/rapid, with an average rating of 1776. The human evaluation is conducted in a within-participant setup. For each move, each participant evaluates five versions of comments generated by five methods (i.e., four baselines and CCC), where the order of methods are randomized. A total of 50 moves are evaluated by the participants (i.e., a total of 250 comments). The evaluation take approximately four hours to complete. Each participant is compensated by an amount equivalent to 73 USD. Our university IRB approves the evaluation plan. Appendix C summarizes details of human evaluation, including the instructions and questions used.
During the evaluation, participants are presented with a chessboard displaying a specific move, marked with a blue arrow. Alongside the moves, the corresponding commentaries are provided. Each participant is asked to rate the commentary across six questions: five evaluation metrics and one question for categorizing the type of incorrectness when applicable. The evaluated metrics are: correctness, relevance, completeness, clarity, and fluency. Relevance and completeness evaluate how the comment is informative and insightful, and clarity and fluency evaluate how the comment is linguistically natural. Relevance, completeness, clarity, and fluency are assessed using a five-point Likert scale, while correctness is evaluated using a three-point Likert scale, as the correctness of a comment is closer to a binary decision rather than a scaled question. For clear presentation, the scores were rescaled to a range of 0 to 1.
Main results
Table 1 presents the results of the human evaluation. Our proposed method, CCC, achieves the highest scores in all metrics except correctness, where it ranks second. Also, CCC outperforms the reference comments in every metric except correctness, and the correctness is also comparable to the reference. The reference comments, collected from online sources, often contain grammatical mistakes and informal language, underscoring the limitation of similarity-based evaluation metrics. This highlights the need for evaluation metrics beyond similarity, especially when the quality of the reference comments is suboptimal. The use of expert models and concept guidance contribute significantly to the overall performance improvement, as evidenced by the higher scores across most metrics. While GPT-4o + expert shows only a slight improvement in correctness, it generates more detailed explanations, which in some cases lead to minor factual inaccuracies in the details, as illustrated in Figure 3. Although GAC exhibits the highest correctness slightly outperforming CCC, we observe that GAC’s higher correctness comes at the cost of lower details of the explanations; their explanations tend to be brief and thereby less informative in general, leading to lower scores in completeness and shorter the comment lengths.
Detailed analysis
Table 2 provides a detailed analysis of the types of errors. The usage of the expert model and concept reduces simple errors, but errors requiring comprehensive understanding remain within the margin of error.
To validate the consistency of the human evaluation, we calculate inter-annotator agreement using Fleiss’ Kappa Fleiss and Cohen (1973) of ranks across different methods. Table 3 reports the agreement of the participants. The agreement for correctness is 0.54, indicating moderate agreement. This is notably higher than for other metrics, suggesting that correctness is more indisputable for chess experts, compared to more subjective qualities like relevance or fluency.
4.3 Automatic evaluation
To perform an automatic evaluation of generated chess commentaries, we employ our proposed metric, GCC-Eval. This metric is designed to assess both linguistic quality and domain-specific relevance in chess commentary. To validate its reliability, we calculate the correlation between GCC-Eval scores and human evaluations using the same dataset from prior human evaluation studies. As shown in Table 4, GCC-Eval consistently shows a higher correlation with human assessments across all evaluation criteria compared to traditional metrics, such as BLEU and ROUGE, which rely on surface-level similarity measures with reference comments. We further apply GCC-Eval to evaluate the performance of different chess commentary generation methods. The results in Table 5 indicate that CCC outperforms the baselines in all GCC-Eval metrics, showcasing the effectiveness of integrating domain-specific expertise and concept-based explanations.
4.4 Other experiments
Chess skills and knowledge of language model
While LLMs can generate linguistically sound commentary, they lack the deep, inherent understanding of chess strategies. Integrating expert models like chess engines compensates for this limitation, ensuring that the LLM’s output is both fluent and grounded in expert knowledge. To verify the chess skill level of LLMs, we use mate-in-one chess problems and evaluate how the models solve them, in Table 7. GPT-4o solves 57% of problems, while other language models are below 12%, even though ChessGPT Feng et al. (2023) is fine-tuned on chess-related documents. When the expert model evaluation result is given in prompt, the LLM solves 95% of the problems, which is not surprising because the expert model evaluation includes the answer. While GPT-4o + expert includes the answer in the prompt, GPT-4o + concept also shows significant improvement of 17.2%p, with only a simple hint that there is a mate. It implies that a proper concept serves as a powerful hint for the precise analysis of the position. For more detailed explanation, refer to Appendix E.
Reliability of the concept-based explanation
We assess the reliability of the concept-based explanations. Table 6 shows that the average accuracy of the extracted chess concepts is 0.91, demonstrating that the model effectively identifies and utilizes key domain-specific concepts. This further supports the idea that concept-based explanations serve as reliable source for guiding the LLM in generating chess comments.
Interactive commentary generation
We also explore the potential of CCC for generating interactive and context-aware chess commentary. By augmenting the LLM with the decision-making capabilities of an expert model, it responds to flexible user questions, providing deeper insights beyond simple commentary on a move. The questions can be strategic intentions, long-term plans, and potential threats in a given chess position. An example of these interactive commentary capabilities and corresponding results are found in Appendix D. These experiments demonstrate that CCC is capable of generating not only accurate move annotations, but also high-quality interactive chess insights that meet different requirements of different users.
Language models | LLM | LLM + expert | LLM + concept (mate-in-one) |
GPT-4o | 0.564 | 0.982 | 0.736 |
GPT-4o-mini | 0.014 | 0.988 | 0.031 |
GPT-3.5-turbo | 0.036 | 0.988 | 0.056 |
ChessGPT Feng et al. (2023) | 0.118 | 0.563 | 0.175 |
5 Discussions
Language model as an explanation form
Our work shows that the CCC framework effectively transfers AI-driven chess knowledge to human users. Beyond concept-based explanation, language models can act as a crucial medium between the expert model’s internal reasoning and the end-user. This connection facilitates more intuitive and understandable feedback than traditional explanation methods like saliency-based, which suffer from issues of inconsistency and unreliability. By employing language-based form of explanation, the transparency of the explanation can be improved, making the evaluation of the model’s reliability more straightforward.
Fine-tuning with GCC-Eval
We validate that GCC-Eval is well-correlated with human evaluation. One promising direction to improve the quality of chess commentary is to incorporate GCC-Eval as a training objective, replacing human evaluator. By optimizing models to directly align with this evaluative criterion, we can better ensure that the generated commentary meets the standards of human chess experts. This approach offers a potential pathway toward more robust and human-aligned commentary systems in future applications.
6 Conclusions
In this paper, we propose methods for chess commentary generation (CCC) and evaluation (GCC-Eval). CCC integrates expert and language models through concept-based explanations, utilizing techniques such as prioritization, few-shot learning, and Chain-of-Thought prompting to align effectively with expert knowledge. CCC either surpasses or matches the quality of human-generated commentary, demonstrating the capability of LLMs to express expert-level understanding and potentially enhance learning for human users. We also present GCC-Eval, a multi-dimensional evaluation framework that incorporates chess-specific knowledge to assess chess commentary. The strong correlation between human evaluation and GCC-Eval validates the robustness. These findings underscore promising future research directions, including using a language model as an explanation method and using GCC-Eval fine-tuning chess commentary generation models.
7 Limitations
Use of proprietary LLMs
We plan to release the source code and datasets used in our experiments. However, since we employed proprietary LLMs including GPT-4o, GPT-4o-mini, and GPT-3.5-turbo (from July to October 2024), it can be limited to fully reproduce the results. Nonetheless, the proposed framework remains adaptable and can be further enhanced with the integration of more advanced LLMs. In addition, it is also interesting to further investigate the efficacy of our framework with smaller LLMs.
Educational purpose / comment for beginners
The main audience for commentary is often beginners and or those with less knowledge than the commentator. In the human evaluation in Section 4.2, we assess the commentary in the view of expert chess players. Another human evaluation involving novice players can assess the educational impact of the comments. For the same purpose, Chen et al. (2023) propose counterfactual simulatability, as an automatic evaluation metric of the improvement of students.
Beyond chess commentary
Although we focus on the chess commentary generation, our method can be extended to other tasks, that require comprehensive decision-making abilities and have an expert model. Empirical experiments in other tasks require finding the appropriate tasks and corresponding expert models.
More concepts
Although we use concepts from Stockfish 8, there are other useful concepts such as fork, pin, double-pawn or open-file. We do not use the concepts because of insufficient concept labels, but they could be valuable, as the concept "mate-in-one" improves chess skill in Table 7.
Differences between concept evaluation function and extracted concept
In our work, we extract the concept vectors from an expert model. Although using oracle concept evaluation functions is relatively more accurate, there are two key reasons for using the extracted concepts. First, recent findings Schut et al. (2023) emphasize that expert models often possess super-human knowledge, capturing patterns and strategies not easily interpretable by humans. It implies the extracted concepts can cover the comprehensive knowledege of model, even if the humans do not understand and an oracle concept evaluation function is not present. Second, when the model has defects, the extracted concepts are used to find the cause of failure. These two aspects facilitate us to use extracted concepts.
References
- Alain and Bengio (2016) Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644.
- Authors (2024) The LCZero Authors. 2024. Leelachesszero. Available at https://lczero.org/.
- Campbell et al. (2002) Murray Campbell, A. Joseph Hoane, and Feng hsiung Hsu. 2002. Deep blue. Artif. Intell., 134:57–83.
- Chen et al. (2023) Yanda Chen, Ruiqi Zhong, Narutatsu Ri, Chen Zhao, He He, Jacob Steinhardt, Zhou Yu, and Kathleen McKeown. 2023. Do models explain themselves? counterfactual simulatability of natural language explanations. arXiv preprint arXiv:2307.08678.
- Chernev (2003) I. Chernev. 2003. Logical Chess : Move By Move: Every Move Explained. Rizzoli.
- Cortes and Vapnik (1995) Corinna Cortes and Vladimir Naumovich Vapnik. 1995. Support-vector networks. Machine Learning, 20:273–297.
- Feng et al. (2023) Xidong Feng, Yicheng Luo, Ziyan Wang, Hongrui Tang, Mengyue Yang, Kun Shao, David Mguni, Yali Du, and Jun Wang. 2023. Chessgpt: Bridging policy learning and language modeling. arXiv preprint arXiv:2306.09200.
- Fleiss and Cohen (1973) Joseph L Fleiss and Jacob Cohen. 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and psychological measurement, 33(3):613–619.
- Jhamtani et al. (2018) Harsh Jhamtani, Varun Gangal, Eduard Hovy, Graham Neubig, and Taylor Berg-Kirkpatrick. 2018. Learning to generate move-by-move commentary for chess games from large-scale social forum data. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1661–1671, Melbourne, Australia. Association for Computational Linguistics.
- Kim et al. (2018) Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. 2018. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In International Conference on Machine Learning, pages 2668–2677. PMLR.
- Lee et al. (2022) Andrew Lee, David Wu, Emily Dinan, and Mike Lewis. 2022. Improving chess commentaries by combining language models with symbolic reasoning engines. arXiv preprint arXiv:2212.08195.
- Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Liu et al. (2023) Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
- McGrath et al. (2022) Thomas McGrath, Andrei Kapishnikov, Nenad Tomašev, Adam Pearce, Martin Wattenberg, Demis Hassabis, Been Kim, Ulrich Paquet, and Vladimir Kramnik. 2022. Acquisition of chess knowledge in alphazero. Proceedings of the National Academy of Sciences, 119(47):e2206625119.
- Mehri and Eskenazi (2020) Shikib Mehri and Maxine Eskenazi. 2020. Usr: An unsupervised and reference free evaluation metric for dialog generation. arXiv preprint arXiv:2005.00456.
- OpenAI (2023) OpenAI. 2023. Gpt-4 technical report. Preprint, arXiv:2303.08774.
- Pálsson and Björnsson (2023) Aðalsteinn Pálsson and Yngvi Björnsson. 2023. Unveiling concepts learned by a world-class chess-playing agent. In IJCAI, pages 4864–4872.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Polgar (2014) J. Polgar. 2014. Judit Polgar Teaches Chess 3 Â a Game of Queens. Judit Polgar teaches chess. Quality Chess UK LLP.
- Rafailov et al. (2024) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2024. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
- Reiter and Belz (2009) Ehud Reiter and Anja Belz. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4):529–558.
- (23) Tord Romstad, Marco Costalba, Joona Kiiski, and Gary Linscott. Stockfish chess engine. https://stockfishchess.org.
- Schut et al. (2023) Lisa Schut, Nenad Tomasev, Tom McGrath, Demis Hassabis, Ulrich Paquet, and Been Kim. 2023. Bridging the human-ai knowledge gap: Concept discovery and transfer in alphazero. arXiv preprint arXiv:2310.16410.
- Silver et al. (2018) David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. 2018. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144.
- Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy P. Lillicrap, Fan Hui, L. Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. 2017. Mastering the game of go without human knowledge. Nature, 550:354–359.
- Stent et al. (2005) Amanda Stent, Matthew Marge, and Mohit Singhai. 2005. Evaluating evaluation methods for generation in the presence of variation. In Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing, pages 341–351.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
- Yuan et al. (2021) Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34.
- Yuksekgonul et al. (2023) Mert Yuksekgonul, Maggie Wang, and James Zou. 2023. Post-hoc concept bottleneck models. In The Eleventh International Conference on Learning Representations.
- Zang et al. (2019) Hongyu Zang, Zhiwei Yu, and Xiaojun Wan. 2019. Automated chess commentator powered by neural chess engine. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5952–5961, Florence, Italy. Association for Computational Linguistics.
- Zhang* et al. (2020) Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
- Zhao et al. (2019) Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M Meyer, and Steffen Eger. 2019. Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578.
- Zhong et al. (2022) Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and Jiawei Han. 2022. Towards a unified multi-dimensional evaluator for text generation. arXiv preprint arXiv:2210.07197.
Appendix
Appendix A Details for GCC-Eval
The scoring prompt, which includes the expert evaluation and Auto-CoT reasoning, is illustrated in Figure A2.
Appendix B Reproduction of baselines
Although Jhamtani et al. (2018) provide the source code, pre-processing files are missing. For fair comparison, we align the raw files with the pre-processed files to compare reference text and chess move to the generated comments. For the same reason, we cannot reproduce Zang et al. (2019). Lee et al. (2022) do not share the source code, and as our baseline GPT-4o + expert shares the same idea with it, we do not reproduce it.
Appendix C Human evaluation
Every participant is Asian and fluent in reading and writing English. All participants have expert-level chess knowledge, specifically, all participants have a chess.com rapid rating above 1500, which is 99.51st percentile among chess players, with an average rating of 1776. The human evaluation is conducted in a within-participant setup. For each move, each participant evaluates five versions of comments generated by five methods (i.e., four baselines and CCC), where the order of methods are randomized. A total of 50 moves were evaluated by the participants (i.e., a total of 250 comments). The evaluation take approximately four hours to complete. Each participant was compensated by an amount equivalent to 73 USD. Our university IRB approves the evaluation plan. We explain the purpose of the research and usage plan, and obtain consent from all participants. All evaluation results are anonymized, and do not contain any personal information.
Appendix D Interactive commentary
Figure A5 shows an example of interactive comments, starting from CCC. The initial chess commentary is generated by CCC. If there are parts of the generated comments that are unclear or difficult to understand, users can engage with the system by asking follow-up questions to clarify any ambiguous or complex parts of the commentary. Similarly, they can request additional insights, such as alternative moves or a deeper analysis of the current game position.
This interactive approach enhances knowledge transfer between the AI and users, making expert-level chess understanding more accessible. By enabling two-way communication, the functionality of LLMs is extended, transforming the model from a static generator of text into an interactive learning tool that adapts to the needs and curiosity of the user. This capability promotes a more engaging and educational experience in chess commentary, expanding the role of LLMs in expert domains.
comment: Qxe4 is an even trade but misses a stronger opportunity with Qf2. While the move captures the passed e4 pawn, it exposes White to counterplay, particularly b5-b2, which could generate strong pressure on White’s queenside. Better would have been Qf2, maintaining positional advantage and limiting Black’s activity. [Following question by user:]
After the move, can black’s h4 knight survive? [Response of LLM:]
After 26. Qxe4, Black’s knight on h4 is under threat. The White queen can capture the knight with Qxh4. Black doesn’t have any immediate way to defend or save the knight on h4 effectively. Given that the knight is undefended, and White can simply take it on the next move, the knight cannot survive unless Black manages to create a very strong counterattack that would force White to deal with something else first, but that seems unlikely based on the current position. Thus, it looks like Black’s knight on h4 cannot survive and is likely lost after White’s next move.
Appendix E Chess skill evaluation details
We conduct chess skill evaluation for LLMs. We use mate-in-one puzzle data from database of Lichess (https://database.lichess.org/#puzzles). We conduct evaluation for 1,000 puzzle data. Evaluation prompts are shown in Figure A6. For GPT-4o + expert, we include expert model evaluation information in the prompt (Figure A6(a)). For GPT-4o + concept, we provide an explanation indicating that the board is in a mate-in-one situation (Figure A6(b)). For GPT-4o, GPT-4o-mini, GPT-3.5-turbo, and ChessGPT, we use a basic prompt for evaluation (Figure A6(c)).
You will be given a chess board, formatted with Forsyth-Edwards notation(FEN) string. Your task is to find the best move of this board. Please answer the best move in standard algebraic notation(SAN). [User]
position: N6r/1p1k1ppp/2np4/b3p3/4P1b1/N1Q5/P4PPP/R3KB1R b KQ - 0 18 engine evaluation: best move - Bxc3# Move(SAN formatted move only):
You will be given a chess board, formatted with Forsyth-Edwards notation(FEN) string. Your task is to find the best move of this board. You can make checkmate in one move. Please answer the best move in standard algebraic notation(SAN). [User]
position: N6r/1p1k1ppp/2np4/b3p3/4P1b1/N1Q5/P4PPP/R3KB1R b KQ - 0 18 Move(SAN formatted move only):
You will be given a chess board, formatted with Forsyth-Edwards notation(FEN) string. Your task is to find the best move of this board. Please answer the best move in standard algebraic notation(SAN). [User]
position: N6r/1p1k1ppp/2np4/b3p3/4P1b1/N1Q5/P4PPP/R3KB1R b KQ - 0 18 Move(SAN formatted move only):
Appendix F Licenses of artifacts
In this study, GPT-4o is used in compliance with its usage policy. ChessGPT is used under the terms of the Apache-2.0 license. The Lichess database is used according to the Creative Commons CC0 license. As there are no specific license statements for GameKnot and GAC, we regard them as Creative Commons CC0 license.
All artifacts are used within the intended use.