Bridging the Gap between Expert and Language Models:
Concept-guided Chess Commentary Generation and Evaluation

Jaechang Kim¹, Jinmin Goh¹, Inseok Hwang¹, Jaewoong Cho², Jungseul Ok¹

¹POSTECH, ²KRAFTON

Abstract

Deep learning-based expert models have reached superhuman performance in decision-making domains such as chess and Go. However, it is under-explored to explain or comment on given decisions although it is important for human education and model explainability. The outputs of expert models are accurate, but yet difficult to interpret for humans. On the other hand, large language models (LLMs) produce fluent commentary but are prone to hallucinations due to their limited decision-making capabilities. To bridge this gap between expert models and LLMs, we focus on chess commentary as a representative case of explaining complex decision-making processes through language and address both the generation and evaluation of commentary. We introduce Concept-guided Chess Commentary generation (CCC) for producing commentary and GPT-based Chess Commentary Evaluation (GCC-Eval) for assessing it. CCC integrates the decision-making strengths of expert models with the linguistic fluency of LLMs through prioritized, concept-based explanations. GCC-Eval leverages expert knowledge to evaluate chess commentary based on informativeness and linguistic quality. Experimental results, validated by both human judges and GCC-Eval, demonstrate that CCC generates commentary that is accurate, informative, and fluent.

Jaechang Kim¹, Jinmin Goh¹, Inseok Hwang¹, Jaewoong Cho², Jungseul Ok¹ ¹POSTECH, ²KRAFTON

1 Introduction

Refer to caption — Figure 1: *Comparison of chess commentary generation methods.* The *red color* indicates incorrect information.

Artificial intelligence (AI) has achieved superhuman performance in various decision-making tasks, particularly in abstract strategy games like chess and Go. Milestones such as Deep Blue’s victory over the world chess champion Campbell et al. (2002) and AlphaGo’s defeat of top human Go players highlight AI’s capabilities in solving complex problems Silver et al. (2017). While these expert models deliver highly accurate decisions, they often lack interpretability, which is critical for human education and trust in AI systems. The strategic insights and rationales behind decisions are often explained through natural language commentary Chernev (2003); Polgar (2014). Large language models (LLMs) exhibit their outstanding performance in generating fluent natural language. However, LLMs often struggle with hallucinations due to their limited capability in complex decision-making and lack of domain-specific knowledge.

We aim to bridge the gap between expert and language models. Specifically, we focus on the task of chess commentary generation to explain given decisions. Although chess is a resourceful testbed with extensive dataset and study Zang et al. (2019); Lee et al. (2022); Feng et al. (2023), the chess commentary generation has two main challenges: (i) producing accurate and insightful commentary, which requires deep chess knowledge and linguistic ability, and (ii) developing evaluation metrics to assess commentary quality, which is overlooked in previous research.

Although language models can generate fluent natural language, they lack the chess-specific knowledge required for chess commentary generation. Even a model Feng et al. (2023) trained on chess-related data struggles in reasoning and understanding complex positions. One promising approach is to integrate expert models with language models. However, prior attempts Zang et al. (2019); Lee et al. (2022) directly feeding the decision-making process of expert models to language models are inadequate because the decision-making process is hard to interpret for language models.

To address them, we introduce an effective approach using concept-based explanations of expert models. By extracting and prioritizing concepts that the expert model focuses on, we guide the language model to concentrate on the most important aspects of the game. This results in commentary that is both linguistically fluent and strategically insightful. Figure 1 illustrates previous approaches and our approach. Our experiments demonstrate that our approach achieves human-level correctness in commentary generation, while outperforming baselines and human-generated comments in informativeness (relevance, completeness) and linguistic quality (clarity, fluency).

Evaluating chess commentary generation is another challenge task. Previous works Jhamtani et al. (2018); Zang et al. (2019); Lee et al. (2022) rely on similarity-based metrics such as BLEU, which are insufficient due to the inherently diverse nature of commentary. Different commentators may focus on distinct aspects of a position, such as attack strategies or defensive plans. In tasks like summarization or translation, which share the same challenges, LLM-based evaluation metrics Zhong et al. (2022); Liu et al. (2023) are proposed to assess multiple dimensions. We adopt G-Eval Liu et al. (2023) by incorporating expert model guidance for chess knowledge. We measure the commentary’s informativeness (relevance, completeness) and linguistic quality (clarity, fluency). Through our experiments, we show that our proposed method correlates well with human judgments, offering a more reliable metric for commentary evaluation.

Our contributions are as follows:

•

We propose an approach that integrates expert models with LLMs through concept-based explanations, facilitating transparent decision-making in chess commentary generation.
•

We develop a prioritization mechanism that highlights important concepts and an LLM inference technique that enables the model to understand moves with concept guidance.
•

We introduce and validate an LLM-based evaluation metric to assess the quality of chess commentary across multiple dimensions.

2 Related work

Chess commentary generation

Chess commentary generation is generating a comment for a chess move. Jhamtani et al. (2018) first address the task by utilizing web-crawled data to form a chess commentary dataset, framing commentary generation as a sequence prediction problem. Building on this, Zang et al. (2019) incorporate domain-specific chess knowledge using internal chess models, improving quality and contextual relevance of generated comments. Lee et al. (2022), integrate BART Lewis et al. (2020) and an external chess engine for more reliable move evaluation. However, their system classifies moves into predefined categories (e.g., excellent, good, inaccuracy, mistake, blunder), without deeper understanding of the model decision-making process. In contrast, we leverage concept-based explanation to extract chess concepts from an expert model to understand the rationale behind the decision.

Not limited to chess commentary, Feng et al. (2023) fine-tune an LLM on chess-related data, to leverage chess skills, not only the linguistic ability. However, we demonstrate that its understanding of chess knowledge is inferior to GPT-4o OpenAI (2023) (Section 4.4).

Concept-based explanation in chess

Concepts are high-level abstractions commonly shared within a community, enabling efficient communication. In chess, concepts such as "king safety" (i.e., all potential threats against the king) condense complex strategies into understandable terms, allowing players to communicate effectively without lengthy explanations. These concepts are understandable to both humans and language models, serving as a bridge between human intuition and neural networks. Concept-based explanations aim to make a model interpretable by aligning its internal decision-making process with these shared concepts, assuming that such concepts are linearly embedded in the representation space Kim et al. (2018); Alain and Bengio (2016); McGrath et al. (2022). This assumption is validated in chess domains Pálsson and Björnsson (2023); McGrath et al. (2022) for chess expert models like Stockfish Romstad et al. , AlphaZero Silver et al. (2018), and their open-source versions, such as LeelaChessZero Authors (2024).

Prioritization of concepts

Yuksekgonul et al. (2023) train a post-hoc concept bottleneck model, and the classifier following the concept bottleneck model is directly interpreted as the global importance of concepts for a class. However, they focus on finding global concept importance per class, without addressing the varying significance of concepts for individual inputs. We address prioritization of concepts for individual inputs, or local importance, to determine the influence of each concept in specific situations.

Evaluation of natural language generation

Classical evaluation metrics for natural language generation (NLG) are based on similarity. Common metrics are BLEU Papineni et al. (2002) and ROUGE Lin (2004). However, these metrics fail to assess content quality Reiter and Belz (2009) and syntactic correctness Stent et al. (2005), and are insufficient to measure the reliability of NLG systems. Zhang* et al. (2020); Zhao et al. (2019) compare the similarity in the text embedding space, to adequately measure semantic similarity.

Recently, beyond the similarity, Yuan et al. (2021); Mehri and Eskenazi (2020) assess generated natural language in multiple dimensions, and Zhong et al. (2022); Liu et al. (2023) evaluate in multiple dimensions using language models. The idea of using LLMs for evaluation is common, and the evaluation methods are known to be aligned with human evaluation, sometimes more than agreements among human evaluators Rafailov et al. (2024); Chen et al. (2023). The LLM-based evaluators are focused on summarization and translation tasks. Regarding evaluation in chess commentary, they still lack the domain-specific knowledge required for evaluating chess commentary.

Evaluating chess commentary is challenging due to its diverse nature, where commentaries on the same move may vary significantly depending on the focus, such as attack strategies, defensive plans, or comparison with other moves. Chess knowledge is essential for evaluating the correctness and relevance of these commentaries. Previous chess commentary researches Jhamtani et al. (2018); Zang et al. (2019); Lee et al. (2022) use classical metrics such as BLEU, ROUGE, or perplexity, but these metrics fall short for chess commentary, as they do not evaluate with domain-specific knowledge. While manual evaluation by human experts remains ideal, we propose an automatic evaluation method leveraging an LLM with chess knowledge.

3 Method: generation and evaluation

We propose two methods to address chess commentary generation (Section 3.1) and chess commentary evaluation (Section 3.2).

3.1 Concept-guided commentary generation

We propose Concept-guided Chess Commentary generation (CCC), which is a method for generating chess commentary by leveraging a chess expert model and its concept-based explanations. The method involves two key steps: 1) extracting concept vectors from a chess expert model (Section 3.1.1); and 2) generating commentary via an LLM using prioritized concepts that explain the given position and movement (Section 3.1.2). Figure 2 provides an overview of the proposed method.

3.1.1 Concept vector extraction

To make a chess expert model interpretable, we extract concept vectors that correspond to key concepts in chess. We follow a common approach Kim et al. (2018); Yuksekgonul et al. (2023) involving two steps: preparing a dataset for concept learning and extracting concept vectors by training a linear classifier. The concepts we focus on are adopted from Stockfish 8, a classical chess engine that can evaluate positions for their relevance to specific concepts (see Table 6). We collect 200,000 chess positions from the Lichess open database ¹¹1https://database.lichess.org/#evals and use Stockfish 8 to assign a score reflecting how strongly each position relates to these concepts. We then label the top $5\%$ of positions with the highest scores as positive samples and the bottom $5\%$ with the lowest scores as negative samples. This process results in a dataset of 20,000 positions for each concept, split equally between positive and negative samples. We employ LeelaChessZero T78, an open-source neural network-based chess model similar to AlphaZero for extracting concept vectors. For the representation space, we use the final layer before policy and value heads (layer 40). We then train a linear Support Vector Machine (SVM) (Cortes and Vapnik, 1995) to classify these samples. The resulting normal vector of the SVM classification boundary serves as the concept vector, and the distance from this boundary determines the concept score for any input position. This score quantifies how strongly a given board state aligns with the extracted concept.

3.1.2 Chess comment generation with an expert model and extracted concepts

Prioritization of concepts

Given a chess position and a specific move, our goal is to identify the concepts most relevant to explaining that movement. For the chess position, we compute the score for each concept by taking the dot product between the expert model representation of the position and the extracted concept vectors. These concept scores reflect how strongly each concept is reflected in the current position. To prioritize concepts, we compare the concept scores before and after the move. By analyzing the differences between pre-move and post-move scores, we identify which concepts are most influenced by the move. This allows us to assign priority to the concepts that explain the impact of the move.

Commentary generation via LLM

We generate chess commentary using an LLM and a chess expert model. Although a language model understands chess-specific notations and terms, it lacks the ability to perform chess-specific reasoning and complex analysis, which can result in hallucination. By integrating chess expert model output, the LLM determines whether to focus on advantageous aspects or disadvantageous aspects. However, since the chess expert model output is based on scalar values, it still generates incorrect comments. Concept-based explanation guides the LLM to focus on critical aspects. Figure 3 is a typical example of a concept-guided comments.

To enhance the reasoning ability of LLM, we employ few-shot prompting, Chain-of-Thought (CoT) prompting Wei et al. (2022), and chess-specific information. This approach provides the LLM with a deeper understanding of chess positions, and prevents potential use of wrongly prioritized concepts. Additionally, we enumerate all existing attacks towards opponent pieces to prevent mentioning of non-existing pieces or illegal moves.

3.2 Automatic evaluation of commentary

Our evaluation approach, termed GCC-Eval, modifies and extends G-Eval to better address the specific challenges of evaluating chess commentary. The core components of GCC-Eval are: (i) Multi-dimensional evaluation by an LLM. (ii) Expert model evaluation for chess knowledge. (iii) Auto-CoT for score-only output. (iv) weighted summation for non-integer scores. Note that our contributions are on the first and second aspects to ensure accurate chess commentary evaluation, focusing on informativeness and linguistic quality.

Evaluation dimensions

The evaluation covers four dimensions: relevance, completeness, clarity, and fluency. While clarity and fluency are general linguistic measures, relevance and completeness require a deep understanding of chess. To address this, we employ an expert model to augment the LLM’s capabilities when scoring relevance and completeness. This integration ensures that the commentary is not only linguistically sound but also informative from a domain-expert perspective.

The scoring prompts, including the expert evaluation and Auto-CoT reasoning, are described in Appendix A. For score computation, we adopt a weighted summation of score probabilities as follows:

\displaystyle\mathrm{score}(x)=\sum_{s\in\{1,2,3,4,5\}}{s\times p(s|x)}.

(1)

This method allows for non-integer scores, capturing subtle nuances in the evaluation that would be missed by integer-only scoring schemes.

Comment generation methods	Correctness	Relevance	Completeness	Clarity	Fluency	Words per comment
Reference	0.62	0.52	0.30	0.60	0.62	15.6
GAC Jhamtani et al. (2018)	0.63	0.46	0.15	0.66	0.64	8.9
GPT-4o	0.36	0.49	0.40	0.72	0.84	27.1
GPT-4o + expert	0.43	0.56	0.49	0.72	0.85	26.2
GPT-4o + expert + concept (CCC, ours)	0.60	0.67	0.59	0.80	0.91	28.5

Table 1: Average scores of human evaluation. Bold and underlined text indicate the best and second-best methods in each column, respectively. Numbers are rescaled to the range

[0,1]

Types of errors	GPT-4o	GPT-4o + expert	GPT-4o + expert + concept
Referring illegal move or non-existing pieces	0.46	0.28	0.20
Wrong understanding of tactical/immediate advantage	0.46	0.40	0.26
Wrong understanding of positional/long-term advantage	0.28	0.26	0.28
Wrong evaluation of the move/position	0.32	0.30	0.34

Table 2: Error rates in different causes of incorrectness. Note that the questions allow multiple answers per question. Error types in the lower rows require more comprehensive reasoning.

Metrics	$\kappa$
Correctness	0.5393
Relevance	0.2448
Completeness	0.2449
Clarity	0.1782
Fluency	0.2328

Table 3: Inter-annotator agreements of human evaluation measured by Fleiss’ kappa( $\kappa$ ).

Metrics	Correctness		Relevance		Completeness		Clarity		Fluency
Metrics	$\rho$	$\tau$	$\rho$	$\tau$	$\rho$	$\tau$	$\rho$	$\tau$	$\rho$	$\tau$
BLEU-1	0.17	0.10	-0.06	0.02	-0.25	-0.07	-0.08	0.01	-0.36	-0.16
ROUGE-1	0.04	-0.01	-0.19	-0.08	-0.29	-0.18	-0.18	-0.14	-0.29	-0.17
ROUGE-2	0.15	0.03	-0.10	-0.03	-0.16	-0.05	-0.02	0.03	-0.15	0.00
ROUGE-L	0.08	0.01	-0.18	-0.08	-0.29	-0.18	-0.16	-0.14	-0.29	-0.17
GCC-Eval	-	-	0.40	0.24	0.56	0.39	0.44	0.23	0.55	0.38

Table 4: Correlations between human and automatic evaluations.

\rho

and

\tau

denotes Pearson correlation and Kendall’s tau correlation, respectively.

Comment generation methods	Relevance	Completeness	Clarity	Fluency
Reference	0.51	0.25	0.47	0.72
GAC Jhamtani et al. (2018)	0.47	0.14	0.39	0.81
GPT-4o	0.79	0.48	0.85	0.95
GPT-4o + expert	0.81	0.49	0.75	0.90
GPT-4o + expert + concept (CCC, ours)	0.89	0.54	0.88	1.00

Table 5: Automatic evaluation results using GCC-Eval. Numbers are rescaled to the range

[0,1]

Concepts	Accuracy	Precision	Recall
Material	0.93	0.93	0.94
Imbalance	0.80	0.73	0.93
Pawns	0.84	0.81	0.90
White Knights	0.91	0.87	0.96
Black Knights	0.91	0.87	0.97
White Bishop	0.77	0.73	0.87
Black Bishop	0.75	0.71	0.83
White Rooks	1.00	1.00	1.00
Black Rooks	0.99	0.99	1.00
White Queens	0.74	0.71	0.79
Black Queens	0.81	0.84	0.77
White Mobility	0.99	0.99	1.00
Black Mobility	0.98	0.96	0.99
White Kingsafety	0.96	0.97	0.94
Black Kingsafety	0.94	0.96	0.91
White Threats	0.93	0.90	0.96
Black Threats	0.93	0.90	0.97
White Space	1.00	1.00	1.00
Black Space	1.00	1.00	1.00
White Passedpawns	0.98	0.98	0.98
Black Passedpawns	0.92	0.91	0.94

Table 6: Test accuracy, precision and recall of concept-based explanations.

4 Experiments

4.1 Experimental settings

Dataset

We evaluate our model using Chess Commentary dataset introduced by Jhamtani et al. (2018). This dataset contains full chess games accompanied by user-generated commentary on specific moves, collected from an online chess forum²²2https://gameknot.com/. Following the train/valid/test split introduced by Jhamtani et al. (2018), we use only the test set for our experiments. Since the absence of pre-processing code, we manually align the raw data with pre-processed data to ensure fair comparison with GAC. Additionally, we exclude comments that covering multiple moves for simplicity in analysis.

Baselines

We compare the experimental results within several methods:

•

reference: These are reference texts from the GameKnot dataset.
•

GAC Jhamtani et al. (2018): An LSTM model trained on the GameKnot dataset for generating chess commentary.
•

GPT-4o OpenAI (2023): The unmodified version of GPT-4o, accessed via OpenAI API, with a temperature setting of $0.1$ to avoid noisy outputs. For detailed discussion of comparison of LLMs, refer to Section 4.4.
•

GPT-4o + expert: This is the same GPT-4o model but augmented with evaluations from a chess expert model. Note that Lee et al. (2022) use BART with a chess expert model and GPT-4o + expert is superior because it uses more powerful language model and a sufficient expert model.

4.2 Human evaluation

Human evaluation settings

We conduct a manual human evaluation for evaluating the quality of comments. For the reliability of evaluation, we ensure every participant possesses sufficient chess knowledge to evaluate chess comments. Specifically, we recruit five participants from university community and SNS. Each participant has a chess.com rapid rating above 1500, which is 99.51st percentile among chess players³³3retrieved in Oct 2024, from https://www.chess.com/leaderboard/live/rapid, with an average rating of 1776. The human evaluation is conducted in a within-participant setup. For each move, each participant evaluates five versions of comments generated by five methods (i.e., four baselines and CCC), where the order of methods are randomized. A total of 50 moves are evaluated by the participants (i.e., a total of 250 comments). The evaluation take approximately four hours to complete. Each participant is compensated by an amount equivalent to 73 USD. Our university IRB approves the evaluation plan. Appendix C summarizes details of human evaluation, including the instructions and questions used.

During the evaluation, participants are presented with a chessboard displaying a specific move, marked with a blue arrow. Alongside the moves, the corresponding commentaries are provided. Each participant is asked to rate the commentary across six questions: five evaluation metrics and one question for categorizing the type of incorrectness when applicable. The evaluated metrics are: correctness, relevance, completeness, clarity, and fluency. Relevance and completeness evaluate how the comment is informative and insightful, and clarity and fluency evaluate how the comment is linguistically natural. Relevance, completeness, clarity, and fluency are assessed using a five-point Likert scale, while correctness is evaluated using a three-point Likert scale, as the correctness of a comment is closer to a binary decision rather than a scaled question. For clear presentation, the scores were rescaled to a range of 0 to 1.

Main results

Table 1 presents the results of the human evaluation. Our proposed method, CCC, achieves the highest scores in all metrics except correctness, where it ranks second. Also, CCC outperforms the reference comments in every metric except correctness, and the correctness is also comparable to the reference. The reference comments, collected from online sources, often contain grammatical mistakes and informal language, underscoring the limitation of similarity-based evaluation metrics. This highlights the need for evaluation metrics beyond similarity, especially when the quality of the reference comments is suboptimal. The use of expert models and concept guidance contribute significantly to the overall performance improvement, as evidenced by the higher scores across most metrics. While GPT-4o + expert shows only a slight improvement in correctness, it generates more detailed explanations, which in some cases lead to minor factual inaccuracies in the details, as illustrated in Figure 3. Although GAC exhibits the highest correctness slightly outperforming CCC, we observe that GAC’s higher correctness comes at the cost of lower details of the explanations; their explanations tend to be brief and thereby less informative in general, leading to lower scores in completeness and shorter the comment lengths.

Detailed analysis

Table 2 provides a detailed analysis of the types of errors. The usage of the expert model and concept reduces simple errors, but errors requiring comprehensive understanding remain within the margin of error.

To validate the consistency of the human evaluation, we calculate inter-annotator agreement using Fleiss’ Kappa Fleiss and Cohen (1973) of ranks across different methods. Table 3 reports the agreement of the participants. The agreement for correctness is 0.54, indicating moderate agreement. This is notably higher than for other metrics, suggesting that correctness is more indisputable for chess experts, compared to more subjective qualities like relevance or fluency.

4.3 Automatic evaluation

To perform an automatic evaluation of generated chess commentaries, we employ our proposed metric, GCC-Eval. This metric is designed to assess both linguistic quality and domain-specific relevance in chess commentary. To validate its reliability, we calculate the correlation between GCC-Eval scores and human evaluations using the same dataset from prior human evaluation studies. As shown in Table 4, GCC-Eval consistently shows a higher correlation with human assessments across all evaluation criteria compared to traditional metrics, such as BLEU and ROUGE, which rely on surface-level similarity measures with reference comments. We further apply GCC-Eval to evaluate the performance of different chess commentary generation methods. The results in Table 5 indicate that CCC outperforms the baselines in all GCC-Eval metrics, showcasing the effectiveness of integrating domain-specific expertise and concept-based explanations.

4.4 Other experiments

Chess skills and knowledge of language model

While LLMs can generate linguistically sound commentary, they lack the deep, inherent understanding of chess strategies. Integrating expert models like chess engines compensates for this limitation, ensuring that the LLM’s output is both fluent and grounded in expert knowledge. To verify the chess skill level of LLMs, we use mate-in-one chess problems and evaluate how the models solve them, in Table 7. GPT-4o solves 57% of problems, while other language models are below 12%, even though ChessGPT Feng et al. (2023) is fine-tuned on chess-related documents. When the expert model evaluation result is given in prompt, the LLM solves 95% of the problems, which is not surprising because the expert model evaluation includes the answer. While GPT-4o + expert includes the answer in the prompt, GPT-4o + concept also shows significant improvement of 17.2%p, with only a simple hint that there is a mate. It implies that a proper concept serves as a powerful hint for the precise analysis of the position. For more detailed explanation, refer to Appendix E.

Reliability of the concept-based explanation

We assess the reliability of the concept-based explanations. Table 6 shows that the average accuracy of the extracted chess concepts is 0.91, demonstrating that the model effectively identifies and utilizes key domain-specific concepts. This further supports the idea that concept-based explanations serve as reliable source for guiding the LLM in generating chess comments.

Interactive commentary generation

We also explore the potential of CCC for generating interactive and context-aware chess commentary. By augmenting the LLM with the decision-making capabilities of an expert model, it responds to flexible user questions, providing deeper insights beyond simple commentary on a move. The questions can be strategic intentions, long-term plans, and potential threats in a given chess position. An example of these interactive commentary capabilities and corresponding results are found in Appendix D. These experiments demonstrate that CCC is capable of generating not only accurate move annotations, but also high-quality interactive chess insights that meet different requirements of different users.

Language models	LLM	LLM + expert	LLM + concept (mate-in-one)
GPT-4o	0.564	0.982	0.736
GPT-4o-mini	0.014	0.988	0.031
GPT-3.5-turbo	0.036	0.988	0.056
ChessGPT Feng et al. (2023)	0.118	0.563	0.175

Table 7: LLM chess skill evaluation on mate-in-one problems.

5 Discussions

Language model as an explanation form

Our work shows that the CCC framework effectively transfers AI-driven chess knowledge to human users. Beyond concept-based explanation, language models can act as a crucial medium between the expert model’s internal reasoning and the end-user. This connection facilitates more intuitive and understandable feedback than traditional explanation methods like saliency-based, which suffer from issues of inconsistency and unreliability. By employing language-based form of explanation, the transparency of the explanation can be improved, making the evaluation of the model’s reliability more straightforward.

Fine-tuning with GCC-Eval

We validate that GCC-Eval is well-correlated with human evaluation. One promising direction to improve the quality of chess commentary is to incorporate GCC-Eval as a training objective, replacing human evaluator. By optimizing models to directly align with this evaluative criterion, we can better ensure that the generated commentary meets the standards of human chess experts. This approach offers a potential pathway toward more robust and human-aligned commentary systems in future applications.

6 Conclusions

In this paper, we propose methods for chess commentary generation (CCC) and evaluation (GCC-Eval). CCC integrates expert and language models through concept-based explanations, utilizing techniques such as prioritization, few-shot learning, and Chain-of-Thought prompting to align effectively with expert knowledge. CCC either surpasses or matches the quality of human-generated commentary, demonstrating the capability of LLMs to express expert-level understanding and potentially enhance learning for human users. We also present GCC-Eval, a multi-dimensional evaluation framework that incorporates chess-specific knowledge to assess chess commentary. The strong correlation between human evaluation and GCC-Eval validates the robustness. These findings underscore promising future research directions, including using a language model as an explanation method and using GCC-Eval fine-tuning chess commentary generation models.

7 Limitations

Use of proprietary LLMs

We plan to release the source code and datasets used in our experiments. However, since we employed proprietary LLMs including GPT-4o, GPT-4o-mini, and GPT-3.5-turbo (from July to October 2024), it can be limited to fully reproduce the results. Nonetheless, the proposed framework remains adaptable and can be further enhanced with the integration of more advanced LLMs. In addition, it is also interesting to further investigate the efficacy of our framework with smaller LLMs.

Educational purpose / comment for beginners

The main audience for commentary is often beginners and or those with less knowledge than the commentator. In the human evaluation in Section 4.2, we assess the commentary in the view of expert chess players. Another human evaluation involving novice players can assess the educational impact of the comments. For the same purpose, Chen et al. (2023) propose counterfactual simulatability, as an automatic evaluation metric of the improvement of students.

Beyond chess commentary

Although we focus on the chess commentary generation, our method can be extended to other tasks, that require comprehensive decision-making abilities and have an expert model. Empirical experiments in other tasks require finding the appropriate tasks and corresponding expert models.

More concepts

Although we use concepts from Stockfish 8, there are other useful concepts such as fork, pin, double-pawn or open-file. We do not use the concepts because of insufficient concept labels, but they could be valuable, as the concept "mate-in-one" improves chess skill in Table 7.

Differences between concept evaluation function and extracted concept

In our work, we extract the concept vectors from an expert model. Although using oracle concept evaluation functions is relatively more accurate, there are two key reasons for using the extracted concepts. First, recent findings Schut et al. (2023) emphasize that expert models often possess super-human knowledge, capturing patterns and strategies not easily interpretable by humans. It implies the extracted concepts can cover the comprehensive knowledege of model, even if the humans do not understand and an oracle concept evaluation function is not present. Second, when the model has defects, the extracted concepts are used to find the cause of failure. These two aspects facilitate us to use extracted concepts.

References

Alain and Bengio (2016) Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644.
Authors (2024) The LCZero Authors. 2024. Leelachesszero. Available at https://lczero.org/.
Campbell et al. (2002) Murray Campbell, A. Joseph Hoane, and Feng hsiung Hsu. 2002. Deep blue. Artif. Intell., 134:57–83.
Chen et al. (2023) Yanda Chen, Ruiqi Zhong, Narutatsu Ri, Chen Zhao, He He, Jacob Steinhardt, Zhou Yu, and Kathleen McKeown. 2023. Do models explain themselves? counterfactual simulatability of natural language explanations. arXiv preprint arXiv:2307.08678.
Chernev (2003) I. Chernev. 2003. Logical Chess : Move By Move: Every Move Explained. Rizzoli.
Cortes and Vapnik (1995) Corinna Cortes and Vladimir Naumovich Vapnik. 1995. Support-vector networks. Machine Learning, 20:273–297.
Feng et al. (2023) Xidong Feng, Yicheng Luo, Ziyan Wang, Hongrui Tang, Mengyue Yang, Kun Shao, David Mguni, Yali Du, and Jun Wang. 2023. Chessgpt: Bridging policy learning and language modeling. arXiv preprint arXiv:2306.09200.
Fleiss and Cohen (1973) Joseph L Fleiss and Jacob Cohen. 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and psychological measurement, 33(3):613–619.
Jhamtani et al. (2018) Harsh Jhamtani, Varun Gangal, Eduard Hovy, Graham Neubig, and Taylor Berg-Kirkpatrick. 2018. Learning to generate move-by-move commentary for chess games from large-scale social forum data. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1661–1671, Melbourne, Australia. Association for Computational Linguistics.
Kim et al. (2018) Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. 2018. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In International Conference on Machine Learning, pages 2668–2677. PMLR.
Lee et al. (2022) Andrew Lee, David Wu, Emily Dinan, and Mike Lewis. 2022. Improving chess commentaries by combining language models with symbolic reasoning engines. arXiv preprint arXiv:2212.08195.
Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
Liu et al. (2023) Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-eval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
McGrath et al. (2022) Thomas McGrath, Andrei Kapishnikov, Nenad Tomašev, Adam Pearce, Martin Wattenberg, Demis Hassabis, Been Kim, Ulrich Paquet, and Vladimir Kramnik. 2022. Acquisition of chess knowledge in alphazero. Proceedings of the National Academy of Sciences, 119(47):e2206625119.
Mehri and Eskenazi (2020) Shikib Mehri and Maxine Eskenazi. 2020. Usr: An unsupervised and reference free evaluation metric for dialog generation. arXiv preprint arXiv:2005.00456.
OpenAI (2023) OpenAI. 2023. Gpt-4 technical report. Preprint, arXiv:2303.08774.
Pálsson and Björnsson (2023) Aðalsteinn Pálsson and Yngvi Björnsson. 2023. Unveiling concepts learned by a world-class chess-playing agent. In IJCAI, pages 4864–4872.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
Polgar (2014) J. Polgar. 2014. Judit Polgar Teaches Chess 3 Â a Game of Queens. Judit Polgar teaches chess. Quality Chess UK LLP.
Rafailov et al. (2024) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2024. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
Reiter and Belz (2009) Ehud Reiter and Anja Belz. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4):529–558.
(23) Tord Romstad, Marco Costalba, Joona Kiiski, and Gary Linscott. Stockfish chess engine. https://stockfishchess.org.
Schut et al. (2023) Lisa Schut, Nenad Tomasev, Tom McGrath, Demis Hassabis, Ulrich Paquet, and Been Kim. 2023. Bridging the human-ai knowledge gap: Concept discovery and transfer in alphazero. arXiv preprint arXiv:2310.16410.
Silver et al. (2018) David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. 2018. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144.
Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy P. Lillicrap, Fan Hui, L. Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. 2017. Mastering the game of go without human knowledge. Nature, 550:354–359.
Stent et al. (2005) Amanda Stent, Matthew Marge, and Mohit Singhai. 2005. Evaluating evaluation methods for generation in the presence of variation. In Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing, pages 341–351.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
Yuan et al. (2021) Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34.
Yuksekgonul et al. (2023) Mert Yuksekgonul, Maggie Wang, and James Zou. 2023. Post-hoc concept bottleneck models. In The Eleventh International Conference on Learning Representations.
Zang et al. (2019) Hongyu Zang, Zhiwei Yu, and Xiaojun Wan. 2019. Automated chess commentator powered by neural chess engine. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5952–5961, Florence, Italy. Association for Computational Linguistics.
Zhang* et al. (2020) Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
Zhao et al. (2019) Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M Meyer, and Steffen Eger. 2019. Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578.
Zhong et al. (2022) Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and Jiawei Han. 2022. Towards a unified multi-dimensional evaluator for text generation. arXiv preprint arXiv:2210.07197.

Appendix

Appendix A Details for GCC-Eval

(a) Example prompt of relevance.

(b) Example prompt of completeness.

(a) Example prompt of clarity.

(b) Example prompt of fluency.

Figure A2: Example prompts for GCC-Eval. The blue text in the figure changes according to the experimental conditions.

The scoring prompt, which includes the expert evaluation and Auto-CoT reasoning, is illustrated in Figure A2.

Appendix B Reproduction of baselines

Although Jhamtani et al. (2018) provide the source code, pre-processing files are missing. For fair comparison, we align the raw files with the pre-processed files to compare reference text and chess move to the generated comments. For the same reason, we cannot reproduce Zang et al. (2019). Lee et al. (2022) do not share the source code, and as our baseline GPT-4o + expert shares the same idea with it, we do not reproduce it.

Appendix C Human evaluation

Every participant is Asian and fluent in reading and writing English. All participants have expert-level chess knowledge, specifically, all participants have a chess.com rapid rating above 1500, which is 99.51st percentile among chess players, with an average rating of 1776. The human evaluation is conducted in a within-participant setup. For each move, each participant evaluates five versions of comments generated by five methods (i.e., four baselines and CCC), where the order of methods are randomized. A total of 50 moves were evaluated by the participants (i.e., a total of 250 comments). The evaluation take approximately four hours to complete. Each participant was compensated by an amount equivalent to 73 USD. Our university IRB approves the evaluation plan. We explain the purpose of the research and usage plan, and obtain consent from all participants. All evaluation results are anonymized, and do not contain any personal information.

Figure A3 and Figure A4 are instructions and questions we used for human evaluation.

Appendix D Interactive commentary

Figure A5 shows an example of interactive comments, starting from CCC. The initial chess commentary is generated by CCC. If there are parts of the generated comments that are unclear or difficult to understand, users can engage with the system by asking follow-up questions to clarify any ambiguous or complex parts of the commentary. Similarly, they can request additional insights, such as alternative moves or a deeper analysis of the current game position.

This interactive approach enhances knowledge transfer between the AI and users, making expert-level chess understanding more accessible. By enabling two-way communication, the functionality of LLMs is extended, transforming the model from a static generator of text into an interactive learning tool that adapts to the needs and curiosity of the user. This capability promotes a more engaging and educational experience in chess commentary, expanding the role of LLMs in expert domains.

Appendix E Chess skill evaluation details

We conduct chess skill evaluation for LLMs. We use mate-in-one puzzle data from database of Lichess (https://database.lichess.org/#puzzles). We conduct evaluation for 1,000 puzzle data. Evaluation prompts are shown in Figure A6. For GPT-4o + expert, we include expert model evaluation information in the prompt (Figure A6(a)). For GPT-4o + concept, we provide an explanation indicating that the board is in a mate-in-one situation (Figure A6(b)). For GPT-4o, GPT-4o-mini, GPT-3.5-turbo, and ChessGPT, we use a basic prompt for evaluation (Figure A6(c)).

(a) Example prompt of GPT-4o + expert.

(b) Example prompt of GPT-4o + concept "mateIn1".

Figure A6: Example prompts for chess skill evaluation with mate-in-one problems. The blue text in figures (a) and (b) indicates the differences from figure (c).

Appendix F Licenses of artifacts

In this study, GPT-4o is used in compliance with its usage policy. ChessGPT is used under the terms of the Apache-2.0 license. The Lichess database is used according to the Creative Commons CC0 license. As there are no specific license statements for GameKnot and GAC, we regard them as Creative Commons CC0 license.

All artifacts are used within the intended use.

Bridging the Gap between Expert and Language Models: Concept-guided Chess Commentary Generation and Evaluation

Abstract

1 Introduction

2 Related work

Chess commentary generation

Concept-based explanation in chess

Prioritization of concepts

Evaluation of natural language generation

3 Method: generation and evaluation

3.1 Concept-guided commentary generation

3.1.1 Concept vector extraction

3.1.2 Chess comment generation with an expert model and extracted concepts

Prioritization of concepts

Commentary generation via LLM

3.2 Automatic evaluation of commentary

Evaluation dimensions

4 Experiments

4.1 Experimental settings

Dataset

Baselines

4.2 Human evaluation

Human evaluation settings

Main results

Detailed analysis

4.3 Automatic evaluation

4.4 Other experiments

Chess skills and knowledge of language model

Reliability of the concept-based explanation

Interactive commentary generation

5 Discussions

Language model as an explanation form

Fine-tuning with GCC-Eval

6 Conclusions

7 Limitations

Use of proprietary LLMs

Educational purpose / comment for beginners

Beyond chess commentary

More concepts

Differences between concept evaluation function and extracted concept

References

Appendix

Appendix A Details for GCC-Eval

Appendix B Reproduction of baselines

Appendix C Human evaluation

Appendix D Interactive commentary

Appendix E Chess skill evaluation details

Appendix F Licenses of artifacts

Bridging the Gap between Expert and Language Models:
Concept-guided Chess Commentary Generation and Evaluation