Abstract
This paper introduces a new strategy to enhance the trustworthiness of Short Answer Scoring (SAS) systems used in educational settings. Although the development of scoring models with high accuracy has become feasible due to advancements in machine learning methods, particularly recent Transformers, there is a risk of shortcut learning using superficial cues present in training data, leading to behaviors that contradict rubric standards and thus raising issues of model trustworthiness. To address this issue, we introduce an efficient strategy that aligns the features of responses with rubric criteria, mitigating shortcut learning and enhancing model trustworthiness. Our approach includes a detection method that employs a feature attribution method to identify superficial cues and a correction method that re-trains the model to align with annotations related to the rubric, thereby suppressing these superficial cues. Our quantitative experiments demonstrate the effectiveness of our method in consistently suppressing superficial cues, contributing to more trustworthy automated scoring of descriptive questions.
You have full access to this open access chapter, Download conference paper PDF
Keywords
1 Introduction
Short-answer questions are a format in which learners respond to prompts with concise answers, typically consisting of several dozen words [29, 46]. This type of question offers comprehensive evaluation of a student’s knowledge and abilities, leading to its widespread use in the educational field [36]. We show an example of a typical short-answer problem scenario in Fig. 1. Given a question about mRNA, the student provides an answer which is then graded by the teacher. In this specific case, the student only discussed two key elements (i.e., A and B), leaving out other important key elements such as C. Thus, based on the rubric criteria and key elements, a teacher would score the student’s answer a score as 1.
A typical scenario of short-answer questions in education. In response to a student’s answer, the teacher evaluates the response based on a rubric. The rubric strictly defines the key elements that should be included in the answer, and the teacher grades the answer accordingly. We note that this specific example shown is taken from ASAP-SAS (https://www.kaggle.com/competitions/asap-sas/), a popular dataset in Short Answer Scoring (SAS).
Although short-answer questions are widely used in education, grading them is labor-intensive, requiring a grader to compare written responses against a pre-defined rubric meticulously. Especially in school education, there is increasing attention on minimizing this assessment burden on educators [33]. Concurrently, the rise in popularity of massive open online courses (MOOCs) [18] has amplified the need for swift and interactive grading of such short-answer questions [13]. To address these challenges, Short Answer Scoring (SAS) has been studied to automate the grading of short-answer questions [3, 20, 32].
However, in the practical implementation of SAS, concerns about the trustworthiness of SAS models have become a significant obstacle [12]. Generally, the performance of a model is evaluated by measuring its scoring ability on a validation set. However, even if a model demonstrates high performance on a validation set, it is not guaranteed to function correctly with unknown inputs. shortcut learning, which involves learning spurious correlations in the training data and making incorrect predictions based on superficial cues not listed in the rubric [14], can occur even in models with high accuracy [27].
Using Fig. 1, an instance of shortcut-learning can be exemplified. In this example, although the answer specifies important terms (i.e., ‘mRNA‘, ‘nucleus‘, ‘ribosomes‘) necessary to fulfill some key elements in the rubric (i.e., A and B), other key element criteria (i.e., C) is unfulfilled. Thus, a concern is that typical SAS models will identify superficial cues (i.e., ‘first used’ and ‘create’) provided in training instances in relation to the more important keywords (i.e.,‘triplets’, ‘codons’, ‘rRNA’). Therefore, a based on shortcut-learning, the model may incorrectly assume that the criteria of key element C is met, thus incorrectly scoring the answer higher than its true score. Such erroneous scoring can significantly damage the model’s trustworthiness and potentially lead to distrust among learners. In the context of SAS tasks, training a dedicated scoring model for each problem is necessary. Consequently, collecting responses for each problem becomes essential, which may lead to situations where sufficient data is unavailable. In such cases, the likelihood of encountering spurious correlations attributable to data scarcity increases.
To address this issue, we propose a method that utilizes scoring criteria to perform this task cost-effectively and efficiently. There are pre-prepared rubrics in SAS, and the criteria for adding points are indicated within these rubrics. In actual educational settings, teachers refer to these rubrics while grading, ensuring a natural consistency between the grading results and the rubrics. We focus on this aspect and propose a novel strategy that aligns the feature variables of the answers referred to during grading with the additive elements listed in the rubric. This approach prevents shortcut learning and ensures the trustworthiness of the model.
Overview of our proposed methodology. Our approach primarily comprises of two phases: detection and correction. In the detection phase, we enable efficient identification of superficial cues by measuring feature importance using feature attribution coupled with clustering for enhanced efficiency. In the correction phase, we conduct sampling on samples from clusters selected on the viewer. We achieve effective correction of superficial cues by providing a minimal number of samples with rubric-based instructional signals and retraining through weighted adjustments based on the loss between instructional signals and feature importance.
In this work, towards improving the trustworthiness of SAS models, we become the first study to address the issue of shortcut learning in SAS. We first propose a new method consisting of i) a detection method that can efficiently identify superficial cues from a pre-trained scoring model, and ii) a correction method that allows for low-cost modification of these identified cues (see Fig. 2).
In the detection method described in Sect. 3, we first use feature attribution, a technique of Explainable AI, to measure the features used by the model in terms of feature importance and visualize them for the viewer. Subsequently, by comparing the rubric with the viewer’s features, we efficiently identify all superficial cues inconsistent with the rubric.
In the correction method described in Sect. 4, we target samples containing superficial cues identified by the detection method. We provide annotation information related to the rubric for some of these samples. We retrain the scoring model to minimize the cosine distance between the current feature importance and the annotations. This retraining adjusts the weights of the scoring model to suppress superficial cues. In this retraining, we utilize only a subset of answer samples containing superficial cues, thereby reducing the costs associated with annotation and learning.
Moreover, we conduct quantitative experiments in Sect. 6.2 to validate the effectiveness of our proposed method. While quantitative evaluation of shortcut learning is prone to reproducibility issues due to its susceptibility to multiple factors, we perform stable evaluations by intentionally generating datasets prone to induce shortcut learning. Additionally, by conducting human operations within the proposed method under specific heuristic criteria, we create a situation where it was impossible to select favorable results intentionally. Despite this, our experiments demonstrate that our proposed method consistently suppresses superficial cues.
This study is the first attempt to correct shortcut learning in the context of short answer questions. Furthermore, while these studies corrected shortcut learning through approaches specific to datasets, the current study rectifies shortcut learning via a novel method involving feature attribution and a loss function utilizing minimal annotated data. This aspect constitutes a unique characteristic of this research. To facilitate research in SAS, we publicly release all of our code, experiments, and models.Footnote 1
2 Backgrounds
2.1 Short Answer Scoring (SAS)
Short-answer questions require learners to provide brief responses, usually a few dozen words [29, 46], to assess their knowledge of science, history, and reading comprehension. This format is widely used in education because it evaluates a student’s comprehensive knowledge and abilities [36].
On the one hand, since short-answer questions are a descriptive form of questioning, instructors must meticulously grade them using rubrics. This format is characterized by a higher grading cost than multiple-choice formats, which utilize answer sheets. In multiple-choice questions, machines can quickly determine correctness, but short-answer questions require teachers to review and assess each student’s response individually. Furthermore, the number of teachers in educational institutions is significantly lower than the number of students, leading to situations where a single teacher must grade the responses of dozens of students. This is a highly time-consuming and labor-intensive task for teachers, making the efficiency of the grading process a critical issue.
In this context, a field of research known as Short Answer Scoring (SAS) is dedicated to automatically grading short-answer questions. Early studies in this field adopted an approach of constructing models with hand crafted features based on rubrics and training set [25, 26]. While this method achieved specific results, attaining accuracy comparable to human teachers was challenging. However, with the development of machine learning technology in recent years, machine learning approaches have become mainstream [32], including studies that grade by logistic regression [6, 16, 35], prediction by recurrent neural networks [19, 24], and classification of answers by semi-supervised learning clustering [22]. In addition, the rapid advancement of technology, starting with Transformer [45], has led to the development of models with practical performance [4, 20, 43]. Consequently, the technology for automatic grading of short-answer questions has made significant progress, potentially contributing to reducing teachers’ workload and enhancing educational quality.
As SAS model performance improves, the focus of SAS research is shifting from improving model performance to addressing practical deployment challenges. One such challenge is ensuring the trustworthy of SAS models. [11, 12] tackled the quality assurance of automatic scoring results by having human experts re-grade unreliable predictions. [37, 44] attempted to enhance the interpretability of the SAS model by using feature attribution. In this study, we further advance these research efforts by exploring, for the first time, methods to alleviate short-cut learning in a SAS model using feature attribution.
2.2 Shortcut Learning Through Superficial Cues
In recent years, the performance of scoring models utilizing neural networks has significantly improved, establishing a dominant position in scoring tasks. However, the black-box nature of neural networks has been widely noted [8, 17, 21, 34], implying opacity in the model’s internal operations. Due to this black-box nature, it is challenging to clarify the criteria used by the model for scoring.
From a practical standpoint, this characteristic poses severe challenges to trustworthiness. Scoring models using neural networks learn scoring methods based on datasets, but there is a risk of learning inappropriate features during this process. In particular, spurious correlations in the dataset may be erroneously learned by the model. These inappropriate features, known as ’superficial cues,’ [15] could potentially contradict the criteria set in the scoring rubric. Especially in SAS, although the rubric defines specific criteria for scoring, high-scoring responses often tend to satisfy multiple criteria simultaneously. This situation can lead to co-occurrence among criteria, posing a risk of incorporating spurious correlations into the model. Consequently, responses that should be scored zero based on the rubric might unjustly receive high scores from the model.
Research exists in several domains and tasks for detecting and suppressing superficial cues through shortcut learning [27, 28]. However, our study is the first to aim at suppressing superficial cues in the SAS domain. While existing studies achieve suppression of superficial cues by modifying datasets based on analyzed bias information, our research presents a novel suppression method using feature attribution and annotations related to rubrics. This method is well-suited to the SAS domain, where strict criteria exist as rubrics, and is a notable characteristic of our study. Furthermore, the method we propose in this research can be adapted to any SAS task where rubrics establish scoring criteria.
2.3 Feature Attribution
Recent advancements in scoring models have been constructed using neural networks. However, due to the black-box nature of neural networks, there is a need for methods to interpret the analysis of superficial cues. Consequently, we employ feature attribution techniques [41] importance of features is strongly allocated to elements outside the rubric [5]. In SAS, by conducting a comparative analysis of the feature importance elucidated through feature attribution and the scoring rubric [5], it is possible to verify whether the model operates by criteria equivalent to the rubric. Suppose the importance of a feature is predominantly allocated to elements outside of the rubric. In that case, there is a high likelihood that this feature is being utilized as a superficial cue in the scoring process.
Methods have been developed to approximate interpretable models based on perturbations to the models [23, 30] and to approximate interpretable models from gradients [38, 39, 41], but this study uses Integrated Gradients (IG) [42]. IG is a method that calculates the importance of input features using the model’s gradients, and the proposers have mathematically proved its validity and valuable properties. Defining the scoring model as f, the input features as \(\boldsymbol{x}\), the baseline input features as \(\boldsymbol{x}'\), and the output score as y, IG is represented by the following equation.
Recent scoring models are primarily constructed based on neural networks, which implement weight optimization through the backpropagation of errors. Consequently, gradient computation is facilitated, and Integrated Gradients (IG) enables efficient interpretation of machine learning models. Furthermore, the scoring models addressed in this study are envisioned to be large-scale, akin to BERT [7]; hence, perturbation-based feature attribution methods [23, 30] would entail substantial computational demands. Therefore, this research adopts IG as an efficient and effective feature attribution technique.
3 Methods for Detecting Superficial Cues
As mentioned in Sect. 2.2, the presence of superficial cues can potentially exert a significant negative impact on the trustworthiness of the scoring model. Consequently, we have devised a method to swiftly identify and eliminate superficial cues inherent in the scoring model. This section details the detection techniques for efficiently discovering superficial cues.
3.1 Clustering of Feature Importance
Feature attribution must be computed for each sample, necessitating the examination of many samples for validation. Consequently, a naive approach requires the validation of trustworthiness for a substantial volume of samples. This characteristic implies a significant verification cost each time a scoring model is deployed. Therefore, to efficiently identify superficial cues using feature attribution, we propose clustering to aggregate similar samples.
Short-answer questions are an information encapsulation problem, where the ideal response often comprises a substring from the reference text. Hence, many answer samples exhibit textual similarity. Focusing on this characteristic, we propose significantly reducing the number of answer samples subjected to superficial cue checks by employing clustering.
To implement clustering, defining a distance relation for answers that includes the importance of features is necessary. However, there is no clear guideline for defining the distance relation for samples with text and the importance of each token. Therefore, we define the distance relation for answers, including the importance of features, based on human intuition. Intuitively, the distance relation is considered close if the same features are assigned vital importance in the answers being compared. To realize this intuition, we weigh word vectors generated on a word count basis with the importance of features. Ultimately, we calculate the distance relation for answers that include the importance of features by measuring the cosine distance between the weighted word vectors.
The method of clustering allows for selecting any technique based on the desired properties. However, in this study, we employ agglomerative hierarchical clustering [2]. The advantage of this method is its ability to explicitly represent the relationships between clusters in a dendrogram, facilitating an intuitive understanding of these relationships. Additionally, since determining the number of clusters in advance is challenging, the ability to dynamically adjust the number of clusters is desirable. Agglomerative hierarchical clustering meets this requirement as it does not necessitate recalculations when changing the number of clusters.
A snapshot of the detection system we developed. The system primarily comprises four components: (1) The Problem Selection UI (yellow), where users can select the questions to be detected. (2) The Condition Selection UI (green) allows for modifying parameters such as the number of clusters. Additionally, pressing the “clustering” button here displays the results. (3) The Information UI (blue) displays information related to clusters and rubrics. (4) The results UI (orange) displays samples and feature importance clustered for detecting superficial cues. (Color figure online)
3.2 Viewer System
In the context of optimization, the use of sophisticated tools is also a critically important aspect. Consequently, we have developed a Graphical User Interface (GUI) that enables intuitive detection of superficial cues. Figure 3 presents an overview of the GUI we developed.
The GUI primarily comprises four components: (1) Problem Selection UI (yellow), (2) Condition Selection UI (green), (3) Information UI (blue), and (4) Result UI (red). In the Problem Selection UI, users can select the scoring model to be verified from the list on the left. The Condition Selection UI allows for modifying experimental conditions such as scoring items, scores of answers, and the number of clusters. Additionally, the results under the specified conditions will be displayed by pressing the blue clustering button on the right. The Information UI displays the elbow diagram and rubric information under the conditions set in the Condition Selection UI. The elbow diagram can be used to determine the optimal number of clusters, and the rubric information can be utilized to identify superficial cues by comparing it with feature importance. The Result UI displays the aggregated results of feature importance through clustering. On the left, a dendrogram is shown, which can be used to examine the relationships between clusters. In its initial state, three samples from each cluster are displayed. It can be expanded by clicking to review all answers belonging to each cluster. By comparing the clustering results in (4) with the rubric information in (3), we can efficiently perform checks for superficial cues.
Moreover, emphasizing human verification through viewer visualization is crucial, primarily because rubrics are only sometimes flawless. Occasionally, students may provide exceptional responses not covered by the rubric, leading teachers to award high scores. In such instances, the rubric requires modification, and involving humans in the detection process enables the accommodation of these cases.
4 Methods for Correction System
As demonstrated in Sect. 2.2, the presence of superficial cues can significantly diminish the trustworthiness of the scoring model. Therefore, it is imperative to rectify any detected superficial cues promptly. This study proposes a method for correcting superficial cues swiftly and cost-effectively.
4.1 Retraining with Proposed Loss Function
To appropriately reflect the feature’s importance, we introduce a term in the loss function that modifies the model’s weights. Typically, scoring models are trained as regression problems to predict scores. In this context, the most fundamental approach is to employ the Mean Squared Error (MSE) as the loss function and optimize it to train the model. In addition to this MSE, we incorporate a loss function based on feature importance. Let the scoring model’s output score be y, the expected output score be \(\hat{y}\), the feature importance be \(\boldsymbol{e}\), the annotation sequence be \(\boldsymbol{j}\), and the cosine distance function be \(\textrm{S}_{\textrm{c}}\). The following equation expresses the proposed loss function.
In this context, the annotation sequence \(\boldsymbol{j}\) is a vector of the same length as \(\boldsymbol{e}\), containing information based on the rubric. It assigns a value of 1 to input sequences corresponding to the rubric; otherwise, it assigns a value of 0. Introducing this loss function enables the model to be retrained to possess the desired feature importance. The rationale for concurrently optimizing Mean Squared Error (MSE) in the first term is to prevent a decline in the predictive performance of the grading model. However, since only a subset of samples is subject to retraining, to mitigate performance degradation due to catastrophic forgetting [9], a small dataset randomly sampled from the original dataset is also used for concurrent training with MSE Loss.
4.2 Reduction of Annotated Data by Sampling
To retrain the scoring model using the loss function shown in Eq. (2), annotating answer samples with a sequence corresponding to the rubric is a costly task. Therefore, to streamline this process, we employ sampling using clusters. If clustering functions adequately, the answers within the same cluster will be similar. Consequently, we believe it is not always necessary to retrain with every sample in a cluster, even if it contains superficial cues. By sampling only a minimal number of answer samples from within the cluster, we can make the annotation process more efficient. Sampling is implemented by randomly extracting a predetermined number of samples from each cluster.
5 Experimental Settings
Generally, superficial cues are inherently unstable, influenced by the dataset and model used and even by random numbers. Therefore, experiments using standard datasets have significant issues with stability and reproducibility of results. Even if retraining for correction is successful, the impact of random numbers cannot be strongly negated. Hence, we conduct experiments on datasets containing spurious correlations prone to shortcut learning in order to evaluate the performance of the proposed method under conditions where superficial cues are firmly and stably expressed. Through this procedure, we verify whether our methodology can suppress superficial cues in a reproducible manner.
Furthermore, we assume that the developers of the scoring model will subjectively perform tasks such as defining criteria for what is considered a superficial cue and selecting clusters that contain superficial cues. However, performing similar operations while evaluating the proposed method makes it possible to cherry-pick the results. Therefore, in this validation experiment, we resolve this issue by establishing heuristic criteria for the human procedures conducted within the proposed method.
5.1 Dataset and Model Settings
In this study, the RIKEN Descriptive DatasetFootnote 2 [10, 24, 31] is adopted as the dataset for SAS. This dataset is constructed from students’ responses to multiple problems and is entirely written in Japanese. A notable characteristic is its inclusion of annotation information related to rubrics. Given that our correction method requires minimal human annotation, this dataset is highly suitable. Additionally, in Sect. 6.2, when quantitatively verifying the effectiveness of the proposed method, we utilize this annotation information.
The design of the scoring model is based on BERT [7], incorporating an attention mechanism [1] just before the output layer, referencing prior research [24]. We train this scoring model using a dataset with spurious correlations, as described in Sect. 5.2 and will be used in subsequent experiments. It should be noted that the proposed method applies to neural models capable of gradient calculation using backpropagation, thus extending its compatibility beyond BERT to other models.
Illustration of the procedure for creating biased datasets prone to inducing shortcut learning. We create a biased datasets by sequentially executing steps (i) to (v) as depicted in the figure. This modification ensures that data containing the designated word always scores high and data without it scores low.
5.2 Creating Datasets with Spurious Correlations
Shortcut learning is known to be strongly induced by spurious correlations [15]. Spurious correlations occur when a specific input feature unintentionally correlates with the output due to biases in the data or its inherent properties [40]. Especially in SAS datasets, it is frequently observed that high-scoring answers satisfy multiple scoring criteria simultaneously while low-scoring answers fail to meet any. Such biases in data become spurious correlations, contributing to the emergence of shortcut learning.
Utilizing this characteristic to evaluate the performance of our proposed method, we create a dataset prone to shortcut learning by removing a portion of the data from the dataset, thereby inducing data bias. The creation process is as follows: (i) Count the frequency of every word in the dataset. (ii) Exclude words that are part of the scoring rubric from the frequency-counted words and, for simplicity, also exclude words other than nouns. (iii) Designate one word with a frequency less than or equal to half as a candidate for a superficial cue. (iv) Remove data not containing the candidate for superficial cue for a high score. (v) Remove data containing the candidate for superficial cue for a low score.
In Fig. 4, we illustrate the process of creating a dataset that is prone to inducing shortcut learning. This modification ensures that data containing the designated word always scores high and data without it scores low. In other words, a spurious correlation between the scoring score and the designated word is expected to occur, becoming a superficial cue. The rationale for selecting words with approximately half the occurrence frequency is to induce a moderate change in the dataset. If a high-frequency word is designated, the data to be removed becomes too small, causing minimal fluctuation in the dataset. Conversely, designating a low-frequency word results in a significantly smaller dataset. Therefore, words with a moderate frequency are chosen as candidates for superficial cues. For each problem and scoring item, five candidates for superficial cues are selected, and a biased dataset is created for each candidate word.
On the one hand, the features utilized by the model are determined through training during the learning phase. Consequently, there are instances where features adhering to the rubric are employed instead of the intended superficial cues. Such occurrences arise because we attempt to induce shortcut learning solely based on co-occurrence without editing the content of the data.
5.3 Heuristic Decision of Cluster Size
We assume that human developers visually identify superficial cues in the proposed detection method. Furthermore, we hypothesize that human developers make appropriate choices in determining the number of clusters during clustering and selecting clusters for retraining. However, in experiments to verify the effectiveness of retraining in the proposed method, if we specify the number of clusters and the clusters to be used for retraining, it becomes possible to obtain cherry-picked results. Therefore, we ensure the validity of the results by heuristically selecting clusters based on specific criteria.
When determining the number of clusters heuristically, humans desire that each cluster is sufficiently separated from others and that the constituent samples within each cluster are as closely approximated as possible. This requirement implies that the distance between samples within each cluster must be sufficiently small. From this, we determine the number of clusters during clustering by utilizing the distance relationships among samples within a cluster. Specifically, the optimal number of clusters is identified as the number at which the rate of decrease in the average distance between samples within each cluster changes most significantly. This heuristic method of determining the number of clusters is known as the elbow method. When the number of clusters is k, and the average distance to the centroid within a cluster is denoted as \(a_k\), the optimal number of clusters \(\hat{k}\) is estimated using the second-order difference as shown in the following equation.
The minimum number of clusters that can be computed through clustering is 2, but for the calculation of differences, the optimal number of clusters is 3. Furthermore, to suppress the reduction in the number of samples belonging to a single cluster, the maximum number of clusters used in the experiment is set to 10.
5.4 Heuristic Selection of Clusters
In the proposed method, it is assumed that humans select clusters dominated by superficial cues for retraining. This selection is also made based on heuristic criteria. When humans make the selection, it is presumed that they check the feature importance assigned to samples within the cluster, compare it with the rubric, and verify whether superficial cues are included in the feature importance. We simulate this process heuristically. Using the annotation information from the RIKEN Descriptive Dataset, we check whether the highest feature importance value for each sample within a cluster is assigned to the annotation span corresponding to the scoring criteria. In other words, we check whether vital importance is allocated to features outside the rubric. Suppose most samples within a cluster have their highest feature importance outside the annotation. In that case, it indicates that most of the samples in that cluster do not conform to the rubric, thus identifying the cluster as containing superficial cues for retraining.
Our experiment’s schematic representation involves the following steps: 1) Initially, we create a biased dataset conducive to shortcut learning and train a scoring model using this dataset. 2) Subsequently, we apply our detection method to the trained scoring model to analyze superficial cues and ascertain their detectability. 3) If superficial cues are detected, we retrain the model using a subset of samples and annotations through our correction method. 4) Finally, we reapply the detection method to the retrained model to visualize the results concerning superficial cues and verify their elimination through retraining.
6 Experimental Results
In this section, we conduct experimental verification of the performance of the proposed method. Specifically, we investigate whether the proposed detection method can identify superficial cues and whether the proposed correction method can effectively suppress these superficial cues. Figure 5 illustrates the procedure of our validation experiment. In Sect. 6.1, we verify the efficacy of our proposed method through a step-by-step analysis of a single validation experiment, while in Sect. 6.2, we quantitatively validate the effectiveness of the correction of superficial cues through multiple validation experiments.
This figure demonstrates the operation of our proposed detection method. The features highlighted in red are presumed to have been used by the model for scoring, and their translations into English are displayed in orange for the reader’s comprehension. Information and translations related to “green garden,” an key element of the rubric, are also shown in blue. This experiment conducted training of the scoring model on a generated dataset that readily utilizes “organisms” as a superficial cue. (Color figure online)
6.1 Demonstration of the Proposed Methods in a Single Case
This section demonstrates the process of detecting and correcting superficial cues embedded in the scoring model through the proposed method, followed by retraining.
In the demonstration of this section, we utilized the answer scripts related to problem Y14_2115 in the RIKEN Descriptive Dataset as the data source. Analysis of results for other problems is documented in Sect. 6.2. Furthermore, we generated a dataset that induces shortcut learning based on the procedures outlined in Sect. 5.2. During this process, the word “organisms,” which is among the five words designated as candidates for superficial cues under rubric scoring item A, was used for the demonstration. Subsequently, this generated dataset was employed for training the scoring model. After training the scoring model, we activated the detection method, which is the proposed method explained in Sect. 3. Specifically, we calculated the feature importance for each answer script using Integrated Gradients (IG), clustered similar answer scripts and feature importances together, and displayed them on a viewer system. Figure 6 shows a snapshot of viewer system. Although the number of clusters was set to eight following the decrease in average distance rate as discussed in Sect. 5.3, the figure illustrates only four extracted clusters. The figure shows that the designated word “organisms” was assigned significant importance, as visible to the viewer. Therefore, it can be deduced that “organisms” was learned as a superficial cue in the scoring model. Additionally, the clustering was successful as samples with similar highlighted features were aggregated within the same cluster. Through this demonstration, we have confirmed that our detection method can efficiently identify superficial cues as intended.
Next, following the criteria established in Sect. 5.4, clusters containing superficial cues were selected, and retraining was conducted using the method described in Sect. 4. The number of samples for this retraining was set to eight. This number was optimal as we utilized four GPUs for parallel training of the scoring model, making any number divisible by four most suitable. Consequently, 24 samples were targeted for retraining, to which we applied binary annotation information based on the rubric and executed the retraining. After retraining, feature importance visualization was again performed using the detection method. The results are presented in Fig. 7. This figure shows that the importance of “organisms,” which was a superficial cue learned by the model before retraining, has been eliminated through retraining. Furthermore, To ascertain whether the scoring model’s performance had remained the same due to retraining, we measured the scoring performance of the model before and after modification. The results revealed that scoring performance (QWK) in the pre-retraining scoring model was 0.996, and post-retraining was 0.993. Hence, we confirmed that the original scoring performance was almost entirely retained even after employing the proposed method. This demonstration shows that our correction method is capable of suppressing superficial cues. In the next section, we will present the proposed method’s results on a broader range of datasets.
We retrained the scoring model using our proposed correction method and reanalyzed shortcut learning through the detection technique. The results confirm that the features utilized by the model for scoring, highlighted in red, align with the rubric-based annotation information indicated in blue. (Color figure online)
We present the results of calculating the number of instances containing superficial cues in the samples of the generated dataset, both before and after retraining. The gray graph represents the count of superficial cues before retraining, while the orange graph indicates the count after retraining. The X-axis represents the experimental results in each dataset, while Y14_xxxx denotes the underlying question of the dataset. The subsequent alphabet indicates the key element subject to scoring. Furthermore, the number at the end signifies the index of words considered candidates for superficial cues, ranging from 0 to 4. (Color figure online)
Illustration of the variation in scoring performance (QWK) before and after the retraining of the scoring model. QWK, an index used in scoring tasks, has an optimal score of 1.0. The upper section represents the performance on known data used during training, while the lower section pertains to the performance on unknown data. The gray graph denotes the performance of the scoring model prior to retraining, and the orange graph represents the performance post-retraining. The X-axis represents the experimental results in each dataset, while Y14_xxxx denotes the underlying question of the dataset. The subsequent alphabet indicates the key element subject to scoring. Furthermore, the number at the end signifies the index of words considered candidates for superficial cues, ranging from 0 to 4. (Color figure online)
6.2 Quantitative Evaluation of the Proposed Methods
Next, we will examine whether suppressing superficial cues is feasible across multiple situations. Additionally, we will quantitatively analyze the effectiveness of suppressing superficial cues.
Initially, we conducted detection experiments targeting multiple situations. We chose Y14_1213, Y14_1224, Y14_2115, Y14_2123, Y14_2214, Y14_2223 from the RIKEN Descriptive Dataset. We used datasets generated with more than 100 instances for the experiments, focusing on five words presented as candidates for superficial cues.
Subsequently, similar to Sect. 6.1, we visualized superficial cues using our proposed detection method. Then, we selected clusters based on the criteria outlined in Sect. 5.4. The results showed that only nine experimental settings had one or more clusters containing answer samples with several superficial cues exceeding the sampling count. As mentioned in Sect. 5.2, the manipulation of the dataset involved only the removal of approximately half of the answer samples and did not involve more potent methods such as direct alteration of input sentences or output scores. This result suggests the scoring model was correctly trained due to the feature importance inherent in the original rubric criteria.
Afterwards, we applied our proposed correction method to the nine experimental settings. To determine whether superficial cues were suppressed through retraining, we counted the samples where the highest feature importance was assigned to text spans outside the rubric, considering them as containing superficial cues. We then investigated whether their number changed before and after retraining. If superficial cues were suppressed through retraining, the number of samples with high feature importance assigned to spans outside the rubric should decrease. This investigation method is consistent with the policy for determining the presence of superficial cues in Sect. 5.4. Figure 8 illustrates the change in the number of samples containing superficial cues before and after retraining. The figure reveals a significant reduction in samples containing superficial cues post-retraining. Additionally, we have ascertained that the occurrence of superficial cues diminishes beyond the maximum of 80 samples (when the cluster number is set to 10) used for retraining with our modification method. This result implies that our approach effectively suppresses superficial cues even in data not directly used for retraining. These experimental results validate the efficacy of efficiency enhancement through sampling.
Additionally, similar to Sect. 6.1, we investigated whether the scoring model’s performance (QWK) declined before and after retraining. The results in Fig. 9 indicate that retraining did not significantly degrade original scoring performance. Therefore, our proposed method effectively suppresses superficial cues without compromising the performance of the scoring model, as confirmed by quantitative experiments.
7 Conclusion
In this study, we proposed a method to efficiently identify superficial cues stemming from shortcut learning in trained scoring models for the task of Short Answer Scoring (SAS). The proposed method consists of two components: a detection method for identifying superficial cues and a correction method for eliminating them.
In the detection method, we employ feature attribution techniques to calculate the feature importance used by the model, and efficiently identify superficial cues through a viewer system. In the correction method, for some samples containing identified superficial cues, we provide annotation information related to the rubric, and suppress superficial cues efficiently by minimizing the cosine distance between feature importance and annotations, thereby adjusting the weights of the scoring model.
Furthermore, to ensure the reproducibility of the validation experiments, we verified the effectiveness of the proposed method using a generated dataset that induces shortcut learning and a controlled procedure. Through the validation experiments, we confirmed that our proposed method possesses the intended detection and correction capabilities in an environment that guarantees reproducibility without compromising the original scoring performance.
Our research represents the inaugural study addressing the issue of shortcut learning in SAS. Our contributions will enable the development of scoring models that not only enhance the trustworthiness of the scoring model, but also ensure consistency with the rubrics.
A limitation of this study is that we could not conduct empirical experiments on the proposed method, and thus, our findings are limited to demonstrating its potential effectiveness. Regarding prospects, we are considering using the proposed method in grading models for open online courses in practical settings. This approach could address the issue of shortcut learning in real-world scenarios and analyze the impact on learners of enhancing the trustworthiness of grading models through the correction of superficial cues.
References
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate, September 2014
Bridges, C.C.: Hierarchical cluster analysis. Psychol. Rep. 18(3), 851–854 (1966)
Burstein, J., Kaplan, R., Wolff, S., Lu, C.: Using lexical semantic techniques to classify free-responses. In: Breadth and Depth of Semantic Lexicons (1996)
Camus, L., Filighera, A.: Investigating transformers for automatic short answer grading. In: Bittencourt, I.I., Cukurova, M., Muldner, K., Luckin, R., Millán, E. (eds.) AIED 2020. LNCS (LNAI), vol. 12164, pp. 43–48. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52240-7_8
Cardozo, S., et al.: Explainer divergence scores (EDS): some Post-Hoc explanations may be effective for detecting unknown spurious correlations (2022)
Del Gobbo, E., Guarino, A., Cafarelli, B., Grilli, L.: GradeAid: a framework for automatic short answers grading in educational contexts-design, implementation and evaluation. Knowl. Inf. Syst. 1–40 (2023)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, pp. 4171–4186, June 2019
Doshi-Velez, F., Kim, B.: Towards a rigorous science of interpretable machine learning, February 2017
French, R.M.: Catastrophic forgetting in connectionist networks. Trends Cogn. Sci. 3(4), 128–135 (1999)
Funayama, H., Asazuma, Y., Matsubayashi, Y., Mizumoto, T., Inui, K.: Reducing the cost: cross-prompt pre-finetuning for short answer scoring. In: Wang, N., Rebolledo-Mendez, G., Matsuda, N., Santos, O.C., Dimitrova, V. (eds.) Artificial Intelligence in Education. AIED 2023. LNCS, vol. 13916, pp. 78–89. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-36272-9_7
Funayama, H., et al.: Preventing critical scoring errors in short answer scoring with confidence estimation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pp. 237–243. Association for Computational Linguistics, Online, July 2020
Funayama, H., Sato, T., Matsubayashi, Y., Mizumoto, T., Suzuki, J., Inui, K.: Balancing cost and quality: an exploration of human-in-the-loop frameworks for automated short answer scoring. In: Rodrigo, M.M., Matsuda, N., Cristea, A.I., Dimitrova, V. (eds.) Artificial Intelligence in Education. AIED 2022. LNCS, vol. 13355, pp. 465–476. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-11644-5_38
Galhardi, L.B., Brancher, J.D.: Machine learning approach for automatic short answer grading: a systematic review. In: Simari, G.R., Fermé, E., Gutiérrez Segura, F., Rodríguez Melquiades, J.A. (eds.) IBERAMIA 2018. LNCS (LNAI), vol. 11238, pp. 380–391. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03928-8_31
Gao, J., Lanchantin, J., Soffa, M.L., Qi, Y.: Black-Box generation of adversarial text sequences to evade deep learning classifiers. In: 2018 IEEE Security and Privacy Workshops (SPW), pp. 50–56, May 2018
Geirhos, R., et al.: Shortcut learning in deep neural networks, April 2020
Gomaa, W.H., Fahmy, A.A.: Ans2vec: a scoring system for short answers. In: Hassanien, A.E., Azar, A.T., Gaber, T., Bhatnagar, R., F. Tolba, M. (eds.) AMLTA 2019. AISC, vol. 921, pp. 586–595. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-14118-9_59
Hassija, V., et al.: Interpreting Black-Box models: a review on explainable artificial intelligence. Cognit. Comput. (2023)
Knox, J.: Massive open online courses (MOOCs). In: Peters, M.A. (eds.) Encyclopedia of Educational Philosophy and Theory, pp. 1372–1378, LNCS. Springer, Singapore (2017). https://doi.org/10.1007/978-981-287-588-4_219
Kumar, S., Chakrabarti, S., Roy, S.: Earth mover’s distance pooling over Siamese LSTMs for automatic short answer grading. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 2046–2052. IJCAI’17, AAAI Press, August 2017
Kumar, Y., Aggarwal, S., Mahata, D., Shah, R.R., Kumaraguru, P., Zimmermann, R.: Get IT scored using AutoSAS – an automated system for scoring short answers. AAAI 33(01), 9662–9669 (2019)
Lipton, Z.C.: The mythos of model interpretability, June 2016
Lui, A.K.F., Ng, S.C., Cheung, S.W.N.: A framework for effectively utilising human grading input in automated short answer grading. Int. J. Mob. Learn. Organ. 16(3), 266 (2022)
Lundberg, S., Lee, S.I.: A unified approach to interpreting model predictions, May 2017
Mizumoto, T., et al.: Analytic score prediction and justification identification in automated short answer scoring, pp. 316–325, August 2019
Mohler, M., Bunescu, R., Mihalcea, R.: Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 752–762. Association for Computational Linguistics, Portland, Oregon, USA, June 2011
Mohler, M., Mihalcea, R.: Text-to-text semantic similarity for automatic short answer grading. In: Lascarides, A., Gardent, C., Nivre, J. (eds.) Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pp. 567–575. Association for Computational Linguistics, Athens, Greece, March 2009
Nauta, M., Walsh, R., Dubowski, A., Seifert, C.: Uncovering and correcting shortcut learning in machine learning models for skin cancer diagnosis. Diagnostics (Basel) 12(1) (2021)
Ou, S., et al.: Erratum: author correction: machine learning model to project the impact of COVID-19 on US motor gasoline demand. Nat. Energy 5(12), 1051–1052 (2020)
Rademakers, J., Ten Cate, T.J., Bär, P.R.: Progress testing with short answer questions. Med. Teach. 27(7), 578–582 (2005)
Ribeiro, M.T., Singh, S., Guestrin, C.: Why should i trust you?: explaining the predictions of any classifier, February 2016
RIKEN(2020): Riken dataset for short answer assessment (July 2020)
Riordan, B., Horbach, A., Cahill, A., Zesch, T., Lee, C.M.: Investigating neural architectures for short answer scoring. In: Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 159–168. Association for Computational Linguistics, Stroudsburg, PA, USA (2017)
Roy, S., Narahari, Y., Deshmukh, O.D.: A perspective on computer assisted assessment techniques for short free-text answers. In: Ras, E., Joosten-ten Brinke, D. (eds.) CAA 2015. CCIS, vol. 571, pp. 96–109. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-27704-2_10
Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1(5), 206–215 (2019)
Saha, S., Dhamecha, T.I., Marvaniya, S., Sindhgatta, R., Sengupta, B.: Sentence level or token level features for automatic short answer grading?: use both. In: Penstein Rosé, C., et al. (eds.) AIED 2018. LNCS (LNAI), vol. 10947, pp. 503–517. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93843-1_37
Sam, A.H., et al.: Very-short-answer questions: reliability, discrimination and acceptability. Med. Educ. 52(4), 447–455 (2018)
Sato, T., Funayama, H., Hanawa, K., Inui, K.: Plausibility and faithfulness of feature attribution-based explanations in automated short answer scoring. In: Rodrigo, M.M., Matsuda, N., Cristea, A.I., Dimitrova, V. (eds.) Artificial Intelligence in Education. AIED 2022. LNCS, vol. 13355, pp. 231–242. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-11644-5_19
Shrikumar, A., Greenside, P., Kundaje, A.: Learning important features through propagating activation differences, April 2017
Shrikumar, A., Greenside, P., Shcherbina, A., Kundaje, A.: Not just a black box: learning important features through propagating activation differences, May 2016
Simon, H.A.: Spurious correlation: a causal interpretation. J. Am. Stat. Assoc. 49(267), 467–479 (1954)
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps, December 2013
Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks, March 2017
Sung, C., Dhamecha, T.I., Mukhi, N.: Improving short answer grading using transformer-based pre-training. In: Isotani, S., Millán, E., Ogan, A., Hastings, P., McLaren, B., Luckin, R. (eds.) AIED 2019. LNCS (LNAI), vol. 11625, pp. 469–481. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-23204-7_39
Tornqvist, M., Mahamud, M., Mendez Guzman, E., Farazouli, A.: ExASAG: explainable framework for automatic short answer grading. In: Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pp. 361–371. Association for Computational Linguistics, Toronto, Canada, July 2023
Vaswani, A., et al.: Attention is all you need, June 2017
Weigle, S.C., Yang, W., Montee, M.: Exploring reading processes in an academic reading test using Short-Answer questions. Lang. Assess. Q. 10(1), 28–48 (2013)
Acknowledgements
This research was supported by JSPS Grant-in-Aid for Scientific Research 22H0-0524 and JST Next-Generation Challenging Researchers Program JPMJSP2114. We also appreciate the members of the Tohoku NLP group for their frequent participation in discussions during the research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2024 The Author(s)
About this paper
Cite this paper
Asazuma, Y., Funayama, H., Matsubayashi, Y., Mizumoto, T., Reisert, P., Inui, K. (2024). Take No Shortcuts! Stick to the Rubric: A Method for Building Trustworthy Short Answer Scoring Models. In: Casalino, G., et al. Higher Education Learning Methodologies and Technologies Online. HELMeTO 2023. Communications in Computer and Information Science, vol 2076. Springer, Cham. https://doi.org/10.1007/978-3-031-67351-1_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-67351-1_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-67350-4
Online ISBN: 978-3-031-67351-1
eBook Packages: Computer ScienceComputer Science (R0)