Unsupervised Multiple Choices Question Answering
via Universal Corpus

Abstract

Unsupervised question answering is a promising yet challenging task, which alleviates the burden of building large-scale annotated data in a new domain. It motivates us to study the unsupervised multiple-choice question answering (MCQA) problem. In this paper, we propose a novel framework designed to generate synthetic MCQA data barely based on contexts from the universal domain without relying on any form of manual annotation. Possible answers are extracted and used to produce related questions, then we leverage both named entities (NE) and knowledge graphs to discover plausible distractors to form complete synthetic samples. Experiments on multiple MCQA datasets demonstrate the effectiveness of our method.

Index Terms— Natural Language Processing, Unsupervised Multiple Choices Question Answering, Knowledge Graphs

1 INTRODUCTION

Question Answering (QA) is an important topic in natural language understanding [1, 2, 3]. In QA, Multiple Choices Question Answering (MCQA) tasks require the model to select the answer from a set of answer candidates by employing reasoning [4, 5, 6, 3, 7]. A common approach is fine-tuning a pre-trained language model on a task-specific dataset [8]. However, such task-specific datasets are scarce as they are only available for limited domains and languages [9]. It means we need to derive a large number of annotated samples before applying this process to a new domain, which is time-consuming and resource-intensive [10].

Recently, some unsupervised methods have been proposed for Extractive Question Answering (EQA) tasks. E.g., Lewis et al [9] explored several unsupervised methods for generating question-answer pairs and showed that the obtained data could ensure satisfactory model performance, being comparable to the original data. Fabbri et al [8] and Li et al [11] further extended this idea with template-based question generation and iterative data refinement, but are still only applicable to EQA tasks. There are also some trials for MCQA without supervision. Liu and Lee [12] assumed the absence of correct answer labels, but directly train a QA model based on the context, question, and answer candidate sets. Ren and Zhu [13] emphasized the distractor generation, trying to construct a complete sample using the given context, question, as well as the correct answer. Nevertheless, they still depend on a certain amount of data in the target domain, like the contexts and questions, which further limits their application scenarios.

In this paper, we propose a two-stage unsupervised MCQA framework under a special case, where no labeled sample but only a universal corpus is available. We aim to construct natural questions, correct answers, and related contexts in an unsupervised manner, further generating plausible and reliable distractors as the answer candidate set. Motivated by the recent progress in Unsupervised Extractive Question Answering [9], we generate question-answer pairs in the first stage. Named entities (NE) from the context are extracted and treated as the “correct” answers. Then questions will be generated in a cloze-filling way via unsupervised machine translation models, yielding a series of QA pairs. In the second stage, we will introduce answer distractors. Here we propose a hybrid method to generate high-quality distractors with the aid of both NEs and knowledge graphs (KGs), which will be used to answer the candidate set. In experiments, we show that our unsupervised MCQA method can achieve good results to some extent using RoBERTa [14] as the backbone. We also illustrate that the quality of answer distractors matters, and our hybrid generation approach contributes to better performance.

Refer to caption — Fig. 1: An overview of our method. In the first stage, we extract the answers aa from the context cc, then generate their corresponding questions qq. In the second stage, we use a hybrid method, KG-NE, to generate distractors, thus building the answer candidate set $\mathcal{C}$ .

Our contributions are summarized as follows. Firstly, we are among the first to study the unsupervised MCQA task without data in the target domain and propose a two-stage approach equipped with QA pairs generation and distractors generation. Secondly, our extensive experiments verify the validity of our method, also the impact of different answer distractor generation methods.

2 METHOD

Previous works in unsupervised MCQA [13, 15, 16] usually assume the availability of a certain amount of data in the target domain, such as questions, answer candidates or correct answers, lowering their applicability. Our setting becomes more challenging but meaningful, where target data is given. It is required to construct a set of MCQA samples barely using the contexts from a universal corpus. Each sample consists of the context cc, the question qq, and the set of answer candidates $\mathcal{C}$ where the correct answer aa is labeled. Figure 1 shows an overview of our unsupervised MCQA method. There are two stages included: 1) we use an extractive way to generate the question qq and its corresponding answer aa; 2) we generate distractors to construct the answer candidate set $\mathcal{C}$ , thus obtaining an MCQA sample.

2.1 Question and Answer Generation

Similar to unsupervised EQA [9], we start to build QA samples from a task-agnostic open-domain source corpus. We identify all the named entities (NE) with specific NER tags (in Table 1), and treat them as the correct answers for potential questions. Such an extraction process could be conducted via open-source NLP libraries (e.g., spaCy [17]) without extra training. Then for each extracted answer, we generate a question to form a question-answer pair. To this end, we first mask the NE-like answer to obtain the cloze from the context, then use a way similar to the machine translation to transform the cloze to the natural question [18]. We adopt a seq2seq-based NMT model [19] trained on nonparallel corpora of clozes and questions to conduct such a translation task. We will generate five types of questions: who, where, what, when, and how (e.g, “how long”, “how many”), based on the types of entities.

TYPE	DESCRIPTION \bigstrut
PERSON	People, including fictional. \bigstrut[t]
NORP	Nationalities or religious or political groups.
FAC	Buildings, airports, highways, bridges, etc.
ORG	Companies, agencies, institutions, etc.
GPE	Countries, cites, states.
LOC	Non-GPE locations, mountain ranges, bodies of water.
PRODUCT	Objects, vehicles, foods, etc. (Not services.)
EVENT	Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART	Titles of books, songs, etc.
LAW	Named documents made into laws.
LANGUAGE	Any named language.
DATE	Absolute or relative dates or periods.
TIME	Times smaller than a day.
PERCENT	Percentage, including (%).
MONEY	Monetary values, including unit.
QUANTITY	Measurements, as of weight or distance. \bigstrut[b]

Table 1: The type of the named entities.

2.2 Distractor Generation

Given a couple of a question and the corresponding answer, we seek methods to generate answer distractors as its candidate set. A straightforward way is the Random method, where we randomly select the answers from other questions and treat them as distractors for the current QA pair. However, according to Pho et al [20], there should be high syntactic and semantic homogeneity between the answer and distractors to make the task more difficult. So a good distractor should have the same NE type and similar semantic meaning to the correct one. We thus provide several other methods with the goal of generating high-quality distractors, 1) NE(named entity)-based or 2) KG(knowledge graph)-based methods that the generated candidates are aware of the NE type and the semantic meaning, respectively, and 3) a hybrid NE-KG approach combining the merits of both NE and KG.

NE-based. A simple method selects the distractors that have the same NE type as the gold answer. Since we have already identified the NER tags during answer generation, it is realized by sampling the answers belonging to other questions as the distractors, while they share the same NE type.

KG-based. One drawback of the NE-based method is that the answer candidates having the same NE type may not ensure sufficient semantic similarity, possibly lowering the challenge for QA models. Motivated by the works in former distractor generation [13] and knowledge graph-based question answering [21], we address this issue with the aid of an external knowledge base. Specifically, we use ConceptNet [22], which is a general domain knowledge graph, as our knowledge base. We concatenate the question and the answer to build the input representation vector, then we follow Feng et al [23] to retrieve a subgraph of the ConceptNet consisting of entities that are closely related to the input. Based on the subgraph, we use a pre-trained language model to further estimate the relevance between the entities and the input. Then entities with top-K largest scores will be regarded as the selected distractors.

KG-NE. We provide a hybrid method, KG-NE, aiming at a combination of the benefits of both KG-based and NE-based methods. The distractors generated by NE-based method may vary largely in their semantic meanings. Although the KG-based approach can provide distractors that have high semantic relevance to the answer, sometimes the KG may fail to recognize the entities in the input, making it hard to conduct the follow-up subgraph retrieval and relevance scoring operations. Among the five types of generated questions (Who, What, Where, When, and How), we observe that the KG-based approach only works well for the How-type questions (refer to Table 4 for details). Therefore, we select specific generation methods for different question types in KG-NE. For the questions belonging to the how-type, we apply the KG-based method, while leaving those belonging to the rest four types to the NE-based generation.

3 EXPERIMENTS

3.1 Setup

Implementation. Similar to Lewis et al [9], we use data from the English Wikipedia for constructing the synthetic datasets. We use Spacy for NER in answer extraction and adopt the unsupervised NMT model provided by Lewis et al [9] for the question generation. Regarding the distractor generation, we adopt the ConceptNet and the RoBERTa-large model [14] provided by Yasunaga et al [21] for constructing the entity graph and obtaining the relevance score, respectively. The synthetic datasets are derived from 101,500 passages, where we use 100,000 and 1,500 passages as the training and development set, respectively.

We use the RoBERTa-base model [14] as the backbone of QA model in our evaluation. For each of the synthetic / annotated datasets, we train a model for 3 epochs with a batch size of 16. We use the Adam optimizer [24] with a learning rate of 5e-5.

We fine-tuned the Llama 2-7B [25] using 50,000 samples generated by the KG-NE method with LoRA [26]. The fine-tuning process lasted for 3 epochs, with a batch size of 4 and a learning rate of 1e-4.

Evaluation datasets. We use four annotated MCQA datasets, SWAG [27], ARC [28], CommensenseQA [29], and SocialIQA [30]. We denote all of them as “annotated datasets” for clarity. ¹¹1The testing sets for SWAG and CommensenseQA are half of the validation sets.

Baselines. Besides the proposed methods, we consider another three baselines: “Random”, which is the simple random method mentioned in Section 2.2; Sliding Window (SW) that calculates the overlap between options and questions and contexts to obtain the answer [31]; Knowledge Representation Learning (KRL) a zero-shot method proposed by Banerjee and Baral et al [16]. For large language models (LLMs), we considered Llama-7B [32], Llama 2-7B, Llama 2-13B, and ChatGPT-3.5-Turbo.

Method	ARC-Easy	SWAG	CommensenseQA	SocialIQA
Random	37.2 $\pm$ 1.3	37.0 $\pm$ 2.4	42.2 $\pm$ 1.6	36.2 $\pm$ 1.1
NE-based	38.3 $\pm$ 0.8	46.0 $\pm$ 2.0	39.6 $\pm$ 1.0	38.9 $\pm$ 0.8
KG-based	31.0 $\pm$ 0.6	41.4 $\pm$ 2.7	26.8 $\pm$ 1.1	37.2 $\pm$ 0.4
KG-NE	38.6 $\pm$ 0.4	49.9 $\pm$ 0.6	42.8 $\pm$ 1.1	39.5 $\pm$ 0.5

Table 2: The accuracy (%) of QA models on the four benchmark test sets after training on synthetic datasets generated by different approaches. The results are averaged over 3 random seeds along with the standard deviations.

Method	ARC-Easy	SWAG	CommensenseQA	SocialIQA \bigstrut[t]
Random	38.2 $\pm$ 1.4	45.5 $\pm$ 5.1	42.5 $\pm$ 1.9	35.2 $\pm$ 0.8 \bigstrut[t]
SW [31]	24.8 $\pm$ 5.8	38.0 $\pm$ 2.1	22.4 $\pm$ 4.6	32.8 $\pm$ 1.4
KRL [16]	33.0	-	38.8	48.5
NE-based	40.8 $\pm$ 1.0	48.5 $\pm$ 0.7	39.1 $\pm$ 0.7	42.4 $\pm$ 0.9
KG-based	30.7 $\pm$ 1.3	43.1 $\pm$ 2.0	27.4 $\pm$ 1.1	37.8 $\pm$ 0.4
KG-NE	39.5 $\pm$ 0.2	53.3 $\pm$ 0.5	43.7 $\pm$ 1.4	41.6 $\pm$ 0.2
Llama 2-7B*	77.7 $\pm$ 0.6	55.8 $\pm$ 1.5	58.2 $\pm$ 0.5	53.6 $\pm$ 0.9
Llama-7B	28.34	27.23	22.68	25.71
Llama 2-7B	32.46	25.41	25.37	31.22
Llama 2-13B	72.31	41.66	40.95	48.30
ChatGPT	84.8	75.3	73.1	71.6
Supervised	94.7 ${}^{\dagger}$	94.1 ${}^{\dagger}$	83.3 ${}^{\dagger}$	84.3 ${}^{\dagger}$

Table 3: The accuracy(%) of QA models (RoBERTa-base and Llama 2-7B) after training on different synthetic datasets, where models are determined by the validation performance on the original annotated datasets. The results are averaged over 3 random seeds. Llama 2-7B* is model obtained by fine-tuning the Llama 2-7B model using the synthetic data (KG-NE) generated. Llama-7B, Llama 2-7B, Llama 2-13B, and ChatGPT show the results obtained directly from the original LLMs.

\dagger

: The results are from several leaderboards, including Leaderboard-ARC-E, -SWAG, -CSQA and -SocialIQA

3.2 Results

Table 2 shows the QA models’ performance after training on the synthetic datasets generated by different methods. Note that we use synthetic development to determine the model for the evaluation, as we assume that no sample in the target domain is available. Other than over-fitting, such performance gap could be attributed to the large domain gap between the source dataset (English Wikipedia) and the target datasets (SWAG, ARC, CommonsenseQA and SocialIQA). Those with consideration of the types of NE during the distractor generation (“NE-based”, “KG-NE”), achieve better testing performance than other approaches. In particular, the hybrid method, “KG-NE”, further improves “NE-based” by margin, showing the effectiveness of considering both the NE-types and the semantic meanings. However, solely adopting the KG-based method leads to a performance drop. According to Section 2.2, sometimes the KG-based method may fail to generate the MCQA data from the passage, resulting in a much smaller training set. Besides, we may further investigate the cause of the poor performance of the KG-based method from the perspective of generated question types, please refer to Section 3.3 for details. We also observe that the SW method, although effective in its original setting (only the correct answer $a$ is not provided), performs badly in our setting, where the model has to construct the MCQA data from scratch to train itself. Under such a challenging setting, the proposed KG-NE method shows robustness and yields the highest performance.

3.3 Additional Analysis

Using annotated dev sets. Table 3 shows the results when we slightly alleviate the strict data availability and use the development sets from annotated data for model selection. In general, the NE-based method benefits from this setting, which shows promotions to the results in Table 2. On the other hand, the KG-based method encounters a performance drop, and further partly affects the performance of the hybrid method. Compared to the SW method, all three of our approaches outperform it. Compared to the KRL method, we perform worse on SocialIQA, but we perform better on ARC-Easy and SWAG. Overall, our methods are superior to both baselines. However, We also compare our unsupervised methods with the state-of-the-art supervised models, indicating that there’s still a large improvement that could be made, which we leave as a future direction.

Large language models. For LLMs, in Table 3 we can see the proposed corpus can improve LLM’s performance significantly. When directly testing on original LLMs, they failed a lot due to the hallucinatory and struggled to understand the information behind the questions. As a result, their performance (Llama-7B, Llama 2-7B) was even worse than the results of the fine-tuned RoBERTa-base. Surprisingly, after fine-tuning Llama 2-7B with the corpus we generated, it achieved impressive results, even surpassing Llama 2-13B on all four datasets.

Generated question types. In Table 4, it shows that the KG-based method results in severely unbalanced in question types. It is prone to generating How-type questions, while only a small portion of data samples belonging to the Who-type, the Who-type and the What-type are produced. Such an unbalanced distribution may be another reason for the poor performance of QA models in this case. Instead, the hybrid method is conducive to more balanced question types, benefiting QA models trained on samples derived from it.

Ques-Type	NE-based	KG-based	KG-NE \bigstrut[b]
Who	47975	94	24243 \bigstrut[t]
Where	19366	192	24352
What	4512	5	4510
When	15736	4033	24132
How	13911	39176	24263 \bigstrut[b]
Total	101500	43500	101500 \bigstrut[t]

Table 4: The number of the generated questions, using NE-based/KG-based/KG-NE method, in terms of different types.

3.4 Quality Analysis of Synthetic Data

We conducted a meticulous manual review of two fundamental aspects of the synthetic question: the quality of the generated questions and the appropriateness of the candidate options. Our evaluation revealed that a significant portion of the questions suffered from grammatical errors and lacked contextual alignment. Moreover, the NE-based method often produced candidate options that did not align well with the questions, although there were exceptions. In contrast, the KG-based method consistently maintained semantic coherence between questions and candidate options. However, it primarily generated ‘How-type’ questions and exhibited an overall lower question quantity. Finally, our findings suggest that the KG-NE method outperforms alternative methods in our study, offering a more promising approach.

4 CONCLUSION

In this work, we handle the MCQA task in an unsupervised manner under a fully non-annotated scenario, where no target data is given and only universal corpus can be utilized. We propose a two-stage framework featured with question-answer pair generation and KG-NE based distractor generation, to construct the synthetic data for model training. The experimental results on multiple datasets verity the effectiveness of our approach and the impact of various distractor generation methods.

References

[1] Yu Cao, Meng Fang, and Dacheng Tao, “BAG: Bi-directional attention entity graph convolutional network for multi-hop reasoning question answering,” in EMNLP, 2019.
[2] Yinya Huang, Meng Fang, Yu Cao, Liwei Wang, and Xiaodan Liang, “DAGN: Discourse-aware graph network for logical reasoning,” in NAACL, 2021.
[3] Qin Zhang, Shangsi Chen, Dongkuan Xu, Qingqing Cao, Xiaojun Chen, Trevor Cohn, and Meng Fang, “A survey for efficient open domain question answering,” in ACL, 2023.
[4] Sicheng Yu, Hao Zhang, Wei Jing, and Jing Jiang, “Context modeling with evidence filter for multiple choice question answering,” in ICASSP, 2022.
[5] Xiyan Liu, Yidong Shi, Ruifang Liu, Ge Bai, and Yanyi Chen, “Narrow down before selection: A dynamic exclusion model for multiple-choice qa,” in ICASSP, 2023.
[6] Qin Zhang, Shangsi Chen, Meng Fang, and Xiaojun Chen, “Joint reasoning with knowledge subgraphs for multiple choice question answering,” Information Processing & Management, vol. 60, no. 3, pp. 103297, 2023.
[7] Meng Fang, Shilong Deng, Yudi Zhang, Zijing Shi, Ling Chen, Mykola Pechenizkiy, and Jun Wang, “Large language models are neurosymbolic reasoners,” in AAAI, 2024.
[8] Alexander Fabbri, Patrick Ng, Zhiguo Wang, Ramesh Nallapati, and Bing Xiang, “Template-based question generation from retrieved sentences for improved unsupervised question answering,” in ACL, 2020.
[9] Patrick Lewis, Ludovic Denoyer, and Sebastian Riedel, “Unsupervised question answering by cloze translation,” in ACL, 2019.
[10] Meng Fang, Yuan Li, and Trevor Cohn, “Learning how to active learn: A deep reinforcement learning approach,” in EMNLP, 2017.
[11] Zhongli Li, Wenhui Wang, Li Dong, Furu Wei, and Ke Xu, “Harvesting and refining question-answer pairs for unsupervised QA,” in ACL, 2020.
[12] Chi-Liang Liu and Hung-yi Lee, “Unsupervised multiple choices question answering: Start learning from basic knowledge,” in ACL, 2021.
[13] Siyu Ren and Kenny Q Zhu, “Knowledge-driven distractor generation for cloze-style multiple choice questions,” in AAAI, 2021.
[14] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[15] Jeroen Offerijns, Suzan Verberne, and Tessa Verhoef, “Better distractions: Transformer-based distractor generation and multiple choice question filtering,” arXiv preprint arXiv:2010.09598, 2020.
[16] Pratyay Banerjee and Chitta Baral, “Self-supervised knowledge triplet learning for zero-shot question answering,” in EMNLP, 2020.
[17] Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd, “spacy: Industrial-strength natural language processing in python,” 2020.
[18] Zi-Yi Dou and Nanyun Peng, “Zero-shot commonsense question answering with cloze translation and consistency optimization,” arXiv preprint arXiv:2201.00136, 2022.
[19] Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato, “Phrase-based & neural unsupervised machine translation,” in EMNLP, 2018.
[20] Van-Minh Pho, Thibault André, Anne-Laure Ligozat, Brigitte Grau, Gabriel Illouz, and Thomas François, “Multiple choice question corpus analysis for distractor characterization,” in LREC, 2014.
[21] Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure Leskovec, “QA-GNN: Reasoning with language models and knowledge graphs for question answering,” in NAACL-HLT, 2021.
[22] Robyn Speer, Joshua Chin, and Catherine Havasi, “Conceptnet 5.5: an open multilingual graph of general knowledge,” in AAAI, 2017.
[23] Yanlin Feng, Xinyue Chen, Bill Yuchen Lin, Peifeng Wang, Jun Yan, and Xiang Ren, “Scalable multi-hop relational reasoning for knowledge-aware question answering,” in EMNLP, 2020.
[24] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
[25] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
[26] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
[27] Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi, “SWAG: A large-scale adversarial dataset for grounded commonsense inference,” in EMNLP, 2018.
[28] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord, “Think you have solved question answering? try arc, the ai2 reasoning challenge,” ArXiv, vol. abs/1803.05457, 2018.
[29] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant, “CommonsenseQA: A question answering challenge targeting commonsense knowledge,” in NAACL-HLT, 2019.
[30] Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi, “Social IQa: Commonsense reasoning about social interactions,” in EMNLP-IJCNLP, 2019.
[31] Matthew Richardson, Christopher J.C. Burges, and Erin Renshaw, “MCTest: A challenge dataset for the open-domain machine comprehension of text,” in EMNLP, 2013.
[32] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.

Unsupervised Multiple Choices Question Answering via Universal Corpus