1 Introduction
High precision extraction of structured data from unstructured text has valuable applications in medicine [
72], finance [
22], science [
50,
54,
73], and law enforcement [
12,
35,
71]. In this article, we explore
information extraction (
IE) use cases in criminal justice. This area is of particular interest because the existing data sources—whether publicly available [
3,
15,
55,
65] or restricted [
39]—are usually derived from routinely collected administrative data, which often omits key information like victim details, or mitigating and aggravating circumstances [
81]. This information, and much more, can however be found in long unstructured documents like trial transcripts. Making such information easily accessible would, among else, enable much needed research into bias in the judicial system.
There are several ways to achieve this. While humans can often extract information from unstructured documents with near-perfect accuracy, their work is time-intensive, costly, and possibly traumatizing (e.g., when the content is violent or sexual) [
19]. On the other hand, even state-of-the-art automated methods still cannot extract information from text with human-level accuracy [
30,
79]. The main focus of our article are therefore human–computer collaborative approaches to IE with emphasis on the
tradeoff between processing speed and extraction accuracy.
Our study is centered around three distinct datasets and two IE tasks. Two of the datasets contain information about criminal cases from legal documents and news articles. For both, the task is to extract structured (tabular) information about the victims, defendants, and situational circumstances. The third dataset contains online communications between convicted child sexual predators, and police officers posing as minors; the task is to identify specific predatory behavior patterns [
18]. In all cases, the text documents are too long for
manual extraction at scale [
37], and the high precision necessary for the extracted data to be useful to researchers rules out fully
automated approaches. Unlike in most of the existing literature on human–machine collaboration [e.g.,
5,
62], all our tasks are examples of the under-explored scenario where computers achieve lower accuracy, but can significantly speed up processing.
In Section
2, we discuss how this type of human–computer complementarity affects design choices within collaborative IE, and identify a gap among existing solutions (see Table
1). In Section
3, we fill this gap by developing ELICIT, a new human-validated IE tool which leverages the speed of modern machine learning models, and the near-perfect accuracy of humans. ELICIT is built on weak supervision approaches [
63,
80], utilizing multiple independent algorithms to provide suggestions from which the user selects the final answer. In Section
4, we perform a case-study comparing manual extraction and ELICIT, with two of this paper’s authors serving as the human validators. ELICIT achieves precision on-par with manual annotation, significantly outperforming the state-of-the-art among
automated IE tools [
63,
80], while using orders of magnitude less processing time (Section
4.2).
1 In Section
4.3, we further explore: (i) increasing the number of computer-generated suggestions to improve recall; (ii) deferral as a way to tradeoff performance for time; and (iii) ranking and fine-tuning as auxiliary ways to use the validated data to attain even better performance.
Our main contributions are summarized below:
We analyze the relative strengths of various human–computer approaches to IE from long unstructured texts, in the context of two criminal justice based use cases.
We design and implement ELICIT, an interactive IE tool which combines weak supervision and human validation to enable IE that is faster yet almost as accurate as manual annotation.
Creating three new datasets, we demonstrate ELICIT significantly speeds up the annotation process, while maintaining near-perfect precision. Accuracy can be further traded-off for processing time via a deferral setting.
While we focus on criminal justice use cases, we believe ELICIT can be useful in other areas where the imperfect accuracy of automated methods, and the slowness and other issues of manual annotation, pose a problem (e.g., legal texts, medical records, meta-analysis of literature).
3 ELICIT: A System for User-validated Information Extraction from Text
Comparing existing tools (Section
2.2) to our requirements (Section
2.1.3) reveals a gap. Manual and assisted annotation methods (Sections
2.2.1 and
2.2.2) are time-intensive and provide little shielding, while approaches that do not validate the final extractions (Sections
2.2.5 and
2.2.6) tend to compromise accuracy. Human-in-the-loop systems (Sections
2.2.3 and
2.2.4) offer a balance, but the only existing open-source tool, Rubrix [
17], is not optimized for use-cases which necessitate extracting multiple interdependent variables from lengthy documents for extraction of interdependent variables from long documents common to our use-cases (Section
2.1). Therefore, a new system is required.
3.1 System Design
Based on Section
2.1.3 requirements, we designed and implemented ELICIT, a flexible, accurate, and time-efficient tool for IE from text. Its core functionality falls into the
human validation category (Section
2.2.3), with specialization on extraction of interdependent variables from long complex texts. ELICIT consists of two high-level stages (Figure
1): (1) automated passage retrieval and label suggestion, and (2) human label validation. The prediction
accuracy relies on the user’s ability to infer the correct label from the retrieved passages, while the
time-efficiency gains come mainly from the automated passage retrieval. Since the user only interacts with text excerpts, they are
shielded from having to engage with the entirety of the potentially disturbing text. The automated passage retrieval and label suggestions also allows for step (1) to be
reproducible, and step (2) easily comparable between annotators.
To ensure
flexibility, the automated first step utilizes weak supervision—a method for combining multiple “weak labels” generated by
labeling functions [
63]. For intuition, weak supervision is analogous to assigning labels by combining answers from multiple annotators, where each annotator provides a label and a corresponding confidence level. The final label is then a weighted average of the predicted labels. The weights are proportional to (re)calibrated confidence, where the calibration is used to adjust for miscalibrated annotators. In ELICIT, we use different automated labeling methods as “annotators”. Instead of a single final label, we produce a list ranked by average calibrated confidence, for the human to validate (see Appendix
G for details). Beyond its flexibility, a key advantage of weak supervision for our use cases is that it can enable high overall accuracy even if the labeling functions are not individually performant.
3.2 Core Features and User Interface
Leveraging large language models. In ELICIT, the user defines a set of questions (e.g., “Was the victim vulnerable?”), and an ensemble of
labeling functions (ranging from keyword lookup to neural nets; see Setup in Figure
1). Each labeling function is tasked with: (a) identifying a part of the text relevant to the question; and (b) suggesting an answer label for the user to validate (e.g., “Victim was vulnerable”). The most successful labeling functions we tested utilize large language models [
11,
21,
33] to achieve one or both of the above tasks. Large language models can achieve impressive results on passage retrieval and information extraction tasks [
2,
34], although not without limitations and biases [
7].
Providing explanations. To enable human verification, label predictions must be accompanied by
explanations. For the tasks we consider (Section
2.1), explanations are equivalent to a relevant snippet of the original text. If the provided snippet is insufficient, the user can open a pop-up window containing a larger section of the text. Presenting only relevant snippets is not only faster than manual reading, but also shelters the user from sensitive, graphic, and otherwise problematic content. This reduces both harm to the user, and their mental fatigue.
User interface. The user validates candidate labels—e.g., “Victim is female.”—within the user interface (Figure
2). Too many candidates may overwhelm the user. To streamline use and reduce fatigue, we
merge predictions of the same label if their explanations (snippets) significantly overlap. In the Figure
2 example, the user is asked to validate the victim sex. Rows correspond to suggested label values. The extended snippets for “female” are unrolled. You can see that the explanation with the highest confidence was highlighted by 3-out-of-5 labeling functions.
Note that each label value (e.g., male, female) can be supported by multiple explanations (e.g., “She was described ...”). These are ordered in the UI by a
ranking function, with the highest confidence explanations placed on the left. Our ranking model is similar to Reference [
64], except we compute scores on a per-explanation rather than per-document level (see Figure
E1 and Appendix
G.2 for details). See supplementary materials for a video demonstration of ELICIT’s user interface.
Top-K validation. To improve recall, labeling functions can nominate up to
K candidate tuples—
(label value, confidence score, explanation)—for each label, instead of one per document (see Annotation in Figure
1). We refer to this as
top-K validation. Increasing
K improves recall, but can increase burden on the user.
3.3 Advanced Features
Continual adaptation of candidate ranking. As the user validates the extracted information, a new tabular dataset is created. We can use this newly created data to calibrate the confidence output by the labeling functions, and continually
adapt the ranking function to user needs as in [
8] (see Appendix
G.2 for details). This will improve the chance that the correct answer will be presented to the user first, making the validation even faster.
Fine-tuning labeling functions. Beyond adapting the ranking, the data provided by user feedback can be used to continually improve (fine-tune) the labeling functions. This can be especially useful for improving recall among the automatically generated candidates. If a labeling function is updated and a new candidate is found for an already labeled document, ELICIT’s user interface alerts the user to this fact.
Deferral. ELICIT can be combined with
deferral (Section
2.2.4), where only candidates with low assigned confidence are validated by the user. This allows further tradeoff between time and accuracy, and will work well only if the confidence scores are well calibrated. In Section
4, we assume the automated tool comes with its own confidence score, which is used to decide whether to defer to the user via ELICIT. We note that the automated extraction algorithm need
not be one of the ELICIT labeling functions. This allows specialization, i.e., fine-tuning one system for automated extraction (e.g., SNORKEL, [
63]), and ELICIT for assisting the user (calibrated scores and explanations, high recall, and so on).
5 Ethics and Social Impact
Data. All the data used in this study are publicly available. For Task 1, although the original sentencing remarks and news articles contained full names, we removed these when creating the datasets. These could be re-identified by finding the original publication; however, as the data is in the public domain, this is unlikely to result in further harm to the individual’s privacy. For Task 2, the data is publicly available via the Perverted Justice website [
59]. The website contains conversations between adults and adult decoys pretending to be minors. No actual minors were involved in the conversations. The website only publishes chats that have led to convictions. We note that it is highly unlikely that the offenders consent to publish these conversations. While contestable, this data is the only publicly available source in the domain, and has been widely used in the relevant literature.
Potential impact of this research. The goal of this work is to facilitate creation of new tabular datasets, which would enable new avenues of quantitative and mixed-methods research. This work opens new routes to fill critical data gaps, which is much needed to improve our understanding of the criminal justice system [
81]. However, we acknowledge our research can lead to creation of low-quality, misleading, or cherry-picked datasets, if used irresponsibly. We stress the necessity of accompanying any created dataset by detailed extraction methodology documentation, including how ambiguity is dealt with, true/false positive/negative rates from sample testing, and disclosure of any other known biases and errors.
6 Conclusion
We present a framework for human-validated information extraction from text. We centered our investigation around two use cases from the criminal justice domain. Based on their commonalities, we identified several key functionality requirements: accuracy, time-efficiency, flexibility, reproducibility, and reducing user’s exposure to harmful content. We reviewed suitability of different collaborative settings and tools with respect to the identified requirements. Since none satisfied all our requirements, we developed ELICIT which we release as open-source.
ELICIT is a flexible tool useful for a variety of information extraction tasks. Its design is inspired by weak supervision approaches: we use a set of algorithmic annotators (labeling functions) to identify relevant pieces of text, which are then validated by the user. Compared with manual annotation, ELICIT can attain comparable accuracy at fraction of time. This is achieved by leveraging the complementary strengths of humans and machines in our use cases: speed and accuracy.
We perform a case-study, evaluating ELICIT on three extraction tasks based on our criminal justice use cases. In each case, we achieve accuracy close to manual annotation with orders of magnitude lower time investment. ELICIT significantly outperforms the automated extraction on both precision and recall. We demonstrate that recall can be further improved by using the already validated data for fine-tuning. We further quantify the tradeoff between human effort and performance within a deferral setup. Finally, based on our own experience extracting information manually and of using ELICIT, we felt using ELICIT required less emotional strain compared with manual annotation.
Our framework can be particularly effective for extraction of factual information from very long documents, when high precision is an essential requirement. Beyond helping human annotators do their work more effectively, we believe the continual learning component of our system (Section
4.3.2) is a promising direction for improving machine performance via human feedback. Learning from human feedback is a topic of growing importance in machine learning [e.g.,
13,
29,
36], where fine-tuning language models on human feedback can yield significant gains [
6,
14,
56,
70]. These approaches largely rely on human annotators hired specifically for the purpose of ranking model outputs based on their quality. In contrast, ELICIT uses the interactions with users themselves to become more performant, demonstrating a complementary venue to effectively collecting and learning from human feedback. We hope ELICIT’s value-led design—combined with engineering choices informed by task-specific requirements—inspires further research in this direction.