4.2.1 Interview.
We conducted semi-structured interviews with four experienced legal workers separately, including one lawyer, one prosecutor, and two judges. They all work in Beijing. Three of them are men and one is a woman. Although the face-to-face interview only involves a small sample, it could allow a more in-depth questioning and discussion, broadening and deepening the understanding on the research problem [
56]. The interviewees in our study were all well-experienced in legal practice and came from representative legal occupations. Each interview took about 30 minutes. Interviewees were compensated about $100 for their participation. The audio was recorded for later analysis.
To begin each interview, the interviewer introduced the proposed taxonomy in detail, including the three criteria, the hierarchical structure, and six intent categories, as shown in Figure
1 and Table
1. Each interview was centered on two open questions plus a short series of follow-up questions.
The first question asked about the intent taxonomy’s coverage and rationality. Here is an example
5: First, we asked,
What do you think of the coverage of this taxonomy? Can it cover all your information needs in legal search daily? If not, is there anything else that needs to be added? Then, we followed with questioning about the concrete categories. For example,
How about the XXX category? What do you think about the definition and characteristics of this category? The follow-up question is a good trigger for open discussions. We collected rich comments and views on these intent categories from the perspectives of diverse legal occupations, which further helped us revise the taxonomy.
The second question asked about the importance of different categories in the interviewee’s daily search. For example, Among these intent categories, what do you think are more critical or occur more often in your daily search? And why? We designed this question to obtain explicit feedback on the importance of different intents in the practice of legal case retrieval. Unlike user surveys or search logs, we could receive much more fine-grained explanations regarding this aspect despite the small data samples.
Results. After completing all the interviews, we analyzed the records. The main results are summarized as follows:
(1)
Regarding the first question, all the interviewees stated that the proposed taxonomy has good coverage of daily needs in legal case retrieval. No more new categories were proposed.
(2)
Regarding the comments on each intent category (i.e., the follow-up question), the Analysis category attracted plenty of discussions. The lawyer and the judges, who usually dealt with such analytical reports (e.g., similar case search reports), indicated that this type (Analysis) could be covered by the other categories mentioned earlier. Although it highly depends on the individual case, the underlying information need is still to learn about a specific legal problem (e.g., Characterization or Penalty most of the time). The prosecutor interviewee indicated that he seldom had this type of intent. The potential situation he came up with is that when dealing with a difficult legal issue (e.g., the Procedure), he might also sum it up to an analytical report afterward, such as personal learning material.
(3)
Other comments on the concrete categories are centered on the categories under Criterion 3. To be specific, the Characteriztion category should also include the situations of innocence and those of non-prosecution (from the prosecutor). Under the Procedure intent, they usually search for the legal requirement related to jurisdiction or avoidance (from the lawyer). Regarding the Penalty intent, all the interviewees mentioned that it has attracted increasing attention in recent judicial practice, but meanwhile it is much harder to be satisfied in the current legal case search systems. Precision was especially emphasized under this intent. Last but not least, they all suggested that these three intents expect more diversified results than the Particular Case(s) intent and meanwhile require higher precision and recall than the Interest intent.
(4)
Regarding the second question, all the interviewees suggested that the Characterization and Penalty are the most important and common in their daily search. Especially, the Penalty was emphasized again by the prosecutor and the lawyer separately. Meanwhile, the prosecutor and the lawyer also mentioned that the Procedure is highly significant. Although the Procedure intent is less common than the above two categories in legal case search, it will be pretty valuable and, meanwhile, difficult if there is a need for case retrieval surrounding the procedure requirement.
Revisit and Revision. Based on the interview responses, we had further iterative discussions on the taxonomy and finally reached an agreement on the revision, as illustrated in Figure
1. To be specific, we removed the
Analysis category, since it could be covered by the left intent categories. It would be better to view it as a context that triggers a legal case search rather than an independent intent search category. Furthermore, the first two authors re-coded the user survey and the sampled search logs that were used to establish the taxonomy in Section
4.1. As a result, the revised taxonomy still had good coverage. To summarize, we achieved a revised taxonomy composed of three criteria and five intent categories, as illustrated in Figure
1(b). Meanwhile, the in-depth discussions and exhaustive feedback help us further clarify the definitions of intent categories and also give us an qualitative view of the importance of different categories in practice.
4.2.2 Editorial User Study.
To verify the revised taxonomy, we further conducted an editorial user study. In this study, we recruited three external legal experts to annotate the users’ search intents in the user survey responses and query logs. The user survey responses and query logs are those described in Section
4.1. Unlike the establishing stage, we utilized all the sampled query logs (600 sessions in total) this time. The three annotators were all graduate students majoring in law and qualified in legal practice.
6 They all reported using legal case retrieval regularly and being familiar with current legal case search engines. They all signed a consent form before participating.
At the beginning of the study, we introduced the revised taxonomy in detail. We provided the annotators with the criteria, the taxonomy structure, the description, and examples of each intent category. In addition to the five intent categories, we provided another two choices for the annotators: Others (O) and Multi (M). The O means that the search intent does not belong to any of the proposed categories. The M denotes that the underlying intent seems to fall into multiple categories. For the additional two choices, we asked annotators to provide explanations for their choice. For example, the annotator needed to give what intent categories the search task might fall into if she selected M. The annotators were required to annotate the underlying search intent of each response in the user survey based on the answers to the three open questions and annotate the search intent of each session according to its queries. After all the annotators confirmed a good understanding of the taxonomy, they annotated the survey responses and query logs independently.
It took about 1.5 hours and 7 hours on average for each annotator to annotate the user survey responses and query logs, respectively. Each annotator would be paid about $12 for a one-hour annotation. As for label aggregation, we utilized the majority vote. In particular, if every annotator made different annotations for a sample, we tagged it as Multi (M).
Results. The Fleiss’s Kappa [
12]
\(\kappa\) among three annotators is 0.62 in terms of the user survey annotation, reaching a substantial agreement ((0.61, 0.80)). As for the query log annotation, the
\(\kappa\) among three annotators is 0.58, reaching a moderate agreement ((0.41–0.60)). Compared with the survey where users described their search scenario explicitly, the query logs were vaguer for intent labeling and thus explained the slight drop in
\(\kappa\) [
55,
65]. Given the relatively high number of categories, the inner-annotator consistency is acceptable [
5,
65] for both datasets, suggesting that the taxonomy can be easily understood and distinguished.
Table
3 shows the proportion of each intent category. As a result, less than 1% search tasks were annotated as
Others in both user survey and query log datasets, indicating that the proposed taxonomy has good coverage of users’ intents in legal case retrieval.
Regarding the five categories in the taxonomy, the general distributions are similar in both datasets, especially for the three intents classified by
Criterion 3. In particular, the
Characterization intent accounts for about 50%, indicating it is a fundamental and common task in legal case retrieval. Consistently, recent research and benchmarks [
34,
46,
47] designed tasks mainly based on this category. Meanwhile, the proportions of the
Penalty and the
Procedure are lower but still non-trivial compared with that of the
Characterization, which also aligned with the feedback collected in the interviews. These two are also primary tasks in the legal decision process. Besides, as the interviewees (in Section
4.2.1) also pointed out, the need for
Penalty has been growing and is increasingly important in recent years.
Meanwhile, the distributions of Particular Case(s) and Interest vary in the two datasets. A higher proportion of PC intent is observed in search logs, while few Interest intents are in the logs. We think that users’ explicit responses in the survey can better reflect their real information needs, while the query logs are implicit indicators. Besides, the search engine itself might cause some bias in user preference. For example, if the search engine is not good at satisfying Interest needs, then users may not like to use it for this intent and vice versa. According to the survey, we also noted that some users would use Web search engines rather than legal databases under the Interest intent.
Mixture Analysis. Nearly 10% of search tasks are tagged as
Multi in both datasets. Note that
Multi in Table
1 consists of two parts, i.e., more than two annotators labeled as
Multi (7.27% and 4.85% in the survey and logs, respectively) or three annotators gave completely different labels (4.55% and 5.18% in the survey and logs, respectively). To deeply analyze it, we visualize the co-occurrence of different intents, as shown in Figure
3. For each sample belonging to the first part, we manually processed the annotators’ explanations, from which we extracted all potential intents. We considered all annotated categories as possible intents for each sample belonging to the second part. Then, we count each pair in the possible intent set as one co-occurrence. For example, the intent set, “Ch+Pr+Pe,” contributes one occurrence for the “Ch-Pe,” “Ch-Pr,” and “Pe-Pr” pairs, respectively. Numbers in Figure
3 are normalized by the number of pairs.
As shown in Figure
3, the pair of
Characterization and
Penalty is the one that co-occurs most frequently in both user survey and query log data, accounting for around 50%, which suggests that users might search for both needs simultaneously. Meanwhile, we observe that the
Procedure usually co-occurs with the above two intents, which also aligns with the hierarchical structure of the intent taxonomy. Generally speaking, the query logs, where user intents could only be inferred implicitly, involve more types of co-occurrence of potential intents compared to the user survey. The results here suggest retrieval methods that explicitly recognize multi-intent queries are needed.