5.1 XAI needs: Participants desired more information about AI, especially practically useful information that can improve their collaboration with AI
Based on open-ended questions and the survey we developed from the XAI Question Bank [
66], we found that while participants were generally curious about AI system details, only those with high-AI background and/or high-domain interest were willing to actively seek out this information (Sec.
5.1.1). However, participants unanimously expressed a need for information that can improve their collaboration with the AI system (Sec.
5.1.2).
5.1.1 Participants were generally curious about AI system details, but curiosity levels differed based on AI background and domain interest.
As most other AI applications, Merlin does not provide much information about its underlying technology. Hence, when we asked participants what they knew about the app’s AI, all replied that they didn’t know much about system details, although those with high-AI background (P6, P11, P13, P18, P19) had detailed guesses about the app’s data, model architectures, and training algorithms.
So what did participants want to know? According to the survey results, participants wanted to know everything about the app’s AI. For all questions in the survey, most if not all participants responded they “know (or have a good idea of) the answer” and/or are “curious to know (more).” That is, participants were curious about overall system details (questions in the Data, Output, Performance, How, Transparency categories), as well as how the AI reasons and makes judgments on specific inputs (questions in the Why, Why not, What if, How to be that, How to still be this categories). We report the full survey results in the supplementary material.
But how curious are they, really? When we tempered self-reported levels of curiosity with interview questions about the effort participants were willing to invest to satisfy that curiosity, the picture changed. “I wouldn’t go tremendously out of my way to find the answer to these questions” (P12) was a sentiment shared by many participants (P1, P5, P6, P7, P9, P10, P12, P13, P16, P20). For instance, P5 said: “If there’s an opportunity that arises, I’d love to ask about it [...] but I don’t think I would be contacting people at Cornell [app developers].” Other participants were open to searching around a bit (P9, P10), listening to talks or podcasts (P12), or reading some documentation if easily available (P1, P6, P7, P13, P16, P20), but didn’t want to take the initiative to seek out more information about the AI system, as described by the questions in the survey.
Exceptions were some participants with high-AI background (P11, P18, P19) or notably high interest in birds (P1, P4, P8). P11, P18, and P19, likely because they develop AI systems in their work, were very curious about the app’s AI and were willing to go to the extent of reaching out to the app developers (P11, P18) or playing with the data themselves (P19). For example, P19 said: “I’d love to talk to one of the engineers and pick their brain [...] or get some data and play with it myself.” P1, P4, P8 have medium-AI background, but their exceptionally high interest in birds seemed to fuel their curiosity about the app’s AI. They were particularly curious about how the AI tackles difficult identifications such as mockingbirds that mimic other birds or birds that are difficult for experienced human birders to identify (e.g., “little brown birds”).
In contrast, participants with low-to-medium AI background (P7, P8, P9, P10, P12, P16) had lower explainability needs. For instance, P7, P8, and P10 had little-to-no interest in how the AI reasons and makes judgments on specific inputs. P8 said questions in the Why, Why not, What if, How to be that, How to still be this categories were not what they would ever think about on their own. P7 expressed more bluntly that they prefer to keep the AI as a black box: “No, I don’t want to ruin the mystique.” P9, P12, and P16, on the other hand, became more curious during the interview, however, their responses suggest that they were not very curious about the AI in their natural use environment prior to the interview.
In short, all participants were interested in learning more about the AI, but only those with high-AI background and/or high-domain interest were willing to expend effort to gain more information about the AI’s system details.
5.1.2 Participants desired information that can improve their collaboration with AI.
Participants’ expressed needs for explanation shifted, however, when our interview questions moved away from gauging their curiosity about AI system details, and towards querying their use of the app. While participants’ needs for system details differed based on background and interest, they unanimously expressed a need for practically useful information that could improve their collaboration with the AI system.
To begin, participants wanted a general understanding of the AI’s capabilities and limitations (P1, P4, P5, P16, P19, P20). P1 described a number of ways this understanding would help their use of the app: “It would definitely first help me understand more about when certain identifications may be more or less reliable. But also it will help me supply better inputs to the app to try and get the best quality identification results that I can” (P1). Participants had already tried to gain this understanding by pushing the AI to its limits (P4, P5, P16, P19, P20). Some had tried to fool the AI with non-bird sounds (e.g., sounds of other animals, bird impersonations) to understand when it works and when it breaks (P4, P5, P16, P19). Others had conducted more rigorous experimentation by altering their input (e.g., clip the audio recording, remove location information) and observing changes in the AI’s output to understand what factors influence the AI’s output and how (P4, P20).
Another frequently expressed need was for a display of the AI’s confidence (P1, P2, P3, P4, P6, P13, P18, P20). Participants wanted this information to better determine when to trust the AI’s output. Concretely, P2 demanded for percentage-based confidence scores: “If it doesn’t give a percentage [...] I just don’t have a gauge of how correct it is” (P2). P7 requested the AI to qualify its output by saying “it may not be the exact match” or give a general answer (e.g., “we don’t know the exact species but this bird is in the Wren family”).
Lastly, participants wanted the AI to give more detailed outputs (P2, P10, P11, P12). They demanded information that would help them verify the AI’s output. For instance, P10 wanted the AI to “highlight the time period of the [sound] clip that it calls a certain species” because it is hard to know which sound corresponds to which bird when multiple birds are singing at once. Going a step further, P2, P11, and P12 wanted the AI to specify the type of bird sound it heard. Currently, the verification process is arduous because each bird species has a number of songs and calls, as well more specific sounds such as juvenile calls, flock calls, and alarm calls. They said the suggested features will make the verification process easier and provide more information about how the AI has made its identification, with which they can more readily check the AI’s output and determine whether to trust it.
In sum, when we queried participants about their actual, real-world use of the app, they expressed a desire for information which could improve their use of the app, particularly in deciding whether or not to trust the AI’s outputs. Intriguingly, they expressed these desires before we showed them our mock-ups of what XAI explanations for the app might look like. This suggests that these XAI needs were not prompted solely by seeing XAI explanations.
5.2 XAI uses: Participants intended to use explanations for calibrating trust, improving their own task skills, collaborating more effectively with AI, and giving constructive feedback to developers
Next, when we showed XAI explanations to participants, they were excited to use them for various purposes beyond understanding the AI’s outputs: for determining when to trust the AI (Sec.
5.2.1), which is a well-known use and commonly-stated motivation for XAI [
39,
79,
111,
137,
140], but also for learning to perform the task better on their own (Sec.
5.2.2), changing their behavior to supply better inputs to the AI (Sec.
5.2.3), and giving feedback to the developers to improve the AI (Sec.
5.2.4), which are less discussed uses in existing literature.
5.2.1 Participants intended to use explanations to determine when to trust AI.
Many participants said they would use explanations to determine when to believe the app’s identification result (P1, P4, P8, P11, P13, P18, P20). The need underlying this use is consistent with the aforementioned need for information that helps them decide when to trust the AI. While looking at different explanation mock-ups, participants gave examples of when their trust would increase and decrease. For instance, participants said they would feel more confident in the AI’s output when heatmap-based explanations show that the AI is “
looking at the right things” (P8) and when example-based explanations show example photos that look similar to their input photo. Conversely, they said they would feel more skeptical when heatmap-based explanations suggest that an “
artifact was important” (P8), when concept-based explanations have errors in their concept recognition (e.g., says there is a long beak when there is not) (P18), and when prototype-based explanations match photo regions and prototypes that don’t look similar to them (P4). These findings confirm existing literature [
39,
79,
111,
137,
140] and suggest that trust calibration will be an important use of XAI.
5.2.2 Participants desired to learn via explanations to better achieve the task on their own.
Intriguingly, a greater number of participants said that they intend to use explanations to improve their task skills (P1, P2, P4, P6, P7, P8, P9, P10, P11, P13, P15, P17, P19, P20). Participants viewed the AI as a teacher and were keen to learn the features it looks at via explanations, so they can look for these features in the future when they are birding on their own. Participants were aware that the features the AI looks at may be different from what expert human birders look at. But they weren’t very concerned about the potential differences. One participant even said it would be interesting if the AI finds new ways of identifying birds and explanations can “
call attention towards things that people did not really think of before” (P1). Still, participants preferred that explanation
forms be similar to those of human birders. We elaborate on this point further in Sec.
5.3.
Overall, participants were excited about how explanations could make birding more accessible for themselves and others who lack access to expert resources (e.g., mentoring from human birders):
“It [the explanation] is kind of training or giving me more information and I’m kind of learning these things [what features to look at]. Whereas before, birders or ornithologists are learning this from mentors or teachers in the field. But those opportunities are limited based on social relations, privilege, how closely you are are connected to birding groups and stuff. And so it will be much more openly accessible if that kind of more comparative identification knowledge was accessible through just an app.” – P1
Even participants with high-domain background, whose main goal for using the app was not to obtain such knowledge, appreciated the educational value of explanations and said explanations would help them learn faster (P16).
These findings are closely related to recent works by Goyal and colleagues [
47] and Pazzani and colleagues [
96]. They demonstrated that XAI explanations help non-bird-experts (graduate students in machine learning [
47] and undergraduate students in psychology, cognitive science, or linguistics courses [
96]) learn to distinguish birds. While their experiments employed relatively easy tasks, i.e., assigning bird images to one of two species options, they showed the potential of
learning from AI via XAI explanations. While [
47,
96] did not establish that this is a need that people have, our work provides empirical evidence for it, suggesting
learning from AI as another important use case for XAI.
We postulate this use case stemmed from Merlin’s status as an expert AI system. Many AI applications are deployed to automate tasks that are easy for people (e.g., face verification, customer service chatbot) in settings where it is costly or implausible to have humans in the loop. In contrast, Merlin possesses expertise that most people don’t have and need to invest time and effort to gain. This expertise is likely the source of Merlin explanations’ educational value. In other types of AI applications, end-users may not intend to learn from AI via explanations.
5.2.3 Participants viewed explanations as an opportunity to be better AI-collaborators.
Participants also saw explanations as an opportunity for action. They looked for feedback on their own behavior that would in turn enable them to help the AI better achieve the task (P1, P7, P9, P20). P20 said explanations, by providing insights into how the AI got an identification wrong, can help them figure out the answer to: “What would I have to do to change this photo to make it [AI] understand it better?” Participants sought out opportunities to improve their own collaborative skills when working with the AI to achieve a task, because at the end of the day they want to achieve best possible outcomes:
“You’re still trying to look for the right bird. So if you can adjust human behavior to get the right answer out of the robot [AI], then that’s helpful.” – P20
Because of this need, participants were critical towards XAI approaches they thought didn’t provide actionable feedback. For instance, P9 questioned the utility of heatmap and example-based explanations: “How is it helpful to the user in the future? Besides just being cool and interesting? How does it change the user’s use of the app? Does it make you take a different photo?” They critiqued that these approaches don’t help them help the AI be more correct.
We view use this intended use of XAI explanations as an extension of participants’ current efforts to help out the AI. When describing their use of the app, participants mentioned several different ways they help the AI perform better. Some were smaller adjustments on the spot, such as facing the microphone closer to the bird and getting a sufficiently long recording for Sound ID (P9). Others were more involved, such as the efforts P1 described as part of their “general workflow” for using Photo ID:
“I basically don’t use images that are either too blurry or do not feature the bird in an unobstructed manner. I know from my personal experience using it that Merlin works a lot better if it has a more silhouetted side profile shot of the bird. [...] So I try to feed Merlin photos taken from similar angles, also in acceptable lighting conditions. I might have to boost the contrast or the brightness of a picture artificially to feed it into Merlin to get better results. If there’s no real contrast, then it’s much harder to get credible results.” – P1
In short, participants viewed the AI as a collaborator. They have already found ways to better work with it, and they intended to use XAI explanations to further improve their collaboration. To this end, they wanted explanations to provide actionable feedback on their own behavior so that they can supply better inputs to the AI.
5.2.4 Participants saw explanations as a medium to give feedback to developers and improve AI.
Finally, participants with high-AI background intended to use explanations as a medium to give feedback to developers and contribute to improving the AI (P13, P18, P19). These participants mentioned that explanations, by providing more information to end-users about how the AI produced its output, enable end-users to give more detailed feedback. This feedback can then help developers improve the AI system. P13 illustrated this process using prototype-based explanations as an example:
“The fact that it [AI] identifies parts of the tree, that’s a great opportunity to [to have end-users] tap that region and say ‘not a part of the bird’ so that you can get the users helping you to do some curation and labeling on the images, which someone could review or whatever. You can make much higher quality models by getting this sort of the labeling right.” – P13
P18 suggested a similar feedback process for example-based explanations. They said when end-users disagree with the provided examples of similar looking birds, they can correct them by saying “no, I think it actually looks more like bird number three” and help developers align the AI’s notion of perceptual similarity with that of humans, and improve the AI.
Lastly, P19 described XAI’s potential for creating a positive feedback loop that helps both end-users and the AI system:
“So there’s a feedback loop here, right? Because if that [teaching people to better identify birds] is your goal, and you’re successful in doing that, then you’re able to rely on people to verify their data, contribute solid data, and that data can help inform Merlin, which makes Merlin better, which makes it do its job better. [...] I think no matter what, it [providing explanations] is kind of beneficial.” – P19
P13 and P18 shared this view and said they would be excited to help developers improve the app by providing feedback via explanations. P18, in particular, expressed a strong desire to contribute. They had already been signing up for beta versions of the app, and the first answer they gave to the question “What would you like to know more about Merlin?” was: “How I can contribute more” (P18).
In short, participants with high-AI background desired to use explanations to help improve the AI, so that they can achieve better outcomes with it in the future. We interpret this as another example of participants viewing the AI as a collaborator whom they work together with.
5.3 XAI perceptions: Participants preferred part-based explanations that resemble human reasoning and explanations
In this last results section, we describe how participants perceived the four XAI approaches we mocked up: Heatmap (Sec.
5.3.1), Example (Sec.
5.3.2), Concept (Sec.
5.3.3), and Prototype (Sec.
5.3.4). We also summarize concerns expressed toward explanations (Sec.
5.3.5), and explore how existing XAI approaches might satisfy end-users’ explainability needs and goals identified in the previous sections.
5.3.1 Heatmap-based explanations: Most mixed opinions.
We received the most mixed reviews for heatmap-based explanations. Participants who liked heatmaps described them as “fun” (P15), “aesthetically pleasing” (P3), and intuitive—“it’s very easy, it hits you right away” (P9). Some participants were positive because they often use heatmaps in their work and find them helpful for representing information (P12, P19). Conversely, a few participants expressed a strong dislike (P14, P16), e.g., “I hate those things [...] They are simply not intuitive” (P14). P20 didn’t like heatmaps as an explanation form because “heatmaps feel like they should be related to weather,” revealing individual differences in perception.
Regarding utility, some said heatmaps help them understand how the AI had made a mistake (P7, P9, P13). For instance, P19 said they see how the AI made a mistake for the Marsh Wren photo because the heatmap (in Fig.
2) did not highlight areas that are important for distinguishing different species of Wrens (e.g., Marsh Wren has a white eyebrow that House Wren doesn’t). However, many participants criticized that the shown heatmaps were too coarse and uninformative (P1, P2, P3, P4, P6, P10, P11, P16, P17, P19). “
It’s just highlighting the bird” was a common remark. Participants said heatmaps would be more helpful if they highlight a few salient features of the bird, just like how human birders focus on a few field markers when identifying birds.
Finally, some participants thought heatmap-based explanations were inherently limited by its form. P9, P11, and P17 said heatmaps were unsatisfying because they don’t answer the “why” question. Regarding heatmaps’ highlighted regions, P17 asked: “Yes it’s important, but why was it important?” Other participants were dissatisfied because heatmaps lacked actionable information (P9, P11). They said knowing which parts of the photo were important to the AI does not help them change their behavior to help the AI be more correct in future uses.
5.3.2 Example-based explanations: Intuitive but uninformative.
There was a consensus among participants that example-based explanations are “really easily understandable.” However, opinions diverged regarding their utility. Some found them helpful for determining when to trust the AI (P4, P5, P17, P20) since they themselves can compare their input photo to the examples in the explanations. P4 noted that example-based explanations feel “much more collaborative” since they allow end-users to do their own moderation of the provided information. P19, on the other hand, were concerned that they would “open the door for user error.” Especially for difficult identifications where there are only subtle differences between candidate birds, P19 said example-based explanations wouldn’t help non-bird-expert end-users arrive at a more accurate identification.
Many participants described example-based explanations as rather uninformative (P1, P4, P6, P8, P10, P11, P12, P18). Some thought they didn’t add much information to example photos that are already shown in the app with the identification result (P1, P6, P10, P11). They understood the difference between the two, that example-based explanations convey what the AI considers similar to the input photo, while the currently provided example photos are part of a fixed bird description and independent of the input. Still, they thought the explanations were not very useful. Some even preferred the current example photos because they are high-quality and well-curated (P1, P6).
Another frequent criticism against example-based explanations was that they are too general and impression-based (P4, P8, P10, P12, P18). Participants were frustrated that they don’t communicate what features the AI was using to make its identifications, e.g., P8 said “This kind of tells you nothing.” Due to this lack of specificity, many mentioned that example-based explanations were not helpful for their various intended uses, ranging from understanding the AI’s reasoning to supplying better inputs to the AI to improving their own bird identification skills.
5.3.3 Concept-based explanations: Well-liked overall but overwhelming to some.
Participants were largely positive towards concept-based explanations. Most praises were about their part-based form. They liked that the AI’s output was broken down into chunks that human birders reason with, i.e., concepts (P3, P4, P11). “This is what a person looks for basically when they’re identifying a bird,” remarked P3. Relatedly, participants liked that concept-based explanations resemble the way bird identifications are taught and shared between birders (P3, P8, P17). P17 said, “before all this technology, this is exactly how you would basically learn to ID a bird.” For these reasons, participants mentioned that concept-based explanations seem helpful for learning to identify birds on their own.
Participants also mentioned other use cases where concept-based explanations can help. For instance, P11 said they would allow people to check the AI’s output more thoroughly because people can agree or disagree with the explanation at the level of individual concepts. As an example, they said they would not believe the AI’s output if the explanation says there are red feathers in the photo when there are not. Participants also liked that the shown explanations provided a final score for the output because they display the AI’s confidence in the identification (P1, P5, P17). P5 said such scores would be particularly helpful when they are comparing similar-looking candidate birds.
Nonetheless, participants mentioned a few areas of improvement. Several participants pointed out that the concepts in the shown explanations (e.g., long beak, black feathers, white body) were too general (P1, P4, P5, P10). They suggested adopting birders’ language and describing birds with more specific terms such as “underbelly, chest, rump, wing, wingbars, neck, head, cap” (P4). Participants also recommended making the numbers in the explanations as easily understandable as possible (P6, P9, P12, P13, P15, P16, P18). P6 pointed out that the current concept coefficients are confusing: “I have no idea what any of the numbers mean? Like is 1.7 good?” Specifying what are good and bad numbers and constraining the coefficients’ range may mitigate some of the confusions. Even with these changes, however, concept-based explanations may not be everyone’s cup of tea. Some participants shared that they find the explanation form inherently overwhelming and less attractive (P5, P13, P16, P20). P16 shared: “I sort of tune out with numbers after a while.” P20 also expressed their preferences for more visual explanations: “I’m such a visual person that stuff like this would go right over my head and make no sense for the most part.”
5.3.4 Prototype-based explanations: Most preferred.
Many participants picked prototype-based explanations as their favorite (P2, P3, P4, P6, P7, P9, P10, P12, P13, P15, P16, P17, P19, P20). The part-based form was clearly preferred, for similar reasons mentioned for concept-based explanations. P15 and P20 said prototype-based explanations are analogous to how they think about birds, and P1 that they are analogous to how birders teach each other. Between prototypes and concepts, participants tended to prefer prototypes for their visual nature and information content: prototype-based explanations locate and draw a box around relevant bird parts in the user-input photo, whereas concept-based explanations only list the bird parts. P4 summarized the advantages: “It makes a very clear match between the photo that you’re looking at and a larger base of what this bird should look like. It also skips over the whole language issue and is incredibly visual which I really appreciate.” Participants also noted that prototype-based explanations can help many uses, e.g., learning how to identify new birds (P2, P8, P13, P15, P19, P20), understanding how the AI is working (P11, P13, P15, P16, P20), spotting the AI’s mistakes (P4, P13), and changing their own behavior to supply better inputs to the AI (P20).
Complaints against prototype-based explanations were mostly minor. Some participants described the current version as “cluttered” and “difficult to see” (P1, P4, P5, P6, P11) and made UI design recommendations, e.g., having one prototype-photo region match pop up at a time (P11). Participants also mentioned that some prototypes were ambiguous (P2, P11, P18). For instance, P11 said they had to “examine the prototype and the example to figure out what the concept was that they corresponded to.” As a solution, P2 suggested providing a textual description of the prototype. Another complaint was that some prototypes (e.g., feet) were uninteresting (P1, P13, P18). “Very few bird species are differentiated based on their feet,” remarked P1. For solving this problem, participants suggested curating prototypes with domain experts and end-users so that the explanation focuses on salient and interesting features, those that human birders would use to identify birds.
Finally, several participants suggested combining prototype-based explanations with other approaches (P2, P4, P11, P12, P16, P18, P19). Concretely, P2 suggested combining it with heatmap-based, P2, P12, P16 and P18 with concept-based, and P4 and P11 with example-based explanations. P19 didn’t specify an approach. Regarding the combination, some suggestions were general (e.g., show both types of explanations) while others were more specific (e.g., add concept labels to prototypes). P12 and P18 particularly advocated for using information from multiple sources (e.g., photo, sound, location) for both improving the AI’s performance and explaining its results to end-users.
5.3.5 Concerns about XAI explanations.
Participants were overall excited to see XAI explanations in Merlin, however, some expressed concerns regarding the faithfulness and potential negative effects of explanations. In particular, participants who were familiar with XAI questioned how faithfully the shown approaches would explain the app’s identification process, if they were to be implemented in the app (P6, P10). For example, P6 said example-based explanations feel like “cheating interpretability” unless the AI actually makes identifications using clustering or other techniques that group similar photos together. Regarding concept-based explanations, P6 and P10 asked if they imply that the AI system is interpretable-by-design and actually reasons in two steps (first concept recognition, then bird identification), or if they were post-hoc explanations produced by a separate “explainer” system. These questions highlight the importance and challenges of communicating what XAI explanations are actually showing. In some cases, explanations of XAI explanations (“meta-explanations”) may be more complex than the XAI explanations themselves.
Another concern was that explanations might lead to mistrust or overtrust in AI systems. P20 said a convincing explanation for a misidentification would be “
detrimental” to end-users who are trying to learn bird identification on their own, because they might more readily believe in the misidentification and accumulate wrong knowledge. Similarly, P19 said explanations might encourage end-users to “
double down on the incorrect identification,” and even create a negative feedback loop if the AI system relies on end-users to input or verify data. These concerns are consistent with findings from recent research [
57,
101] that people tend to believe in AI outputs when given explanations for them, and raise caution against negative effects explanations might have on end-users, irrespective of XAI designers’ intent.