1 Introduction
Suppose the owner of an artificial intelligence (AI) product named “G3” ran a before/after user study to find out whether potential customers had better user experiences (UX) with G3’s new version than an older, not very good version of G3, the study results came out like
Figure 1.
The product owner should be somewhat pleased—the product UX clearly improved. Of the 13 measured UX outcome variables (y-axis), 11 were positive. In fact, the asterisks indicate that these 11 differences were significant between the old and new versions of G3. Still, the effect sizes were mainly small (yellow bars), with only two moderate effect sizes (blue bars).
What the product owner would now like to know is: who was included in those positive effects, and who was left out? What further changes are needed to enable G3 to better support more of the potential customers?
These kinds of questions are human-AI interaction (HAI) questions about the UX quality that AI products offer their customers. In this article, we abbreviate the concept of HAI user experiences as HAI-UX.
One method that the HAI community currently uses to improve AI products’ UX is to develop and apply guidelines for HAI. At least three major companies—Apple, Google, and Microsoft—have each proposed guidelines, providing high-level advice for how to improve HAI, such as “Consider offering multiple options when requesting explicit feedback” [
64], “Let users give feedback” [
47], and “Support efficient correction” [
6]. In fact, G3’s product owner followed a guideline from the Microsoft set, and doing so improved the product (
Figure 1).
In this article, we consider how to measure beyond just whether products like G3 improve their HAI-UX. We investigate how to know who, of all the diverse humans who could be using products like G3, is included in our HAI-UX improvements and who has been left out. Applying the concept of inclusivity to human interactions with AI products, we will say that AI product A is more inclusive to some particular group of people than product B is, if product A provides those particular people with measurably better UX outcomes than product B does.
The primary groups of interest in this article are those who are diverse in the ways they go about
problem-solving. We use the term
problem-solving to mean any time that people are engaged in solving problems, such as whether and how to accept/reject an AI’s recommendations. We consider participants’ problem-solving diversity via the set of five problem-solving style spectra from the inclusive design method known as Gender Inclusiveness Magnifier (GenderMag) [
19]. We use the term
problem-solving styles to refer to the approaches that individuals take to go about trying to solve a problem. GenderMag’s five problem-solving style spectra are people’s diverse: attitudes toward risk, levels of computer self-efficacy, motivations, information processing style, and styles of learning technology.
For example, “risk-averse” is one endpoint of the risk attitude spectrum. Applying risk-aversion to technology, risk-averse users may be hesitant to invoke a new feature for fear that it may have undesirable side-effects (e.g., privacy), may waste their time, may not be worth learning about, and so forth. At the other end of the spectrum, “risk-tolerant” users may be more willing to take such risks, even if the feature has not been proven to work yet and requires additional time to understand [
53,
98,
125].
In this investigation, we consider how UX of people with diverse problem-solving styles were impacted by design differences in AI-powered systems like G3 above. Specifically, we gathered 1,016 participants’ five GenderMag problem-solving styles and investigated inclusivity differences in 16 pairs of AI products. Each AI product had controlled differences: one AI product applied an HAI guideline from the Amershi et al. guidelines set [
6], and its counterpart violated that guideline. All AI products were productivity software (e.g., Microsoft PowerPoint, etc.) that had added AI features. An earlier investigation on the same data, reported in Li et al. [
83], investigated the “whether” questions of these data, i.e., whether HAI-UX differences between each pair AI products occurred. That investigation found that participants’s HAI-UX outcomes were generally better when the guidelines were applied;
Figure 1 is in fact one example of their findings. Our investigation instead focuses on “who” questions, i.e, who were included (and who were not) in the HAI-UX outcome changes, from the perspective of participants’ diverse problem-solving styles.
To show how analyzing HAI-UX data by these five problem-solving styles can reveal actionable insights into how to improve an AI product’s inclusivity, this article presents a detailed analysis of one of these problem-solving style types, namely attitudes toward risk. However, space constraints prevent providing detailed analyses for all five of these problem-solving style types, so this article summarizes the remaining four problem-style types’ results with an eye toward generality; we also provide detailed analyses of all five problem-solving style types in the Appendices. We selected attitudes toward risk as the problem-solving type to present in detail because of the preponderance of recent research literature and popular perception focusing on risks with AI, such as risks of inaccuracies, of privacy loss, of excessive or insufficient trust, of job loss, and more [e.g.,
31,
35,
45,
57,
61,
62,
72,
118]. We investigate the following:
RQ1-Risk: When the HAI guidelines are violated vs. applied to AI products, how inclusive are the resulting AI products to users with diverse attitudes toward risk?
RQ2-AllStyles: How inclusive are such products to users with diverse values of Gender-Mag’s other four problem-solving styles?
We also investigate the relationship between participants’ problem-solving style diversity and their demographic diversity. Our reason for relating problem-solving diversity with demographic diversity is that knowing the demographic disparities in who a product serves may not lead to actionable ways to address those disparitarites; for example, if one gender is left out of high-quality UX with an AI product, how to fix it? In contrast, problem-solving style disparities often do suggest actionable ways forward; for example, if risk-averse participants are left out of high-quality UX, perhaps the product should be clearer about risks of using it (e.g., its privacy impacts):
RQ3-Demographic Diversity: How does AI product users’ problem-solving diversity align with their demographic diversity?
Thus, the new contributions of our research are as follows:
—
Measuring HAI-UX inclusivity: Presents an approach to measure inclusivity of an AI product’s UX.
—
Risk-inclusivity in HAI-UX: Uses the approach to reveal which of the participants with diverse attitudes toward risk are well-supported by a set of 16 AI products and which are not.
—
Beyond risk-inclusivity in HAI-UX: Generalizes the above results to the other four GenderMag problem-solving style spectra.
—
Actionable inclusivity in HAI-UX: Reveals whether results to the above suggest actionable steps an HAI practitioner can take to make an AI product more inclusive.
—
Problem-solving diversity and demographic diversity: Reveals relationships between participants’ problem-solving styles and their intersectional gender-and-age demographic diversity, to enable HAI practitioners to bring actionable results from problem-solving diversity investigations to bear on demographic disparities.
—
Implications for practitioners: Suggests concrete ways HAI practitioners can use the approach on their own products, as well as starting points to develop new criteria, guidelines, and/or onboarding processes on designing their own AI products to be more inclusive of and equitable to diverse customers.
3 Methodology
To investigate our research questions, we performed 18 independent experiments, one for each of Amershi et al.’s 18 HAI guidelines [
6] (listed earlier in
Section 2.1). We used these experiments to perform two investigations. Investigation One, reported in Li et al. [
83], investigated the impacts of violating/applying these guidelines. Investigation Two, which is the one we report in this article, investigated potential disparities in the UX of participants with diverse problem-solving style values. Our investigation used GenderMag’s five problem-solving style spectra—the spectrum of participants’ attitudes toward risk, of their computer self-efficacy, of their motivations, of their information processing styles, and of their learning styles (by process vs. by tinkering).
To answer these research questions, we generated the following statistical hypotheses before data collection. For any dependent variable and any of the five problem-solving styles, our statistical hypotheses between applications (app) and violations (vio) of any guideline were as follows:
3.1 Study Design
The experiments’ context was productivity software, such as document editors, slide editors, search engines, e-mail applications, and spreadsheet applications. Each experiment was a 2 \(\times\) 2 factorial experiment, where each factor had two levels.
The first factor, the “guideline adherence” factor, was within-subjects, and the factor’s levels were “guideline violation” and “guideline application.” For any one guideline’s experiment, the “guideline violation” condition violated that particular HAI guideline; for example, in Guideline 1’s experiment (make clear what the system can do), the “guideline violation” did not make clear what the system can do. Similarly, in Guideline 11’s experiment (make clear why the system did what it did), the “guideline violation” did not make clear why the system did what it did. In contrast, the “guideline application” level applied each guideline; for example, in Guideline 1’s experiment (make clear what the system can do), the “guideline application” condition added clarifying information about what the system can do.
The second factor, the “AI performance” factor, was between-subjects. This factor’s levels were “AI optimal” and “AI sub-optimal.” In the “AI optimal” level, the AI sometimes made mistakes but worked well most of the time, whereas in the “AI sub-optimal” level, the AI sometimes made mistakes and sometimes worked well.
In each experiment, both the product that violated the guideline and the product that applied it were represented by vignettes, as in several other works in HAI [
1,
32,
81,
85,
91]. The vignettes were developed in two phases: in the first phase, two researchers went through an iterative brainstorming process, where they independently thought about how the 18 guidelines might show up in productivity software, drafting between 5 and 8 interaction scenarios for each guideline. Then, the researchers collaborated to review, rewrite, and sometimes replace the scenarios. In the second phase, the researchers adhered to Auspurg et al.’s [
10] best practices to make the vignettes simple, clear, and realistic. In cases where the interaction description was not understandable through text, images were used to promote understandability. Before deploying the study, each vignette went through two rounds of piloting. In the first round, each vignette received feedback from seven Human-Computer Interaction researchers not familiar with the project, and changes were made based on that feedback. In the second round, we piloted the updated vignettes on Amazon Mechanical Turk (MTurk) with five participants per vignette; no issues were identified from this second pilot. Each of the final vignettes was composed of three parts: (1) a product/feature introduction; (2) a description of what the AI feature did; and (3) a summary of how well the AI performed.
Figure 3 provides an example of the two vignettes from the experiment for Guideline 1 (“Make clear what the system can do”). In first part, the only difference between the two conditions’ vignettes was in the name (Ione and Kelso), generic names given to each product to distinguish them from each other and to avoid the influence of prior familiarity with a real product. The second part manipulates the “guideline adherence” factor. In
Figure 3(a), part 2 states:
“We will help you improve your presentation style”
without giving specific examples or details, thus violating the guideline by
not making clear what the system can do. In contrast,
Figure 3(b)’s part 2 applies the guideline to make clear exactly what the system can do, stating:
“As you practice your presentation, we will give you feedback about your presentation style: how fast you speak, use of filler words (such as ‘um’ and ‘like’), use of inappropriate words (such as ‘damn’).”
We will refer to the vignette that violated the guideline as the Violation AI product. Similarly, we will refer to the vignette that applied the guideline as the Application AI product.
Table 2 lists the questions that the participants responded to for both the
Violation AI product and the
Application AI product. These dependent variables gather information about different dimensions of participants’ UX. The first five questions (control, secure, inadequate, uncertain, and productive) follow Benedek and Miner’s [
14] approach of measuring end users’ feelings in UX. Perceived usefulness was taken from Reichheld and Markey [
102] and has been known to relate to acceptance and use of AI-infused systems. The last four questions (suspicious, harmful, reliable, and trust) came directly from Jian et al. [
66], who focused on scales for trust in automated systems. The answer to each question was an agreement scale, ranging from “extremely unlikely” (encoded as a 1) to “extremely likely” (encoded as a 7).
3.2 Participants and Procedures
A total of 1,300 participants were recruited from Amazon MTurk, a popular crowdsourcing platform. To ensure quality data, participants had to meet certain performance criteria on MTurk before they could participate in the study, such as having at least 100 approved human intelligence tasks (HITs) and having above a 95% acceptance rate on the platform. Additionally, participants had to be located in the USA and be at least 18 years old. After workers accepted the HIT, they were presented with an Institutional Review Board consent form and then answered three screening questions. The first two asked about their familiarity with productivity software and the last confirmed that they were above the minimum age requirements. Upon completion of the screening survey, participants were provided $0.20.
Once participants had completed the screening survey, they were randomly assigned to 1 (and only one) of the 18 experiments, one for each guideline. First, participants randomly saw either the
Violation AI product or
Application AI product, such as the example provided in
Figure 3. 5 Participants then responded to the UX questions shown in
Table 2, asked in a random order. Once participants completed their responses for the first AI product, they saw the second product and answered the same UX questions in another random order.
Once participants had seen both products and answered the UX questions for each, they were asked to select which product they preferred and explain why they preferred it. As detailed by Li et al. [
83], one of the authors read the open-ended answers provided in each factorial survey repeatedly, until codes began to emerge. Then, the codes were recorded and each comment was coded. Other team members conducted spot checks to verify the qualitative coding. Participants were then asked two manipulation check questions,
6 one closed- and one open-ended. The closed-ended manipulation check asked participants whether or not they agreed with text that mirrored the guidelines themselves (e.g., “make clear what the system can do,” “make clear why the system did what it did”). For example, if participants in Guideline 1’s experiment agreed that the
Application AI product made clear what the system can do, and they disagreed with the statement for the
Violation AI product, then they passed the manipulation check. The open-ended manipulation check asked participants to “…briefly describe the differences between Kelso and Ione” (the fictitious names randomly assigned to the
Violation AI product and
Application AI product). The open-ended answers were qualitatively coded to check whether or not each participant had successfully perceived the experimental manipulation.
Participants then filled out a questionnaire with their demographic data, including their age, self-identified gender, race, highest education level, and field of employment.
7 They also filled out the problem-solving style questionnaire (
Section 3.4) and were paid a bonus of $5 for completing the experiment.
3.3 Investigation One Results Summary
As mentioned in
Section 1, Investigation One, reported in Li et al. [
83], compared UX outcomes of AI products that had applied the guidelines against AI products that had not. That investigation’s measures were generalized eta-squared (
\(\eta^{2}\)) effect sizes for each of the dependent variables in each of the experiments.
The primary takeaway from Investigation One was that, for most of the guidelines, participants perceived the
Application AI products as more useful and as providing better UX than the
Violation AI products.
Figure 4 shows thumbnails of their results for each guideline’s experiment. The more color-filled each thumbnail, the larger the positive effect sizes were for that guideline’s experiment. For example, G3’s thumbnail shows significant differences with small or medium effect sizes on most of the HAI-UX aspects measured. G6’s experiment produced particularly strong results. Its thumbnail is almost filled with color, indicating that G6’s experiment produced significant differences on all HAI-UX measures, with medium or large effect sizes for all but one.
In addition, Investigation One’s analysis informed two aspects of Investigation Two’s analysis. First, Investigation One’s analysis revealed that 2 of the 18 experiments failed the manipulation checks (
Section 3.2)—the experiments for Guidelines 2 and 16—and as such were dropped from Investigation One. Thus, our investigation also drops those two experiments, which leaves a total of 1,043 participants across the remaining 16 experiments. Second, Investigation One’s analysis of these remaining 16 experiments revealed that the AI optimality factor (
Section 3.1) was significant in only one of these experiments. This resulted in Investigation One dropping this experimental factor, and we do the same for Investigation Two.
3.4 Investigation Two (Current Investigation) Data Analysis
This article’s investigation analyzes the same independent experiments’ data from a new perspective: the inclusivity that the violation vs. application AI products afforded diverse participants. Specifically, we consider diversity in terms of participants’ diverse problem-solving styles (RQ1-Risk and RQ2-AllStyles) and their diverse gender/age demographics (RQ3-DemographicDiversity).
To collect demographics, we used a questionnaire asking participants their gender identity and age group. To collect participants’ diverse problem-solving styles, we used the GenderMag facets survey [
55], a validated survey that measures participants’ values of the five GenderMag problem-solving style types enumerated earlier in
Section 2.1, termed “facets” in GenderMag publications. Each problem-solving style type has multiple Likert-style questions that run from
Disagree Completely (encoded as a 1) to
Agree Completely (encoded as a 9), a few examples of which are shown in
Table 3. For example, using this instrument, if one participant answers the first question (top row) closer to
Agree Completely than a second participant, the first participant is considered to be more risk-averse than the second participant. Of the 1,043 participants, 27 failed at least one attention check in the problem-solving style survey, leaving 1,016 participants for this investigation. Appendix
A lists the full questionnaire, including the attention checks.
The GenderMag survey has previously been validated in multiple ways. Hamid et al. [
55] summarize the six-step validation process; among the steps were literature searches, multiple statistical analyses, demographic validation, and problem-solving style validation. Particularly relevant to this article was Guizani et al.’s [
53] participant validation of the problem-solving styles the survey captures. In that study, participants took the survey, then spoke aloud throughout problem-solving tasks. Participants’ in-the-moment verbalizations when problem-solving validated their own questionnaire responses 78% of the time, a reasonably good measure of consistency [
48].
To score a participant’s problem-solving style values, we summed up that participant’s responses to the risk questions, then the self-efficacy questions, and so on. Each sum is the participant’s “score” for that problem-solving style. Comparing these scores reveals a participant’s placement in that problem-solving style type compared to others in the same peer group, such as among computer science professors, or among residents of eldercare facilities, and so forth; in our case, the peer group is the adult productivity software users who participated in the study.
These scores formed 16 distributions, one for each experiment (e.g., see
Figure 5 for the risk score distributions). Using each experiment’s median,
8 which is robust against outliers, we then defined participants as being either more risk-averse than their peers (i.e., above the median) or more risk-tolerant, and similarly for the other four problem-solving styles.
9To analyze the dependent variables for each of the 16 experiments, we used
t-tests after ensuring that the assumptions held, as follows. To investigate inclusivity (
Section 4), we compared within-subjects using paired
t-tests, treating each
Violation AI product as a “before” and
Application AI product as an “after.” As
Table 4 shows, each of the 16 experiments had over 30 participants, suggesting normality of every experiment by the Central Limit Theorem.
10 In addition, in cases where the sample size fell beneath 30, we used Shapiro-Wilk tests to validate that the underlying reference distribution was not significantly different than normal (i.e.,
\(p\geq 0.05\)).
Satisfying these assumptions indicated that the above
t-tests were appropriate analysis techniques for these data. Each of the 16 experiments were designed with pre-planned hypotheses for each dependent variable, so we do not report statistical corrections in the body of this article. As other researchers [
8,
99] point out, statistical corrections (e.g., Bonferroni, Holm Bonferroni, Benjamini-Hochberg) are necessary only if “…a large number of tests are carried out without pre-planned hypotheses” [
8,
9]. Still, we recognize that not all readers may agree with this choice, so we also show all the Holm-Bonferroni corrections [
58] in Appendix
D.
4 Results: What Participants’ Risk Styles Revealed
RQ1-Risk considers the 16 pairs of AI products described in
Section 3—one violating a guideline and its counterpart applying that guideline—and how the two differed in their inclusivity of risk-diverse participants’ HAI experiences. (We will generalize beyond risk in
Section 5.)
In this article, we measure whether/how applying a guideline to an AI product
changed the product’s inclusivity toward some particular group of participants. For any UX dependent variable in an AI product, we will say the HAI-UX is
more (less) inclusive to a group of participants if the Application AI product’s result for that variable are
significantly higher (lower) than the Violation AI product’s result for
those participants.
11To answer this question, we performed an in-depth analysis of all HAI-UX measurements’ inclusivity by considering participants’ attitudes toward risk. HAI-UX inclusivity could change in only four possible ways: (1) inclusivity changes for both the risk-averse and risk-tolerant participants, (2) inclusivity changes for neither of them, (3) inclusivity changes for the risk-averse only, and (4) inclusivity changes for the risk-tolerant only. As
Table 5 shows, instances of all of these categories occurred.
Table 5 also shows that the risk results fell mainly in categories (1) and (2) above. Perhaps most important, the table shows that whenever applying a guideline produced a change in inclusivity, it was almost always a
positive change for at least some risk-attitude group of participants—without loss of inclusivity for the other group.
4.1 When Everybody Gained: More Inclusivity for Both the Risk-Averse and the Risk-Tolerant
At first glance, it may appear that the “when everybody gained” category of results is a natural consequence of the overall success rates shown by Investigation One. For example, many of the experiments that produced strong positive results in Investigation One also did so in Investigation Two, as with G6–G9. Still, even the G6–G9 experiments reveal relationships between outcomes and perceptions of risk that shed new light on the whys of these results.
For example, consider Guideline 8. In this experiment, the AI-powered feature was a design helper to automatically provide design suggestions for alternative layouts in a presentation application. In the Violation AI product’s vignette, the feature’s behavior was: “You are working on a slide and Design Helper pops up, showing you some design suggestions. You do not need any design help at this time, but there is no way to hide the design suggestions.” In contrast, the Application AI product’s vignette started out the same, but its last sentence was: “You do not need any design help at this time, so you click on a button visible on screen to hide the design suggestions.”
The second row of
Table 6 shows one of Guideline 8’s outcomes,
12 with the
risk-averse participants’ suspicions of the
Violation AI product (
left hatched boxplot) significantly worse than their suspicions of the
Application AI product (
left clear boxplot).
13 (
t(25) = 3.2354,
p = 0.003, and
d = 0.648).
14 Likewise, the
risk-tolerant participants also were significantly less suspicious of the
Application AI product than of the
Violation AI product (
t(34) = 3.0020,
p = 0.005, and
d = 0.507), as shown in the
right boxplots.
Yet, despite their agreement on these outcomes, participants’ free-text remarks showed that their reasoning differed with their attitudes toward risk. The risk facet is nuanced—it includes aversion/tolerance risks with privacy/security, of producing low-quality work, of wasting too much time, of having trouble with the product, and so forth. In Guideline 8’s experiment, about a fourth (14/61) of the participants’ comments focused on the second of these, the risk of low-quality work.
This focus on risk of low-quality work was especially true of risk-averse participants. 31% (8/26) of this experiment’s risk-averse participants wrote about preferring the increased control they had over their work quality with the Application AI product.
G08-1921-risk-averse: “…very convenient and still make me feel very much in control of my choices.”
G08-3619-risk-averse: “I don’t trust [Application AI product]…, but the fact I can turn the feature off lets me be in more control.”
Even the more risk-tolerant were worried about this type of risk, and 17% (6/35) of these participants expressed the same sentiments. However, for these more risk-tolerant participants, annoyance and frustration also figured prominently in their reasoning (26%: 9/35), compared to only one risk-averse participant expressing this sentiment.
G08-2831-risk-tolerant: “Because I can get rid of the content that might… influence me to do something stupid. If I am going to do something stupid it will be my idea.”
G08-3681-risk-tolerant: “…[Application AI product] would allow me more freedom, and be less annoying with its suggestions, even when they are wrong.”
G08-2627-risk-tolerant: “Without an option to turn off an unnecessary feature, I would be extremely frustrated…as it would be a severe distraction… never would I use [Violation AI product]…”
Comments like these, when coupled with the
risk-averse and
risk-tolerant participants’ feeling both significantly more in control and less suspicious of the
Application AI product, suggest relationships between an expectation of risk and four particular HAI-UX inclusivity outcomes. As
Table 5 shows, across all seven experiments where the
Application AI product gained inclusivity for both the risk-averse and risk-tolerant participants’ (not)-suspicious outcome (row 7), it also gained inclusivity for their certainty (row 4), control (row 1), and trust outcomes (row 10).
This result provides insight into why the five experiments that gained the most inclusivity across the risk spectrum participants—G6, G7, G8, G9, and G15—did as well as they did. What these five
guidelines have in common is that they all give users more control over the product. What their five
experiments have in common (from
Table 5) is that, in all of them, the
Application AI product gained inclusivity in all four of the above variables:
4.2 When Nobody Gained: No Inclusivity Improvements for Either Risk Group
Not all the results were as positive for diversity. Some changes did not change inclusivity outcomes for either of the two groups, measured in this article as no significant difference in HAI-UX inclusivity for either the risk-averse or the risk-tolerant participants between the Violation AI product and Application AI product. This was the second-most prevalent category, occurring 44 times across 10 experiments.
Consider Guideline 4’s (“show contextually relevant information”) results; examples are in
Table 7. Guideline 4’s experiment produced nine instances of the “nobody gained” category. In that experiment, the application was a document editor, and the AI-powered feature was an acronym explainer. The
Violation AI product violated the guideline:
“When you highlight an acronym to see what it stands for, [Violation] shows you a standard list of possible definitions taken from a popular acronym dictionary.” In contrast, the
Application AI product:
“When you highlight an acronym to see what it stands for, [Application] shows you definitions that are used in your workplace and pertain to the topic of the current document.”In some ways, the participants’ reasoning for their unchanging responses to the Violation AI product vs. the Application AI product echoed those of the previous subsection, namely wanting to avoid the risk of low-quality work. As in the previous section, this reasoning was especially common among the risk-averse participants (34% = 10/29), although 22% (8/36) of the risk-tolerant also used it. However, whereas in the previous section participants gave this risk as an asset of the Application AI product, in this section they gave it as a liability of the Application AI product.
G4-4098-risk-averse: “[Violation AI product] may make mistakes… but its use of a generic dictionary makes it easier to recognize mistakes… With [Application AI product], I would be more likely to miss mistakes.”
G4-3799-risk-averse: “…if [Application AI product] were to make a mistake on me, I would have a hard time trusting it because I did not make any part of the decision.”
Guideline 4 also raised privacy concerns among some participants:
G4-3905-risk-averse: “… I would be nervous that [Application AI product] is pulling data from things like my other software and my browsing history.”
G4-3947-risk-tolerant: “I don’t like the idea of [Application AI product] taking definitions from my workplace. It makes me worry I’m being listened to…”
In the “everybody gained” category (previous section), the five most inclusive guidelines across the risk spectrum revealed a relationship among risk-inclusivity and trust, control, certainty, and (not)-suspicious. The five
least inclusive guidelines as per
Table 5—G1, G4, G5, G12, and G13—show that the relationship persisted in the “nobody gained” category. None of G1, G4, G5, and G12 produced any inclusivity gains for any of these interrelated variables; and G13 showed only two such gains. These results not only confirm
Result #2, but also provide a complement to
Result #3:
4.3 Selective Inclusivity: Who Gained, Who Did Not, and Why?
The third and fourth categories of HAI-UX inclusivity changes were inclusivity gains for the
risk-averse participants only or the
risk-tolerant participants only. Neither category was very large, with 13 and 27 total instances, respectively.
Table 8 shows a few examples.
Despite the relatively small totals, the fourth category, that of bringing gains to the
risk-tolerant only, reveals a unique pattern shared by three experiments—Guideline 3’s, Guideline 13’s, and Guideline 18’s. As a column-wise reading of
Table 5 shows, in these three experiments, inclusivity gains for the
risk-tolerant participants abounded, but the
risk-averse participants rarely gained.
Guideline 3’s experiment offers a case in point. In that experiment, the Application AI product provides services only when it decides the user’s current task/environment would benefit. The Application AI product’s vignette applied this guideline by stopping e-mail notifications “…when you are busy.”
As
Table 5 shows, for Guideline 3’s
risk-tolerant participants, the
Application AI product showed inclusivity gains on every dependent variable except one for the
risk-tolerant participants. However, only four of these gains extended to the
risk-averse participants.
Why such differences? For Guideline 3’s experiments, the risk-averse participants’ concerns about risks to their work or their privacy appeared to outweigh the benefits of fewer notifications, whereas for the risk-tolerant, the balance seemed to tip the other way. For example, 26% (8/31) of the risk-averse participants explicitly brought up concerns about these risks, but only 11% (4/37) of the risk-tolerant did.
G3-3504-risk-averse: “…Also, I wouldn’t be sure that [Application AI product] would be able to accurately qualify my activities.”
G3-3054-risk-averse: “[Application AI product]… would have to be able to monitor your online activity…that would be a little invasion of privacy…”
A risk-oriented commonality among these three experiments lies in what these AI products were actually doing. All three of these Application AI products learn from the user’s own data, as opposed to learning mainly from huge datasets mostly consisting of other people’s data. Specifically, both Guideline 3’s and Guideline 13’s AI products involved learning from the user’s own context and behaviors, and Guideline 18’s AI product involved learning from that user’s e-mails. In the latter case, the AI product also moved that user’s e-mails around, adding the risk that the user might not find some of their e-mails later.
Two other Application AI products, Guideline 4’s and Guideline 12’s, also had this attribute. Guideline 4 involved learning from the user’s contexts, and Guideline 12 learned from the user’s recent interactions. Most of these two products’ outcomes were in the “nobody gained” category: the risk-averse were very uncomfortable with these products, and even the risk-tolerant saw too many risks (e.g., recall the discussion of Guideline 4’s experiment in the previous subsection).
These five guidelines’
Application AI products were the only ones with this attribute. For these five experiments, the
risk-averse participants hardly ever experienced any inclusivity gains (from
Table 5).
4.4 The Risk Results and Actionability
The AI products in these experiments were designed to isolate effects of applying vs. violating each guideline. However, if they were real products for sale, the products’ owners would probably hope to make each product as well-received by as many of its customers as possible.
The risk results provide actionable ideas for such product owners, for seven of these AI products.
Figure 6 points out which products those were.
The green boxes in the figure mark those products: they are G1’s, G3’s, G4’s, G5’s, G12’s, G13’s, and G18’s. For example,
Section 4.1 revealed how lack of user control affected some of the low-performing AI products (e.g., G1’s, G4’s, G5’s, G12’s, and G13’s). An actionable implication for these products is that those products would improve by offering users more control. As another example,
Section 4.3 revealed the sensitivity the
risk-averse participants had to products that potentially did “too much” with
their data (e.g., G3’s, G4’s, G12’s, G13’s, and G18’s products). One actionable idea for those products would be to provide information on what else the user’s personal data are used for and how long these data are stored. More generally, results like these suggest that a way to improve AI products favored by only the
risk-averse participants or only the
risk-tolerant participants is to attend to risk-oriented attributes of the product that were not tolerated well by participants at that end of the risk spectrum.
5 Results: Beyond Risk—the Other Four Problem-Solving Styles
Section 4 considered only one type of problem-solving diversity, namely participants’ diverse attitudes toward risk. We now turn to
RQ2-AllStyles, which asks “How inclusive are such products to users with diverse values of GenderMag’s other four problem-solving styles?” Although space does not permit an in-depth analysis of each remaining problem-solving style—motivations, learning style, computer self-efficacy, and information processing style—we summarize in this section whether and how the risk results of
Section 4 generalize to analogous results. If they do, we also consider whether those new results add anything to our understanding of the UX the AI products offer to diverse problem-solvers. Full analyses for all of these styles are in the Appendices.
Recall from
Table 1 that GenderMag uses two personas, “Abi” and “Tim,” to identify the distinguished endpoints of each of GenderMag’s five problem-solving style types. As per the table’s definitions, we classify participants as more
“Abi”-like if they had any of the following problem-solving style values: more
risk-averse,
lower computer self-efficacy,
task-oriented motivations for using technology, had a
comprehensive information processing style, or were a
process-oriented learner. Participants nearer the opposite endpoint of these problem-solving spectra are classified to be more
“Tim”-like; i.e., more
risk-tolerant, had
higher computer self-efficacy, had
tech-oriented motivations, had a more
selective information processing style, or learned more by
tinkering. As in other persona research [
3], we use these persona names for two reasons: (1) to provide a vocabulary for an associated collection of traits and collection of traits (recall
Section 2), and (2) using the “Abi” and “Tim” vocabulary helps emphasize which of the “distinguished endpoints” of each problem-solving style type tend to co-occur, helping to keep clear which problem-solving value belongs to the underserved population.
As the upcoming
Tables 9–
12, and earlier
Table 5 show, the first result of these analyses was very good news for most of the
Application AI products. As was also true of risk diversity results, whenever applying an
Application AI product produced a change in inclusivity, it was almost always a
positive change for at least some group of participants—without loss of inclusivity for the other group. For example, as
Table 9 shows, whenever applying a guideline produced a gain for either the task-motivated group or the tech-motivated group, it almost never produced an inclusivity loss for the other group, with only one exception.
One result from
RQ2-AllStyles was
who were advantaged across these 16 product pairs. Note that the GenderMag assignment of endpoints to “Abi” vs. “Tim” followed widespread statistical skews of genders toward these particular styles [
20]; previous research has shown that “Abi” styles have statistical tendencies to cluster, and so do the “Tim” styles. Thus, one might expect color patterns in risk’s results (
Table 5) to be similar to the columnar color patterns in, say, the motivations results (
Table 9).
However, this sometimes did not happen. For example,
Table 5 visually contained twice as many
T cells for the
risk-tolerant participants as there were
A cells for the
risk-averse participants (27 vs. 13 respectively). But Tables
9–
12 show that who gained more inclusivity advantages depended on which problem-solving style was considered. For example,
Table 9 reverses
Table 5’s trend, with more
A cells for the task-oriented participants than
T cells for the tech-oriented.
This trend sometimes occurred even within individual experiments, demonstrated in
Figure 7 for Guideline 13’s experiment. Guideline 13’s product was a presentation app, and the AI feature was a design helper that recommended designs for alternative layouts. When participants saw the
Violation AI product, they were told that “
…Violation AI product has not learned your preferences and blue designs appear in the same place among the suggested designs as the first time you used it,” whereas the
Application AI product “
…has learned your preferences and now features blue designs prominently.” Considering participants’ attitudes toward risk (first column), the
risk-tolerant participants derived the most benefit from the
Application AI product. However, the second column shows these same participants’ data but instead considering their motivations. This adds to the risk results; the
Application AI product benefited both the
risk-tolerant and also those with
task-oriented motivations.
For HAI practitioners, results like these suggest that different design decisions can appeal to different problem-solving styles for different reasons. For example, 46% (13/28) of the participants with task-oriented motivations mentioned how efficient they would become with the Application AI product or how much time they would save while using it:
G13-2178-task-oriented: “I like to have software that anticipates my needs, because it makes working more efficient.”
G13-4099-task-oriented: “It is more efficient to see designs similar to those I have used before…it will take me less time to find them.”
G13-2740-task-oriented: “It [the Application AI product] learned my preferences quicker which in time will save me time and trouble.”
Although the participants with tech-oriented “Tim”-like motivations also raised efficiency and time savings, they did so less frequently than their task-oriented peers (only 17%—5/29):
G13-662-tech-oriented: “Because it saves time than starting from scratch every time I use it [the Application AI product].”
G13-662-tech-oriented: “I prefer [Application AI product] because of its ability to learn my preferences…thus helping me to work more efficiently.”
Comments like these also have ties to the research literature. In that body of work, people who are more
task-oriented prefer to use technologies to accomplish their task, using methods they are already familiar and comfortable with [
19,
21,
24,
60,
87,
114]. Task-oriented people do so in an attempt to focus on the tasks that they care about, which might explain why these task-oriented participants commented so frequently on how the
Application AI product saved them time; if the product saves them time, then the task-oriented participants could achieve their task more quickly, devoting more time to what they care about rather than having to spend additional time recreating designs.
As the examples and tables in this section have shown, the types of participants who did and did not benefit from the changes in the Application AI product varied by problem-solving style—attending to only the risk style did not tell the whole story. This suggests that HAI practitioners wanting to create a more inclusive AI-powered product for those with diverse problem-solving styles should consider all five of GenderMag’s problem-solving styles.
7 Discussion
7.1 Inclusivity and Equity: Complements in HAI-UX Fairness
Ideas about fairness in AI, what it is, and how to achieve it have recently received substantial attention [e.g.,
18,
41,
50,
57]. Research and conversations in this area usually refer to algorithmic or data fairness—but the ideas are also relevant to HAI-UX fairness.
In considering any type of fairness, two concepts often drive the discussion—inclusivity and equity. This article has considered inclusivity, but not equity.
A way to think about inclusivity in HAI is as an “outcome-oriented” concept that applies within a specific group of people. As shown in earlier sections, when an AI product somehow led to disadvantageous outcomes for some particular group of participants (e.g., risk-averse participants), then that product was not inclusive to that group.
Although this article’s inclusivity results revealed who the guidelines were helping the most and who was being left out, they do not answer how much more inclusivity progress an AI product still needs to make and for whom. Measuring equity can help to answer this question. Like inclusivity, equity in HAI is also an “outcome-oriented” concept—but it applies to between-group comparisons. For example, if an AI product’s UX for two groups (e.g., risk-averse participants and risk-tolerant participants) were of the same high—or low—quality, then the product was equitable.
Ideally, one would like the inclusivity gains
Application AI product achieved to result in a final outcome that is equitable to the two groups. To explore how useful a measure of equity would be to our investigation’s results, we measured equity of a dependent variable’s outcome for a given product as the absence of a significant difference between the two participant groups.
Table 14 shows risk-group equity outcomes by this measure, superimposed on the risk-group inclusivity outcomes.
For example,
Table 14’s G15 column shows that the G15
Application AI product achieved inclusivity gains two times for
risk-averse participants (orange cells) only, and three times for
risk-averse participants (blue cells) only. The G15 column further shows that those five targeted gains, along with the inclusivity gains experienced by everyone, ultimately led to fully equitable outcomes the risk spectrum (“=” markings). Thus, applying the G15 guideline ended up targeting exactly who it should have targeted to bring everyone up to an equitable state.
In total,
Table 14 shows that the guidelines’ resulting
Application AI products almost always produced equitable outcomes across the risk spectrum. Specifically, 129/160 outcomes (81%) were equitable, marked by “=” in the table. Of the 31/160 outcomes that were
inequitable, only 3/160 (2%) favored the
risk-averse (“A”) and 28/160 (17%) favored the
risk-tolerant (“T”).
Of course, equity does not always mean success. The G15
Application AI product produced entirely equitable HAI UX (
Table 14) but only moderately positive HAI UX (revisit
Figure 6). In contrast, the G1
Application AI product produced mostly equitable but extremely low HAI UX, and the G6
Application AI product produced entirely equitable and very positive HAI UX.
Due to space limitations, we do not present equity results in detail for risk or for the other four problem-solving styles. Still, our limited exploration here shows the additional value measuring equity can bring, so we advocate for measuring equity as well as inclusivity as a way to fully understand who is being included vs. who is being left out.
7.2 Practical Implications for HAI Practitioners
Measuring inclusivity and equity can bring practical benefits to HAI practitioners. As our results show, incorporating users’ problem-solving into AI products’ HAI-UX work can sometimes point out where and why mismatches are arising between a group of users and an AI product. Some particularly actionable examples were given in
Section 4.4. A way to gather participants’ problem-solving styles would be to incorporate the validated survey [
55] we used into user testing.
Armed with this new information, HAI practitioners could gain actionable insights in use-cases like the following:
HAI Practice Use-Case 1: To see which problem-solving groups of users are being left behind on an AI product with a problematic HAI-UX, measure equity state. To do so, for each problem-solving style, HAI practitioners could compare equity outcomes that are significantly different between the “A” participant group and the “T” participant group.
HAI Practice Use-Case 2: To see who a particular AI product change/new feature has benefited, measure
inclusivity changes. To do so, HAI practitioners could compare “A” participants before the change vs. after the change, and likewise for “T” participants, as in the examples in
Sections 4 and
5.
HAI Practice Use-Case 3: After an AI product has changed, complement a measure of its
equity state (as per Use-Case 1) with a measure of
dependent variable final outcomes (e.g., as in
Figure 6). This combination shows not only final equity state but also how successful the AI product’s HAI-UX is for each group of participants.
Our results suggest that doing measures like these can provide new, valuable information on who is being included, who is being left out, and how a product can improve.
7.3 Threats to Validity and Limitations
As with every empirical study [
76,
129], our investigation has limitations and threats to validity.
In any study, researchers cannot ask participants every possible question, having to balance research goals with participant fatigue. As such, the dependent variables we analyzed may not have captured all information about people’s reactions. For example, some participants’ free-text remarks suggested outcomes that our Likert-style questionnaires did not cover; one example was participants’ mentions of privacy concerns while interacting with certain products. Because the study was not designed with a dependent variable about privacy, we cannot be certain if remarks such as these indicated only isolated cases or more prevalent phenomena.
Another threat was how to handle missing data. Since participants had the option to say “I don’t know” for any of the questions, we had to decide whether to (1) impute the data or (2) drop the “I don’t know” values, costing degrees of freedom in our statistical tests. We chose the latter, because although there are many imputation methods to leverage (e.g., hot-deck, cold-deck, and regression), any inferences are then limited to the imputed data, rather than the original data.
Another threat was how to handle the number of statistical tests we ran. As mentioned in
Section 3.4, we did not report statistically corrected results in this article because every test corresponded to a pre-planned hypothesis [
8,
9]. That said, we recognize that some readers may not agree with this decision, so we also provide all Holm-Bonferroni corrected results in Appendix
D.
Also, we chose to use vignettes vs. a real system. Each approach has its own advantage: Using vignettes allows enough control to genuinely isolate the experimental variation to vary only the independent variable, and this isolation was critical to our statistical power. In contrast, a real system’s strength is realism in the external world, but at the cost of controls. Because this was a set of controlled experiments, we chose control, leaving to other studies to investigate external validity questions (faithfulness to real world conditions).
Other threats to validity could arise from the particular pairing of vignette to product and/or from participants associating a vignette with a specific real product with which they had familiarity. We attempted to avert the latter by randomly assigning generic names (Ione and Kelso) instead of real product names, but participants may have still imagined their favorite productivity software. If this occurred, it would contribute an extra source of variation in these data.
Although the productivity software and GenderMag problem-solving styles have been shown to be viable/useful in countries around the world, the participants in our study were restricted to those who lived in the USA at the time of the study. As such, the results in this article cannot be generalized to other countries around the world. However, since the methodology is not U.S.-specific, replicating the study with participants from additional countries should be straightforward.
One limitation of this investigation is that its results cannot be generalized to AI-powered systems outside of productivity software. This suggests the need to investigate HAI-UX impacts on diverse problem-solvers across a spectrum of domains, from low-stakes domains (e.g., music recommender systems) to high-stakes domains (e.g., automated healthcare or autonomous vehicles).
Threats and limitations like these can only be addressed through additional studies across a spectrum of empirical methods and situations, to isolate different independent variables of study and establish generality of findings across different AI applications, measurements, and populations.