research-article

Open access

Measuring User Experience Inclusivity in Human-AI Interaction via Five User Problem-Solving Styles

Authors:

Margaret BurnettAuthors Info & Claims

ACM Transactions on Interactive Intelligent Systems, Volume 14, Issue 3

Article No.: 21, Pages 1 - 90

https://doi.org/10.1145/3663740

Published: 24 September 2024 Publication History

PDF eReader

Abstract

Motivations: Recent research has emerged on generally how to improve AI products’ human-AI interaction (HAI) user experience (UX), but relatively little is known about HAI-UX inclusivity. For example, what kinds of users are supported, and who are left out? What product changes would make it more inclusive?

Objectives: To help fill this gap, we present an approach to measuring what kinds of diverse users an AI product leaves out and how to act upon that knowledge. To bring actionability to the results, the approach focuses on users’ problem-solving diversity. Thus, our specific objectives were (1) to show how the measure can reveal which participants with diverse problem-solving styles were left behind in a set of AI products and (2) to relate participants’ problem-solving diversity to their demographic diversity, specifically gender and age.

Methods: We performed 18 experiments, discarding two that failed manipulation checks. Each experiment was a 2$\times$2 factorial experiment with online participants, comparing two AI products: one deliberately violating 1 of 18 HAI guidelines and the other applying the same guideline. For our first objective, we used our measure to analyze how much each AI product gained/lost HAI-UX inclusivity compared to its counterpart, where inclusivity meant supportiveness to participants with particular problem-solving styles. For our second objective, we analyzed how participants’ problem-solving styles aligned with their gender identities and ages.

Results and Implications: Participants’ diverse problem-solving styles revealed six types of inclusivity results: (1) the AI products that followed an HAI guideline were almost always more inclusive across diversity of problem-solving styles than the products that did not follow that guideline—but “who” got most of the inclusivity varied widely by guideline and by problem-solving style; (2) when an AI product had risk implications, four variables’ values varied in tandem: participants’ feelings of control, their (lack of) suspicion, their trust in the product, and their certainty while using the product; (3) the more control an AI product offered users, the more inclusive it was; (4) whether an AI product was learning from “my” data or other people’s affected how inclusive that product was; (5) participants’ problem-solving styles skewed differently by gender and age group; and (6) almost all of the results suggested actions that HAI practitioners could take to improve their products’ inclusivity further. Together, these results suggest that a key to improving the demographic inclusivity of an AI product (e.g., across a wide range of genders, ages) can often be obtained by improving the product’s support of diverse problem-solving styles.

1 Introduction

Suppose the owner of an artificial intelligence (AI) product named “G3” ran a before/after user study to find out whether potential customers had better user experiences (UX) with G3’s new version than an older, not very good version of G3, the study results came out like Figure 1.

Fig. 1.

The product owner should be somewhat pleased—the product UX clearly improved. Of the 13 measured UX outcome variables (y-axis), 11 were positive. In fact, the asterisks indicate that these 11 differences were significant between the old and new versions of G3. Still, the effect sizes were mainly small (yellow bars), with only two moderate effect sizes (blue bars).

What the product owner would now like to know is: who was included in those positive effects, and who was left out? What further changes are needed to enable G3 to better support more of the potential customers?

These kinds of questions are human-AI interaction (HAI) questions about the UX quality that AI products offer their customers. In this article, we abbreviate the concept of HAI user experiences as HAI-UX.

One method that the HAI community currently uses to improve AI products’ UX is to develop and apply guidelines for HAI. At least three major companies—Apple, Google, and Microsoft—have each proposed guidelines, providing high-level advice for how to improve HAI, such as “Consider offering multiple options when requesting explicit feedback” [64], “Let users give feedback” [47], and “Support efficient correction” [6]. In fact, G3’s product owner followed a guideline from the Microsoft set, and doing so improved the product (Figure 1).

In this article, we consider how to measure beyond just whether products like G3 improve their HAI-UX. We investigate how to know who, of all the diverse humans who could be using products like G3, is included in our HAI-UX improvements and who has been left out. Applying the concept of inclusivity to human interactions with AI products, we will say that AI product A is more inclusive to some particular group of people than product B is, if product A provides those particular people with measurably better UX outcomes than product B does.

The primary groups of interest in this article are those who are diverse in the ways they go about problem-solving. We use the term problem-solving to mean any time that people are engaged in solving problems, such as whether and how to accept/reject an AI’s recommendations. We consider participants’ problem-solving diversity via the set of five problem-solving style spectra from the inclusive design method known as Gender Inclusiveness Magnifier (GenderMag) [19]. We use the term problem-solving styles to refer to the approaches that individuals take to go about trying to solve a problem. GenderMag’s five problem-solving style spectra are people’s diverse: attitudes toward risk, levels of computer self-efficacy, motivations, information processing style, and styles of learning technology.

For example, “risk-averse” is one endpoint of the risk attitude spectrum. Applying risk-aversion to technology, risk-averse users may be hesitant to invoke a new feature for fear that it may have undesirable side-effects (e.g., privacy), may waste their time, may not be worth learning about, and so forth. At the other end of the spectrum, “risk-tolerant” users may be more willing to take such risks, even if the feature has not been proven to work yet and requires additional time to understand [53, 98, 125].

In this investigation, we consider how UX of people with diverse problem-solving styles were impacted by design differences in AI-powered systems like G3 above. Specifically, we gathered 1,016 participants’ five GenderMag problem-solving styles and investigated inclusivity differences in 16 pairs of AI products. Each AI product had controlled differences: one AI product applied an HAI guideline from the Amershi et al. guidelines set [6], and its counterpart violated that guideline. All AI products were productivity software (e.g., Microsoft PowerPoint, etc.) that had added AI features. An earlier investigation on the same data, reported in Li et al. [83], investigated the “whether” questions of these data, i.e., whether HAI-UX differences between each pair AI products occurred. That investigation found that participants’s HAI-UX outcomes were generally better when the guidelines were applied; Figure 1 is in fact one example of their findings. Our investigation instead focuses on “who” questions, i.e, who were included (and who were not) in the HAI-UX outcome changes, from the perspective of participants’ diverse problem-solving styles.

To show how analyzing HAI-UX data by these five problem-solving styles can reveal actionable insights into how to improve an AI product’s inclusivity, this article presents a detailed analysis of one of these problem-solving style types, namely attitudes toward risk. However, space constraints prevent providing detailed analyses for all five of these problem-solving style types, so this article summarizes the remaining four problem-style types’ results with an eye toward generality; we also provide detailed analyses of all five problem-solving style types in the Appendices. We selected attitudes toward risk as the problem-solving type to present in detail because of the preponderance of recent research literature and popular perception focusing on risks with AI, such as risks of inaccuracies, of privacy loss, of excessive or insufficient trust, of job loss, and more [e.g., 31, 35, 45, 57, 61, 62, 72, 118]. We investigate the following:

RQ1-Risk: When the HAI guidelines are violated vs. applied to AI products, how inclusive are the resulting AI products to users with diverse attitudes toward risk?
RQ2-AllStyles: How inclusive are such products to users with diverse values of Gender-Mag’s other four problem-solving styles?

We also investigate the relationship between participants’ problem-solving style diversity and their demographic diversity. Our reason for relating problem-solving diversity with demographic diversity is that knowing the demographic disparities in who a product serves may not lead to actionable ways to address those disparitarites; for example, if one gender is left out of high-quality UX with an AI product, how to fix it? In contrast, problem-solving style disparities often do suggest actionable ways forward; for example, if risk-averse participants are left out of high-quality UX, perhaps the product should be clearer about risks of using it (e.g., its privacy impacts):

RQ3-Demographic Diversity: How does AI product users’ problem-solving diversity align with their demographic diversity?

Thus, the new contributions of our research are as follows:

—

Measuring HAI-UX inclusivity: Presents an approach to measure inclusivity of an AI product’s UX.

—

Risk-inclusivity in HAI-UX: Uses the approach to reveal which of the participants with diverse attitudes toward risk are well-supported by a set of 16 AI products and which are not.

—

Beyond risk-inclusivity in HAI-UX: Generalizes the above results to the other four GenderMag problem-solving style spectra.

—

Actionable inclusivity in HAI-UX: Reveals whether results to the above suggest actionable steps an HAI practitioner can take to make an AI product more inclusive.

—

Problem-solving diversity and demographic diversity: Reveals relationships between participants’ problem-solving styles and their intersectional gender-and-age demographic diversity, to enable HAI practitioners to bring actionable results from problem-solving diversity investigations to bear on demographic disparities.

—

Implications for practitioners: Suggests concrete ways HAI practitioners can use the approach on their own products, as well as starting points to develop new criteria, guidelines, and/or onboarding processes on designing their own AI products to be more inclusive of and equitable to diverse customers.

2 Background and Related Work

2.1 Background

2.1.1 GenderMag.

The GenderMag problem-solving styles are foundational underpinnings to this investigation (Table 1). The GenderMag method is an inclusive design and evaluation method based on these five problem-solving style types [20]; software professionals use the method to detect “(gender-)inclusivity bugs.”¹ GenderMag’s problem-solving styles are particularly well-suited to our investigation into diverse users’ experiences with AI features, because GenderMag was developed to improve technology’s inclusiveness for problem-solving technology—such as spreadsheet development, software debugging, and any other domains where problems can arise with which users must grapple [20]. Because the GenderMag method is intended for practical use by developers without social science backgrounds, a set of criteria [20, 92] were applied to the original long list of applicable problem-solving style types [13] to reduce this list to the five styles in Table 1. These five style types have been repeatedly identified as having strong ties to both problem-solving and gender;² we will summarize some of this research shortly.

Table 1.

Each of these problem-solving style types have continuous ranges whose endpoints (Table 1’s columns 2 and 4) are the only distinguished values. Values at one end are assigned to a persona named “Abi,” those at the other end are assigned to a persona named “Tim,” and a mix of values are assigned to a persona named “Pat.” For example, Abi and Pat are more risk-averse about technology risks than Tim (Table 1’s row 1), so Abi and Pat might be less likely than Tim to use the same password on multiple sites.

To summarize these five styles (see [20] for details):

—

Attitudes toward risk: Studies across multiple domains have reported a wide diversity in people’s attitudes toward risk [e.g., 37, 126] and how they solve problems [27, 63, 127]. Gender differences have also been reported in risk and decision in numerous problem-solving domains, with women almost always (statistically) less willing to take risks than other people [27, 127]. Note that risks relevant to decision-making include not only obvious risks like privacy and security but also risk of wasting time and/or of failing [74].

—

Computer self-efficacy: One specific form of confidence is self-efficacy—people’s belief in their ability to succeed in a specific task [12]. Self-efficacy matters to problem-solving because it influences people’s use of cognitive strategies, effort exerted, persistence with a problem, and coping strategies [12]. Regarding gender, empirical data have shown that women tend statistically to have lower computer self-efficacy than other people [117]. Overall, ties between people’s self-efficacy and how they approach a variety of problem-solving tasks have been well-documented in many domains [e.g., 98, 124, 132, 135].

—

Motivations: In the context of technology, motivations are the reasons an individual decides to interact with technology, such as using technology mainly to accomplish a task vs. having an interest and enjoyment in using and exploring technology [e.g., 116]. Gender differences have been reported in these “task-oriented” and “tech-oriented” motivations [117]. An individual’s motivations can affect not only which technology features they decide to use but also how they go about using those features [16, 51, 70, 97, 116].

—

Information processing style: Solving problems often requires gathering information. However, individuals vary on how much information they gather and when. Some gather information comprehensively—i.e., in sizeable batches—to form an approach first and then carry it out, whereas others gather it selectively, acting upon the first promising information, then possibly gathering a little more information before taking the next action, and so on. Regarding gender, women are statistically more likely to process information comprehensively and men are statistically more likely to process it selectively [117]. In both research and practice, much attention has been given to how technology can enable different individuals to obtain the right amount of information at the right time [e.g., 26, 100, 119, 128].

—

Learning style for technology (by process vs. by tinkering): Learning style considers how people go about solving problems by how they structure their approach. For example, some people prefer to learn new technology by an organized process, like a recipe. Others prefer to tinker, exploring options and experimenting in a “what if I did this” way. Regarding gender, women have been shown to be statistically less likely than other people to use the latter approach when encountering features new to them [117]. Because of such differences, technology organizations have begun to stress the importance of supporting the entire range of learning style values; one example is in Microsoft’s design guidelines [74].

In GenderMag evaluations across the world, these five problem-solving style types have repeatedly shown impacts on which technology features diverse people decide to use and/or how they use them [4, 23, 30, 49, 75, 88, 95, 98, 111, 125]. However, most such evaluations have been outside the context of AI. This article is within AI contexts and isolates the HAIs from all other interactions.

2.1.2 Guidelines for HAI.

A second key component of this article is Amershi et al.’s 18 guidelines for HAI [6]. This set of 18 guidelines for HAI, depicted in Figure 2, provides high-level advice for HAI designers. Each guideline has three components (1) a number, (2) a name which provides high-level advice (e.g., “Make clear what the system can do”), and (3) a brief description of what the guideline means (e.g., “Help the user understand what the AI system is capable of doing”). Amershi et al. also ran an initial study to investigate how designers of AI-powered systems perceived these guidelines, and the designers found that the guidelines were clear and that they could find examples of these guidelines [6].

Fig. 2.

2.2 Related Works

Inclusivity and related concepts like fairness in HAI can be thought of in two broad categories: (1) “under-the-hood” algorithmic inclusivity (i.e., how to detect and fix when algorithms behave inappropiately or unfairly for some groups of people) and (2) “over-the-hood” inclusivity of diverse users’ differing experiences when they interact directly with AI products. There has been a host of literature for the former category [e.g., 15, 18, 50, 57, 71, 80, 133], but this article instead focuses on the latter category.

2.2.1 Investigations into Individuals’ Problem-Solving Styles in Human-AI Contexts.

Because this article addresses five problem-solving style ranges—risk, computer self-efficacy, motivations, information processing style, and learning style with technology (e.g., by process or by tinkering), we focus on those five styles here.

RQ1-Risk focuses on attitudes toward risk. There is a preponderance of research identifying risks associated with AI products (e.g., risks of inaccuracies, privacy loss, misaligned trust) In light of AI’s risks, Cohen et al. [28] has posited that AI’s explanations should consider both risk-averse and risk-tolerant users. Schmidt and Biessmann [106] also considered people’s attitudes toward risk, classifying participants as either risk-averse or risk-tolerant using an incentivized gambling task. Their three conditions manipulated the level of transparency in the system, and they found that as system transparency increased, participants who were more risk-averse exhibited a more pronounced algorithmic bias (trusting the AI too much) than risk-tolerant participants, which they suggested was “a sign of blind trust in ML predictions that can be attributed to increased transparency” [106].

RQ2-AllStyles considers GenderMag’s remaining four problem-solving style types, one of which is computer self-efficacy. Kulesza et al. [78] explicitly measured their participants’ computer self-efficacy. Their work measured the change in their participants’ computer self-efficacy as an outcome of explaining “why”-oriented eXplainable AI approaches. They showed that scaffolding participants’ experiences with “behind the scenes” training led to higher self-efficacy improvements and increased mental model soundness when compared to the participants without the scaffolding. Additionally, Jiang et al. [67] found that when their participants used an AI system, those with elevated self-confidence were less likely to accept the system’s proposed solution, preventing them from being persuaded by the system in the presence of system discrepancies.

Another problem-solving style type considered in this article is motivations. In this article, motivations refers to reasons why people are interacting with the technology. Other researchers have also considered motivations in this sense, in the context of AI systems. For example, Shao and Kwon [110] identified four such motivations: people using AI products for the purposes of entertainment, companionship, functional utility, and dynamic control. They found that when participants interacted with AI products for functional utility and dynamic control, there was a positive relationship between their participants’ satisfaction and these two motivations. Skjuve et al. [115] contributed six user motivations among respondents’ responses for why they interacted with ChatGPT, including productivity, novelty, creative work, learning/development, and as a means of social interaction/support. Li et al. investigated motivations and user satisfaction in AI settings [82] and found a significant interaction between explanation type and motivations with respect to an AI product’s persuasiveness (i.e., neighbor-rating explanation and hedonic motivation) for three types of explanations in a movie recommender system.

Other researchers have investigated a different meaning of motivation for users of AI products, namely how motivated a user was (i.e., how great their desire to participate, succeed, or win). For example, Eisbach et al. [39] found that when participants were more motivated to succeed at a task, they were more likely to intentionally process AI recommendations and explanations. Visser [123] found that when participants perceived an AI as masculine, they were more motivated to play games with more intensity to win than when the AI was perceived as feminine. Some researchers have also investigated how motivated (i.e., likely) users are to interact with an AI product in the future. For example, Baek and Kim [11] found that when ChatGPT was perceived as creepy, participants were less motivated to interact with it in the future.

Another problem-solving style type considered in this article is information processing style. As explained in Section 2.1, information processing style refers to the diverse ways that people gather information to solve problems. A closely-related concept is need for cognition [29, 65, 73, 104], which refers to the extent to which individuals are inclined toward effortful cognitive activities [22], ranging from those with a lower need for cognition to those with higher. In AI contexts, when considering need for cognition among participants, Dodge et al. [35] found that diverse needs for cognition came with different needs for explanation types and amount of explanation. Millecamp et al. [94] found that their participants with a higher need for cognition put in more effort to find the “best” AI recommendation, and Riefle et al. [103] found that their participants with higher need for cognition felt they understood an AI’s explanations more than their counterparts.

One particularly pertinent inclusivity result while considering need for cognition in AI contexts was that of Buçinca et al. [17], researching how to reduce over-reliance on AI explanations. They found that adding cognitive forcing functions benefited only those with higher need for cognition, creating intervention-generated inequalities³ because those with higher need for cognition have historically been a more advantaged group.

The fifth problem-solving style type considered in this article is learning style, and some researchers have considered process-oriented vs. tinkering-oriented learning styles in their analyses of AI products and development. For example, Nam et al. [96] used the GenderMag problem-solving style survey to investigate how 32 developers’ information processing styles and learning styles affected their ways of using large language model (LLM) developer tools. They leveraged three linear regression models, one for each of the investigated LLM features, and they used both participants’ information processing style and learning style as explanatory variables for feature usage. For learning style, they found that process-oriented learners were significantly more likely to probe the LLMs with follow-up queries, whereas tinkering-oriented learners tended to jump directly into tinkering with the code after getting minimal direction from the LLM.

This article differs from the above works by considering all five of these user problem-solving styles (as opposed to subsets of them), and comparing their effects on different versions of the same products. This article also differs by showing how investigating these styles can help reveal actionable steps toward improving AI-powered technologies’ inclusivity across diverse users.

2.2.2 Investigations of HAI Inclusivity to Diverse Humans, from a Demographic Perspective.

Particularly relevant to RQ3-DemographicDiversity are works in AI contexts that investigate gender differences while analyzing human data. van Berkel et al. [120] studied perceived fairness in AI recidivism and loan predictions and found that their participants who identified as men were significantly more likely to say that both systems were fairer than those who identified as women. de Graaf et al. [33] found that gender influenced participants’ willingness to accept robotic technologies. Derrick and Ligon [34] also found gender differences on how likable the AI was, depending on how it behaved. Similarly, Joseph et al. [69] utilized a regression model to report on how awareness of AI impacted perception and utilization of AI tools. They found that increases in male students’ awareness of AI resulted in an increase in their utilization of AI tools. However, increases in female students’ awareness resulted in a decrease in their utilization of AI. Of particular interest to this article, Hu and Min [61] found that, although both men and women were concerned about the “watching eye” of AI, the participants who identified as women were more concerned about privacy violations than the men.

Another line of research particularly relevant to RQ3-DemographicDiversity are investigations of age differences in HAI. Gillath et al. found that older participants were significantly less likely to trust AI. Similarly, both Shahid et al. [108] and Martinez-Miranda [89] found that age impacted their participants’ perceptions of AI-powered robots. (Their participants were much younger than ours; i.e., under 18 years old.) Other works have identified that regarding UX, utilizing technologies like augmented reality and affective computing can help social robots become better companions for older adults [7] or that aging populations have been empowered by using AI to personalize smart home interfaces [44]. Additionally, Zhou et al. [136] found that considering participants’ age while adapting human-facing AI interfaces in the smart home domain led to improvements in usability for elderly participants. Other works related to UX have similarly found how people’s age might influence attitudes toward AI, impacting things like trust [45] and acceptance [68]. As with gender, some works have also discovered age differences in risky situations, such as Shandilya et al. [109], who interviewed 15 participants, all aged 60 or over, and their findings included UX themes which may cluster by attitudes toward risk, such as the perceived annoyance when AI-enabled products deviated from expected behavior or data privacy threats.

Although these works considered participants’ demographic diversity to find differences in UX while interacting with AI via demographic dimensions such as gender, our work differs by instead providing actionable avenues for HAI practitioners through the alignment with participants’ five GenderMag problem-solving style values with the participants’ demographic differences. We establish these ties in Section 6.

2.2.3 Actionable Recommendations for HAI.

This investigation occurred within the context of Amershi et al.’s guidelines for HAI, but there are other ongoing efforts to support HAI. In January 2022, Xu et al. [131] suggested that the set of design standards and guidelines supporting human computer AI-based systems was quite sparse, corroborating Yang et al.’s [134] observations that designing for quality HAI experiences remains a challenge for researchers and designers. Some of the challenges Yang et al. identified included assisting users in understanding AI capabilities, how to craft thoughtful interactions, and collaborating with AI engineers throughout the design process.

To address these challenges, other works have also proposed (and evaluated) sets of design principles for HAI. In 1999, Horvitz [59] identified 12 critical factors for mixed-initiative user interfaces, since humans would transition toward performing collaborative tasks with intelligent agents.⁴ Some of the critical factors pointed toward the need to consider things like the uncertainty of a user’s goals, as well as how to empower the user to infer ideal actions in light of costs, benefits, and uncertainties. Since then, researchers have proposed multiple principles toward aspects of HAI, such as Kulesza et al.’s [77] principles of explanatory debugging, with situational considerations like principles for explaining how an AI made its decisions in the event that is wrong. Other proposed principles focus on specific technologies, such as Ahmad et al.’s [5] focus on personality-adaptive conversational agents. Ahmad et al.’s work produced six principles, some of which suggest a need to design agents in such a way that they can support diverse users in a mental health setting.

Others have investigated methods of informing the design of HAI through guidelines. Wright et al. [130] survey guidelines from three major companies—Apple, Google, and Microsoft—and unify more than 200 guidelines into multiple categories. In their work, they classify the guidelines into categories such as initial considerations of AI, curating the models themselves, the deployment of the AI-powered system, and the human-AI interface. As Wright et al. point out, both Apple’s [64] and Google’s [47] guidelines are developed with the developer in mind, whereas Amershi et al.’s guidelines focus on how the design pertains to the user. The closest work to our own that does an empirical investigation of guidelines for HAI comes from the first user investigation of the Amershi et al. guidelines, reported in Li et al. [83]. The results, discussed in more detail in Section 3.3, found that in almost all of the experiments, participants preferred products which applied the guidelines, and applying the guidelines positively impacted participants’ UX.

All of these works investigated guidelines for HAI. Although our article’s context was applying a set of HAI guidelines, its foci are to present an empirical approach to measuring HAI’s inclusivity outcomes in AI-powered systems and to show that the approach can produce actionable results for HAI designers.

3 Methodology

To investigate our research questions, we performed 18 independent experiments, one for each of Amershi et al.’s 18 HAI guidelines [6] (listed earlier in Section 2.1). We used these experiments to perform two investigations. Investigation One, reported in Li et al. [83], investigated the impacts of violating/applying these guidelines. Investigation Two, which is the one we report in this article, investigated potential disparities in the UX of participants with diverse problem-solving style values. Our investigation used GenderMag’s five problem-solving style spectra—the spectrum of participants’ attitudes toward risk, of their computer self-efficacy, of their motivations, of their information processing styles, and of their learning styles (by process vs. by tinkering).

To answer these research questions, we generated the following statistical hypotheses before data collection. For any dependent variable and any of the five problem-solving styles, our statistical hypotheses between applications (app) and violations (vio) of any guideline were as follows:

\begin{align}& H_{0}:\mu_{app}-\mu_{vio}=0\\& H_{A}:\mu_{app}-\mu_{vio}\neq 0.\end{align}

3.1 Study Design

The experiments’ context was productivity software, such as document editors, slide editors, search engines, e-mail applications, and spreadsheet applications. Each experiment was a 2 $\times$ 2 factorial experiment, where each factor had two levels.

The first factor, the “guideline adherence” factor, was within-subjects, and the factor’s levels were “guideline violation” and “guideline application.” For any one guideline’s experiment, the “guideline violation” condition violated that particular HAI guideline; for example, in Guideline 1’s experiment (make clear what the system can do), the “guideline violation” did not make clear what the system can do. Similarly, in Guideline 11’s experiment (make clear why the system did what it did), the “guideline violation” did not make clear why the system did what it did. In contrast, the “guideline application” level applied each guideline; for example, in Guideline 1’s experiment (make clear what the system can do), the “guideline application” condition added clarifying information about what the system can do.

The second factor, the “AI performance” factor, was between-subjects. This factor’s levels were “AI optimal” and “AI sub-optimal.” In the “AI optimal” level, the AI sometimes made mistakes but worked well most of the time, whereas in the “AI sub-optimal” level, the AI sometimes made mistakes and sometimes worked well.

In each experiment, both the product that violated the guideline and the product that applied it were represented by vignettes, as in several other works in HAI [1, 32, 81, 85, 91]. The vignettes were developed in two phases: in the first phase, two researchers went through an iterative brainstorming process, where they independently thought about how the 18 guidelines might show up in productivity software, drafting between 5 and 8 interaction scenarios for each guideline. Then, the researchers collaborated to review, rewrite, and sometimes replace the scenarios. In the second phase, the researchers adhered to Auspurg et al.’s [10] best practices to make the vignettes simple, clear, and realistic. In cases where the interaction description was not understandable through text, images were used to promote understandability. Before deploying the study, each vignette went through two rounds of piloting. In the first round, each vignette received feedback from seven Human-Computer Interaction researchers not familiar with the project, and changes were made based on that feedback. In the second round, we piloted the updated vignettes on Amazon Mechanical Turk (MTurk) with five participants per vignette; no issues were identified from this second pilot. Each of the final vignettes was composed of three parts: (1) a product/feature introduction; (2) a description of what the AI feature did; and (3) a summary of how well the AI performed.

Figure 3 provides an example of the two vignettes from the experiment for Guideline 1 (“Make clear what the system can do”). In first part, the only difference between the two conditions’ vignettes was in the name (Ione and Kelso), generic names given to each product to distinguish them from each other and to avoid the influence of prior familiarity with a real product. The second part manipulates the “guideline adherence” factor. In Figure 3(a), part 2 states:

“We will help you improve your presentation style”

without giving specific examples or details, thus violating the guideline by not making clear what the system can do. In contrast, Figure 3(b)’s part 2 applies the guideline to make clear exactly what the system can do, stating:

“As you practice your presentation, we will give you feedback about your presentation style: how fast you speak, use of filler words (such as ‘um’ and ‘like’), use of inappropriate words (such as ‘damn’).”

Fig. 3.

We will refer to the vignette that violated the guideline as the Violation AI product. Similarly, we will refer to the vignette that applied the guideline as the Application AI product.

Table 2 lists the questions that the participants responded to for both the Violation AI product and the Application AI product. These dependent variables gather information about different dimensions of participants’ UX. The first five questions (control, secure, inadequate, uncertain, and productive) follow Benedek and Miner’s [14] approach of measuring end users’ feelings in UX. Perceived usefulness was taken from Reichheld and Markey [102] and has been known to relate to acceptance and use of AI-infused systems. The last four questions (suspicious, harmful, reliable, and trust) came directly from Jian et al. [66], who focused on scales for trust in automated systems. The answer to each question was an agreement scale, ranging from “extremely unlikely” (encoded as a 1) to “extremely likely” (encoded as a 7).

Table 2.

Dependent Variable Name	Dependent Variable Wording	Reverse-Coded?
I would feel in control	“I would feel in control while using the product.”
I would feel secure	“I would feel secure while using the product.”
I would feel inadequate	“I would feel inadequate while using the product.”	$\checkmark$
I would feel uncertain	“I would feel uncertain while using the product.”	$\checkmark$
I would feel productive	“I would feel productive while using the product.”
I perceived it as useful	“I would find the product useful.”
I would be suspicious	“I would be suspicious of the intent, action, or outputs of the product.”	$\checkmark$
It would be harmful	“I would expect the product to have a harmful or injurious outcome.”	$\checkmark$
I find the product reliable	“I would expect the product to be reliable.”
I would trust the product	“I would trust the product.”

Table 2. The 10 Dependent Variables regarding Users’ Perceived Feelings [14], Usefulness [102], and Trust [66] Questions

Participants answered these seven-point agreement scale questions for both the Violation Product and the Application Product, which they saw in a random order. We indicate the reverse-coded questions ($\checkmark$)—feel inadequate, feel uncertain, suspicious, harmful—which Li et al. also did. As such, they became feel adequate, feel certain, not suspicious, not harmful. Participants saw only the wording shown in the “Dependent Variable Wording” column.

3.2 Participants and Procedures

A total of 1,300 participants were recruited from Amazon MTurk, a popular crowdsourcing platform. To ensure quality data, participants had to meet certain performance criteria on MTurk before they could participate in the study, such as having at least 100 approved human intelligence tasks (HITs) and having above a 95% acceptance rate on the platform. Additionally, participants had to be located in the USA and be at least 18 years old. After workers accepted the HIT, they were presented with an Institutional Review Board consent form and then answered three screening questions. The first two asked about their familiarity with productivity software and the last confirmed that they were above the minimum age requirements. Upon completion of the screening survey, participants were provided $0.20.

Once participants had completed the screening survey, they were randomly assigned to 1 (and only one) of the 18 experiments, one for each guideline. First, participants randomly saw either the Violation AI product or Application AI product, such as the example provided in Figure 3. ⁵ Participants then responded to the UX questions shown in Table 2, asked in a random order. Once participants completed their responses for the first AI product, they saw the second product and answered the same UX questions in another random order.

Once participants had seen both products and answered the UX questions for each, they were asked to select which product they preferred and explain why they preferred it. As detailed by Li et al. [83], one of the authors read the open-ended answers provided in each factorial survey repeatedly, until codes began to emerge. Then, the codes were recorded and each comment was coded. Other team members conducted spot checks to verify the qualitative coding. Participants were then asked two manipulation check questions,⁶ one closed- and one open-ended. The closed-ended manipulation check asked participants whether or not they agreed with text that mirrored the guidelines themselves (e.g., “make clear what the system can do,” “make clear why the system did what it did”). For example, if participants in Guideline 1’s experiment agreed that the Application AI product made clear what the system can do, and they disagreed with the statement for the Violation AI product, then they passed the manipulation check. The open-ended manipulation check asked participants to “…briefly describe the differences between Kelso and Ione” (the fictitious names randomly assigned to the Violation AI product and Application AI product). The open-ended answers were qualitatively coded to check whether or not each participant had successfully perceived the experimental manipulation.

Participants then filled out a questionnaire with their demographic data, including their age, self-identified gender, race, highest education level, and field of employment.⁷ They also filled out the problem-solving style questionnaire (Section 3.4) and were paid a bonus of $5 for completing the experiment.

3.3 Investigation One Results Summary

As mentioned in Section 1, Investigation One, reported in Li et al. [83], compared UX outcomes of AI products that had applied the guidelines against AI products that had not. That investigation’s measures were generalized eta-squared ($\eta^{2}$) effect sizes for each of the dependent variables in each of the experiments.

The primary takeaway from Investigation One was that, for most of the guidelines, participants perceived the Application AI products as more useful and as providing better UX than the Violation AI products. Figure 4 shows thumbnails of their results for each guideline’s experiment. The more color-filled each thumbnail, the larger the positive effect sizes were for that guideline’s experiment. For example, G3’s thumbnail shows significant differences with small or medium effect sizes on most of the HAI-UX aspects measured. G6’s experiment produced particularly strong results. Its thumbnail is almost filled with color, indicating that G6’s experiment produced significant differences on all HAI-UX measures, with medium or large effect sizes for all but one.

Fig. 4.

In addition, Investigation One’s analysis informed two aspects of Investigation Two’s analysis. First, Investigation One’s analysis revealed that 2 of the 18 experiments failed the manipulation checks (Section 3.2)—the experiments for Guidelines 2 and 16—and as such were dropped from Investigation One. Thus, our investigation also drops those two experiments, which leaves a total of 1,043 participants across the remaining 16 experiments. Second, Investigation One’s analysis of these remaining 16 experiments revealed that the AI optimality factor (Section 3.1) was significant in only one of these experiments. This resulted in Investigation One dropping this experimental factor, and we do the same for Investigation Two.

3.4 Investigation Two (Current Investigation) Data Analysis

This article’s investigation analyzes the same independent experiments’ data from a new perspective: the inclusivity that the violation vs. application AI products afforded diverse participants. Specifically, we consider diversity in terms of participants’ diverse problem-solving styles (RQ1-Risk and RQ2-AllStyles) and their diverse gender/age demographics (RQ3-DemographicDiversity).

To collect demographics, we used a questionnaire asking participants their gender identity and age group. To collect participants’ diverse problem-solving styles, we used the GenderMag facets survey [55], a validated survey that measures participants’ values of the five GenderMag problem-solving style types enumerated earlier in Section 2.1, termed “facets” in GenderMag publications. Each problem-solving style type has multiple Likert-style questions that run from Disagree Completely (encoded as a 1) to Agree Completely (encoded as a 9), a few examples of which are shown in Table 3. For example, using this instrument, if one participant answers the first question (top row) closer to Agree Completely than a second participant, the first participant is considered to be more risk-averse than the second participant. Of the 1,043 participants, 27 failed at least one attention check in the problem-solving style survey, leaving 1,016 participants for this investigation. Appendix A lists the full questionnaire, including the attention checks.

Table 3.

For this Problem Solving Style:	Sample Question:
Attitude toward risk	“I avoid using new apps or technology before they are well-tested”
Computer self-efficacy	“I am able to use unfamiliar technology when I have seen someone else using it before trying it.”
Motivations	“It’s fun to try new technology that is not yet available to everyone, such as being a participant in beta programs to test unfinished technology.”
Information processing style	“I always do extensive research and comparison shopping before making important purchases.”
Learning style (by process vs. by tinkering)	“I enjoy finding the lesser-known features and capabilities of the devices and software I use.”

Table 3. Examples of Questions from the Validated Problem-Solving Style Survey (Full Survey in Appendix A)

The GenderMag survey has previously been validated in multiple ways. Hamid et al. [55] summarize the six-step validation process; among the steps were literature searches, multiple statistical analyses, demographic validation, and problem-solving style validation. Particularly relevant to this article was Guizani et al.’s [53] participant validation of the problem-solving styles the survey captures. In that study, participants took the survey, then spoke aloud throughout problem-solving tasks. Participants’ in-the-moment verbalizations when problem-solving validated their own questionnaire responses 78% of the time, a reasonably good measure of consistency [48].

To score a participant’s problem-solving style values, we summed up that participant’s responses to the risk questions, then the self-efficacy questions, and so on. Each sum is the participant’s “score” for that problem-solving style. Comparing these scores reveals a participant’s placement in that problem-solving style type compared to others in the same peer group, such as among computer science professors, or among residents of eldercare facilities, and so forth; in our case, the peer group is the adult productivity software users who participated in the study.

These scores formed 16 distributions, one for each experiment (e.g., see Figure 5 for the risk score distributions). Using each experiment’s median,⁸ which is robust against outliers, we then defined participants as being either more risk-averse than their peers (i.e., above the median) or more risk-tolerant, and similarly for the other four problem-solving styles.⁹

Fig. 5.

To analyze the dependent variables for each of the 16 experiments, we used t-tests after ensuring that the assumptions held, as follows. To investigate inclusivity (Section 4), we compared within-subjects using paired t-tests, treating each Violation AI product as a “before” and Application AI product as an “after.” As Table 4 shows, each of the 16 experiments had over 30 participants, suggesting normality of every experiment by the Central Limit Theorem.¹⁰ In addition, in cases where the sample size fell beneath 30, we used Shapiro-Wilk tests to validate that the underlying reference distribution was not significantly different than normal (i.e., $p\geq 0.05$).

Satisfying these assumptions indicated that the above t-tests were appropriate analysis techniques for these data. Each of the 16 experiments were designed with pre-planned hypotheses for each dependent variable, so we do not report statistical corrections in the body of this article. As other researchers [8, 99] point out, statistical corrections (e.g., Bonferroni, Holm Bonferroni, Benjamini-Hochberg) are necessary only if “…a large number of tests are carried out without pre-planned hypotheses” [8, 9]. Still, we recognize that not all readers may agree with this choice, so we also show all the Holm-Bonferroni corrections [58] in Appendix D.

Table 4.

Experiment	G1	G3	G4	G5	G6	G7	G8	G9	G10	G11	G12	G13	G14	G15	G17	G18
Risk-averse participants	26	31	29	32	28	26	26	26	31	31	35	27	31	29	27	32
Risk-tolerant participants	31	37	36	37	36	32	35	32	34	32	34	30	37	35	36	35
Total	57	68	65	69	64	58	61	58	65	63	69	57	68	64	63	67

Table 4. The Number of Risk-Averse vs. Risk-Tolerant Participants (Rows) per Experiment (Columns)

The group sizes were similar, with the smaller group at least 43% of the total in every experiment.

4 Results: What Participants’ Risk Styles Revealed

RQ1-Risk considers the 16 pairs of AI products described in Section 3—one violating a guideline and its counterpart applying that guideline—and how the two differed in their inclusivity of risk-diverse participants’ HAI experiences. (We will generalize beyond risk in Section 5.)

In this article, we measure whether/how applying a guideline to an AI product changed the product’s inclusivity toward some particular group of participants. For any UX dependent variable in an AI product, we will say the HAI-UX is more (less) inclusive to a group of participants if the Application AI product’s result for that variable are significantly higher (lower) than the Violation AI product’s result for those participants.¹¹

To answer this question, we performed an in-depth analysis of all HAI-UX measurements’ inclusivity by considering participants’ attitudes toward risk. HAI-UX inclusivity could change in only four possible ways: (1) inclusivity changes for both the risk-averse and risk-tolerant participants, (2) inclusivity changes for neither of them, (3) inclusivity changes for the risk-averse only, and (4) inclusivity changes for the risk-tolerant only. As Table 5 shows, instances of all of these categories occurred.

Table 5.

Table 5 also shows that the risk results fell mainly in categories (1) and (2) above. Perhaps most important, the table shows that whenever applying a guideline produced a change in inclusivity, it was almost always a positive change for at least some risk-attitude group of participants—without loss of inclusivity for the other group.

Result #1: Following the guidelines usually led to inclusivity gains. Applying these guidelines led to 115 (75 $+$ 13 $+$ 27) inclusivity gains for either or both risk groups, and only one inclusivity loss.

4.1 When Everybody Gained: More Inclusivity for Both the Risk-Averse and the Risk-Tolerant

At first glance, it may appear that the “when everybody gained” category of results is a natural consequence of the overall success rates shown by Investigation One. For example, many of the experiments that produced strong positive results in Investigation One also did so in Investigation Two, as with G6–G9. Still, even the G6–G9 experiments reveal relationships between outcomes and perceptions of risk that shed new light on the whys of these results.

For example, consider Guideline 8. In this experiment, the AI-powered feature was a design helper to automatically provide design suggestions for alternative layouts in a presentation application. In the Violation AI product’s vignette, the feature’s behavior was: “You are working on a slide and Design Helper pops up, showing you some design suggestions. You do not need any design help at this time, but there is no way to hide the design suggestions.” In contrast, the Application AI product’s vignette started out the same, but its last sentence was: “You do not need any design help at this time, so you click on a button visible on screen to hide the design suggestions.”

The second row of Table 6 shows one of Guideline 8’s outcomes,¹² with the risk-averse participants’ suspicions of the Violation AI product (left hatched boxplot) significantly worse than their suspicions of the Application AI product (left clear boxplot).¹³ (t(25) = 3.2354, p = 0.003, and d = 0.648).¹⁴ Likewise, the risk-tolerant participants also were significantly less suspicious of the Application AI product than of the Violation AI product (t(34) = 3.0020, p = 0.005, and d = 0.507), as shown in the right boxplots.

Table 6.

Yet, despite their agreement on these outcomes, participants’ free-text remarks showed that their reasoning differed with their attitudes toward risk. The risk facet is nuanced—it includes aversion/tolerance risks with privacy/security, of producing low-quality work, of wasting too much time, of having trouble with the product, and so forth. In Guideline 8’s experiment, about a fourth (14/61) of the participants’ comments focused on the second of these, the risk of low-quality work.

This focus on risk of low-quality work was especially true of risk-averse participants. 31% (8/26) of this experiment’s risk-averse participants wrote about preferring the increased control they had over their work quality with the Application AI product.

G08-1921-risk-averse: “…very convenient and still make me feel very much in control of my choices.”
G08-3619-risk-averse: “I don’t trust [Application AI product]…, but the fact I can turn the feature off lets me be in more control.”

Even the more risk-tolerant were worried about this type of risk, and 17% (6/35) of these participants expressed the same sentiments. However, for these more risk-tolerant participants, annoyance and frustration also figured prominently in their reasoning (26%: 9/35), compared to only one risk-averse participant expressing this sentiment.

G08-2831-risk-tolerant: “Because I can get rid of the content that might… influence me to do something stupid. If I am going to do something stupid it will be my idea.”
G08-3681-risk-tolerant: “…[Application AI product] would allow me more freedom, and be less annoying with its suggestions, even when they are wrong.”
G08-2627-risk-tolerant: “Without an option to turn off an unnecessary feature, I would be extremely frustrated…as it would be a severe distraction… never would I use [Violation AI product]…”

Comments like these, when coupled with the risk-averse and risk-tolerant participants’ feeling both significantly more in control and less suspicious of the Application AI product, suggest relationships between an expectation of risk and four particular HAI-UX inclusivity outcomes. As Table 5 shows, across all seven experiments where the Application AI product gained inclusivity for both the risk-averse and risk-tolerant participants’ (not)-suspicious outcome (row 7), it also gained inclusivity for their certainty (row 4), control (row 1), and trust outcomes (row 10).

Result #2: Suspicion, control, trust, and certainty changed in tandem, for both risk groups. In all experiments, every inclusivity gain in (1) (not-)suspicious for both the risk-averse and risk-tolerant was coupled with an inclusivity gain in all three of (2) in-control and (3) trust, and (4) certainty.

This result provides insight into why the five experiments that gained the most inclusivity across the risk spectrum participants—G6, G7, G8, G9, and G15—did as well as they did. What these five guidelines have in common is that they all give users more control over the product. What their five experiments have in common (from Table 5) is that, in all of them, the Application AI product gained inclusivity in all four of the above variables:

Result #3: Giving users control mattered for both risk groups. The five experiments with the most inclusivity gains across risk-diverse participants were those whose guidelines increased users’ control over the AI products.

4.2 When Nobody Gained: No Inclusivity Improvements for Either Risk Group

Not all the results were as positive for diversity. Some changes did not change inclusivity outcomes for either of the two groups, measured in this article as no significant difference in HAI-UX inclusivity for either the risk-averse or the risk-tolerant participants between the Violation AI product and Application AI product. This was the second-most prevalent category, occurring 44 times across 10 experiments.

Consider Guideline 4’s (“show contextually relevant information”) results; examples are in Table 7. Guideline 4’s experiment produced nine instances of the “nobody gained” category. In that experiment, the application was a document editor, and the AI-powered feature was an acronym explainer. The Violation AI product violated the guideline: “When you highlight an acronym to see what it stands for, [Violation] shows you a standard list of possible definitions taken from a popular acronym dictionary.” In contrast, the Application AI product: “When you highlight an acronym to see what it stands for, [Application] shows you definitions that are used in your workplace and pertain to the topic of the current document.”

Table 7.

In some ways, the participants’ reasoning for their unchanging responses to the Violation AI product vs. the Application AI product echoed those of the previous subsection, namely wanting to avoid the risk of low-quality work. As in the previous section, this reasoning was especially common among the risk-averse participants (34% = 10/29), although 22% (8/36) of the risk-tolerant also used it. However, whereas in the previous section participants gave this risk as an asset of the Application AI product, in this section they gave it as a liability of the Application AI product.

G4-4098-risk-averse: “[Violation AI product] may make mistakes… but its use of a generic dictionary makes it easier to recognize mistakes… With [Application AI product], I would be more likely to miss mistakes.”

G4-3799-risk-averse: “…if [Application AI product] were to make a mistake on me, I would have a hard time trusting it because I did not make any part of the decision.”

Guideline 4 also raised privacy concerns among some participants:

G4-3905-risk-averse: “… I would be nervous that [Application AI product] is pulling data from things like my other software and my browsing history.”
G4-3947-risk-tolerant: “I don’t like the idea of [Application AI product] taking definitions from my workplace. It makes me worry I’m being listened to…”

In the “everybody gained” category (previous section), the five most inclusive guidelines across the risk spectrum revealed a relationship among risk-inclusivity and trust, control, certainty, and (not)-suspicious. The five least inclusive guidelines as per Table 5—G1, G4, G5, G12, and G13—show that the relationship persisted in the “nobody gained” category. None of G1, G4, G5, and G12 produced any inclusivity gains for any of these interrelated variables; and G13 showed only two such gains. These results not only confirm Result #2, but also provide a complement to Result #3:

Result #4: Not having user control mattered, for both risk groups. None of the five guidelines showing the fewest risk-inclusivity gains offered increased user control over the AI products.

4.3 Selective Inclusivity: Who Gained, Who Did Not, and Why?

The third and fourth categories of HAI-UX inclusivity changes were inclusivity gains for the risk-averse participants only or the risk-tolerant participants only. Neither category was very large, with 13 and 27 total instances, respectively. Table 8 shows a few examples.

Table 8.

Despite the relatively small totals, the fourth category, that of bringing gains to the risk-tolerant only, reveals a unique pattern shared by three experiments—Guideline 3’s, Guideline 13’s, and Guideline 18’s. As a column-wise reading of Table 5 shows, in these three experiments, inclusivity gains for the risk-tolerant participants abounded, but the risk-averse participants rarely gained.

Guideline 3’s experiment offers a case in point. In that experiment, the Application AI product provides services only when it decides the user’s current task/environment would benefit. The Application AI product’s vignette applied this guideline by stopping e-mail notifications “…when you are busy.”

As Table 5 shows, for Guideline 3’s risk-tolerant participants, the Application AI product showed inclusivity gains on every dependent variable except one for the risk-tolerant participants. However, only four of these gains extended to the risk-averse participants.

Why such differences? For Guideline 3’s experiments, the risk-averse participants’ concerns about risks to their work or their privacy appeared to outweigh the benefits of fewer notifications, whereas for the risk-tolerant, the balance seemed to tip the other way. For example, 26% (8/31) of the risk-averse participants explicitly brought up concerns about these risks, but only 11% (4/37) of the risk-tolerant did.

G3-3504-risk-averse: “…Also, I wouldn’t be sure that [Application AI product] would be able to accurately qualify my activities.”
G3-3054-risk-averse: “[Application AI product]… would have to be able to monitor your online activity…that would be a little invasion of privacy…”

A risk-oriented commonality among these three experiments lies in what these AI products were actually doing. All three of these Application AI products learn from the user’s own data, as opposed to learning mainly from huge datasets mostly consisting of other people’s data. Specifically, both Guideline 3’s and Guideline 13’s AI products involved learning from the user’s own context and behaviors, and Guideline 18’s AI product involved learning from that user’s e-mails. In the latter case, the AI product also moved that user’s e-mails around, adding the risk that the user might not find some of their e-mails later.

Two other Application AI products, Guideline 4’s and Guideline 12’s, also had this attribute. Guideline 4 involved learning from the user’s contexts, and Guideline 12 learned from the user’s recent interactions. Most of these two products’ outcomes were in the “nobody gained” category: the risk-averse were very uncomfortable with these products, and even the risk-tolerant saw too many risks (e.g., recall the discussion of Guideline 4’s experiment in the previous subsection).

These five guidelines’ Application AI products were the only ones with this attribute. For these five experiments, the risk-averse participants hardly ever experienced any inclusivity gains (from Table 5).

Result #5: Learning from “my” data mattered. Whenever the Application AI products learned from participants’ own data, inclusivity gains for risk-averse participants were rare.

4.4 The Risk Results and Actionability

The AI products in these experiments were designed to isolate effects of applying vs. violating each guideline. However, if they were real products for sale, the products’ owners would probably hope to make each product as well-received by as many of its customers as possible.

The risk results provide actionable ideas for such product owners, for seven of these AI products. Figure 6 points out which products those were.

Fig. 6.

The green boxes in the figure mark those products: they are G1’s, G3’s, G4’s, G5’s, G12’s, G13’s, and G18’s. For example, Section 4.1 revealed how lack of user control affected some of the low-performing AI products (e.g., G1’s, G4’s, G5’s, G12’s, and G13’s). An actionable implication for these products is that those products would improve by offering users more control. As another example, Section 4.3 revealed the sensitivity the risk-averse participants had to products that potentially did “too much” with their data (e.g., G3’s, G4’s, G12’s, G13’s, and G18’s products). One actionable idea for those products would be to provide information on what else the user’s personal data are used for and how long these data are stored. More generally, results like these suggest that a way to improve AI products favored by only the risk-averse participants or only the risk-tolerant participants is to attend to risk-oriented attributes of the product that were not tolerated well by participants at that end of the risk spectrum.

Result #6: Some risk results were actionable. For the seven Application AI products associated with G1, G3, G4, G5, G12, G13, and G18, the risk results provided actionable ideas for further improving the HAI-UX those products offered.

5 Results: Beyond Risk—the Other Four Problem-Solving Styles

Section 4 considered only one type of problem-solving diversity, namely participants’ diverse attitudes toward risk. We now turn to RQ2-AllStyles, which asks “How inclusive are such products to users with diverse values of GenderMag’s other four problem-solving styles?” Although space does not permit an in-depth analysis of each remaining problem-solving style—motivations, learning style, computer self-efficacy, and information processing style—we summarize in this section whether and how the risk results of Section 4 generalize to analogous results. If they do, we also consider whether those new results add anything to our understanding of the UX the AI products offer to diverse problem-solvers. Full analyses for all of these styles are in the Appendices.

Recall from Table 1 that GenderMag uses two personas, “Abi” and “Tim,” to identify the distinguished endpoints of each of GenderMag’s five problem-solving style types. As per the table’s definitions, we classify participants as more “Abi”-like if they had any of the following problem-solving style values: more risk-averse, lower computer self-efficacy, task-oriented motivations for using technology, had a comprehensive information processing style, or were a process-oriented learner. Participants nearer the opposite endpoint of these problem-solving spectra are classified to be more “Tim”-like; i.e., more risk-tolerant, had higher computer self-efficacy, had tech-oriented motivations, had a more selective information processing style, or learned more by tinkering. As in other persona research [3], we use these persona names for two reasons: (1) to provide a vocabulary for an associated collection of traits and collection of traits (recall Section 2), and (2) using the “Abi” and “Tim” vocabulary helps emphasize which of the “distinguished endpoints” of each problem-solving style type tend to co-occur, helping to keep clear which problem-solving value belongs to the underserved population.

As the upcoming Tables 9–12, and earlier Table 5 show, the first result of these analyses was very good news for most of the Application AI products. As was also true of risk diversity results, whenever applying an Application AI product produced a change in inclusivity, it was almost always a positive change for at least some group of participants—without loss of inclusivity for the other group. For example, as Table 9 shows, whenever applying a guideline produced a gain for either the task-motivated group or the tech-motivated group, it almost never produced an inclusivity loss for the other group, with only one exception.

Result #7: Following the guidelines usually led to inclusivity gains—for every one of these five problem-solving styles. Applying these guidelines led to 115, 116, 116, and 116 inclusivity gains, respectively, for motivations-diverse, learning-style-diverse, self-efficacy-diverse, and information-processing-diverse participants; with only one or two inclusivity losses for any of these types of problem-solving diversity.

Table 9.

Table 10.

Table 11.

Table 12.

One result from RQ2-AllStyles was who were advantaged across these 16 product pairs. Note that the GenderMag assignment of endpoints to “Abi” vs. “Tim” followed widespread statistical skews of genders toward these particular styles [20]; previous research has shown that “Abi” styles have statistical tendencies to cluster, and so do the “Tim” styles. Thus, one might expect color patterns in risk’s results (Table 5) to be similar to the columnar color patterns in, say, the motivations results (Table 9).

However, this sometimes did not happen. For example, Table 5 visually contained twice as many T cells for the risk-tolerant participants as there were A cells for the risk-averse participants (27 vs. 13 respectively). But Tables 9–12 show that who gained more inclusivity advantages depended on which problem-solving style was considered. For example, Table 9 reverses Table 5’s trend, with more A cells for the task-oriented participants than T cells for the tech-oriented.

This trend sometimes occurred even within individual experiments, demonstrated in Figure 7 for Guideline 13’s experiment. Guideline 13’s product was a presentation app, and the AI feature was a design helper that recommended designs for alternative layouts. When participants saw the Violation AI product, they were told that “…Violation AI product has not learned your preferences and blue designs appear in the same place among the suggested designs as the first time you used it,” whereas the Application AI product “…has learned your preferences and now features blue designs prominently.” Considering participants’ attitudes toward risk (first column), the risk-tolerant participants derived the most benefit from the Application AI product. However, the second column shows these same participants’ data but instead considering their motivations. This adds to the risk results; the Application AI product benefited both the risk-tolerant and also those with task-oriented motivations.

For HAI practitioners, results like these suggest that different design decisions can appeal to different problem-solving styles for different reasons. For example, 46% (13/28) of the participants with task-oriented motivations mentioned how efficient they would become with the Application AI product or how much time they would save while using it:

G13-2178-task-oriented: “I like to have software that anticipates my needs, because it makes working more efficient.”
G13-4099-task-oriented: “It is more efficient to see designs similar to those I have used before…it will take me less time to find them.”
G13-2740-task-oriented: “It [the Application AI product] learned my preferences quicker which in time will save me time and trouble.”

Fig. 7.

Although the participants with tech-oriented “Tim”-like motivations also raised efficiency and time savings, they did so less frequently than their task-oriented peers (only 17%—5/29):

G13-662-tech-oriented: “Because it saves time than starting from scratch every time I use it [the Application AI product].”
G13-662-tech-oriented: “I prefer [Application AI product] because of its ability to learn my preferences…thus helping me to work more efficiently.”

Comments like these also have ties to the research literature. In that body of work, people who are more task-oriented prefer to use technologies to accomplish their task, using methods they are already familiar and comfortable with [19, 21, 24, 60, 87, 114]. Task-oriented people do so in an attempt to focus on the tasks that they care about, which might explain why these task-oriented participants commented so frequently on how the Application AI product saved them time; if the product saves them time, then the task-oriented participants could achieve their task more quickly, devoting more time to what they care about rather than having to spend additional time recreating designs.

As the examples and tables in this section have shown, the types of participants who did and did not benefit from the changes in the Application AI product varied by problem-solving style—attending to only the risk style did not tell the whole story. This suggests that HAI practitioners wanting to create a more inclusive AI-powered product for those with diverse problem-solving styles should consider all five of GenderMag’s problem-solving styles.

Result #8: The union of these five styles’ results revealed more about who was left out—and why—than any one style’s results alone could do. G03, G13 (Figure 7), and G18 are cases in point, illustrated through the change in the colors of the cells between “Abi” and “Tim” across these five problem-solving styles.

6 Participants’ Problem-Solving Styles and Their Demographics

Some HAI research has suggested demographic differences in different AI systems’ HAI usability [e.g., 40, 79, 137]. Here, we consider whether problem-solving style results like those in Sections 4 and 5 can shed light on why such demographic differences exist.

RQ3-DemographicDiversity seeks to understand how participants’ problem-solving diversity aligned with their demographic diversity. The answer to this question will show whether problem-solving disparities in HAI UX can help explain demographic disparities in HAI UX.

For example, consider the Guideline 18 outcome variable of “certainty” and the two genders for whom enough data are present for inferential statistics—women and men. A statistical peek at the G18 data by gender reveals that the men’s inclusivity significantly increased with G18’s Application AI product over the Violation AI product (t(28) = 3.1777, p = 0.004, d = 590), whereas the women’s did not (t(34) = 1.0359, p = 0.308, d = 0.175). This gender disparity seems problematic, but knowing its presence does not suggest a solution.¹⁵

6.1 Problem-Solving Style Diversity, Gender, and Age

If the gender results for the G18 example above show alignment with, for example, the G18 risk results of Section 4, the risk-oriented solution ideas from that section might help remove the gender disparity. And indeed, these two results do align: risk analysis showed that G18’s Application AI product, which added user control to the AI product, did not provide significant inclusivity gains for the risk-averse certainty outcome (t(31) = 1.7261, p = 0.094, and d = 0.256) but did for the risk-tolerant (t(34) = 2.1884, p = 0.036, and d = 0.370).

Our investigation into RQ3 will enable leveraging this kind of alignment. For example, if we find that the women participants skewed toward risk-aversion, that knowledge would suggest that improving the G18 Application AI product’s inclusivity across the risk spectrum could also improve its inclusivity across the gender spectrum.

Thus, to find out how our participants’ problem-solving styles aligned with their genders, we counted the number of “Abi”-like and “Tim”-like styles of each participant of all 16 experiments and then compared the counts by gender. We begin with the two genders for whom enough data are present for inferential statistics—women and men, who provided 98.7% of the data—and then non-statistically present the data for the participants in the Lesbian, Gay, Bisexual, Transgender, Queer, Intersex, and Asexual* (LGBTQIA*) community.¹⁶

As Figure 8 (left) shows, the women were split almost equally between having three or more “Abi”-like styles (first three orange bars, 50.6%), vs. having two or fewer (49.4%). For example, the leftmost pair of bars show that 59 women and 24 men had five Abi-like problem-solving style values (0 Tim-like styles). In contrast, the men skewed heavily toward the right; only 34.5% of the men had three or more “Abi”-like styles (first three blue bars). As Figure 8 (right) shows, these gender skew differences were statistically significant under Fisher’s exact test ($p\lt0.0001$).¹⁷ Vorvoreanu et al. [125] found similar gender skew results while investigating an academic search tool.

Fig. 8.

Adding age demographics into our analysis, an intersectional gender—age analysis showed analogous gender skews in each of the five age groups in our data (Figure 9). The results were significant in the three age groups between ages 25 and 54.

Fig. 9.

We also analyzed the presence of such gender–age intersectional results within each problem-solving style type. As Figure 10 suggests, the gender differences did manifest by age in four of the five style types.

Fig. 10.

Table 13.

For the four styles shown in Figure 10, the gender-by-age differences in these problem-solving attributes are consistent with other gender- and/or age-difference reports in the literature [e.g., 25, 36, 38, 40, 54, 84, 86, 107]. For information processing style, although our participants did not show these demographic differences, others’ research has shown both gender differences [93, 113] and age differences [43, 52, 90, 119]. Such demographic differences in problem-solving style by gender and by age may help explain demographic differences between people’s experiences with AI products [e.g., 46, 62, 121].

Result #9: Problem-solving styles and gender/age were related. Participants’ problem-solving styles clustered by both gender and age. An implication of this result is that inclusivity gains for certain problem-solving styles, as per the results in Sections 4 and 5, should also translate into inclusivity gains for certain genders and/or age groups.

6.2 The LGBTQIA* Community

The genders “woman” and “man” are only two points on the gender spectrum. Table 13 reports the GenderMag problem-solving style values for the 13 participants who were members of the LGBTQIA* community. Although a dataset of 13 participants is small, we hope it will add to literature being populated by other researchers with datasets of LGBTQIA* participants [e.g., 2, 42, 56], to enable the possibility of future meta-analyses to broaden our understanding of how to inclusively design for users of all gender identities.

7 Discussion

7.1 Inclusivity and Equity: Complements in HAI-UX Fairness

Ideas about fairness in AI, what it is, and how to achieve it have recently received substantial attention [e.g., 18, 41, 50, 57]. Research and conversations in this area usually refer to algorithmic or data fairness—but the ideas are also relevant to HAI-UX fairness.

In considering any type of fairness, two concepts often drive the discussion—inclusivity and equity. This article has considered inclusivity, but not equity.

A way to think about inclusivity in HAI is as an “outcome-oriented” concept that applies within a specific group of people. As shown in earlier sections, when an AI product somehow led to disadvantageous outcomes for some particular group of participants (e.g., risk-averse participants), then that product was not inclusive to that group.

Although this article’s inclusivity results revealed who the guidelines were helping the most and who was being left out, they do not answer how much more inclusivity progress an AI product still needs to make and for whom. Measuring equity can help to answer this question. Like inclusivity, equity in HAI is also an “outcome-oriented” concept—but it applies to between-group comparisons. For example, if an AI product’s UX for two groups (e.g., risk-averse participants and risk-tolerant participants) were of the same high—or low—quality, then the product was equitable.

Ideally, one would like the inclusivity gains Application AI product achieved to result in a final outcome that is equitable to the two groups. To explore how useful a measure of equity would be to our investigation’s results, we measured equity of a dependent variable’s outcome for a given product as the absence of a significant difference between the two participant groups. Table 14 shows risk-group equity outcomes by this measure, superimposed on the risk-group inclusivity outcomes.

Table 14.

For example, Table 14’s G15 column shows that the G15 Application AI product achieved inclusivity gains two times for risk-averse participants (orange cells) only, and three times for risk-averse participants (blue cells) only. The G15 column further shows that those five targeted gains, along with the inclusivity gains experienced by everyone, ultimately led to fully equitable outcomes the risk spectrum (“=” markings). Thus, applying the G15 guideline ended up targeting exactly who it should have targeted to bring everyone up to an equitable state.

In total, Table 14 shows that the guidelines’ resulting Application AI products almost always produced equitable outcomes across the risk spectrum. Specifically, 129/160 outcomes (81%) were equitable, marked by “=” in the table. Of the 31/160 outcomes that were inequitable, only 3/160 (2%) favored the risk-averse (“A”) and 28/160 (17%) favored the risk-tolerant (“T”).

Of course, equity does not always mean success. The G15 Application AI product produced entirely equitable HAI UX (Table 14) but only moderately positive HAI UX (revisit Figure 6). In contrast, the G1 Application AI product produced mostly equitable but extremely low HAI UX, and the G6 Application AI product produced entirely equitable and very positive HAI UX.

Due to space limitations, we do not present equity results in detail for risk or for the other four problem-solving styles. Still, our limited exploration here shows the additional value measuring equity can bring, so we advocate for measuring equity as well as inclusivity as a way to fully understand who is being included vs. who is being left out.

7.2 Practical Implications for HAI Practitioners

Measuring inclusivity and equity can bring practical benefits to HAI practitioners. As our results show, incorporating users’ problem-solving into AI products’ HAI-UX work can sometimes point out where and why mismatches are arising between a group of users and an AI product. Some particularly actionable examples were given in Section 4.4. A way to gather participants’ problem-solving styles would be to incorporate the validated survey [55] we used into user testing.

Armed with this new information, HAI practitioners could gain actionable insights in use-cases like the following:

HAI Practice Use-Case 1: To see which problem-solving groups of users are being left behind on an AI product with a problematic HAI-UX, measure equity state. To do so, for each problem-solving style, HAI practitioners could compare equity outcomes that are significantly different between the “A” participant group and the “T” participant group.

HAI Practice Use-Case 2: To see who a particular AI product change/new feature has benefited, measure inclusivity changes. To do so, HAI practitioners could compare “A” participants before the change vs. after the change, and likewise for “T” participants, as in the examples in Sections 4 and 5.

HAI Practice Use-Case 3: After an AI product has changed, complement a measure of its equity state (as per Use-Case 1) with a measure of dependent variable final outcomes (e.g., as in Figure 6). This combination shows not only final equity state but also how successful the AI product’s HAI-UX is for each group of participants.

Our results suggest that doing measures like these can provide new, valuable information on who is being included, who is being left out, and how a product can improve.

7.3 Threats to Validity and Limitations

As with every empirical study [76, 129], our investigation has limitations and threats to validity.

In any study, researchers cannot ask participants every possible question, having to balance research goals with participant fatigue. As such, the dependent variables we analyzed may not have captured all information about people’s reactions. For example, some participants’ free-text remarks suggested outcomes that our Likert-style questionnaires did not cover; one example was participants’ mentions of privacy concerns while interacting with certain products. Because the study was not designed with a dependent variable about privacy, we cannot be certain if remarks such as these indicated only isolated cases or more prevalent phenomena.

Another threat was how to handle missing data. Since participants had the option to say “I don’t know” for any of the questions, we had to decide whether to (1) impute the data or (2) drop the “I don’t know” values, costing degrees of freedom in our statistical tests. We chose the latter, because although there are many imputation methods to leverage (e.g., hot-deck, cold-deck, and regression), any inferences are then limited to the imputed data, rather than the original data.

Another threat was how to handle the number of statistical tests we ran. As mentioned in Section 3.4, we did not report statistically corrected results in this article because every test corresponded to a pre-planned hypothesis [8, 9]. That said, we recognize that some readers may not agree with this decision, so we also provide all Holm-Bonferroni corrected results in Appendix D.

Also, we chose to use vignettes vs. a real system. Each approach has its own advantage: Using vignettes allows enough control to genuinely isolate the experimental variation to vary only the independent variable, and this isolation was critical to our statistical power. In contrast, a real system’s strength is realism in the external world, but at the cost of controls. Because this was a set of controlled experiments, we chose control, leaving to other studies to investigate external validity questions (faithfulness to real world conditions).

Other threats to validity could arise from the particular pairing of vignette to product and/or from participants associating a vignette with a specific real product with which they had familiarity. We attempted to avert the latter by randomly assigning generic names (Ione and Kelso) instead of real product names, but participants may have still imagined their favorite productivity software. If this occurred, it would contribute an extra source of variation in these data.

Although the productivity software and GenderMag problem-solving styles have been shown to be viable/useful in countries around the world, the participants in our study were restricted to those who lived in the USA at the time of the study. As such, the results in this article cannot be generalized to other countries around the world. However, since the methodology is not U.S.-specific, replicating the study with participants from additional countries should be straightforward.

One limitation of this investigation is that its results cannot be generalized to AI-powered systems outside of productivity software. This suggests the need to investigate HAI-UX impacts on diverse problem-solvers across a spectrum of domains, from low-stakes domains (e.g., music recommender systems) to high-stakes domains (e.g., automated healthcare or autonomous vehicles).

Threats and limitations like these can only be addressed through additional studies across a spectrum of empirical methods and situations, to isolate different independent variables of study and establish generality of findings across different AI applications, measurements, and populations.

8 Conclusion

This article has presented a new empirical approach for measuring an AI product’s UX inclusivity, to answer questions like these: what kinds of users does this AI-powered product support and who does it leave out? And what changes could make it more inclusive?

The essence of the approach is to empirically measure participants’ values in each of five problem-solving style types and then to empirically analyze how well users with different values in those style types were supported by the AI product. The article demonstrates the approach on an empirical investigation of 16 AI-powered products, and those products’ outcomes on a total of 1,106 human participants.

Among the results of the empirical investigation were:

—

Actionable: Many of the empirical results of applying the approach pointed to directions that HAI practitioners could take to make the AI products more inclusive (Result #6).

—

Risk—impacts on all of control, suspicion, trust, and certainty: When an AI product had risk implications, four variables’ values varied in tandem: participants’ feelings of control, their (lack of) suspicion, their trust in the product, and their certainty while using the product (Result #2).

—

User control mattered: The more control an AI product offered users, the more inclusive it was for both risk attitudes (Results #3 and 4).

—

Stay away from my data! When an AI product was learning from “my” data, risk-averse participants rarely experienced inclusivity gains (Result #5).

—

The Amershi HAI guidelines usually helped inclusivity: Although the Amershi guidelines were not designed with inclusivity in mind, they usually helped with inclusivity for at least one group of participants, for all five GenderMag problem-solving style spectra (Results #1 and 7).

—

Problem-solving diversity meets demographic diversity: The participants’ problem-solving styles showed alignments with their intersectional gender-and-age demographic diversity. These alignments suggest that improving an AI product’s inclusivity to diverse problem-solving styles (e.g., attitudes toward risk) is likely to improve the product’s demographic (e.g., gender, age) inclusivity as well (Result #9).

—

Problem-solving styles mattered: Which participants were most advantaged/disadvantaged depended on both the AI product and the particular problem-solving style. This suggests that measuring each problem-solving style separately is key to finding an actionable fix to pinpoint supporting that particular style (Result #8).

Abstracting above these results, this work directly relates to one of Shneiderman’s stances for human-centered AI [112]. Shneiderman advocates a “shift from emulating humans” to “empowering people.” This article provides an approach for carrying out this point and taking it one step further: from “empowering people” to empowering diverse people.

Acknowledgments

We thank Rupika Dikkala, Catherine Hu, Jeramie Kim, Elizabeth Li, Caleb Matthews, Christopher Perdriau, Sai Raja, and Prisha Velhal for their help with this article. We are grateful to the editors and reviewers for their encouragement and constructive engagement, which greatly helped to improve this article.

Footnotes

GenderMag finds the issues not by using people’s gender identity, but rather by the five problem-solving style types. These problem-solving styles’ values statistically cluster around genders.

Some of these works gathered participants’ biological sex rather than their gender identities; others by gender. Further, most of the literature has been binary, reporting only females/males or women/men, so the upcoming description of the styles is also necessarily binary. In this discussion, we use gender terminology (e.g., “woman” instead of “female”) simply to avoid switching back and forth for different studies.

According to Veinot et al., intervention-generated inequalities occur when a technological intervention disproportionately benefits a group of people who are already privileged in a particular context [122].

⁴

Amershi et al. point out that eight of their guidelines map to principles outlined in Horvitz’s work.

⁵

Appendix B shows all vignettes for every product. Participants were not told that one product violated/applied a guideline and one did the opposite.

⁶

In experimental design, a manipulation check is a test used to determine the effectiveness of a manipulation in an experimental design. Passing manipulation checks indicates that the manipulation in an experimental design was effective, whereas failing manipulation checks indicates that it was not.

⁷

Counts reported in Appendix C.

⁸

This approach has also been used in other measures of problem-solving styles, such as need for cognition [17].

⁹

These classification rules are detailed in Table A7 in Appendix A.

¹⁰

“The Central Limit Theorem asserts that averages based on large samples have approximately normal sampling distributions, regardless of the shape of the population distribution” [101]. By convention, the rule of thumb for a large enough sample is often considered to be $\geqslant$30 [101].

¹¹

Recall that, because this inclusivity measure is within-subject, we used paired t-tests to measure HAI-UX in the before (Violation AI product) vs. after (Application AI product) versions of the AI products.

¹²

Appendix D provides boxplots of results for all dependent variables in all experiments.

¹³

Recall from Table 2 that the “suspicious” dependent variable was one of the variables that reverse-coded for presentation clarity, so that more positive outcomes were always higher on the scales.

¹⁴

This result was derived using Student’s t-test, although these data are not continuous. We validated all results in this article using Wilcoxon signed rank test, and the non-parametric results agreed with our own 97% of the time.

¹⁵

One of the authors of this article is reminded of the many times she has heard software practitioners say things like, “what am I supposed to do, paint it pink?”

¹⁶

LGBTQIA* used based on Scheuerman et al.’s living document [105].

¹⁷

For this test, we used the threshold that minimized the chance of showing significance by maximizing the sum of p-values [101].

A Measuring User Experience Inclusivity in Human-AI Interaction via Five User Problem-Solving Styles

Table A1.

Table A2.

Table A3.

Table A4.

Table A5.

Table A6.

Table A7.

B Measuring User Experience Inclusivity in Human-AI Interaction via Five User Problem-Solving Styles

Table B1.

Guideline 1 Application Vignette	Guideline 1 Violation Vignette
You are using a presentation app similar to Microsoft PowerPoint, Google Slides, Apple Keynote to make slides for a presentation. It is called [Application AI product]. [Application AI product] has a capability called Presenter Coach that gives you feedback on your presentation skills as you practice your presentation in front of your computer.	You are using a presentation app similar to Microsoft PowerPoint, Google Slides, Apple Keynote to make slides for a presentation. It is called [Application AI product]. [Application AI product] has a capability called Presenter Coach that gives you feedback on your presentation skills as you practice your presentation in front of your computer.
When you turn on Presenter Coach, it displays information like this: As you practice your presentation, we will give you feedback about your presentation style: how fast you speak, use of filler words (such as “um” and “like”), use of inappropriate words (such as “damn”).	When you turn on Presenter Coach, it displays information like this: We will help you improve your presentation style.

Table B1. Guideline 1’s Vignettes for Guideline Application (left) and Violation (right)

Table B2.

Guideline 3 Application Vignette	Guideline 3 Violation Vignette
You are using an email application called [Application AI product]. It is similar to Gmail, Outlook, Apple Mail, etc. [Application AI product] pops up notifications on your screen when you have new emails.	You are using an email application called [Violation AI product]. It is similar to Gmail, Outlook, Apple Mail, etc. [Violation AI product] pops up notifications on your screen when you have new emails.
As you use [Application AI product], you notice that [Application AI product] stops the notifications when you are busy, for example, when giving a presentation.	As you use [Violation AI product], you notice that [Violation AI product] pops up notifications whenever a new email arrives, regardless of your activity.

Table B2. Guideline 3’s Vignettes for Guideline Application (left) and Violation (right)

Table B3.

Guideline 4 Application Vignette	Guideline 4 Violation Vignette
You are using a document editing app called [Application] to write documents. It is similar to MS Word, Google Docs, Apple Pages. It has a feature that defines meanings of acronyms. For example, if you see an acronym such as “CDC” in a document, you can access an explanation of what CDC stands for (such as, Centers for Disease Control).	You are using a document editing app called [Violation] to write documents. It is similar to MS Word, Google Docs, Apple Pages. It has a feature that defines meanings of acronyms. For example, if you see an acronym such as “CDC” in a document, you can access an explanation of what CDC stands for (such as, Centers for Disease Control).
When you highlight an acronym to see what it stands for, [Application] shows you definitions that are used in your workplace and pertain to the topic of the current document.	When you highlight an acronym to see what it stands for, [Violation] shows you a standard list of possible definitions taken from a popular acronym dictionary.

Table B3. Guideline 4’s Vignettes for Guideline Application (left) and Violation (right)

Table B4.

Guideline 5 Application Vignette	Guideline 5 Violation Vignette
You are using a document editing app called [Application AI product] to write documents. It is similar to Microsoft Word, Google Docs, Apple Pages. [Application AI product] has a feature that makes suggestions to improve your writing style.	You are using a document editing app called [Violation AI product] to write documents. It is similar to Microsoft Word, Google Docs, Apple Pages. [Violation AI product] has a feature that makes suggestions to improve your writing style.

You are writing a document, and [Application AI product] presents its suggested options by saying: “Consider using…”	You are writing a document, and [Violation AI product] presents its suggested options by saying: “You made a mistake. Replace with…”

Table B4. Guideline 5’s Vignettes for Guideline Application (left) and Violation (right)

Table B5.

Guideline 6 Application Vignette	Guideline 6 Violation Vignette
You are using an online search engine called [Application AI product] to search for images of CEOs and doctors.	You are using an online search engine called [Violation AI product] to search for images of CEOs and doctors.

On the first page, the search results show images of different types of people in terms of gender and skin tone, including people who look like you.	The search results don’t show any images of women or people of color on the first page.

Table B5. Guideline 6’s Vignettes for Guideline Application (left) and Violation (right)

Table B6.

Guideline 7 Application Vignette	Guideline 7 Violation Vignette
You are using a presentation app similar to Microsoft PowerPoint, Google Slides, Apple Keynote to make slides for a presentation. It is called [Application AI product]. [Application AI product] has a capability called Design Helper that provides alternative design ideas. As you work on a slide and add text and images, Design Helper automatically provides you with design suggestions for alternative layouts.	You are using a presentation app similar to Microsoft PowerPoint, Google Slides, Apple Keynote to make slides for a presentation. It is called [Violation AI product]. [Violation AI product] has a capability called Design Helper that provides alternative design ideas. As you work on a slide and add text and images, Design Helper automatically provides you with design suggestions for alternative layouts.
You are working on a slide and need some design help, but Design Helper hasn’t automatically popped up any design suggestions. You click a button visible on the interface to request suggestions and they appear on the side bar.	You are working on a slide and need some design help, but Design Helper hasn’t automatically popped up any design suggestions. There is no button visible on the interface that you could use to request suggestions from Design Helper.

Guideline 7 Application Vignette

Guideline 7 Violation Vignette

You are using a presentation app similar to Microsoft PowerPoint, Google Slides, Apple Keynote to make slides for a presentation. It is called [Application AI product]. [Application AI product] has a capability called Design Helper that provides alternative design ideas. As you work on a slide and add text and images, Design Helper automatically provides you with design suggestions for alternative layouts.

You are using a presentation app similar to Microsoft PowerPoint, Google Slides, Apple Keynote to make slides for a presentation. It is called [Violation AI product]. [Violation AI product] has a capability called Design Helper that provides alternative design ideas. As you work on a slide and add text and images, Design Helper automatically provides you with design suggestions for alternative layouts.

You are working on a slide and need some design help, but Design Helper hasn’t automatically popped up any design suggestions. You click a button visible on the interface to request suggestions and they appear on the side bar.

You are working on a slide and need some design help, but Design Helper hasn’t automatically popped up any design suggestions. There is no button visible on the interface that you could use to request suggestions from Design Helper.

Table B6. Guideline 7’s Vignettes for Guideline Application (left) and Violation (right)

Table B7.

Guideline 8 Application Vignette	Guideline 8 Violation Vignette
You are using a presentation app similar to Microsoft PowerPoint, Google Slides, Apple Keynote to make slides for a presentation. It is called [Application AI product]. [Application AI product] has a capability called Design Helper that provides alternative design ideas. As you work on a slide and add text and images, Design Helper automatically provides you with design suggestions for alternative layouts.	You are using a presentation app similar to Microsoft PowerPoint, Google Slides, Apple Keynote to make slides for a presentation. It is called [Violation AI product]. [Violation AI product] has a capability called Design Helper that provides alternative design ideas. As you work on a slide and add text and images, Design Helper automatically provides you with design suggestions for alternative layouts.
You are working on a slide and Design Helper pops up, showing you some design suggestions. You do not need any design help at this time, so you click on a button visible on screen to hide the design suggestions.	You are working on a slide and Design Helper pops up, showing you some design suggestions. You do not need any design help at this time, but there is no way to hide the design suggestions.

Guideline 8 Application Vignette

Guideline 8 Violation Vignette

You are working on a slide and Design Helper pops up, showing you some design suggestions. You do not need any design help at this time, so you click on a button visible on screen to hide the design suggestions.

You are working on a slide and Design Helper pops up, showing you some design suggestions. You do not need any design help at this time, but there is no way to hide the design suggestions.

Table B7. Guideline 8’s Vignettes for Guideline Application (left) and Violation (right)

Table B8.

Guideline 9 Application Vignette	Guideline 9 Violation Vignette
You are using a presentation app similar to Microsoft PowerPoint, Google Slides, Apple Keynote to make slides for a presentation. It is called [Application AI product]. [Application AI product] has a capability called Design Helper that provides alternative design ideas. As you work on a slide and add text and images, Design Helper automatically provides you with design suggestions for alternative layouts.	You are using a presentation app similar to Microsoft PowerPoint, Google Slides, Apple Keynote to make slides for a presentation. It is called [Violation AI product]. [Violation AI product] has a capability called Design Helper that provides alternative design ideas. As you work on a slide and add text and images, Design Helper automatically provides you with design suggestions for alternative layouts.
You are working on a slide and Design Helper shows you a few design suggestions. You click one of the suggestions, and the layout is applied to the current slide. You decide you want to make a few changes to the layout, such as resizing or repositioning the images, which you can do directly on the slide.	You are working on a slide and Design Helper shows you a few design suggestions. You click one of the suggestions, and the layout is applied to the current slide. You decide you want to make a few changes to the layout, such as resizing or repositioning the images, but you see that it does not allow you to make changes.

Guideline 9 Application Vignette

Guideline 9 Violation Vignette

You are working on a slide and Design Helper shows you a few design suggestions. You click one of the suggestions, and the layout is applied to the current slide. You decide you want to make a few changes to the layout, such as resizing or repositioning the images, which you can do directly on the slide.

Table B8. Guideline 9’s Vignettes for Guideline Application (left) and Violation (right)

Table B9.

Guideline 10 Application Vignette	Guideline 10 Violation Vignette
You are using a document editing app called [Application AI product] to write documents. It is similar to Microsoft Word, Google Docs, Apple Pages. [Application AI product] has a spelling assistant that can automatically detect typos.	You are using a document editing app called [Violation AI product] to write documents. It is similar to Microsoft Word, Google Docs, Apple Pages. [Violation AI product] has a spelling assistant that can automatically detect typos.
You are writing a document, and you make a typo. The spelling assistant isn’t sure what word you intended to type, so it provides multiple options for you to choose from.	You are writing a document, and you make a typo. The spelling assistant isn’t sure what word you intended to type, so it replaces the typo with its best bet. For example, if you type “multipl”, it automatically replaces it with “multiple”, even though other options might make sense–e.g., multiple, multiplex.

Table B9. Guideline 10’s Vignettes for Guideline Application (left) and Violation (right)

Table B10.

Guideline 11 Application Vignette	Guideline 11 Violation Vignette
You are using a spreadsheet app called [Application AI product] to analyze some data. It is similar to MS Excel, Google Sheets, Apple Numbers. [Application AI product] has a feature that suggests charts based on the data you have in a file. For each suggested chart, the interface shows an interesting insight that summarizes the data–for example, “People who walk to work drink more coffee than people who drive.”	You are using a spreadsheet app called [Violation AI product] to analyze some data. It is similar to MS Excel, Google Sheets, Apple Numbers. [Violation AI product] has a feature that suggests charts based on the data you have in a file. For each suggested chart, the interface shows an interesting insight that summarizes the data–for example, “People who walk to work drink more coffee than people who drive.”
As you use [Application AI product], you wonder how each suggested chart and insight was generated, and you notice a button under each chart that you can click to access an explanation.	As you use [Violation AI product], you wonder how each suggested chart and insight was generated but are unable to access an explanation.

Guideline 11 Application Vignette

Guideline 11 Violation Vignette

You are using a spreadsheet app called [Application AI product] to analyze some data. It is similar to MS Excel, Google Sheets, Apple Numbers. [Application AI product] has a feature that suggests charts based on the data you have in a file. For each suggested chart, the interface shows an interesting insight that summarizes the data–for example, “People who walk to work drink more coffee than people who drive.”

You are using a spreadsheet app called [Violation AI product] to analyze some data. It is similar to MS Excel, Google Sheets, Apple Numbers. [Violation AI product] has a feature that suggests charts based on the data you have in a file. For each suggested chart, the interface shows an interesting insight that summarizes the data–for example, “People who walk to work drink more coffee than people who drive.”

As you use [Application AI product], you wonder how each suggested chart and insight was generated, and you notice a button under each chart that you can click to access an explanation.

As you use [Violation AI product], you wonder how each suggested chart and insight was generated but are unable to access an explanation.

Table B10. Guideline 11’s Vignettes for Guideline Application (left) and Violation (right)

Table B11.

Guideline 12 Application Vignette	Guideline 12 Violation Vignette
You are using an email application called [Application AI product]. It is similar to Gmail, Outlook, Apple Mail, etc.	You are using an email application called [Violation AI product]. It is similar to Gmail, Outlook, Apple Mail, etc.
When attaching a file, [Application AI product] shows a list of recent files you worked on to choose from, along with a file explorer you can use to navigate to other files not in the list.	When attaching a file, [Violation AI product] opens a file explorer window you can use to navigate to files.

Table B11. Guideline 12’s Vignettes for Guideline Application (left) and Violation (right)

Table B12.

Guideline 13 Application Vignette	Guideline 13 Violation Vignette
You are using a presentation app similar to Microsoft PowerPoint, Google Slides, Apple Keynote to make slides for a presentation. It is called [Application AI product]. [Application AI product] has a capability called Design Helper that provides alternative design ideas. As you work on a slide and add text and images, Design Helper automatically provides you with design suggestions for alternative layouts.	You are using a presentation app similar to Microsoft PowerPoint, Google Slides, Apple Keynote to make slides for a presentation. It is called [Violation AI product]. [Violation AI product] has a capability called Design Helper that provides alternative design ideas. As you work on a slide and add text and images, Design Helper automatically provides you with design suggestions for alternative layouts.
As you make slides, you pick from the design suggestions offered by Design Helper. You usually pick designs that feature the color blue. After using [Application AI product] a few times, you notice it has learned your preferences and now features blue designs prominently.	As you make slides, you pick from the design suggestions offered by Design Helper. You usually pick designs that feature the color blue. After using [Violation AI product] a few times, you notice it has not learned your preferences, and blue designs appear in the same place among the suggested designs as the first time you used it.

Guideline 13 Application Vignette

Guideline 13 Violation Vignette

As you make slides, you pick from the design suggestions offered by Design Helper. You usually pick designs that feature the color blue. After using [Application AI product] a few times, you notice it has learned your preferences and now features blue designs prominently.

As you make slides, you pick from the design suggestions offered by Design Helper. You usually pick designs that feature the color blue. After using [Violation AI product] a few times, you notice it has not learned your preferences, and blue designs appear in the same place among the suggested designs as the first time you used it.

Table B12. Guideline 13’s Vignettes for Guideline Application (left) and Violation (right)

Table B13.

Guideline 14 Application Vignette	Guideline 14 Violation Vignette
You are using a document editing app similar to Microsoft Word, Google Docs, Apple Pages, called [Application AI product]	You are using a document editing app similar to Microsoft Word, Google Docs, Apple Pages, called [Violation AI product]
As you work on a document, the app automatically updates a small dedicated part of the menu bar at the top right, based on what functionality it thinks you’ll need next. The rest of the menu bar never changes.	As you work on a document, the app automatically updates the entire menu bar at the top based on what functionality it thinks you’ll need next.

Table B13. Guideline 14’s Vignettes for Guideline Application (left) and Violation (right)

Table B14.

Guideline 15 Application Vignette	Guideline 15 Violation Vignette
You are using a spreadsheet app called [Application AI product] to analyze some data. It is similar to MS Excel, Google Sheets, Apple Numbers. [Application AI product] has a feature that suggests charts based on the data you have in a file.	You are using a spreadsheet app called [Violation AI product] to analyze some data. It is similar to MS Excel, Google Sheets, Apple Numbers. [Violation AI product] has a feature that suggests charts based on the data you have in a file.
Below each suggested chart, [Application AI product] shows buttons you can use to provide feedback as to whether that suggestion is useful to you or not.	[Violation AI product] does not have a way for you to provide feedback as to whether any suggestion is useful to you or not.

Table B14. Guideline 15’s Vignettes for Guideline Application (left) and Violation (right)

Table B16.

Guideline 17 Application Vignette	Guideline 17 Violation Vignette
You are using an email application called [Application AI product]. It is similar to Gmail, Outlook, Apple Mail, etc. [Application AI product] can automatically sort your emails into 2 categories:important email and other emails.	You are using an email application called [Violation AI product]. It is similar to Gmail, Outlook, Apple Mail, etc. [Violation AI product] can automatically sort your emails into 2 categories: important email and other emails.
As you use [Application AI product], you notice that emails from some important people (e.g., your boss, your mom) are not being marked as important. [Application AI product] provides a setting where you can enter the names of specific people to make sure they are always marked as important going forward.	As you use [Violation AI product], you notice that emails from some important people (e.g., your boss, your mom) are not being marked asimportant. Going forward, you periodically check the other email category to make sure you are not missing emails from those important people.

Table B16. Guideline 17’s Vignettes for Guideline Application (left) and Violation (right)

Table B17.

Guideline 18 Application Vignette	Guideline 18 Violation Vignette
You are using an email application called [Application AI product]. It is similar to Gmail, Outlook, Apple Mail, etc. [Application AI product] can automatically sort your emails into 2 categories:important email and other emails.	You are using an email application called [Violation AI product]. It is similar to Gmail, Outlook, Apple Mail, etc. [Violation AI product] can automatically sort your emails into 2 categories: important email and other emails.
Occasionally, [Application AI product] improves the way it sorts emails into important and other. For example, one day it informs you that: — “From now on, emails Sarah [your boss] will always appear in the important category and emails from Local News [some mailing lists] will no longer appear the important category.”	Occasionally, [Violation AI product] improves the way it sorts emails into important and other. For example, as you use it, you notice that emails from your boss start to appear in the important category when they didn’t always before and emails from some mailing lists no longer seem to appear in the important category.

Table B17. Guideline 18’s Vignettes for Guideline Application (left) and Violation (right)

C Measuring User Experience Inclusivity in Human-AI Interaction via Five User Problem-Solving Styles

Table C1.

Demographic	Category	Count
Age	18—24 years old	125
	25—34 years old	422
	35—44 years old	259
	45—54 years old	138
	55—64 years old	58
	65—74 years old	14
Gender	Woman	527
	Man	478
	Non-Binary	12
	Transgender	4
	Gender Non-conforming	2
	Declined to report	2
	Intersex	1
Education	Bachelors degree	430
	Some college (no degree)	199
	Advanced degree (MA/MS/Ph.D./M.D.)	162
	Associates degree	112
	At most high school	91
	Trade/technical school	22
Employment Field	Science/Technology/Programming	176
	Education	113
	Unemployed/other	97
	Sales/Marketing	83
	Administration/Clerical/Reception	80
	Health care (Physical & Mental)	66
	Customer service	48
	Management (Senior/Corporate)	42
	Arts/Leisure/Entertainment	37
	Operations/Logistics	34
	Homemaker	25
	Student	24
	Restaurant/Food service	24
	Construction	23
	Social service	21
	HR Management	17
	Consulting	17
	Real Estate	15
	Production	14
	Research	11
	Distribution	9
	News/Information	9
	Advertisement/PR	8
	Architecture/Design	8
	Retired	8
	Beauty/Fashion	3
	Planning (meeting, events, etc.)	2
	Buying/Purchasing	2

Table C1. Counts of participants by Age, Gender, Education, and Employment Field

D Measuring User Experience Inclusivity in Human-AI Interaction via Five User Problem-Solving Styles

This is in accompaniment of the following citation:

Andrew Anderson, Tianyi Li, Mihaela Vorvoreanu, Jimena Noa Guevara, Margaret Burnett. 2024. Measuring User Experience Inclusivity in Human-AI Interaction via Five User Problem-Solving Styles. ACM Trans. Interact. Intell. Syst. (To Appear)

References

[1]

Bryan Abendschein, Chad Edwards, and Autumn Edwards. 2021. The influence of agent and message type on perceptions of social support in human-machine communication. Communication Research Reports 38, 5 (2021), 304–314.

Abstract

1 Introduction

2 Background and Related Work

2.1 Background

2.1.1 GenderMag.

2.1.2 Guidelines for HAI.

2.2 Related Works

2.2.1 Investigations into Individuals’ Problem-Solving Styles in Human-AI Contexts.

2.2.2 Investigations of HAI Inclusivity to Diverse Humans, from a Demographic Perspective.

2.2.3 Actionable Recommendations for HAI.

3 Methodology

3.1 Study Design

3.2 Participants and Procedures

3.3 Investigation One Results Summary

3.4 Investigation Two (Current Investigation) Data Analysis

4 Results: What Participants’ Risk Styles Revealed

4.1 When Everybody Gained: More Inclusivity for Both the Risk-Averse and the Risk-Tolerant

4.2 When Nobody Gained: No Inclusivity Improvements for Either Risk Group

4.3 Selective Inclusivity: Who Gained, Who Did Not, and Why?

4.4 The Risk Results and Actionability

5 Results: Beyond Risk—the Other Four Problem-Solving Styles

6 Participants’ Problem-Solving Styles and Their Demographics

6.1 Problem-Solving Style Diversity, Gender, and Age

6.2 The LGBTQIA* Community

7 Discussion

7.1 Inclusivity and Equity: Complements in HAI-UX Fairness

7.2 Practical Implications for HAI Practitioners

7.3 Threats to Validity and Limitations

8 Conclusion

Acknowledgments

Footnotes

A Measuring User Experience Inclusivity in Human-AI Interaction via Five User Problem-Solving Styles

B Measuring User Experience Inclusivity in Human-AI Interaction via Five User Problem-Solving Styles

C Measuring User Experience Inclusivity in Human-AI Interaction via Five User Problem-Solving Styles

D Measuring User Experience Inclusivity in Human-AI Interaction via Five User Problem-Solving Styles

References

Index Terms

Recommendations

Evaluating user preferences for adaptive reminding

Human-AI Interaction and AI Avatars

User Experience of Universal School-Based e-Mental Health Solutions: Exploring the Expectations and Desires of Adolescent Users

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations