1 Introduction
In the field of Human-Robot Interaction, we typically run experiments to see how human participants react to robots in some way [
26]. For instance, this might involve exploring how certain robot behaviours affect people’s perceptions of that robot [
29,
51], or whether robots can facilitate learning of a second language [
50,
49], or even how children play with them [
7]. However, unlike (most) robots, people can be highly unpredictable. In other words, they are noisy. Given an identical task on two separate occasions, they are unlikely to complete it in precisely the same way. Thus, to determine whether any observed effects in our experiments are meaningful, rather than just noise, we use statistical methods, most often
Null Hypothesis Statistical Testing (NHST).
In NHST, we confirm our hypothesis by statistically comparing our results against a hypothesis of no effect [
42]. What this means is that tests of statistical significance are a way of deciding between two possible explanations for an observed effect. One explanation, referred to as the null hypothesis, is a statement that there is no difference or relationship between phenomena or populations, and that the observed effect occurred by chance. However, the alternative hypothesis posits that the effect observed in a sample reflects the effect that exists in the general population. Importantly, NHST rests on the assumption that the null hypothesis is true and determines how likely it is that the observed effect would have occurred if there was no effect in the general population.
To illustrate, imagine you are building a robot for conversational therapy and need to decide on the appearance of the robot. You hypothesise that a robot that has more human-like features will be trusted more than one that is more robot-like. You spend several weeks in the workshop building a human-like and a robot-like robot. You bring in participants and randomly assign them to one of two conditions. One half of your participant group interacts with a human-like robot, the other half of the group interacts with a more robot-like robot. After the interaction you measure trust using a questionnaire. The null hypothesis is that there is no difference in trust between the two conditions: Having a robot that has more human-like features does not increase or decrease trust compared to a robot that appears more robot-like. Using the results and details of the study, NHST methods can be used to calculate how likely it would be that an effect at least at large as the one you found would occur by chance.
When it comes to experimental research using NHST, there are two types of error of which researchers have to be aware. Type I errors are false positives; the null hypothesis is rejected in favour of the alternative hypothesis when the alternative is actually false [
19]. In contrast, Type II errors are false negatives; failing to reject the null hypothesis when the alternative hypothesis is true [
19]. To ensure that our conclusions are valid—and that the study is telling us something meaningful—it is important that researchers control for these two types of error.
Controlling for Type I errors in statistical analysis is fairly straightforward. The probability of a Type I error is known as
\(\alpha\) (alpha). When conducting statistical significance tests, researchers can control for the probability of a Type I error by setting an acceptable
\(\alpha\) or significance level [
40,
36]. In practice, researchers control for Type I errors by only accepting statistical results as “significant” if there is a very small chance (e.g.,
\(p \lt 5\%\) ) that the result is a false positive. When conducting statistical analyses, tests of significance report a
p-value, which denotes the probability of observing results at least as extreme as observed, if the null hypothesis is true. The
\(\alpha\) level acts as a threshold such that
p-values that are smaller than
\(\alpha\) are considered “significant”; i.e., there is a low probability that the result was a false positive. In most cases, the accepted
\(\alpha\) level is
\(\alpha = 0.05\) (5% chance of a Type I error) [
13].
Controlling for Type II errors is, arguably, slightly more complicated. The probability of a Type II error (false negative) is commonly denoted as
\(\beta\) (beta). This value is used to calculate the
power of a study such that
\(\textrm {power} = 1 - \beta\) [
38]. To ensure that a study has a low risk of producing a false negative finding, studies must be designed to have sufficient power to find a
meaningful difference. That is, as researchers, we want to avoid conducting studies that are either underpowered or overpowered, as both tend to produce exaggerated and misleading results. As with the alpha level, a researcher can adjust the power value to reflect what they consider an acceptable probability of a Type II error in the statistical tests for their study. Cohen [
13] reasoned that a good balance between
\(\alpha\) and
\(\beta\) would be to have a 5% chance of a Type I error and a 20% chance of a Type II error (i.e., a power of 80%;
\(\textrm {power} = 1 - \beta\) ). These values have since been generally accepted as a good default when conducting behavioural research.
So, how does one design a study to control how probable it is that they will get a false negative (Type II error) result? By performing a power analysis to inform the design of a study, specifically, the sample size. While there are a number of factors that influence the probability of Type II errors (e.g.,
\(\alpha\) , effect size, sample size, and whether the statistical test used is one- or two-tailed [
36]), one of the few factors that can be actively controlled by a researcher is sample size [
36]. Therefore, power calculations are conducted to establish how many participants a researcher needs to recruit for their study to have a good chance of detecting an effect of the expected size. In the case of small effect sizes, larger sample sizes are generally required to ensure an intended power. From a planning perspective, power analyses in these cases allow researchers to realistically consider whether enough participants can be recruited or whether a different recruitment method (e.g., online vs. in-person) or study design (within- vs. between-subjects) should be adopted. However, detecting a large effect size while achieving the same level of power requires fewer participants. Here, power analyses allow researchers to ensure that they do not spend valuable resources recruiting more participants than necessary.
The goal of behavioural research is to conduct studies that tell us something about the general population. For example, in an HRI context one might want to provide evidence that a robot tutor is
significantly more effective at promoting learning than a human tutor. To test this, a study could be run comparing the learning gains of two groups: one that is taught by the robot tutor and one taught by the human tutor. A t-test comparing group means produces a
p-value of
\(p = 0.03\) . Because we chose an alpha level of
\(\alpha = 0.05\) , we consider this result to be significant, and it effectively demonstrates that, if the null hypothesis were true, then a result this extreme would occur only 3% of the time [
18,
28].
Now let us say that, to achieve a power level of
\(\textrm {power} = 0.8\) and detect an effect size of
\(d = 0.5\) the study needed 64 participants in each group (128 participants total, as indicated by Figure
1). Unfortunately, only 30 participants were recruited in each group (60 in total). By referring back to Figure
1, we can see that this resulted in the study only achieving a power level of roughly 0.475, i.e., the study was underpowered. This changes how the findings are interpreted. Studies that are underpowered are more susceptible to random variation in sample means. One consequence of this is that there is a higher chance of producing a large effect size and of reaching statistical significance [
20]. These large, significant effects, however, come with a high degree of uncertainty due to the low statistical power, and it is important that this uncertainty be taken into account. That is, rather than concluding that having a robot tutor produces significantly greater learning gains than having a human tutor, it would be more accurate to state that “our results suggest that having a robot tutor may result in better learning than having a human tutor for this task, but the ‘true’ effect may be smaller than that found, and additional studies with greater power are needed to establish a better estimate of the true effect.”
Power analyses allow us to calculate the number of participants needed while controlling for both \(\alpha\) and \(\beta\) for us to be confident in our results. They are also extremely valuable to readers so the results being presented in a research paper can be properly interpreted. However, there seems to be a lack of power analyses reported in papers in the field of HRI. To examine whether or not this is the case, the next section provides a brief look at how often papers published in the field of HRI include reports of power calculations.
4 Under- AND Overpowered Studies
4.1 Underpowered Studies
It is worth noting that we are not arguing that every study must ensure adequate power but simply that the power analysis should be reported. There may be good reasons to go ahead with an underpowered study; for example, if it focuses on a population from which obtaining a sufficient sample size is not realistic (such as persons of advanced age). Underpowered studies can also be useful in raising awareness around a new research avenue, or as a stepping-stone for larger studies.
At the same time, running an underpowered study might also raise ethical concerns, because it requires investments from participants (even if it is just time) into a study whose ability to generate insights might be limited. In particular, this concern exists when dealing with vulnerable participant groups, children, and so on; in other words, a situation in which underpowered studies are likely because, as mentioned above, it may simply not be possible to recruit sufficient numbers from this population. Since this is an ethical concern, it falls within the remit of the relevant ethical committee or authority to assess it against the possible benefits of the study. A power analysis can help the committee reach the appropriate decision.
If there are no concerns preventing the study from being run, then it can also be published. In that case, it remains important for authors to provide the power analysis, acknowledging the underpowered nature of the study (including an explanation of why this was unavoidable) and ensure that conclusions are phrased accordingly; in particular, avoiding overly assertive or strong claims. At the same time, it is also important for editors and reviewers to realise that there is no “magical” power threshold with an automatic rejection on one side of it. As with p-values, confidence intervals, and so on, such measures exist on a continuum, and it is ultimately up to the reader to decide what confidence they have in the results. However, for readers to be able to make an informed decision, they need to be provided with the necessary descriptors, including a power analysis.
To illustrate how such instances could be addressed in publications, we suggest that, if authors find that the power analysis done during the planning fails to suggest an unrealistic or unattainable sample size, and a new, smaller, sample size is selected, then researchers can use a plot similar to that presented in Figure
1 to calculate the power level achieved with this sample. This power level can then be reported alongside a brief explanation of why the “preferred” sample size could not be recruited.
4.2 Overpowered Studies
Recently, especially during the COVID-19 pandemic, online data collection, also known as “crowd sourcing,”
1 has proven very popular to collect experimental data. Many studies, not only in HRI [
6,
9], now rely on online crowd sourcing, as data can be collected with lower effort, often at lower cost, in a shorter time and from a much wider population. This is a promising development, especially when we are prevented from letting participants interact with real robots. However, as recruiting more participants increases the power of a study, it brings with it the problem of potentially
overpowering a study [
24].
A large number of participants, with hundreds or even thousands of participants not being an exception in online HRI studies, increases the statistical power of a study. However, it also is more likely that low
p-values will be observed [
46]. In other words, increasing the number of participants increases the probability of getting a significant result even for very small differences between conditions. So, while your results are statistically significant, they might well be scientifically uninteresting.
Overpowered studies are not wrong in a statistical sense, but they are ethically questionable when only significance is reported without also reporting the effect sizes and alongside a discussion on the relevance of the effect found. In addition, overpowered studies waste resources. Time, money, and participants are valuable to most of us, and correctly assessing the number of participants needed to answer a research questions means that no more resources are used than strictly needed.
5 How to Calculate Power
Numerous tools exist for calculating sample size, including G*Power [
16], PASS [
32], SAS [
44], the pwr package for R [
11], and the Power and Sample Size website [
27]. The following examples were conducted using G*Power:
To run a power analysis, we first need to collect/define a few key pieces of information:
(1)
What statistical test will be used?
(2)
What is the chosen value for \(\alpha\) ?
(3)
What is the size of the effect that we predict we will find?
Question 1: For question 1, we consider only the statistical test planned for testing the main research question. So, if a study is looking at the effect of robot tutor vs. human tutor on learning, then the main test might be comparing the groups’ average test scores. If run as a between-subjects study, then the test would be an Independent Samples T-test or 1-way ANOVA. Alternatively, such a study could be run using a within-subjects design (thus reducing the required sample size), which would require a Repeated-Measures ANOVA [
19].
Question 2: Answering question 2 is, in most cases, a simple task of deciding whether or not to stick with the standard
\(\alpha = 0.05\) significance level (although using a lower
\(\alpha\) has been argued to improve replication [
21]).
Question 3: There are a couple of different ways to answer question 3. Effect size quantifies the magnitude of an experimental effect. In an experiment comparing an experimental group to a control group, the effect size quantifies the difference between the two groups or means [
12,
39]. In correlation analyses with two or more variables, or studies where there are more than two groups, the effect size measures the strength of the association between variables [
39]. They can be thought of as measuring the correlation between the effect and the dependent variable. Estimating the effect size is arguably the most difficult step in this process. One way of obtaining an estimate of the effect size is by looking at existing, similar research and calculating an average effect size that represents an estimate of the population effect size. However, this approach is subject to potential bias introduced by the “file drawer effect” [
3], whereby the preference for publishing statistically significant results has likely resulted in the published effect sizes being an over-estimation of the actual population effect size. A direct consequence of this is that future studies that use this method for estimating effect sizes may be underpowered. Some have therefore advised that researchers take a conservative approach when using this method [
15].
An alternative approach is to conduct a pilot study before conducting the full study. This approach, however, does come with its own problems, one being that the variability of effect sizes found during small pilot studies will be large [
8]. Because of this, researchers may be less inclined to conduct a full study if the pilot study reveals a small or insignificant effect, but pilot studies that reveal large effects might be over-estimations and also lead to underpowered full studies [
3,
2].
A third option is to use suggested effect size values based on what test we plan to use and whether we expect to find a small, medium, or large effect [
14,
17]. For example, if our test is a comparison of independent means (i.e., independent t-test) and we want to be able to detect a medium effect size, then Cohen [
14] suggests an estimated effect size of
\(d = 0.5\) . It should be noted, at this point, that Cohen’s definitions of what constitutes a “small,” “medium,” and “large” effect size are inconsistent across different measures of effect size. A recent paper by Correll et al. [
15] provides an in-depth discussion of this. We therefore recommend caution when taking this approach. In our examples below, we demonstrate one of the recommendations made in Correll et al. [
15]. That is, to convert between a standard effect size, e.g.,
\(\eta ^{2}\) , and other measures, rather than relying on the different definitions provided by Cohen [
14].
To our knowledge, there is currently no “perfect” approach to estimating effect size for power analyses. The best recommendation we can provide, therefore, is to use caution and utilise conservative effect size estimates. The following sections provide examples of how to calculate required sample size for a few different studies.
5.1 Example 1 - Independent Samples T-test
For these demonstrations of power calculations, we use the example study of comparing the effect of a robot vs. a human tutor on learning gains, and we use G*Power to perform the power analysis. In this first example, the hypothetical study is simply intended to compare the effect of tutor on learning. Participants are taught either by a robot or a human tutor and then tested at the end of the learning phase. The average test scores of each group will be compared using an independent t-test.
The effect size for the power analysis is obtained by calculating an estimate of the population effect size based on previous, similar studies. This reveals an average effect size of
\(d = 0.46\) . The chosen alpha level is
\(\alpha = 0.05\) , and power level is
\(1 - \beta = 0.8\) . The G*Power software calculates that the required sample size is 152 (76 in each group) (see Figure
2).
This is a rather large required sample size and might not be achievable. One way to reduce this requirement is to instead use a within-subjects design that reduces the total required sample size to 40 (see Figure
3).
5.2 Example 2 - 2-way ANOVA
To demonstrate a power analysis for a 2-way ANOVA the study needs to involve two independent variables. Let us therefore imagine that we want to look at the effect of both tutor (robot vs. human) and subject difficulty (easy vs. hard) on learning gains. The study also uses a between-subjects design, requiring four independent groups. Here, following one of the recommendations from Correll et al. [
15], we calculate a “medium” effect size based on Cohen’s medium value for
\(\eta ^{2}\) = 0.06 [
13]. G*Power can do this and gives the equivalent
\(f = 0.253\) . As in the previous example, we calculate the required sample size using an alpha level of
\(\alpha = 0.05\) , and power level is
\(1 - \beta = 0.8\) . G*Power gives the required sample size of 125 participants (31 or 32 per group) (see Figure
4).
5.3 Example 3 - Repeated Measures ANOVA
As a third example, let us propose a study using a within-subjects design. The study is the same as the previous example, where we examine the effect of tutor (robot vs. human) and task difficulty (easy vs. hard) on learning gains, but participants see all four conditions. The estimated effect size is again
\(f = 0.253\) , and we chose the parameters
\(\alpha = 0.05\) and
\(1 - \beta = 0.8\) . There are no between-subjects factors, so the study has one group and four measurements. A pilot study can be used to get an estimate of the correlation between repeated measures. In this case, we will imagine this shows a correlation of roughly
\(\rho = 0.5\) and that the sphericity assumption was not violated (therefore
\(\epsilon = 1\) ). This analysis reveals that the sample size required to ensure the chosen power level is 23 (see Figure
5).
6 A Note ON Post Hoc Power
The examples we have provided here are examples of a priori power analyses—calculations done before data collection to inform study design. Another type of power analysis that can be done is post hoc. Post hoc power analyses are conducted after the study has been conducted. There are a few ways to approach conducting post hoc power analyses. The first can be thought of as simply doing the power analyses after the study. In this instance, the power analysis is conducted using the same method as an a priori analysis. That is, using the chosen alpha and target power values, an estimated effect size, and calculating how many participants would have been required to achieve the target power. This can then be compared with the actual sample size to assess the study. Alternatively, one can use the chosen alpha level, estimated effect size, and actual sample size to estimate the achieved power level. Note that in both of these cases it is the estimated effect size and not the observed effect size that results from the study, which is used in the analysis. These methods can be useful for interpreting the study’s results, similar to a priori power calculations and for providing a possible explanation for inconclusive results. For instance, if a study did not reveal any effect, then this kind of post hoc analysis might suggest that the study was underpowered and that a second, better powered study should be conducted.
Another approach that is often seen is to use the actual sample size and the observed effect size to calculate power. This has also been referred to as “observed power” or “retrospective power” [
41]. However, it has been argued that this approach is not as meaningful as once thought. The purpose of a power analysis is to determine the likelihood that a chosen statistical test will detect an effect at least as large as the “real” or population effect, assuming there is one. However, power analyses that use the observed effect size are determining the likelihood that the test will produce a statistically significant result, assuming that the population effect is the same as the observed effect [
48,
35,
41,
45]. In the first instance, this implies that a very strong assumption is being made. Additionally, it has been proven that there is a 1-1 relationship between
p-values and this kind of observed power statistic such that tests with larger
p-values always have low “observed” power [
25]. Post hoc power analyses, therefore, offer little additional insight in the case of non-significant results.
7 Conclusion
This article presents a discussion of the importance of power calculations in response to the observation that power seems to be under-reported in the field of HRI. The first step to addressing this is to encourage researchers to perform power calculations and report these calculations in their papers.
One possible reason that power analyses are under-reported might be that the majority of researchers have not received sufficient training on the use and importance of power analyses and sample size calculations. This is partly due to the multi-disciplinary nature of HRI. Just as one would not expect a psychologist to have received formal training on how to develop or program a robot, one would not expect someone trained in robotics or software engineering to have also received training on how to design experiments involving human participants. Fortunately, there is a wide range of educational resources available on power analyses and the importance of calculating sample size, including informational and instructional
books [
34,
19,
23,
31],
articles [
16,
43,
30,
36,
38,
5], and
videos [
47,
22,
10,
33]. Many of these resources are freely available, and links to the YouTube videos and channels can be found in the reference section of this article.
Another way to educate researchers on power analysis, which may be more beneficial and potentially reach more people, would be to introduce training and educational workshops at more conferences where the focus is on learning skills and techniques rather than on presenting recent research. In HRI, these types of workshops could be run not only to teach researchers about power analysis, but other skills such as statistical analysis methods, coding for social robotics, and even introductions to newly developed datasets so researchers can have hands-on experience of what a dataset contains and how it might be used.
In terms of the current state of the field, the lack of power analyses in existing research is concerning in that it suggests that the conclusions that have so far been drawn may not be accurate. Another, hugely important, next step then is to replicate. In general, more emphasis on the importance of replications is desperately needed in most, if not all, scientific fields. It is a problem that is difficult to address from within the research community. There are several reasons for this. First, most journals prioritise, or even require, novelty in the studies that they publish. A recent review of over 1,000 psychology journals found that only 3% stated that they welcomed replication studies for publication [
37]. Additionally, it appears that funding is not widely available for replication studies, considering that the Netherlands Organisation for Scientific Research made headlines in 2016 with the first grant programme dedicated to replication studies [
4]. Thus, encouraging replications is not a trivial issue and requires the coordinated action of researchers, funding agencies, and journals. In the short term however, providing educational resources and ensuring the use of power analyses will enable us to be more confident in the conclusions that we draw from future research.