Talk:P-value: Difference between revisions

Content deleted Content added

Inline

Revision as of 22:49, 8 September 2021

This is the talk page for discussing improvements to the P-value article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Archives: 1, 2: 90 days

Template:Vital article

This article is of interest to multiple WikiProjects.

Please add the quality rating to the {{WikiProject banner shell}} template instead of this project banner. See WP:PIQA for details.

Statistics B‑class Top‑importance

	This article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.StatisticsWikipedia:WikiProject StatisticsTemplate:WikiProject StatisticsStatistics articles
B	This article has been rated as B-class on Wikipedia's content assessment scale.
Top	This article has been rated as Top-importance on the importance scale.

Please add the quality rating to the {{WikiProject banner shell}} template instead of this project banner. See WP:PIQA for details.

Mathematics B‑class Mid‑priority

	Mathematics portal This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.MathematicsWikipedia:WikiProject MathematicsTemplate:WikiProject Mathematicsmathematics articles
B	This article has been rated as B-class on Wikipedia's content assessment scale.
Mid	This article has been rated as Mid-priority on the project's priority scale.

Misleading examples

The examples given are rather misleading. For example in the section about the rolling of two dice the articles says. "In this case, a single roll provides a very weak basis (that is, insufficient data) to draw a meaningful conclusion about the dice. "

However it makes no attempt to explain why this is so - and a slight alteration of the conditions of the experiment renders this statement false.

Consider a hustler/gambler who has two sets of apparently identical dice - one of which is loaded and the other fair. If he forgets which is which - and then rolls one set and gets two sixes immediately then it is quite clear that he has identified the loaded set.

The example relies upon the underlying assumption that dice are almost always fair - and therefore it would take more than a single roll to convince you that they are not. However this assumption is never clarified - which might mislead people into supposing that a 0.05 p value would never be sufficient to establish statistical significance. Richard Cant — Preceding unsigned comment added by 152.71.70.77 (talk)

Alternating Coin Flips Example Should Be Removed

"By the second test statistic, the data yield a low p-value, suggesting that the pattern of flips observed is very, very unlikely. There is no "alternative hypothesis" (so only rejection of the null hypothesis is possible) and such data could have many causes. The data may instead be forged, or the coin may be flipped by a magician who intentionally alternated outcomes.

This example demonstrates that the p-value depends completely on the test statistic used and illustrates that p-values can only help researchers to reject a null hypothesis, not consider other hypotheses."

Why would there be "no alternative hypothesis?" Whenever there is a null hypothesis (H0), there must be an alternative hypothesis ("not H0"). In this case, the null hypothesis is that the coin-flipping is not biased toward alternation. Consequently, the alternative hypothesis is that the coin-flipping IS biased toward alternation. It seems that author of this passage did not understand what "alternative hypothesis" means. The same confusion is apparent in the claim that p-values can't help researchers "consider other hypotheses." There are other problems with the passage as well (e.g., the unencyclopedic phrase "very, very" and, as another editor noted, a highly arbitrary description). I suggest getting rid of the whole section, which is completely unsourced, is full of questionable claims, is likely to cause confusion, and serves no apparent function in the article. — Preceding unsigned comment added by 23.242.198.189 (talk) 01:50, 24 July 2019 (UTC)[reply]

Also, the very concept of coin-flipping that is biased toward alternation is quite odd and not particularly realistic outside of a fake-data scenario. The examples of trick coins that are biased towards one side or the other are much more intuitive, and thus much more useful in my opinion. 23.242.198.189 (talk) 06:55, 24 July 2019 (UTC)[reply]

What on Earth is "in my opinion" supposed to mean in an unsigned "contribution"?

FWIW, I agree with that opinion. I have neither seen nor ever heard of a coin being biased to alternate and cannot imagine how one might be made.

David Lloyd-Jones (talk) 08:19, 4 May 2020 (UTC)[reply]

I imagine that "in my opinion" means the same thing in an unsigned contribution that it means in a signed contribution. I don't see why that should be confusing or why there would be a need to put quotes around "contribution." 99.47.245.32 (talk) 20:16, 2 January 2021 (UTC)[reply]

Actually, most of the examples are problematic, completely unsourced, and should be removed. For instance, the "sample size dependence" example says: "If the coin was flipped only 5 times, the p-value would be 2/32 = 0.0625, which is not significant at the 0.05 level. But if the coin was flipped 10 times, the p-value would be 2/1024 ≈ 0.002, which is significant at the 0.05 level." Huh? How can you say what the p-value will be without knowing what the results of the coin-flips will be? And the "one roll of a pair of dice" example appears to be nonsensical; it's not even clear how the test statistic (the sum of the rolled numbers) is supposed to relate to the null hypothesis that the dice are fair, and the idea of computing a p-value from a single data point is very odd in itself. Thus, the example doesn't seem very realistic or useful for understanding how p-values work and actually risks causing confusion and misunderstanding about how p-values work. Therefore, I suggest that the article would be improved by removing all the "examples" except for the one entitled "coin flipping." 131.179.60.237 (talk) 20:42, 24 July 2019 (UTC)[reply]

This dreadful article nowhere tells us what a p-value test is, nor how one is calculated. It merely pretends to. The whole thing is just a lot of blather about some p-value tests people have reported under the pretence of telling us "what p-values do" or something of the sort.

The promiscuous and incompetent use of commas leaves two or three lists of supposed distinctions muddy and ambiguous.

Given the somewhat flamboyant and demonstrative use of X's and Greek letters, my impression is that this was written by a statistician of only moderate competence who regards himself, almost certainly a him self, as so far above us all that he need not actually focus on the questions at hand.

David Lloyd-Jones (talk) 08:19, 4 May 2020 (UTC)[reply]

Indeed,the article is hopeless. I made some changes a year ago (see "Talk Archive 2") and explained on the talk pages why and what I had done, but that work has been undone by editors who did not understand the difficulties I had referred to. I think the article should start by describing the concept of statistical model: namely a family of possible probability distributions of some data. Then one should talk about a hypothesis: that's a subset of possible probability distributions. Then a test statistic. Finally one can give the correct definition of p-value as the largest probability which any null hypothesis model gives to the value of the statistic actually observed, or larger. I know it is a complex and convoluted definition. But one can give lots of examples of varying level of complexity. Finally one can write statements about p-values which are actually true, such as for instance the fact that *if* the null hypothesis *fixes* the probability distribution of your statistic, and if that statistic is continuously distributed, *then* your p-value is uniformly distributed between 0 and 1 if the null hypothesis is true. I know that "truth" is not a criterion which Wikipedia editors may use. But hopefully, enough reliable sources exist to support my claims. What is presently written in the article on this subject is nonsense. Richard Gill (talk) 14:52, 22 June 2020 (UTC)[reply]

I have made a whole lot of changes. Richard Gill (talk) 16:48, 22 June 2020 (UTC)[reply]

The article is moving in a good direction, thanks Richard Gill. A point about reader expectations with regards to the article: talk of p-values almost always occurs in the context on NHST; the 'Basic concepts' section is essentially an outline of NHST, but the article nowhere names NHST and Null hypothesis significance testing is a redirect to Statistical inference, an article that is probably not the best introduction to the topic (we also have a redirect from the mishyphenated Null-hypothesis significance-testing to Statistical hypothesis testing). I suggest tweaking the 'Basic concepts' section so that NHST is defined there and have NHST redirect to this article. — Charles Stewart (talk) 19:51, 22 June 2020 (UTC)[reply]

Thank Chalst; I have made some more changes in the same direction, namely to distinguish between original data X and a statistic T. This also led to further adjustments to the material on one-sided versus two-sided tests and then to the example of 20 coin tosses. I'm glad more people are looking at this article! It's very central in statistics. The topic is difficult, no doubt about it. Richard Gill (talk) 12:28, 30 June 2020 (UTC)[reply]

I reconfigured the Basic Concepts section, building on Gill110951's work and Chalst's comments. I tried to clarify what null hypothesis testing is, what we do in it, and the importance of p-values to it. I focused on stating p-values as rejecting the null hypothesis, and tried to explain the importance of also looking at real-world relevance. (I'm not sure if I should put this here or in a separate section, but it seemed a continuation of what Gill110951 did) TryingToUnderstand11 (talk) 09:55, 20 August 2021 (UTC)[reply]

Case of a composite null hypothesis

If a null hypothesis is composite, there is not just one null-hypothesis probability that your test statistic exceeds the value actually observed, but many. For instance, consider one-sided test that a normal mean is less than or equal to zero versus the alternative that it is strictly larger than zero. The p-value is computed "at mu = 0" since this gives the *largest* possible probability under the null hypothesis. The ASA definition is inadequate. It is correct for simple null hypotheses, but not in general. We need an authoritative published general case definition, to show people that they cannot rely narrowly on what the ASA said. Richard Gill (talk) 13:15, 24 August 2020 (UTC)[reply]

Notice that an important characteristic of p-value is that in order to have a hypothesis test of level alpha, one rejects if and only if the p-value is less than or equal to alpha. An alternative definition of p-value is: the smallest significance level such that one could still (just) reject the null hypothesis at that level of significance. The level of significance is defined as the *largest* probability under the null hypothesis of (incorrectly) rejecting the null. Hence the p-value is the largest probability under the null of exceeding (or equalling) the observed value of the test statistic. At least, for a one-sided test based on *large* values of some statistic.

I know that "truth" is not an allowed criterion for inclusion on Wikipedia. One must have reliable sources, and notability. But I do think it is important at least not to tell lies on Wikipedia if it can be avoided, even if many authorities do it all the time. Perhaps those are "lies for children". But children will hopefully grow up and they must be able to find out that they were told a simplified version of the truth. (That's just my personal opinion). Richard Gill (talk) 13:47, 24 August 2020 (UTC)[reply]

Here is a good literature reference: Chapter 17 Section 4 (p. 216) [sorry, I first had "Chapter 7"] of "All of Statistics: A Concise Course in Statistical Inference", Springer; 1st edition (2002). Larry Wasserman Richard Gill (talk) 15:15, 24 August 2020 (UTC).[reply]

And another: Section 3.3 of Testing Statistical Hypotheses (third edition), E. L. Lehmann and Joseph P. Romano (2005), Springer. Richard Gill (talk) 16:08, 24 August 2020 (UTC)[reply]

The editor's point is well taken that in the case of one-sided tests, one could argue that a nonzero effect in the uninteresting direction constitutes a "null hypothesis" scenario, and the probability distribution of the test statistic depends on the exact size of that uninteresting effect. But that appears to be an equivocal definition of the null hypothesis. The p-value is computed based on the point null hypothesis distribution, and there is only one of those. If that issue needs clarification, that clarification should be sourced, clearly explained, and placed in a note somewhere about the special case of one-sided tests--not in the lede, not in the main definition of the p-value, and not using unsourced, unexplained, potentially confusing phrases such as "best probability." Understanding what a p-value is presents enough of a challenge without adding unnecessary complications!

I have examined the two sources that Richard Gill claims support the "best/largest probability" phrasing. But on the contrary, they : both support the standard phrasing. Indeed, Lehmann & Romano (2005, p. 64) define the p-value as P0(X > x), where P0 is the probability under the null hypothesis, X is a random-variable test statistic, and x is the observed value of X. And Wasserman (2004) defines the p-value as: "the probability (under H0) of observing a value of the test statistic the same as or more extreme than what was actually observed" (p. 158). The phrases "best probability" and "largest probability" do not appear.

Thus, I see three compelling reasons that it's more appropriate to simply say "probability," rather than "best probability" or "largest probability":

(1) The latter phrasings appear to be nonstandard, as authoritative sources (including the ASA statement) appear to consistently use the former.

(2) The phrases "best probability" and "largest probability" are unclear and potentially confusing.

(3) The arguments by the one editor pushing for the nonstandard phrasing are directly contradicted by that editor's own provided sources 23.242.198.189 (talk) 09:45, 15 December 2020 (UTC)[reply]

Sorry, anonymous editor, you are referring to the places in the text where the authors define the P-value for the case of a simple null hypothesis. If the null hypothesis is composite, P0(X > x) is not even defined!!!! Please register as a user so we can discuss this further. Richard Gill (talk) 16:10, 17 February 2021 (UTC)[reply]

I checked the two books I had cited before. For Wasserman, see Chapter 17 Section 4 (p. 216) [I had formerly referred to chapter 7 by mistake); for Lehmann and Romano, see Lemma 3.3.1, formula (3.12), which defines p-value for composite hypotheses. Richard Gill (talk) 14:40, 20 February 2021 (UTC)[reply]

If you think the definition should include the words "best probability" or "largest probability," your task is simple: Provide authoritative sources that use that language. You haven't done that. Instead, you've provided two sources that DON'T use that language. There's really nothing to discuss. 23.242.198.189 (talk) 19:30, 13 March 2021 (UTC)[reply]

The sources I refer to use strict, concise, precise mathematical language. If you like, I can write out their formulas in words. If necessary we can reproduce the formulas in the article. If you can’t read their formulas, then you have a problem, you will have to rely on those who can. I have a huge number of statistics text books in my office, which I haven’t visited for more than a year. I don’t fancy buying 20 eBooks right now. Richard Gill (talk) 13:07, 15 June 2021 (UTC)[reply]

In particular, Wasserman writes in a displayed formula "p-value equals the supremum over all theta in Theta_0 of the probability under theta that T(X^n) is greater than or equal to T(x^n)". Here, T(.) is the statistic you are using (a function of your data); x^n is the data you actually observed; X^n is the data thought of as a random vector with a probability distribution that depends on some parameter theta. Theta_0 is the set of parameter values that constitute the null hypothesis.

Lehmann and Romano, first of all, define the *size* of a test as the supremum over theta in Omega_H (the null hypothesis) of the probability that the data lies in the rejection region. Next, they suppose that we have a family of tests with nested rejection regions, each with its own size; obviously, size increases as the rejection region gets larger. They then define the p-value to be the smallest size at which the test rejects (formula 3.11). The result is summarized in their Lemma 3.3.1: the p-value is a random variable such that for *all* theta in the null hypothesis, and for any number u between zero and one, the probability that the p-value is less than or equal to u is itself less than or equal to u. I think that "best probability" or "largest probability" are excellent ways to translate "supremum" into plain English whereby I do not bother the reader that a supremum might only be approached arbitrarily closely, not necessarily achieved. In common examples the supremum is achieved, ie it is a maximum. The biggest, or indeed, the best probability...

Notice, by the way, that the ASA statement is controversial. The Americal Statistical Association has charged a committee with publishing a re-appraisal of p-values and significance testing and it has just come out, https://www.e-publications.org/ims/submission/AOAS/user/submissionFile/51526?confirm=79a17040 . It is not so simplistic or dogmatic.

Here is another reference: An Introduction to Mathematical Statistics; Fetsje Bijma, Marianne Jonker, Aad van der Vaart (2017). ISBN 9789048536115. Amsterdam University Press. Definition 4.19. The p-value is equal to the supremum taken over theta in Theta_0 of the probability under theta that T (your test statistic) is greater than or equal to t (the value you observed of your chosen statistic).

Amusingly, David Cox (2006) Principles of Statistical Inference, does not explicitly define p-values for general composite hypotheses. He focusses on ways of finding methods for getting exact p-values, for instance by conditioning on ancillary statistics, or approximate p-values, for instance from asymptotic theory. Another standard text which avoids the subject is the standard text by Hogg, McKean and Craig. The nearest they get is by saying "Moreover, sometimes alpha is called the 'maximum of probabilities of committing an error of Type I' and the 'maximum of the power of the test when H0 is true'.” This is strange language: they should say that alpha is sometimes *defined* in this way. Consequently, the p-value is *defined* in the way done by the other more mathematically precise authors whom I have cited here. Hogg and Craig say that they want to warn the reader that they may come across different use of language than the language they use. Their book is entitled "Introduction to Mathematical Statistics", but in fact they are not very mathematical.Richard Gill (talk) 17:22, 20 June 2021 (UTC)[reply]

@@ Line 1: / Line 1: @@
+{{Talk header|archive_age=3|archive_units=months|archive_bot=Lowercase sigmabot III}}
-{{Talk header}}
 {{Vital article|level=4|topic=Mathematics|class=B}}
 {{WikiProject banner shell|collapsed=yes|1={{WikiProject Statistics|class=B|importance=top}}
@@ Line 5: / Line 5: @@
 }}
 {{annual readership}}
-{{Auto archiving notice| bot = Lowercase sigmabot III | age = 3 | units = months }}
 {{User:MiszaBot/config
 |maxarchivesize = 100K