Talk:P-value
This is the talk page for discussing improvements to the P-value article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
Archives: 1, 2Auto-archiving period: 90 days |
This article has not yet been rated on Wikipedia's content assessment scale. It is of interest to multiple WikiProjects. | |||||||||||||||||||||||||||
Please add the quality rating to the {{WikiProject banner shell}} template instead of this project banner. See WP:PIQA for details.
Please add the quality rating to the {{WikiProject banner shell}} template instead of this project banner. See WP:PIQA for details.
|
This is the talk page for discussing improvements to the P-value article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
Archives: 1, 2Auto-archiving period: 90 days |
Misleading examples
The examples given are rather misleading. For example in the section about the rolling of two dice the articles says. "In this case, a single roll provides a very weak basis (that is, insufficient data) to draw a meaningful conclusion about the dice. "
However it makes no attempt to explain why this is so - and a slight alteration of the conditions of the experiment renders this statement false.
Consider a hustler/gambler who has two sets of apparently identical dice - one of which is loaded and the other fair. If he forgets which is which - and then rolls one set and gets two sixes immediately then it is quite clear that he has identified the loaded set.
The example relies upon the underlying assumption that dice are almost always fair - and therefore it would take more than a single roll to convince you that they are not. However this assumption is never clarified - which might mislead people into supposing that a 0.05 p value would never be sufficient to establish statistical significance. Richard Cant — Preceding unsigned comment added by 152.71.70.77 (talk)
Alternating Coin Flips Example Should Be Removed
"By the second test statistic, the data yield a low p-value, suggesting that the pattern of flips observed is very, very unlikely. There is no "alternative hypothesis" (so only rejection of the null hypothesis is possible) and such data could have many causes. The data may instead be forged, or the coin may be flipped by a magician who intentionally alternated outcomes.
This example demonstrates that the p-value depends completely on the test statistic used and illustrates that p-values can only help researchers to reject a null hypothesis, not consider other hypotheses."
Why would there be "no alternative hypothesis?" Whenever there is a null hypothesis (H0), there must be an alternative hypothesis ("not H0"). In this case, the null hypothesis is that the coin-flipping is not biased toward alternation. Consequently, the alternative hypothesis is that the coin-flipping IS biased toward alternation. It seems that author of this passage did not understand what "alternative hypothesis" means. The same confusion is apparent in the claim that p-values can't help researchers "consider other hypotheses." There are other problems with the passage as well (e.g., the unencyclopedic phrase "very, very" and, as another editor noted, a highly arbitrary description). I suggest getting rid of the whole section, which is completely unsourced, is full of questionable claims, is likely to cause confusion, and serves no apparent function in the article. — Preceding unsigned comment added by 23.242.198.189 (talk) 01:50, 24 July 2019 (UTC)
Also, the very concept of coin-flipping that is biased toward alternation is quite odd and not particularly realistic outside of a fake-data scenario. The examples of trick coins that are biased towards one side or the other are much more intuitive, and thus much more useful in my opinion. 23.242.198.189 (talk) 06:55, 24 July 2019 (UTC)
- What on Earth is "in my opinion" supposed to mean in an unsigned "contribution"?
- FWIW, I agree with that opinion. I have neither seen nor ever heard of a coin being biased to alternate and cannot imagine how one might be made.
- David Lloyd-Jones (talk) 08:19, 4 May 2020 (UTC)
- I imagine that "in my opinion" means the same thing in an unsigned contribution that it means in a signed contribution. I don't see why that should be confusing or why there would be a need to put quotes around "contribution." 99.47.245.32 (talk) 20:16, 2 January 2021 (UTC)
Actually, most of the examples are problematic, completely unsourced, and should be removed. For instance, the "sample size dependence" example says: "If the coin was flipped only 5 times, the p-value would be 2/32 = 0.0625, which is not significant at the 0.05 level. But if the coin was flipped 10 times, the p-value would be 2/1024 ≈ 0.002, which is significant at the 0.05 level." Huh? How can you say what the p-value will be without knowing what the results of the coin-flips will be? And the "one roll of a pair of dice" example appears to be nonsensical; it's not even clear how the test statistic (the sum of the rolled numbers) is supposed to relate to the null hypothesis that the dice are fair, and the idea of computing a p-value from a single data point is very odd in itself. Thus, the example doesn't seem very realistic or useful for understanding how p-values work and actually risks causing confusion and misunderstanding about how p-values work. Therefore, I suggest that the article would be improved by removing all the "examples" except for the one entitled "coin flipping." 131.179.60.237 (talk) 20:42, 24 July 2019 (UTC)
- This dreadful article nowhere tells us what a p-value test is, nor how one is calculated. It merely pretends to. The whole thing is just a lot of blather about some p-value tests people have reported under the pretence of telling us "what p-values do" or something of the sort.
- The promiscuous and incompetent use of commas leaves two or three lists of supposed distinctions muddy and ambiguous.
- Given the somewhat flamboyant and demonstrative use of X's and Greek letters, my impression is that this was written by a statistician of only moderate competence who regards himself, almost certainly a him self, as so far above us all that he need not actually focus on the questions at hand.
- David Lloyd-Jones (talk) 08:19, 4 May 2020 (UTC)
- Indeed,the article is hopeless. I made some changes a year ago (see "Talk Archive 2") and explained on the talk pages why and what I had done, but that work has been undone by editors who did not understand the difficulties I had referred to. I think the article should start by describing the concept of statistical model: namely a family of possible probability distributions of some data. Then one should talk about a hypothesis: that's a subset of possible probability distributions. Then a test statistic. Finally one can give the correct definition of p-value as the largest probability which any null hypothesis model gives to the value of the statistic actually observed, or larger. I know it is a complex and convoluted definition. But one can give lots of examples of varying level of complexity. Finally one can write statements about p-values which are actually true, such as for instance the fact that *if* the null hypothesis *fixes* the probability distribution of your statistic, and if that statistic is continuously distributed, *then* your p-value is uniformly distributed between 0 and 1 if the null hypothesis is true. I know that "truth" is not a criterion which Wikipedia editors may use. But hopefully, enough reliable sources exist to support my claims. What is presently written in the article on this subject is nonsense. Richard Gill (talk) 14:52, 22 June 2020 (UTC)
- I have made a whole lot of changes. Richard Gill (talk) 16:48, 22 June 2020 (UTC)
The article is moving in a good direction, thanks Richard Gill. A point about reader expectations with regards to the article: talk of p-values almost always occurs in the context on NHST; the 'Basic concepts' section is essentially an outline of NHST, but the article nowhere names NHST and Null hypothesis significance testing is a redirect to Statistical inference, an article that is probably not the best introduction to the topic (we also have a redirect from the mishyphenated Null-hypothesis significance-testing to Statistical hypothesis testing). I suggest tweaking the 'Basic concepts' section so that NHST is defined there and have NHST redirect to this article. — Charles Stewart (talk) 19:51, 22 June 2020 (UTC)
- Thank Chalst; I have made some more changes in the same direction, namely to distinguish between original data X and a statistic T. This also led to further adjustments to the material on one-sided versus two-sided tests and then to the example of 20 coin tosses. I'm glad more people are looking at this article! It's very central in statistics. The topic is difficult, no doubt about it. Richard Gill (talk) 12:28, 30 June 2020 (UTC)
Case of a composite null hypothesis
If a null hypothesis is composite, there is not just one null-hypothesis probability that your test statistic exceeds the value actually observed, but many. For instance, consider one-sided test that a normal mean is less than or equal to zero versus the alternative that it is strictly larger than zero. The p-value is computed "at mu = 0" since this gives the *largest* possible probability under the null hypothesis. The ASA definition is inadequate. It is correct for simple null hypotheses, but not in general. We need an authoritative published general case definition, to show people that they cannot rely narrowly on what the ASA said. Richard Gill (talk) 13:15, 24 August 2020 (UTC)
Notice that an important characteristic of p-value is that in order to have a hypothesis test of level alpha, one rejects if and only if the p-value is less than or equal to alpha. An alternative definition of p-value is: the smallest significance level such that one could still (just) reject the null hypothesis at that level of significance. The level of significance is defined as the *largest* probability under the null hypothesis of (incorrectly) rejecting the null. Hence the p-value is the largest probability under the null of exceeding (or equalling) the observed value of the test statistic. At least, for a one-sided test based on *large* values of some statistic.
I know that "truth" is not an allowed criterion for inclusion on Wikipedia. One must have reliable sources, and notability. But I do think it is important at least not to tell lies on Wikipedia if it can be avoided, even if many authorities do it all the time. Perhaps those are "lies for children". But children will hopefully grow up and they must be able to find out that they were told a simplified version of the truth. (That's just my personal opinion). Richard Gill (talk) 13:47, 24 August 2020 (UTC)
Here is a good literature reference: Chapter 10 of "All of Statistics: A Concise Course in Statistical Inference", Springer; 1st Corrected ed. 20 edition (September 17, 2004). Larry Wasserman Richard Gill (talk) 15:15, 24 August 2020 (UTC).
And another: Section 3.3 of Testing Statistical Hypotheses (third edition), E. L. Lehmann and Joseph P. Romano (2005), Springer. Richard Gill (talk) 16:08, 24 August 2020 (UTC)
- The editor's point is well taken that in the case of one-sided tests, one could argue that a nonzero effect in the uninteresting direction constitutes a "null hypothesis" scenario, and the probability distribution of the test statistic depends on the exact size of that uninteresting effect. But that appears to be an equivocal definition of the null hypothesis. The p-value is computed based on the point null hypothesis distribution, and there is only one of those. If that issue needs clarification, that clarification should be sourced, clearly explained, and placed in a note somewhere about the special case of one-sided tests--not in the lede, not in the main definition of the p-value, and not using unsourced, unexplained, potentially confusing phrases such as "best probability." Understanding what a p-value is presents enough of a challenge without adding unnecessary complications!
- I have examined the two sources that Richard Gill claims support the "best/largest probability" phrasing. But on the contrary, they : both support the standard phrasing. Indeed, Lehmann & Romano (2005, p. 64) define the p-value as P0(X > x), where P0 is the probability under the null hypothesis, X is a random-variable test statistic, and x is the observed value of X. And Wasserman (2004) defines the p-value as: "the probability (under H0) of observing a value of the test statistic the same as or more extreme than what was actually observed" (p. 158). The phrases "best probability" and "largest probability" do not appear.
- Thus, I see three compelling reasons that it's more appropriate to simply say "probability," rather than "best probability" or "largest probability":
(1) The latter phrasings appear to be nonstandard, as authoritative sources (including the ASA statement) appear to consistently use the former. (2) The phrases "best probability" and "largest probability" are unclear and potentially confusing. (3) The arguments by the one editor pushing for the nonstandard phrasing are directly contradicted by that editor's own provided sources. 23.242.198.189 (talk) 09:45, 15 December 2020 (UTC)
- Sorry, anonymous editor, you are referring to the places in the text where the authors define the P-value for the case of a simple null hypothesis. If the null hypothesis is composite, P0(X > x) is not even defined!!!! Please register as a user so we can discuss this further. Richard Gill (talk) 16:10, 17 February 2021 (UTC)