A new quantile estimator with weights based on a subsampling approach

Firat Ozdemir

A new quantile estimator with weights based on a subsampling approach

2020, British Journal of Mathematical and Statistical Psychology

Quantiles are widely used in both theoretical and applied statistics, and it is important to be able to deploy appropriate quantile estimators. To improve performance in the lower and upper quantiles, especially with small sample sizes, a new quantile estimator is introduced which is a weighted average of all order statistics. The new estimator, denoted NO, has desirable asymptotic properties. Moreover, it offers practical advantages over four estimators in terms of efficiency in most experimental settings. The Harrell–Davis quantile estimator, the default quantile estimator of the R programming language, the Sfakianakis–Verginis SV2 quantile estimator and a kernel quantile estimator. The NO quantile estimator is also utilized in comparing two independent groups with a percentile bootstrap method and, as expected, it is more successful than other estimators in controlling Type I error rates.

1 British Journal of Mathematical and Statistical Psychology (2020) © 2020 The British Psychological Society www.wileyonlinelibrary.com A new quantile estimator with weights based on a subsampling approach G€ ozde Navruz* € and A. Fırat Ozdemir Department of Statistics, Faculty of Sciences, Dokuz Eyl€ul University, _Izmir, Turkey Quantiles are widely used in both theoretical and applied statistics, and it is important to be able to deploy appropriate quantile estimators. To improve performance in the lower and upper quantiles, especially with small sample sizes, a new quantile estimator is introduced which is a weighted average of all order statistics. The new estimator, denoted NO, has desirable asymptotic properties. Moreover, it offers practical advantages over four estimators in terms of efficiency in most experimental settings. The Harrell–Davis quantile estimator, the default quantile estimator of the R programming language, the Sfakianakis–Verginis SV2 quantile estimator and a kernel quantile estimator. The NO quantile estimator is also utilized in comparing two independent groups with a percentile bootstrap method and, as expected, it is more successful than other estimators in controlling Type I error rates. 1. Introduction The estimation of a population quantile is frequently of interest in fields such as psychology, medicine, environment, biology and marketing. Let X be a continuous random variable with cumulative distribution function F(x). The qth quantile of a population is denoted by QðqÞ and defined in terms of the functional inverse of the cumulative distribution function obtained at q, namely QðqÞ ¼ F 1 ðqÞ ¼ inf fx : FðxÞ qg for predetermined 0\q\1. This statement can also be expressed as Pðx QðqÞÞ ¼ q, which means that 100q% of the observations are less than or equal to the population quantile Q(q). For instance, if X has a standard normal distribution, its .95th quantile is 1.645, which is expressed as Qð0:95Þ ¼ Pðx 1:645Þ ¼ 0:95. There are three main approaches for obtaining a quantile estimator: using a single order statistic; taking the weighted average of two order statistics; and taking the weighted average of all order statistics. Although the second approach has a slight advantage over the first, these two approaches are not desirable because using one or two order statistics can lead to larger variance and consequently lower efficiency. Using a weighted average of all order statistics is a better idea which improves efficiency. There are a great number of ways in the literature to obtain weights for order statistics that lead to different quantile estimators. Harrell and Davis (1982) derived a distribution-free quantile estimator which uses all order statistics to estimate a population quantile. They also calculated a jackknife variance estimator of the proposed estimator and stated that their estimator is more efficient than the traditional one based on one or two order statistics. For instance, *Correspondence should be addressed to G€ ozde Navruz, Department of Statistics, Faculty of Sciences, Dokuz Eyl€ ul University, Tınaztepe Campus, 35390 Buca, I_zmir, Turkey (email: [email protected]). DOI:10.1111/bmsp.12198 2 € G€ozde Navruz and A. Fırat Ozdemir Tq ¼ ð1 gÞXðiÞ þ gXðiþ1Þ is one of the traditional quantile estimators that uses a weighted average of two consecutive order statistics, where ðn þ 1Þq ¼ i þ g, i is the integral part of (n + 1)q and n is used here and in what follows for the sample size. The Harrell–Davis (HD) quantile estimator is generally 1.1 times as efficient as Tq; however, using the HD quantile estimator for small sample sizes and extreme quantiles is not recommended. In the same year, Kaigh and Lachenbruch (1982) derived an alternative to the conventional quantile estimator which is found by averaging an appropriate subsample quantile over all subsamples of a fixed size. The proposed estimator uses all order statistics, and the results of their study show that it has smaller mean squared error (MSE) than the conventional quantile estimator using one order statistic. Hyndman and Fan (1996) compared the quantile estimator definitions in common statistical computer packages by writing them in the same manner, analysing their motivation and investigating some of their properties. Their study is restricted to quantile estimators based on one or two order statistics. They draw attention to the difficulty of comparing definitions since there are many equivalent ways of defining quantiles. Therefore, they suggested using the median-unbiased estimator, which was called definition 8, to avoid confusion between statistical packages. Note that the default quantile estimator of the R programming language (henceforth referred to as the R quantile estimator) is given in definition 7 of their study. Yoshizawa, Sen, and Davis (1985) showed asymptotic equivalence for the sample median and Harrell–Davis median estimator, which allows us to use the asymptotic properties of the sample median for the Harrell–Davis median estimator. Quantile estimators that take a weighted average of all the order statistics are distinguished by their weight functions. Basically, the weights can be based on a subsampling or kernel approach, where Harrell–Davis and Kaigh–Lachenbruch quantile estimators are important examples of the former. Kaigh and Cheng (1991) also followed the subsampling approach for the weight functions of the quantile estimators. They derived a new statistic which is closely associated with the Harrell–Davis and Kaigh– Lachenbruch estimators. Their estimator was found to be successful in efficiency comparisons. Sheather and Marron (1990) emphasize the importance of choosing the weight function when obtaining a quantile estimator as a weighted average of all order statistics. Kernel quantile estimators are an important class of this type. Let K be a density function symmetric about zero such that Kh ðÞ ¼ h1 Kð=hÞ, where the bandwidth h ! 0 as n ! 1. One example i of a kernel quantile estimator is obtained Pn hR i=n by writing KQq ¼ i¼1 i1=n Kh ðt qÞdt XðiÞ . Sheather and Marron briefly explained the asymptotic properties of kernel quantile estimators and proposed a data-based bandwidth selection method. They also compared the performance of kernel quantile estimators with those of the Harrell–Davis and Kaigh–Lachenbruch estimators. Previous to this, Yang (1985) proposed a kernel quantile estimator and estimated its mean squared convergence rate. This kernel quantile estimator is given by P Sn ðqÞ ¼ n1 ni¼1 XðiÞ Kh ði=n qÞ, where K is defined as Kh ðÞ ¼ h1 Kð=hÞ and the bandwidth h ! 0 as n ! 1. Clearly, this estimator weights the XðiÞ for which i/n is close to q more heavily than the XðiÞ for which i/n is far from q. Falk (1985) studied the asymptotic normality of kernel quantile estimators. Another estimator is obtained by jackknifing the kernel quantile estimator (Mc Cune & Mc Cune, 1991). Suppose that Qn ðq; K; h1 Þ and Qn ðq; K; h2 Þ are two kernel quantile A new quantile estimator 3 estimators which use the same kernel function K with different bandwidths h1 and h2 . Then, for some constant R ¼ 6 1, Mc Cune and Mc Cune (1991) proposed the generalized jackknife estimator GðQn ðq; K; h1 Þ; Qn ðq; K; h2 ÞÞ ¼ ½Qn ðq; K; h1 Þ RQn ðq; K; h2 Þ=ð1 RÞ. It is also possible to obtain a quantile estimator by using Bernstein polynomials, as in Cheng (1995) and Pepelyshev, Rafajłowicz, and Steland (2014). The Bernstein polynomial quantile estimator function for the qth quantile is given by Pn n 1 ni B i1 Q^n ðqÞ ¼ i¼1 q ð1 qÞ XðiÞ , and Cheng (1995) investigated the asympi1 totic properties of Q^Bn . Moreover, let Qn ðxÞ be the empirical quantile function; by applying the Bernstein–Durrmeyer smoothing operator to the sample quantile function, the Bernstein–Durrmeyer estimator of the quantile function is estimated as P DN ðQn ðxÞÞ ¼ ðN þ 1Þ Nk¼0 a~k BNk ðxÞ, x 2 ½0; 1, where a~k corresponds to weights and Pn i1 N 1 , in which the BNk ðxÞ are Bernstein polynomials simplified by a^k ¼ n i¼1 XðiÞ Bk n1 of degree N. Pepelyshev et al. (2014) studied this type of quantile estimators in terms of Mean squared error (MSE), convergence rate and asymptotic distribution. Parrish (1990) compared the performances of ten nonparametric quantile estimators when sampling randomly from a normal distribution. This work involves the Harrell– Davis estimator, Kaigh–Lachenbruch estimator and some other conventional quantile estimators based on one or two order statistics. Dielman, Lowry, and Pfaffenberger (1994) broadened Parrish’s work to include non-normal distributions. One of the most current quantile estimators was derived by Sfakianakis and Verginis (2008). They used weighted averages of all order statistics and aimed to get efficient estimators with small samples. Although the HD estimator was almost the best estimator according to the literature, the Sfakianakis and Verginis estimators (SV1, SV2 and SV3) outperformed the HD estimator and some other conventional estimators in some experimental settings. As exemplified, there are a great number of studies on quantile estimators, some of which define an approach for obtaining a quantile estimator, while others compare the performance of the existing quantile estimators. Although the HD quantile estimator performs well according to the literature, there is a need for a new quantile estimator especially for extreme quantiles with small sample sizes. When the purpose is to obtain a quantile estimator in order to estimate a population quantile, a reasonable approach is to use a weighted average of all order statistics. For this reason, specifying the weights becomes an important detail. For the purpose of improving the performance of the available quantile estimators, especially when extreme quantiles are considered with small sample sizes, a new quantile estimator is proposed. Note that the performance criteria that are expected to be improved are efficiency as well as control of the actual Type I error rate when the quantile estimator is used in some hypothesistesting procedure. Clearly, the new proposed quantile estimator is a weighted average of all order statistics. We now give a description, discuss its asymptotic properties and investigate its performance. 2. The NO quantile estimator 2.1. Description of the new NO quantile estimator We begin by setting S0 ¼ ðL; Xð1Þ Þ, S1 ¼ ½Xð1Þ ; Xð2Þ Þ, . . . , Sn1 ¼ ½Xðn1Þ ; XðnÞ Þ, Sn ¼ ½XðnÞ ; UÞ, where Xð1Þ ; . . .; XðnÞ are order statistics of the sample, and U and L are 4 € G€ozde Navruz and A. Fırat Ozdemir upper and lower bounds for X (which may be 1 and 1, respectively). The qth quantile of the population, Qq , lies in one of these intervals. Define the random variables di ¼ Xi Q q ; Xi [ Q q ; 1; 0; where di BernoulliðqÞ. Then their sum N ¼ d1 þ d2 þ . . . þ dn has a binomial distribution with probability of success q, provided that the di are independent. So PðQq 2 Si Þ ¼ PðN ¼ iÞ ¼ Bði; n; qÞ, i = 0, 1, . . ., n. Consider the point estimator of Qq , namely Q0q;i , conditioned on the event Qq 2 Si , i = 0, 1, . . ., n. When Q0q;i ¼ qXðiÞ þ ð1 qÞXðiþ1Þ is assumed, an estimator of Qq can be P n 0 obtained by evaluating the expected value E i¼0 Qq;i . Let Q0q;i ¼ qXðiÞ þ ð1 qÞXðiþ1Þ and assume the proxies Q0q;0 Q0q;1 ¼ Q0q;1 Q0q;2 ; Then E n P i¼0 E n X Q0q;i ð1Þ Q0q;n Q0q;n1 ¼ Q0q;n1 Q0q;n2 : n P qXðiÞ þ ð1 qÞXðiþ1Þ is evaluated as follows: ¼E i¼0 ! ðqXðiÞ þ ð1 qÞXðiþ1Þ Þ ¼ i¼0 ¼ n X i¼0 n X ð2Þ EðqXðiÞ þ ð1 qÞXðiþ1Þ Þ Bði; n; qÞ qXðiÞ þ ð1 qÞXðiþ1Þ i¼0 ¼ Bð0; n; qÞ qXð0Þ þ ð1 qÞXð1Þ |ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ} ðÞ þ n1 X ð3Þ Bði; n; qÞ qXðiÞ þ ð1 qÞXðiþ1Þ i¼1 |ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ} ðÞ þ Bðn; n; qÞ qXðnÞ þ ð1 qÞXðnþ1Þ : |ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ} ðÞ Using the proxy in equation (1), Q0q;0 ¼ 2Q0q;1 Q0q;2 ; ð4Þ qXð0Þ þ ð1 qÞXð1Þ ¼ 2 qXð1Þ þ ð1 qÞXð2Þ qXð2Þ þ ð1 qÞXð3Þ ¼ 2qXð1Þ þ 2ð1 qÞXð2Þ qXð2Þ ð1 qÞXð3Þ ¼ 2qXð1Þ þ ð2 3qÞXð2Þ ð1 qÞXð3Þ ; the expression in (3) denoted by (*) is obtained as ð5Þ A new quantile estimator 5 Bð0; n; qÞ qXð0Þ þ ð1 qÞXð1Þ ¼ Bð0; n; qÞ 2qXð1Þ þ ð2 3qÞXð2Þ ð1 qÞXð3Þ : ð6Þ Using the proxy in equation (2), Q0q;n ¼ 2Q0q;n1 Q0q;n2 ; ð7Þ qXðnÞ þ ð1 qÞXðnþ1Þ ¼ 2 qXðn1Þ þ ð1 qÞXðnÞ qXðn2Þ þ ð1 qÞXðn1Þ ¼ 2qXðn1Þ þ 2ð1 qÞXðnÞ qXðn2Þ ð1 qÞXðn1Þ ¼ 2ð1 qÞXðnÞ þ ð3q 1ÞXðn1Þ qXðn2Þ ; ð8Þ the expression in (3) denoted by (**) is obtained as Bðn; n; qÞ qXðnÞ þ ð1 qÞXðnþ1Þ ¼ Bðn; n; qÞ ð2 2qÞXðnÞ þ ð3q 1ÞXðn1Þ qXðn2Þ : ð9Þ Finally, the expression in (3) denoted by (***) is evaluated as n1 X Bði; n; qÞ qXðiÞ þ ð1 qÞXðiþ1Þ ¼ Bð1; n; qÞ qXð1Þ þ ð1 qÞXð2Þ i¼1 þ Bð2; n; qÞ qXð2Þ þ ð1 qÞXð3Þ þ . . . þ Bðn 2; n; qÞ qXðn2Þ þ ð1 qÞXðn1Þ þ Bðn 1; n; qÞ qXðn1Þ þ ð1 qÞXðnÞ ¼ Bð1; n; qÞqXð1Þ n2 X þ ðBði; n; qÞð1 qÞ þ Bði þ 1; n; qÞqÞXðiþ1Þ i¼1 þ Bðn 1; n; qÞð1 qÞXðnÞ : ð10Þ Combining equations (6), (9) and (10) yields Bð0; n; qÞ2qXð1Þ þ Bð0; n; qÞð2 3qÞXð2Þ Bð0; n; qÞð1 qÞXð3Þ þ Bð1; n; qÞqXð1Þ n2 X þ ðBði; n; qÞð1 qÞ þ Bði þ 1; n; qÞqÞXðiþ1Þ i¼1 þ Bðn 1; n; qÞð1 qÞXðnÞ þ Bðn; n; qÞð2 2qÞXðnÞ þ Bðn; n; qÞð3q 1ÞXðn1Þ Bðn; n; qÞqXðn2Þ : The new quantile estimator NOq is obtained as ð11Þ 6 € G€ozde Navruz and A. Fırat Ozdemir NOq ¼ ðBð0; n; qÞ2q þ Bð1; n; qÞqÞXð1Þ þ Bð0; n; qÞð2 3qÞXð2Þ Bð0; n; qÞð1 qÞXð3Þ n2 X þ ðBði; n; qÞð1 qÞ þ Bði þ 1; n; qÞqÞXðiþ1Þ Bðn; n; qÞqXðn2Þ i¼1 þ Bðn; n; qÞð3q 1ÞXðn1Þ þ ðBðn 1; n; qÞð1 qÞ þ Bðn; n; qÞð2 2qÞÞXðnÞ ; ð12Þ where q is the considered quantile with 0 < q < 1 and Bði; n; qÞ are the binomial probabilities for i = 0, 1, . . ., n. Note that, in the computation of Bði; n; qÞ, q is used for the probability of success instead of the classical notation p. With the objective of investigating the asymptotic properties of the new estimator, the asymptotic distribution of XðiÞ of functions of order statistics was first studied, where i = [nq] + 1 and [nq] is the integer part of nq for 0 < q < 1 (Arnold, Balakrishnan, & Nagaraja, 1992; Stigler, 1974). Since the new estimator is also a linear function of the order statistics with weights of binomial probabilities, it can be stated that the new estimator is asymptotically normally distributed. 3. Design of the simulation study 3.1. Asymptotic properties A simulation study is conducted to examine the asymptotic properties of the NO quantile estimator in practice. The aim is to verify whether estimated quantile values converge to population quantiles and whether the difference between estimated and population quantiles converges to zero as the sample size increases. Two other criteria for the NO quantile estimator are asymptotic variance and asymptotic MSE. It is expected that the variance and MSE of the estimator decrease as n increases. Four different g-and-h distributions with different g and h parameters are used, namely, standard normal (g = h = 0), asymmetric and heavy-tailed (g = h = 0.5), asymmetric and light-tailed (g = 0.5, h = 0), and symmetric and heavy-tailed (g = 0, h = 0.5). The g-and-h distribution allows us to understand how a distribution differs from the normal; the parameters g and h are concerned with skewness and kurtosis, respectively. Let Z be a random variable that has a standard normal distribution. By using the transformation X ¼ ½ðexpðgZÞ 1Þ expðhZ 2 =2Þ=g when g 6¼ 0 and X ¼ Z expðhZ 2 =2Þ when g = 0, data are generated from the g-and-h distribution (Hoaglin, 1985). Note that the symbol h is used not only for the kurtosis parameter of g-and-h distribution but also for the bandwidth in kernel quantile estimators. The nominal significance level is set at a ¼ :05. Both lower, middle and upper quantiles are considered: q = .05, .1, .5, .9, .95. Also, sample sizes are varied from 10 to 10,000 in order to show the effect of increasing the sample size. All simulations are based on 10,000 replications and carried out using R programming language version 3.5.1. 3.2. Relative efficiencies In the second part of the simulation study, the relative efficiency of the NO quantile estimator over the HD estimator, the R quantile estimator, the SV2 quantile estimator and a kernel quantile estimator are calculated for different experimental settings. Their computation is briefly explained here. For any random sample of size n, Xð1Þ Xð2Þ . . . XðnÞ denote the order statistics. Let Y Beta (a, b), with a = (n + 1)q and b = (n + 1)(1 – q). The probability density function of Y is. A new quantile estimator Cða þ bÞ a1 y ð1 yÞb1 ; CðaÞCðbÞ 0 y 1; 7 ð13Þ where C is the gamma function. Given that Wi ¼ P½ði 1Þ=n Y i=n, the Harrell– Davis estimate of the qth quantile is obtained as HDq ¼ n X Wi XðiÞ : i¼1 Let c = nq + m – i, m = 1 – q and i = [m + nq], where i is the integral part of m + nq. The R quantile estimator is obtained as Rq ¼ ð1 cÞXðiÞ þ cXðiþ1Þ : The SV2 quantile estimator is the second estimator of Sfakianakis and Verginis (2008). Although Sfakianakis and Verginis derived three quantile estimators (SV1, SV2, SV3), here € only SV2 is considered because Navruz and Ozdemir (2018) stated that it performed best in most of the cases they considered. Additionally, a kernel quantile estimator is considered which is based on replacing the distribution function F(x) with its kernel estimator and then using the quasi-inverse of F(x). It is called as kernel in this study and can be calculated using the npquantile function which requires R package np. The HD quantile estimator is considered because it is cited in the literature as one of the best quantile estimators based on all order statistics. The SV2 quantile estimator, which again uses all order statistics, is considered because it is one of the most current estimators € and, according to literature, outperforms the HD estimator (Navruz & Ozdemir, 2018; Sfakianakis & Verginis, 2008). The default quantile estimator of R programming language is used because it is a successful representative of quantile estimators based on two order statistics. Similarly, the kernel quantile estimator is used since it is an important example of the quantile estimators based on kernel approximation. Relative efficiencies are as follows: relative efficiency over R ¼ MSEðRÞ ; MSEðNOÞ relative efficiency over HD ¼ relative efficiency over kernel ¼ relative efficiency over SV2 ¼ ð14Þ MSEðHDÞ ; MSEðNOÞ ð15Þ MSEðkernelÞ ; MSEðNOÞ ð16Þ MSEðSV2Þ : MSEðNOÞ ð17Þ Sample sizes are chosen as n = 20 and n = 40. The quantiles q = .05, .1, .5, .9, .95 are considered with the distributions given in Section 3.1. € G€ozde Navruz and A. Fırat Ozdemir 8 4. RESULTS 4.1. Asymptotic properties The estimated quantile values are given in Tables 1 and 2, for the distributions g = 0, h = 0 and g = 0.5, h = 0.5 respectively. As the sample size goes to infinity the estimated quantile values converge to the real value, which demonstrates the asymptotic consistency of the NO quantile estimator. The results for the distributions g = 0.5, h = 0 and g = 0, h = 0.5 are not included since they are similar to those in Tables 1 and 2. In Tables 3 and 4, the mean squared error, variance and bias values are given. The MSE and variance values decrease as the sample size goes to infinity. In addition, the bias values converge to zero. Note that the results for the distributions g = 0.5, h = 0 and g = 0, h = 0.5 are omitted since they are similar to those in Tables 3 and 4. 4.2. Relative efficiency The relative efficiency results are given in Tables 5 and 6. Ratios greater than 1 are marked as bold. The NO quantile estimator is more efficient than the HD quantile estimator, the R quantile estimator, the kernel quantile estimator and the SV2 quantile estimator in respectively 27, 23, 28 and 32 cases out of 40. In addition, there are six results that are smaller than, but very close to, 1. 5. An application: comparing two independent groups through quantiles One of the main purposes in applied statistics is to determine whether there is a significant difference between two populations. The classic and most common way of doing this is by comparing two independent groups with a method based on just one reference point such as the arithmetic mean, as in Student’s t test. But Student’s t test has some restrictive assumptions, and it is clear that comparing groups using more than one reference point gives the researcher a deeper insight into the case under consideration. Moreover, it enables an understanding of how different subpopulations of groups compare. For instance, think about a study that aims to assess the effectiveness of intervention in terms of reducing depressive symptoms in older adults. In such a study, lower and upper tails of populations may respond differently to the experimental method. That is, the intervention may be more effective for participants with higher levels of depression corresponding to the upper quantiles of the groups. Conversely, it may be less effective or harmful for participants with lower levels of depression corresponding to the lower quantiles of the groups. In such situations, comparing the control and experimental groups through quantiles is more informative (Wilcox, Erceg-Hurn, Clark, & Carlson, 2014). Table 1. Estimated quantile values, g = 0 and h = 0 q q q q q = = = = = .05 .1 .5 .9 .95 n = 10 n = 100 n = 1,000 n = 10,000 Population quantile 1.211244 1.061011 0.001266 1.062493 1.209876 1.633582 1.274546 0.000811 1.275200 1.630211 1.643446 1.28098 0.000710 1.280973 1.643179 1.64448 1.281652 0.000057 1.281580 1.644827 1.644854 1.281552 0 1.281552 1.644854 A new quantile estimator 9 Table 2. Estimated quantile values, g = 0.5 and h = 0.5 q q q q q = = = = = .05 .1 .5 .9 .95 n = 10 n = 100 n = 1,000 n = 10,000 Population quantile 1.512341 1.348625 0.151886 3.047103 3.405389 2.369799 1.478920 0.007774 2.923225 5.838127 2.216067 1.432176 0.000345 2.726090 5.082168 2.206738 1.427140 0.000098 2.708180 5.023742 2.204211 1.426551 0.000238 2.706450 5.019086 Table 3. MSE, variance and bias values, g = 0 and h = 0 n = 10 n = 100 n = 1,000 n = 10,000 MSE q q q q q = = = = = .05 .1 .5 .9 .95 0.48649 0.26576 0.11494 0.26082 0.48967 0.03570 0.02444 0.01424 0.02432 0.03513 0.00411 0.00272 0.00151 0.00272 0.00423 0.00044 0.00028 0.00015 0.00029 0.00044 Variance q q q q q = = = = = .05 .1 .5 .9 .95 0.29847 0.21712 0.11494 0.21283 0.30046 0.03557 0.02439 0.01424 0.02428 0.03492 0.00411 0.00272 0.00151 0.00272 0.00422 0.00044 0.00028 0.00015 0.00029 0.00044 Bias q q q q q = = = = = .05 .1 .5 .9 .95 0.43361 0.22054 0.00127 0.21906 0.43498 0.01127 0.00701 0.00081 0.00635 0.01464 0.00141 0.00057 0.00071 0.00058 0.00167 0.00037 0.00010 0.00006 0.00003 0.00003 Wilcox et al. compared two independent groups through the lower and upper quantiles by using a percentile bootstrap based method in conjunction with the HD quantile estimator (Wilcox et al., 2014). According to the results of that study, the actual Type I error rates were controlled in many cases. However, when the extreme quantiles were considered with small sample sizes, the actual Type I error rates exceeded the nominal level. For example, when comparing two independent groups through q = .9 with n = 20, the actual Type I error rate may exceed .1 at nominal significance level a ¼ :05. On the other hand, when comparing q = .95, the minimum sample size for controlling the actual Type I error rate is n = 50. € Navruz and Ozdemir (2018) compared the performance of the HD, Sfakianakis– Verginis and R quantile estimators in terms of saving the actual Type I error rate when utilizing these estimators in comparing two independent groups based on a percentile bootstrap. The Sfakianakis–Verginis quantile estimators had actual Type I error rates closer to the nominal level compared to the HD and R quantile estimators. However, none of these estimators could control the actual Type I error rates with small sample sizes when extreme quantiles were of interest. For example, when the sample size was n = 10 and q = .95, the actual Type I error rates obtained as .141, .1175, .1037, .125 and .1172 by using the quantile estimators HD, SV1, SV2, SV3 and R, respectively. 10 € G€ozde Navruz and A. Fırat Ozdemir Table 4. MSE, variance and bias values, g = 0.5 and h = 0.5 n = 10 n = 100 n = 1,000 n = 10,000 MSE q q q q q = = = = = .05 .1 .5 .9 .95 2.00253 1.10596 0.29699 18.74674 24.49199 0.40213 0.08665 0.01519 0.74941 6.09278 0.03081 0.00819 0.00154 0.05918 0.32991 0.00319 0.00084 0.00015 0.00596 0.03235 Variance q q q q q = = = = = .05 .1 .5 .9 .95 1.52385 1.09989 0.27399 18.6307 21.88797 0.37471 0.08665 0.01513 0.70242 5.42195 0.03067 0.00816 0.00154 0.05879 0.32593 0.00318 0.00084 0.00015 0.00595 0.03232 Bias q q q q q = = = = = .05 .1 .5 .9 .95 0.69187 0.07793 0.15165 0.34065 1.61370 0.16559 0.05237 0.00754 0.21677 0.81904 0.01186 0.00562 0.00011 0.01964 0.06308 0.00253 0.00059 0.00014 0.00173 0.00466 When comparing two independent groups through quantiles with a percentile bootstrap based method, the performance of the NO quantile estimator was investigated. The criterion was control of the actual Type I error rates at the a = .05 nominal significance level. To interpret the results, Bradley’s (1978) liberal criterion of robustness was considered, so that actual Type I error rates should fall within the interval (0.025, 0.075). To explain the proposed approach in formal terms, the aim is to test H0 : hq1 ¼ hq2 , where hqj is the qth quantile corresponding to the jth group (j = 1, 2). Let Xij be a random sample from the jth group, where j = 1, 2 and i ¼ 1; . . .; nj . Then, a bootstrap sample from each group is generated by resampling with replacement, yielding Xij . The estimate of the qth quantile for group j is obtained from this bootstrap sample and denoted by b h j . The group difference d ¼ b h1 b h 2 is defined, and this process is repeated B times in order to obtain db , b = 1, . . ., B. These db values are then put in ascending order, dð1Þ . . . dðBÞ . Let ‘ ¼ aB=2 be rounded to nearest integer and u ¼ B ‘. An approximate confidence interval for h1 h2 is obtained as ðdð‘þ1Þ ; dðuÞ Þ. Finally, a generalized p-value can be evaluated. Let A be the number of times that d \0 and let C be the number of times that d ¼ 0. Then pb ¼ ðA þ 0:5C Þ=B and the generalized p-value is 2 min pb ; 1 pb . Obviously, if the p-value is less than the nominal significance level a, then H0 is rejected (Wilcox, 2017). Note that sample sizes n = 10, 20, 40 were taken with the aim of investigating the effect of both small- and large-sample cases. Two independent groups were separately compared through the quantiles q = .05, .1, .5, .9, .95, in order to be able to detect differences occurring not only in the middle but also in the tails. Additionally, data were generated from g-and-h distributions. Actual Type I error rate results for g-and-h distributions are given in Tables 7 and 8. The number of bootstrap samples chosen was B = 2,000, and all simulations were done with 10,000 replications. Problems may occur some in hypothesis-testing situations when there are tied values among observations. In addition to continuous distributions, the beta-binomial q q q q q n = 40 = = = = = = = = = = .05 .1 .5 .9 .95 .05 .1 .5 .9 .95 0.99070 1.21651 1.17301 1.19235 0.99029 1.18587 1.25219 1.14274 1.23146 1.19510 0.86779 1.24822 1.13870 1.06228 0.87260 1.07484 1.28075 1.13071 1.11119 1.06741 Note. Ratios greater than 1 are marked as bold. q q q q q n = 20 1.44649 1.50502 0.87404 1.54114 1.39977 1.930088 1.042931 1.034191 1.446043 1.81566 3.63880 1.56109 1.07802 1.77638 1.755187 1.54521 1.91612 0.89401 1.82750 1.53817 7.91514 0.92285 1.00308 1.21385 7.37383 20.66868 0.81206 1.00466 1.49783 24.1464 Over HD Over SV2 Over kernel Over HD Over R g = 0, h = 0.5 g = 0, h = 0 Table 5. Relative efficiencies, g = 0, h = 0 and g = 0, h = 0.5 0.50976 1.05482 1.00416 1.07138 0.57109 0.90941 1.02860 1.01084 1.05383 0.89765 Over R 1.28632 0.85344 1.04161 0.69710 0.87792 46.99993 0.59904 1.10621 0.75784 1.50014 Over kernel 0.45170 0.68276 1.03389 8.59226 27.29152 0.69539 0.21479 1.06844 19.43331 73.37133 Over SV2 A new quantile estimator 11 q q q q q n = 40 = = = = = = = = = = .05 .1 .5 .9 .95 .05 .1 .5 .9 .95 0.82677 0.65111 1.06729 0.43523 1.50999 0.64505 0.51581 0.99870 0.14283 1.50302 4.01514 2.40974 29.79292 1.07614 1.07959 1.25373 0.56500 1.08051 0.58080 1.06360 3.12290 1.12714 1.09157 2.06626 3.25799 5.70539 2.05658 1.20460 3.20148 3.72061 5.22908 3.53368 16.46816 1.19951 1.24078 0.99342 0.46756 1.03346 0.46256 0.97806 Note. Ratios greater than 1 are marked as bold. q q q q q n = 20 10.29910 1.47630 0.99902 1.58808 44.73590 7.63720 4.29891 0.98569 13.01179 70.94930 Over HD Over SV2 Over kernel Over HD Over R g = 0.5, h = 0.5 g = 0.5, h = 0 Table 6. Relative efficiencies, g = 0.5, h = 0 and g = 0.5, h = 0.5 0.55739 1.51427 1.00073 1.49951 0.24361 0.92118 4.97690 0.99296 5.30639 0.86593 Over R 1.14281 0.88163 1.32614 0.48524 0.33714 12.29856 1.14173 1.16343 0.49120 0.65192 Over kernel 0.51389 0.72183 1.10908 22.28215 195.45970 0.97258 0.47268 1.23359 36.65349 220.34480 Over SV2 12 € G€ozde Navruz and A. Fırat Ozdemir A new quantile estimator 13 Table 7. Actual Type I error rates, g = 0, h = 0 and g = 0, h = 0.5 g = 0, h = 0 g = 0, h = 0.5 n n n n n n = = = = = = 10 20 40 10 20 40 q = .05 q = .1 q = .5 q = .9 q = .95 0.0418 0.0576 0.0596 0.0498 0.0648 0.0590 0.0578 0.0616 0.0502 0.0702 0.0700 0.0563 0.0484 0.0450 0.0466 0.0489 0.0423 0.0444 0.0643 0.0591 0.0490 0.0729 0.0667 0.0550 0.0372 0.0548 0.0553 0.0505 0.0630 0.0606 Table 8. Actual Type I error rates, g = 0.5, h = 0 and g = 0.5, h = 0.5 g = 0.5, h = 0 g = 0.5, h = 0.5 n n n n n n = = = = = = 10 20 40 10 20 40 q = .05 q = .1 q = .5 q = .9 q = .95 0.0298 0.0554 0.0578 0.0453 0.0618 0.0625 0.0540 0.0591 0.0513 0.0723 0.0676 0.0524 0.0512 0.0441 0.0434 0.0471 0.0425 0.0430 0.0622 0.0618 0.0466 0.0697 0.0703 0.0529 0.0445 0.0576 0.0555 0.0526 0.0718 0.0618 distribution is used for the discrete case since it allows the effect of tied values to be observed. A variable with beta-binomial distribution is distributed as binomial with parameter p, and the probability of success p has a beta distribution with parameters r and s. It has the following probability function, where B is the complete beta function: PðX ¼ xÞ ¼ Bðm x þ r; x þ sÞ : ðm þ 1ÞBðm x þ 1; x þ 1ÞBðr; sÞ ð18Þ In this study m = 10 was used, which means the possible values for X are the integers 0, 1, . . ., 10. Also, the values for r and s were taken as s = 9 and r = 1, 2, 3, 9. Note that with r = s = 9 the distribution is bell-shaped and symmetric with mean 5. When two independent groups are compared through quantiles by using a percentile bootstrap method in conjunction with the NO quantile estimator, the actual Type I error rates were saved within the stated bounds 0.025 and 0.075 for all continuous distribution settings. Actual Type I error rates are given in Table 9 for the beta-binomial distribution with parameters r = 3, s = 9 and m = 10; other betabinomial distribution settings are omitted since they are same as in Table 9, except for 4 results out of 60. That is, the ability to control the Type I error rate is not affected by the existence of tied values. Table 9. Actual Type I error rates, beta-binomial distribution, r = 3, s = 9, m = 10 n = 10 n = 20 n = 40 q = .05 q = .1 q = .5 q = .9 q = .95 0.0395 0.0467 0.0596 0.0653 0.0674 0.0479 0.0480 0.0466 0.0479 0.0503 0.0598 0.0597 0.0289 0.0356 0.0626 14 € G€ozde Navruz and A. Fırat Ozdemir Actual Type I error rates can also be controlled by a normal approximation. The normality of quantile estimates of two independent groups was investigated and confidence intervals were obtained. However, there is no superiority of normal approximation over the percentile bootstrap in terms of saving nominal Type I error rates, so the results are omitted. In other words, the NO quantile estimator performs very well when it is used with a percentile bootstrap method in order to compare two independent groups, even with extreme quantiles under small sample sizes, and this approach might be an alternative when a researcher aims to compare groups by considering more than one reference point. In addition, the power implications of the NO and HD quantile estimators were analysed. When interpreting power results, attention is focused on the cases where the € actual Type I error results of HD lie within the (0.025, 0.075) interval (Navruz & Ozdemir, 2018). That is, with sample sizes n = 10 and n = 20, HD was able to save its results just for the median. When the sample size increases to n = 40, HD had results in (0.025, 0.075) not only for the median, but also for q = .1 and q = .9. For these settings, the NO quantile estimator generally has higher power than HD. 5.1. Conclusion In this paper the current studies on quantile estimation are briefly reviewed. The ideas behind quantile estimators are investigated and possible drawbacks of existing quantile estimators are determined. Attention is focused on the need for a new quantile estimator and the NO quantile estimator is proposed. This study essentially aims to introduce the NO quantile estimator. After explaining some theoretical results on asymptotic normality of the proposed estimator, asymptotic properties such as consistency, unbiasedness, variance and MSE are investigated practically by conducting a simulation study. The results verified that the NO quantile estimator has very desirable asymptotic properties. The relative efficiencies of the NO quantile estimator over the HD quantile estimator, the R quantile estimator, the SV2 quantile estimator and a kernel quantile estimator are examined, and it is observed that in most simulation settings, the NO quantile estimator is more efficient than these other estimators. Another performance criterion is explored in an application on comparing two independent groups. It is noted that comparing two independent groups through quantiles allows deeper insight than comparison based on a single measure of location. The NO quantile estimator is used in conjunction with a percentile bootstrap method. According to the results, the selected nominal significance level was controlled even when extreme quantiles were being compared with small sample sizes. Furthermore, the effect of the tied values is considered by using beta-binomial distributions as a discrete case. The method is very successful even in the presence of tied values. The NO quantile estimator provides more efficient estimates for population quantiles, especially for extreme ones. It is computationally practical and has desirable asymptotic properties. In addition, it is possible to use the NO quantile estimator in applied statistics. As a final conclusion, using the NO quantile estimator to estimate population quantiles in conjunction with hypothesis-testing applications might be an appropriate alternative for all researchers. R functions related to the proposed NO quantile estimator are available from the corresponding author on request. A new quantile estimator 15 Conflicts of interest All authors declare no conflict of interest. Author Contributions € G€ ozde Navruz: Methodology; Software; Writing – original draft. A. Fırat Ozdemir: Supervision; Writing – review & editing. Data availability statement Data are available on request from the authors. References Arnold, B. C., Balakrishnan, N., & Nagaraja, H. N. (1992). A first course in order statistics. New York, NY: John Wiley & Sons. Bradley, J. V. (1978). Robustness? British Journal of Mathematical & Statistical Psychology, 31, 144–152. https://doi.org/10.1111/j.2044-8317.1978.tb00581.x Cheng, C. (1995). The Bernstein polynomial estimator of a smooth quantile function. Statistics & Probability Letters, 24, 321–330. https://doi.org/10.1016/0167-7152(94)00190-J Dielman, T., Lowry, C., & Pfaffenberger, R. (1994). A comparison of quantile estimators. Communications in Statistics – Simulation and Computation, 23, 355–371. https://doi.org/ 10.1080/03610919408813175 Falk, M. (1985). Asymptotic normality of the kernel quantile estimator. Annals of Statistics, 13, 428– 433. https://doi.org/10.1214/aos/1176346605 Harrell, F. E., & Davis, C. E. (1982). A new distribution-free quantile estimator. Biometrika, 69, 635– 640. https://doi.org/10.2307/2335999 Hoaglin, D. C. (1985). Summarizing shape numerically: The g-and-h distributions. In D. C. Hoaglin, F. Mosteller, & J. W. Tukey (Eds.), Exploring data tables, trends, and shapes. New York, NY: Wiley-Interscience. Hyndman, R. J., & Fan, Y. (1996). Sample quantiles in statistical packages. American Statistician, 50, 361–365. https://doi.org/10.2307/2684934 Kaigh, W. D., & Cheng, C. (1991). Subsampling quantile estimators and uniformity criteria. Communications in Statistics – Theory and Methods, 20, 539–560. https://doi.org/10.1080/ 03610929108830514 Kaigh, W. D., & Lachenbruch, P. A. (1982). A generalized quantile estimator. Communications in Statistics – Theory and Methods, 11, 2217–2238. https://doi.org/10.1080/03610926208828383 Mc Cune, E. D., & Mc Cune, S. L. (1991). Jackknifed kernel quantile estimators. Communications in Statistics – Theory and Methods, 20, 2719–2725. https://doi.org/10.1080/03610929108830661 € Navruz, G., & Ozdemir, A. F. (2018). Quantile estimation and comparing two independent groups with an approach based on percentile bootstrap. Communications in Statistics – Theory and Methods, 47, 2119–2138. https://doi.org/10.1080/03610918.2017.1335410 Parrish, R. S. (1990). Comparison of Quantile estimators in normal sampling. Biometrics, 46, 247– 257. https://doi.org/10.2307/2531649 Pepelyshev, A., Rafajłowicz, E., & Steland, A. (2014). Estimation of the quantile function using Bernstein-Durrmeyer polynomials. Journal of Nonparametric Statistics, 26, 1–20. https://doi. org/10.1080/10485252.2013.826355 Sfakianakis, M. E., & Verginis, D. G. (2008). A new family of nonparametric quantile estimators. Communications in Statistics – Theory and Methods, 37, 337–345. https://doi.org/10.1080/ 03610910701790491 16 € G€ozde Navruz and A. Fırat Ozdemir Sheather, S. J., & Marron, J. S. (1990). Kernel quantile estimators. Journal of the American Statistical Association, 85, 410–416. https://doi.org/10.1080/01621459.1990.10476214 Stigler, S. M. (1974). Linear functions of order statistics with smooth weight functions. Annals of Statistics, 2, 676–693. https://doi.org/10.1214/aos/1176342756 Wilcox, R. R. (2017). Introduction to robust estimation and hypothesis testing (4th ed.). Amsterdam, the Netherlands: Academic Press. Wilcox, R. R., Erceg-Hurn, D. M., Clark, F., & Carlson, M. (2014). Comparing two independent groups via the lower and upper quantiles. Journal of Statistical Computation and Simulation, 84, 1543–1551. https://doi.org/10.1080/00949655.2012.754026 Yang, S. S. (1985). A smooth nonparametric estimator of a quantile function. Journal of the American Statistical Association, 80, 1004–1011. https://doi.org/10.2307/2288567 Yoshizawa, C. N., Sen, P. K., & Davis, C. E. (1985). Asymptotic equivalence of the Harrell-Davis median estimator and the sample median. Communications in Statistics – Theory and Methods, 14, 2129–2136. https://doi.org/10.1080/03610928508829034 Received 27 February 2019; revised version received 14 December 2019

Log In

A new quantile estimator with weights based on a subsampling approach

Related papers

Related papers

Related topics