1. Introduction
In applied multivariate statistical analysis one is frequently faced with the problem of estimating the multivariate normal (mvn) distribution (or, equivalently, integrating the mvn density) not only for a range of correlation or covariance structures, but also for a number of dimensions (i.e., variables) that can span several orders of magnitude. In applications for which only one or a few instances of the distribution, and of low dimensionality (), must be estimated, conventional numerical methods based on, e.g., Newton-Cotes formulæ, Gaussian quadrature and orthogonal polynomials, or tetrachoric series, may offer satisfactory combinations of computational speed and estimation precision.
Increasingly, however, statistical analysis of large datasets requires many evaluations of very high-dimensional
mvn distributions—often as an incidental part of some larger analysis—and places severe demands on the requisite speed and accuracy of numerical methods. We confront the need to estimate the high-dimensional
mvn integral in statistical genetics, and particularly in genetic analyses of extended pedigrees (i.e., large, multi-generational collections of related individuals). A typical exercise is variance component analysis of a discrete trait (e.g., a qualitative or categorical measurement of some disease or other condition of interest) under a liability threshold model [
1,
2,
3]. Maximum-likelihood estimation of the model parameters in such an application can easily require tens or hundreds of evaluations of the
mvn distribution for which
n ≈ 100–1000 or greater [
4,
5,
6,
7], and situations in which
are not unrealistic.
In such problems the dimensionality of the model distribution is determined by the product of the total number of individuals in the pedigree(s) to be analyzed and the number of discrete phenotypes jointly analyzed [
1,
8]. For univariate traits studied in small pedigrees, such as sibships (sets of individuals born to the same parents) and nuclear families (sibships and their parents), the dimensionality is typically small (
n ≈ 2–10), but analysis of multivariate phenotypes in large extended pedigrees routinely necessitates estimation of
mvn distributions for which
n can easily reach several thousand [
2,
3,
7]. A single variance component-based linkage analysis of a univariate discrete phenotype in a set of extended pedigrees involves estimating these high-dimensional
mvn distributions at hundreds of locations in the genome [
3,
9,
10]. In these numerically-intensive applications, estimation of the
mvn distribution represents the main computational bottleneck, and the performance of algorithms for estimation of the
mvn distribution is of paramount importance.
Here we report the results of a simulation-based comparison of the performance of two algorithms for estimation of the high-dimensional
mvn distribution, the widely-used Mendell-Elston (
me) approximation [
1,
8,
11,
12] and the Genz Monte Carlo (
mc) procedure [
13,
14]. Each of these methods is well known, but previous studies have not investigated their properties for very large numbers of dimensions. Using conventional numerical desiderata of estimation accuracy, execution time, and computational efficiency, we examine the performance of these algorithms and identify aspects of the overall
mvn context to which each method is particularly well suited.
In
Section 2 we give a brief overview of techniques for estimating the
mvn distribution. The
me and Genz
mc algorithms are reviewed in
Section 3. Procedures for exercising the algorithms and comparing their performance are described in
Section 4, and results of the comparisons are presented in
Section 5. In
Section 6 we consider the interpretation and broader implications of our results for future applications.
2. Background
Numerical methods for estimation of the
mvn distribution have a long and fascinating history of development and many interesting accounts from varied perspectives have been presented ([
15,
16,
17,
18], and references therein). Classical approaches to the problem have generally been variations on a few standard methods [
19,
20]. Often, some form of numerical quadrature is involved, in which an estimate of the integral is formed by accumulating a weighted sum of integrand values at a sequence of abcissæ covering the region of integration [
21,
22,
23,
24]. Tetrachoric series expansions [
25,
26] offer another approach to the problem, although these series may converge slowly, and in fact do not converge at all at some points in the correlation space for a given number of dimensions [
27]. Other approaches have involved quadrature applied to an integral transformation of the tetrachoric series [
19,
28,
29], and decomposition of the multidimensional probability into a product of conditional univariate probabilities [
1,
8,
11,
12,
30,
31,
32,
33].
In practice, the utility and applicability of any algorithm for estimating the
mvn distribution is overwhelmingly constrained by the dimensionality of the problem. Fast and accurate algorithms have been described for evaluation of the univariate and multivariate normal distributions for ‘small’ numbers of dimensions; for the frequently encountered cases of
[
34] and
[
35,
36,
37,
38], several algorithms are available that can in principle provide any desired accuracy. For the case
, several algorithms (error-bounded and not) based on quadrature have been developed and their relative performance compared [
17,
21,
22,
23,
24]. Monte Carlo approaches to the problem have also been developed that have desirable statistical properties and exhibit good scale properties with the number of dimensions [
13,
14,
39,
40].
As the number of dimensions reaches
, many approaches to estimating the
mvn distribution become impractical. Conventional series approximations and quadrature methods grow unwieldy, and the computational burden for algorithms using these methods rapidly becomes prohibitive [
13,
14,
19,
20,
41]. However, methods of estimation based on reduction or transformation of the joint
n-variate distribution to a series of (typically univariate) integrals continue to scale favorably with the number of dimensions.
3. Algorithms
We examined the performance of two algorithms that appear particularly well suited to estimation of the high-dimensional
mvn distribution. The first of these is the Mendell-Elston (
me) procedure (Algorithm 1), a deterministic, non-error-bounded procedure that approximates the
mvn distribution as a sequence of conditional univariate normal integrals [
1,
8,
11,
12]. The second algorithm is the elegant transformation and estimation procedure described by Genz [
13,
14] (Algorithm 2). In this approach the original
n-variate distribution is transformed into an easily sampled
-dimensional hypercube and estimated by Monte Carlo methods (e.g., [
42,
43]).
Algorithm 1 Mendell-Elston Estimation of the mvn Distribution [12]. |
Estimate the standardized n-variate mvn distribution, having zero mean and correlation matrix , between vector-valued limits and . The function is the univariate normal density at z, and is the corresponding univariate normal distribution. See Hasstedt [12] for discussion of the approximation, extensions, and applications.
|
1 input n, , , |
2. initialize |
3. for - (a)
[update the total probability] if return f - (b)
[peel variable i] - (c)
[condition the remaining variables] for , [end loop over j,k] [end loop over i]
|
The
me approximation is extremely fast, and broadly accurate over much of the parameter space [
1,
8,
17,
41]. The chief source of error in the approximation derives from the assumption that, at each stage of conditioning, the selected and unselected variables continue to distribute in approximately normal fashion [
1]. This assumption is analytically true only for the initial stage(s) of selection and conditioning [
17]; in subsequent stages the assumption is violated to greater or lesser degree and introduces error into the approximation [
31,
33,
44,
45]. Consequently, the
me approximation is most accurate for small correlations and for selection in the tails of the distribution, thereby minimizing departures from normality following selection and conditioning. Conversely, the error in the
me approximation is greatest for larger correlations and selection closer to the mean [
1].
Algorithm 2 Genz Monte Carlo Estimation of the mvn Distribution [13]. |
Estimate the m-variate mvn distribution having covariance matrix , between vector-valued limits and , to an accuracy with probability , or until the maximum number of integrand evaluations is reached. The procedure returns the estimated probability F, the estimation error , and the number of iterations N. The function is the univariate normal distribution at x, is the corresponding inverse function; is a source of uniform random deviates on ; and is the two-tailed Gaussian confidence factor corresponding to . See Genz [13,14] for discussion, a worked example, and suggestions for optimizing algorithm performance. |
1. input m, , , , , , |
2. compute the Cholesky decomposition of |
3. initialize , , , , , |
4. repeat- (a)
for - (b)
for - (c)
update , , - (d)
|
until or |
5. |
6. return F, Δ N |
Despite taking somewhat different approaches to the problem of estimating the
mvn distribution, these algorithms have some features in common. Most significantly, both algorithms reformulate the initial
n-dimensional integral as a series of univariate integrals. This feature facilitates imposing an initial ordering of variables to minimize the potential loss of precision as the integral estimate is accumulated. In similar fashion, prioritizing variables appropriately can also help minimize error in the
me method introduced by violations of the assumptions underlying the method [
17].
4. Algorithm Comparison
4.1. Program Implementation
Programs implementing the
me and
mc approximations were written in
ansi c following published algorithms [
12,
13]. Implementation of the
me approximation follows the procedure described by Hasstedt [
12] for likelihood evaluation of arbitrary mixtures of
mvn densities and distributions. Although the algorithm in [
12] is presented in the context of statistical genetics, it is a completely general formulation of the
me method and suitable for any application requiring estimation of the
mvn distribution. Implementation of the
mc approximation directly follows the algorithm presented by Genz [
13].
To facilitate testing a simple driver program was written for each algorithm. The driver program accepts arguments defining the estimation problem (e.g., number of dimensions, correlations, limits of integration), and any algorithm-specific parameters (e.g., convergence criteria). The driver program then initializes the problem (i.e., generates the correlation matrix and limits of integration), calls the algorithm, records its execution time, and reports results. For the deterministic
me algorithm there are no essential user options; the only input quantities are those defining the
mvn distribution and region of integration. The driver program for the Genz
mc algorithm provides options for setting parameters unique to Monte Carlo estimation such as the (maximum) error in the estimate and the (maximum) allowed number of iterations (integrand evaluations) [
13].
The actual software implementation of the estimation procedures and their respective driver programs is not critical; experiments with multiple independent implementations of these algorithms have shown consistent and reliable performance irrespective of programming language or style [
2,
3,
7,
10,
46]. Attention to programming esoterica—e.g., selective use of alternative numerical techniques according to the region of integration, supplementing iterative estimation with functional approximations or table lookup methods, devolving the original integral as a sequence of conditional oligovariate (rather than univariate) problems—could conceivably yield modest improvements in execution times in some applications.
4.2. Test Problems
For validating and comparing the
mc and
me algorithms it is important to have a source of independently determined values of the
mvn distribution against which to compare the approximations returned by each algorithm. For many purposes it may be sufficient to refer to tables of the
mvn distribution that have been generated for special cases of the correlation matrix [
15,
18,
47,
48,
49,
50,
51]. Here, however, as in similar numerical studies [
1,
8,
14,
41], values of the
mvn distribution were computed independently for correlation matrices defined by
where
n is the number of dimensions,
is the identity matrix,
is a matrix of ones, and
is a correlation coefficient. For
of this form, the
n-variate
mvn distribution at
can be reduced to the single integral
where
is the univariate normal density at
t and
is the corresponding univariate normal distribution [
18,
47,
49,
50]. This result involves only univariate normal functions and can be computed to desired accuracy using standard numerical methods (e.g., [
43]).
4.3. Test Conditions
Two series of comparisons were conducted. In the first series, algorithms were compared using correlation matrices with and (i.e., n from 3 to 10 by 1), , and . The lower and upper limits of integration, respectively, were and , .
In the second series of comparisons, correlation matrices
were generated with values of
drawn randomly from the uniform distribution
[
52,
53]; lower limits of integration remained fixed at
, but upper limits
were chosen randomly from the uniform distribution
.
For the Genz
mc algorithm an initial estimate was generated using
iterations (the actual value of
was not critical); then, if necessary, iterations were continued (using
) until the requested estimation accuracy
was achieved [
13,
14]. Under the usual assumption that independent Monte Carlo estimates distribute normally about the true integral value
I, the
confidence interval for
I is
, where
is the estimated value,
is the standard error of
,
is the Monte Carlo confidence factor for the standard error, and
is the Type I error probability. Therefore, to achieve an error less than
with probability
, the algorithm samples the integral until
. For all results reported here we took
, corresponding to
.
4.4. Test Comparisons
Three aspects of algorithm performance were compared: the error in the estimate, the computation time required to generate the estimate, and the relative efficiency of estimation. One can invent many additional interesting and contextually relevant comparisons examining various aspects of estimation quality and algorithm performance, but the criteria used here have been applied in other studies (e.g., [
39]), are simple to quantify, broadly relevant, and effective for delineating areas of the
mvn problem space in which each method performs more or less optimally.
The estimation error is the difference between the estimate returned by the algorithm and the independently computed expectation. The computation time is the execution time required for the algorithm to return an estimate; for the
mc procedure this quantity includes the (comparatively trivial) time required to obtain the Cholesky decomposition of the correlation matrix. The relative efficiency is the time-weighted ratio of the variance in each estimate (see, e.g., [
39]). Thus, if
and
, respectively, denote the execution times of the
mc and
me algorithms, and
and
the corresponding mean squared errors in the
mc and
me estimates, then the relative efficiency is defined as
, i.e., the product of the relative mean-squared error
and the relative execution time
. The measure is somewhat
ad hoc, and in practical applications the choice of algorithm should ultimately be informed by pragmatic considerations but—
ceteris paribus—values
tend to favor the Genz
mc algorithm, and values
tend to favor the
me algorithm.
4.5. Computing Platforms
Numerical methods are of little use if they are ill-suited to the hardware available to the user. Both the me and Genz mc algorithms involve the manipulation of large, nonsparse matrices, and the mc method also makes heavy use of random number generation, so there seemed no compelling reason a priori to expect these algorithms to exhibit similar scale characteristics with respect to computing resources. Algorithm comparisons were therefore conducted on a variety of computers having wildly different configurations of cpu, clock frequency, installed ram, and hard drive capacity, including an intrepid Intel 386/387 system (25 mhz, 5 mb ram), a Sun sparcstation-5 workstation (160 mhz, 1 gb ram), a Sun sparcstation-10 server (50 mhz, 10 gb ram), a Mac G4 PowerPC (1.5 ghz, 2 gb ram), and a MacBook Pro with Intel Core i7 (2.5 ghz, 16 gb ram). As expected, clock frequency was found to be the primary factor determining overall execution speed, but both algorithms performed robustly and proved entirely practical for use even with modest hardware. We did not, however, further investigate the effect of computer resources on algorithm performance, and all results reported below are independent of any specific test platform.
5. Results
5.1. Error
The errors in the estimates returned by each method are shown in
Figure 1 for a single ‘replication’, i.e., an application of each algorithm to return a single (convergent) estimate. The figure illustrates the qualitatively different behavior of the two estimation procedures—the deterministic approximation returned by the
me algorithm, and the stochastic estimate returned by the Genz
mc algorithm.
Estimates from the mc algorithm are well within the requested maximum error for all values of the correlation coefficient and throughout the range of dimensions considered. Errors are unbiased as well; there is no indication of systematic under- or over-estimation with either correlation or number of dimensions.
In contrast, the error in the estimate returned by the
me method, though not generally excessive, is strongly systematic. For small correlations, or for moderate correlations and small numbers of dimensions, the error is comparable in magnitude to that from
mc estimation but is consistently biased. For
, the error begins to exceed that of the corresponding
mc estimate, and the desired distribution can be significantly under- or overestimated even for a small number of dimensions. This pattern of error in the
me approximation reflects the underlying assumption of multivariate normality of both the marginal and conditional distributions following variable selection [
1,
8,
17]. The assumption is viable for small correlations, and for integrals of low dimensionality (requiring fewer iterations of selection and conditioning); errors are quickly compounded and the approximation deteriorates as the assumption becomes increasingly implausible.
Although bias in the estimates returned by the me method is strongly dependent on the correlation among the variables, this feature should not discourage use of the algorithm. For example, estimation bias would not be expected to prejudice likelihood-based model optimization and estimation of model parameters, which are determined by the location of likelihood extrema. However, estimation bias could conceivably vitiate likelihood-ratio tests involving functions of the actual likelihood values. The latter may become of particular concern in applications that accumulate and compare likelihoods over a collection of independent data under varying model parameterizations.
5.2. Mean Execution Time
Relative mean execution time,
and
for the
me and
mc algorithms respectively, is summarized in
Figure 2 for 100 replications of each algorithm. As absolute execution times for a given application can vary by several orders of magnitude depending on computing resources, the figure presents the ratio
which was found to be effectively independent of computing platform.
For estimation of the mvn in moderately few dimensions () the me approximation is exceptionally fast. The mean execution time of the mc method can be markedly greater—e.g., at about 10-fold slower for and 1000-fold slower for . For small correlations the execution time of the mc method becomes comparable with that of the me method for . For the largest numbers of dimensions considered, the Monte Carlo method can be substantially faster—nearly 10-fold when and nearly 20-fold when .
The scale properties of mean execution time for the me and mc algorithms with respect to correlation and number of dimensions may be important considerations for specific applications. The me method exhibits virtually no variation in execution time with the strength of the correlation, which may be an attractive feature in applications for which correlations are highly variable and the dimensionality of the problem does not vary greatly. For the mc method, execution time increases approximately 10–fold as the correlation increases from to , but is approximately constant with respect to the number of dimensions. This behavior would be desirable in applications for which correlations tend to be small but the number of dimensions varies considerably.
5.3. Relative Performance
In view of the statistical virtues of the
mc estimate but the favorable execution times for the
me approximation, it is instructive to compare the algorithms in terms of a metric incorporating both of these aspects of performance. For this purpose we use the time- and error-weighted ratio used described by Deák [
39], and compare the performance of the algorithms for randomly chosen correlations and regions of integration (see
Section 4.3). As applied here, values of this ratio greater than one tend to favor the Genz
mc method, and values less than one tend to favor the
me method.
The relative mean execution times, mean squared errors, and mean time-weighted efficiencies of the
mc and
me methods are summarized in
Figure 3. Although
me estimates can be markedly faster to compute—e.g.,
faster for
and
faster for
,
in these replications)—the mean squared error of the
mc estimates is consistently 10–100-fold smaller, and on this basis alone is the statistically preferable procedure. Measured by their time-weighted relative efficiency, however, the disparity in performance is less extreme; the
me algorithm is comparatively efficient for
dimensions, beyond which the
mc algorithm becomes the more efficient approach.
6. Discussion
Statistical methodology for the analysis of large datasets is demanding increasingly efficient estimation of the mvn distribution for ever larger numbers of dimensions. In statistical genetics, for example, variance component models for the analysis of continuous and discrete multivariate data in large, extended pedigrees routinely require estimation of the mvn distribution for numbers of dimensions ranging from a few tens to a few tens of thousands. Such applications reflexively (and understandably) place a premium on the sheer speed of execution of numerical methods, and statistical niceties such as estimation bias and error boundedness—critical to hypothesis testing and robust inference—often become secondary considerations.
We investigated two algorithms for estimating the high-dimensional mvn distribution. The me algorithm is a fast, deterministic, non-error-bounded procedure, and the Genz mc algorithm is a Monte Carlo approximation specifically tailored to estimation of the mvn. These algorithms are of comparable complexity, but they also exhibit important differences in their performance with respect to the number of dimensions and the correlations between variables. We find that the me algorithm, although extremely fast, may ultimately prove unsatisfactory if an error-bounded estimate is required, or (at least) some estimate of the error in the approximation is desired. The Genz mc algorithm, despite taking a Monte Carlo approach, proved to be sufficiently fast to be a practical alternative to the me algorithm. Under certain conditions the mc method is competitive with, and can even outperform, the me method. The mc procedure also returns unbiased estimates of desired precision, and is clearly preferable on purely statistical grounds. The mc method has excellent scale characteristics with respect to the number of dimensions, and greater overall estimation efficiency for high-dimensional problems; the procedure is somewhat more sensitive to the correlation between variables, but this is not expected to be a significant concern unless the variables are known to be (consistently) strongly correlated.
For our purposes it has been sufficient to implement the Genz
mc algorithm without incorporating specialized sampling techniques to accelerate convergence. In fact, as was pointed out by Genz [
13], transformation of the
mvn probability into the unit hypercube makes it possible for simple Monte Carlo integration to be surprisingly efficient. We expect, however, that our results are mildly conservative, i.e., underestimate the efficiency of the Genz
mc method relative to the
me approximation. In intensive applications it may be advantageous to implement the Genz
mc algorithm using a more sophisticated sampling strategy, e.g., non-uniform ‘random’ sampling [
54], importance sampling [
55,
56], or subregion (stratified) adaptive sampling [
13,
57]. These sampling designs vary in their approach, applicability to a given problem, and computational overhead, but their common objective is to estimate the integral as efficiently as possible for a given amount of sampling effort. (For discussion of these and other variance reduction techniques in Monte Carlo integration, see [
42,
43].)
Finally, in choosing between these or other procedures for estimating the
mvn distribution, it is helpful to observe a pragmatic distinction between applications that are deterministic and those that are genuinely stochastic in nature. The computational merits of fast execution time, accuracy, and precision may be advantageous for the analysis of well-behaved problems of a deterministic nature, yet be comparatively inessential for inherently statistical investigations. In many applications, some sacrifice in the speed of the algorithm (but not, as
Figure 1 reveals, in the accuracy of estimation) could surely be tolerated in exchange for desirable statistical properties that promote robust inference [
58]. These properties include unbiased estimation of the likelihood, an estimate of error instead of fixed error bounds (or no error bound at all), the ability to combine independent estimates into a variance-weighted mean, favorable scale properties with respect to the number of dimensions and the correlation between variables, and potentially increased robusticity to poorly-conditioned covariance matrices [
20,
42]. For many practical problems requiring the high-dimensional
mvn distribution, the Genz
mc algorithm clearly has much to recommend it.