Bernstein–von Mises theorem: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 22:36, 30 July 2024 edit Bender235 (talk \| contribs) Autopatrolled, Extended confirmed users, Pending changes reviewers, Rollbackers, Template editors 471,492 edits No edit summary ← Previous edit		Latest revision as of 11:56, 23 October 2024 edit undo 138.195.0.164 (talk) Correction Typo in the theorem hypothesis theta_0 to theta
(7 intermediate revisions by 2 users not shown)
Line 2: {{Bayesian statistics}} In [[Bayesian inference]], the '''Bernstein–von Mises theorem''' provides the basis for using Bayesian credible sets for confidence statements in [[parametric model\|parametric models]]. It states that under some conditions, a posterior distribution converges in ~~the~~[[Total ~~limit~~variation distance of ~~infinite~~probability measures\|total variation ~~data~~distance]] to a multivariate normal distribution centered at the maximum likelihood estimator <math>\widehat{\theta}_n</math> with covariance matrix given by <math>n^{-1} \mathcal{I}(\theta_0)^{-1} </math>, where <math>\theta_0</math> is the true population parameter and <math>\mathcal{I}(\theta_0)</math> is the [[Fisher information matrix]] at the true population parameter value:<ref>{{cite book \| last=van der Vaart\|first=A.W.\|title=Asymptotic Statistics \|year=1998\|publisher=Cambridge University Press\|isbn=0-521-78450-6\|chapter=10.2 Bernstein–von Mises Theorem}}</ref> :<math>\|\|P(\theta\|x_1,\dots x_n)= - \mathcal{N}({\~~theta_0~~widehat{\theta}}_n, n^{-1}\mathcal{I}(\theta_0)^{-1}) \|\|_{{\~~text~~mathrm{ ~~for~~ TV}}} n\to xrightarrow{P_{\~~infty.~~theta_0}} = 0</math> The Bernstein–von Mises theorem links [[Bayesian inference]] with [[frequentist inference]]. It assumes there is some true probabilistic process that generates the observations, as in frequentism, and then studies the quality of Bayesian methods of recovering that process, and making uncertainty statements about that process. In particular, it states that asymptotically, many Bayesian credible sets of a certain credibility level <math>\alpha</math> will ~~asymptotically~~act beas confidence sets of confidence level <math>\alpha</math>, which allows for the interpretation of Bayesian credible sets. ==Statement== ~~==Heuristic statement==~~ ~~In a model~~Let <math>(P_\theta\,: \,\theta \in \Theta)</math>, ~~under~~be ~~certain regularity conditions (finite-dimensional,~~a well-specified, ~~smooth,~~statistical ~~existence of tests)~~model, ifwhere the ~~prior~~parameter ~~distribution~~space <math>\PiTheta</math> onis a subset of <math>\~~theta~~mathbb{R}^k</math>. ~~has~~Further, alet ~~density~~data ~~with~~<math>X_1, ~~respect~~\ldots, toX_n ~~the~~\in ~~Lebesgue measure which is smooth enough (near <math>~~\~~theta_0~~mathcal{X}</math> ~~bounded~~be ~~away~~independently ~~from~~and ~~zero),~~identically ~~the~~distributed ~~total variation distance between the rescaled posterior distribution (by centring and rescaling to~~from <math>~~\sqrt~~P_{~~n}(\theta -~~ \theta_0)}</math>). ~~and~~Suppose athat ~~Gaussian~~all ~~distribution centred on any [[efficient estimator]] and with~~of the ~~inverse~~following ~~Fisher~~conditions ~~information as variance will converge in probability to zero.~~hold: # The model admits densities <math>(p_\theta\,:\,\theta\in\Theta)</math> with respect to some measure <math>\mu</math>. ==Bernstein–von Mises and maximum likelihood estimation==▼ # The Fisher information matrix <math>\mathcal{I}(\theta_0)</math> is nonsingular. ~~In case the [[maximum likelihood estimator]] is an efficient estimator, we can plug this in, and we recover a common, more specific, version of the Bernstein–von Mises theorem.~~ # The model is differentiable in quadratic mean. That is, there exists a measurable function <math>f:\mathcal{X}\rightarrow\mathbb{R}^k</math> such that<math>\int\left[\sqrt{p_\theta(x)} - \sqrt{p_{\theta_0}(x)}- \frac{1}{2}(\theta - \theta_0)^\top f(x)\sqrt{p_{\theta_0}(x)}\right]^2 \mathrm{d}\mu(x) = o(\|\|\theta - \theta_0\|\|^2) </math> as <math>\theta \rightarrow \theta_0</math>. # For every <math>\varepsilon > 0</math>, there exists a sequence of test functions <math>\phi_n:\mathcal{X}^n \rightarrow [0, 1]</math> such that <math>\mathbb{E}_{\mathbf{X} \sim P^n_{\theta_0}}\left[\phi_n(\mathbf{X})\right] \rightarrow 0</math> and <math>\sup_{\theta \,:\, \|\|\theta-\theta_0\|\|>\varepsilon} \mathbb{E}_{\mathbf{X}\sim P^n_{\theta}}\left[1 - \phi_n(\mathbf{X})\right] \rightarrow 0</math> as <math>n \rightarrow \infty</math>. # The prior measure is absolutely continuous with respect to the Lebesgue measure in a neighborhood of <math>\theta_0</math>, with a continuous positive density at <math>\theta_0</math>. Then for any estimator <math>\widehat{\theta}_n</math> satisfying <math>\sqrt{n}( {\widehat{\theta}}_n - \theta_0) \xrightarrow{d} \mathcal{N}(0, {\mathcal{I}}^{-1}(\theta_0))</math>, the posterior distribution <math>\Pi_n</math> of <math>\theta\mid X_1, \ldots, X_n</math> satisfies<blockquote><math>{\left\|\left\|\Pi_n - \mathcal{N}\left(\widehat{\theta}_n, \frac{1}{n}{\mathcal{I}}^{-1}({\theta_0})\right)\right\|\right\|}_{\mathrm{TV}} \xrightarrow{P_{\theta_0}} 0.</math></blockquote> as <math>n\rightarrow \infty</math>. ▲==~~Bernstein–von~~Relationship ~~Mises and~~to maximum likelihood estimation== Under certain regularity conditions, the [[maximum likelihood estimator]] is an asymptotically efficient estimator and can thus be used as <math>\widehat{\theta}_n</math> in the theorem statement. This then yields that the posterior distribution converges in total variation distance to the asymptotic distribution of the maximum likelihood estimator, which is commonly used to construct frequentist confidence sets. ==Implications== Line 28 ⟶ 37: Different summary statistics such as the [[Mode (statistics)\|mode]] and mean may behave differently in the posterior distribution. In Freedman's examples, the posterior density and its mean can converge on the wrong result, but the posterior mode is consistent and will converge on the correct result. ==~~Notes~~References== {{Reflist}} ==~~References~~Further reading== {{cite book\|last=Hartigan \|first=J. A. \|authorlink=John A. Hartigan \|chapter=Asymptotic Normality of Posterior Distributions \|title=Bayes Theory \|location=New York \|publisher=Springer \|year=1983 \|isbn= \|doi=10.1007/978-1-4613-8242-3_11 }} {{cite book \|last=Le Cam \|first=Lucien \|authorlink=Lucien Le Cam \|title=Asymptotic Methods in Statistical Decision Theory \|chapter=Approximately Gaussian Posterior Distributions \|pages=336–345 \|location=New York \|publisher=Springer \|year=1986 \|isbn=0-387-96307-3 }} *{{cite book\|last=van der Vaart\|first=A. W. \|title=Asymptotic Statistics \|year=1998\|publisher=Cambridge University Press\|isbn= 0-521-49603-9\|chapter=Bernstein–von Mises Theorem}}