No Title

STAT 450 Lecture 14

Reading for Today's Lecture:

Outline of Today's Lecture:

Review the moment generating function proof of the central limit theorem.
Discuss extensions of the theorem
Introduce basic goals of statistical inference.

Today's notes

The Central Limit Theorem

If $X_1, X_2, \cdots$ are iid with mean 0 and variance 1 then $n^{1/2}\bar{X}$ converges in distribution to N(0,1). That is,

$\begin{displaymath}P(n^{1/2}\bar{X} \le x ) \to \frac{1}{2\pi} \int_{-\infty}^x e^{-y^2/2} dy \, . \end{displaymath}$

Proof: Let m(t) be the moment generating function of an individual X. Then
$\begin{align*}E(e^{tn^{1/2}\bar{X}})& = \text{E}(e^{tn^{-1/2}(X_1+\cdots+X_n)} \... ...t^3 m^{\prime \prime \prime}(\theta)/6\right]^n \\ & \to e^{t^2/2} \end{align*}$
This is the moment generating function of a N(0,1) random variable so we are done by our theorem. Notice in the proof that we are using a specific form of the remainder term in Taylor's theorem and that we are using the fact that if $x_n \to x$ then

$\begin{displaymath}\lim_{n\to\infty} (1+x_n/n)^n = e^x \end{displaymath}$

(In the next homework I ask you to compute this limit rigorously in a special case by taking logs and using L'Hopital's rule.)

The next section of material is extra. Go to rest of lecture

Remarks:

1.: The error using the central limit theorem to approximate a density or a probability is proportional to n^-1/2
2.: This is improved to n^-1 for symmetric densities for which $\gamma=0$ .
3.: Using more terms in the Taylor expansion permits more accurate expansions called Edgeworth expansions. The expansions are asymptotic. This means that adding more terms in the Taylor series usually does not converge. When n=25 it may help to take the second term but get worse if you include the third or fourth or more.

Multivariate convergence in distribution

Definition: $X_n\in R^p$ converges in distribution to $X\in R^p$ if

$\begin{displaymath}E(g(X_n)) \to E(g(X)) \end{displaymath}$

for each bounded continuous real valued function g on R^p.

This is equivalent to either of

Cramér Wold Device: a^tX_n converges in distribution to a^t X for each $a \in R^p$

Convergence of Characteristic Functions:

$\begin{displaymath}E(e^{ia^tX_n}) \to E(e^{ia^tX}) \end{displaymath}$

for each $a \in R^p$ .

Extensions of the CLT

1.

If $Y_1,Y_2,\cdots$ are iid in R^p with mean $\mu$ and variance covariance $\Sigma$ then $n^{1/2}(\bar{Y}-\mu)$ converges in distribution to $MVN(0,\Sigma)$ .

2.

If for each n we have a set of independent mean 0 random variables $X_{n1},\ldots,X_{nn}$ and E(X_ni) =0 and $Var(\sum_i X_{ni}) = 1$ and

$\begin{displaymath}\sum E(\vert X_{ni}\vert^3) \to 0 \end{displaymath}$

then $\sum_i X_{ni}$ converges in distribution to N(0,1). This is the Lyapunov central limit theorem.

3.

As in the Lyapunov central CLT but replace the third moment condition with

$\begin{displaymath}\sum E(X_{ni}^2 1(\vert X_{ni}\vert > \epsilon)) \to 0 \end{displaymath}$

for each $\epsilon > 0$ then again $\sum_i X_{ni}$ converges in distribution to N(0,1). This is the Lindeberg central limit theorem. (Lyapunov's condition implies Lindeberg's.)

4.

There are extensions to random variables which are not independent. Examples include the m-dependent central limit theorem, the martingale central limit theorem, the central limit theorem for mixing processes.

5.

Many important random variables are not sums of independent random variables. We handle these with Slutsky's theorem and the $\delta$ method.

Slutsky's Theorem: If X_n converges in distribution to Xand Y_n converges in distribution (or in probability) to c, a constant, then X_n+Y_n converges in distribution to X+c.

Warning: the hypothesis that the limit of Y_n be constant is essential.

The delta method: Suppose a sequence Y_n of random variables converges to some y a constant and that if we define X_n = a_n(Y_n-y) then X_n converges in distribution to some random variable X. Suppose that f is a differentiable function on the range of Y_n. Then a_n(f(Y_n)-f(y)) converges in distribution to $f^\prime(y) X$ . If X_n is in R^p and f maps R^p to R^q then $f^\prime$ is the $q\times p$ matrix of first derivatives of components of f.

Example: Suppose $X_1,\ldots,X_n$ are a sample from a population with mean $\mu$ , variance $\sigma^2$ , and third and fourth central moments $\mu_3$ and $\mu_4$ . Then

$\begin{displaymath}n^{1/2}(s^2-\sigma^2) \Rightarrow N(0,\mu_4-\sigma^4) \end{displaymath}$

where $\Rightarrow$ is notation for convergence in distribution. For simplicity I define $s^2 = \overline{X^2} -{\bar{X}}^2$ .

We take Y_n to be the vector with components $(\overline{X^2},\bar{X})$ . Then Y_n converges to $y=(\mu^2+\sigma^2,\mu)$ . Take a_n = n^1/2. Then

n^1/2(Y_n-y)

converges in distribution to $MVN(0,\Sigma)$ with

$\begin{displaymath}\Sigma = \left[\begin{array}{cc} \mu_4-\sigma^4 & \mu_3 -\mu(... ...2)\\ \mu_3-\mu(\mu^2+\sigma^2) & \sigma^2 \end{array} \right] \end{displaymath}$

Define f(x₁,x₂) = x₁-x₂². Then s² = f(Y_n) and the gradient of f has components (1,-2x₂). This leads to

$\begin{displaymath}n^{1/2}(s^2-\sigma^2) \approx n^{1/2}(1, -2\mu)\left[\begin{a... ...{X^2} - (\mu^2 + \sigma^2) \\ \bar{X} -\mu \end{array}\right] \end{displaymath}$

which converges in distribution to the law of $(1,-2\mu) Y$ which is $N(0,a^t \Sigma a)$ where $a=(1,-2\mu)^t$ . This boils down to $N(0, \mu_4-\sigma^2)$ .

Remark: In this sort of problem it is best to learn to recognize that the sample variance is unaffected by subtracting $\mu$ from each X. Thus there is no loss in assuming $\mu=0$ which simplifies $\Sigma$ and a.

Special case: if the observations are $N(\mu,\sigma^2)$ then $\mu_3 =0$ and $\mu_4=3\sigma^4$ . Our calculation has

$\begin{displaymath}n^{1/2} (s^2-\sigma^2) \Rightarrow N(0,2\sigma^4) \end{displaymath}$

You can divide through by $\sigma^2$ and get

$\begin{displaymath}n^{1/2}(\frac{s^2}{\sigma^2}-1) \Rightarrow N(0,2) \end{displaymath}$

In fact $(n-1)s^2/\sigma^2$ has a $\chi_{n-1}^2$ distribution and so the usual central limit theorem shows that

$\begin{displaymath}(n-1)^{1/2} [(n-1)s^2/\sigma^2 - (n-1)] \Rightarrow N(0,2) \end{displaymath}$

(using mean of $\chi^2_1$ is 1 and variance is 2). Factoring out n-1 gives the assertion that

$\begin{displaymath}(n-1)^{1/2}(s^2/\sigma^2-1) \Rightarrow N(0,2) \end{displaymath}$

which is our $\delta$ method calculation except for using n-1 instead of n. This difference is unimportant as can be checked using Slutsky's theorem.

The next section of material is extra. Go to rest of lecture

Monte Carlo

The last method of distribution theory that I will review is Monte Carlo simulation. Suppose you have some random variables $X_1,\ldots,X_n$ whose joint distribution is specified and a statistic $T(X_1,\ldots,X_n)$ whose distribution you want to know. To compute something like P(T > t) for some specific value of t we appeal to the limiting relative frequency interpretation of probability: P(T>t) is the limit of the proportion of trials in a long sequence of trials in which Toccurs. We use a (pseudo) random number generator to generate a sample $X_1,\ldots,X_n$ and then calculate the statistic getting T₁. Then we generate a new sample (independently of our first, say) and calculate T₂. We repeat this a large number of times say N and just count up how many of the T_k are larger than t. If there are M such T_kwe estimate that P(T>t) =M/N.

The quantity M has a Binomial( N,p=P(T>t)) distribution. The standard error of M/N is then p(1-p)/N which is estimated by M(N-M)/N³. This permits us to guess the accuracy of our study.

Notice that the standard deviation of M/N is $\sqrt{p(1-p)}/\sqrt{N}$ so that to improve the accuracy by a factor of 2 requires 4 times as many samples. This makes Monte Carlo a relatively time consuming method of calculation. There are a number of tricks to make the method more accurate (though they only change the constant of proportionality - the SE is still inversely proportional to the square root of the sample size).

Generating the Sample

Most computer languages have a facility for generating pseudo uniform random numbers, that is, variables U which have (approximately of course) a Uniform[0,1] distribution. Other distributions are generated by transformation:

Exponential: $X=-\log U$ has an exponential distribution:

$\begin{displaymath}P(X > x) = P(-\log(U) > x) =P(U \le e^{-x})=e^{-x} \end{displaymath}$

Random uniforms generated on the computer sometimes have only 6 or 7 digits or so of detail. This can make the tail of your distribution grainy. If U were actually a multiple of 10^-6 for instance then the largest possible value of X is $6\log(10)$ . This problem can be ameliorated by the following algorithm:

Generate U a Uniform[0,1] variable.
Pick a small $\epsilon$ like 10^-3 say. If $U>\epsilon$ take $Y=-\log(U)$ .
If $U \le \epsilon$ remember that the conditional distribution of Y-y given Y>y is exponential. You use this by generating a new $U^\prime$ and computing $Y^\prime=-\log(U^\prime)$ . Then take $Y=Y^\prime -\log(\epsilon)$ . The resulting Y has an exponential distribution. You should check this by computing P(Y > y).

Normal: In general if F is a continuous cdf and U is Uniform[0,1] then Y=F^-1(U) has cdf F because

$\begin{displaymath}P(Y \le y) = P(F^{-1}(U) \le y) = P(U \le F(y))= F(y) \end{displaymath}$

This is almost the technique in the exponential distribution. For the normal distribution $F=\Phi$ ( $\Phi$ is a common notation for the standard normal cdf) there is no closed form for F^-1. You could use a numerical algorithm to compute F^-1 or you could use the following Box Müller trick. Generate U₁,U₂ two independent Uniform[0,1] variables. Define $Y_1 = \sqrt{-2\log(U_1)} \cos(2\pi U_2)$ and $Y_2 = \sqrt{-2\log(U_1)} \sin(2\pi U_2)$ . Then you can check using the change of variables formula that Y₁ and Y₂ are independent N(0,1) variables.

Acceptance Rejection

If you can't easily calculate F^-1 but you know f you can try the acceptance rejection method. Find a density g and a constant c such that $f(x) \le cg(x)$ for each x and G^-1 is computable or you otherwise know how to generate observations $W_1, W_2, \ldots$ independently from g. Generate W₁. Compute $p=f(W_1)/(cg(W_1)) \le 1$ . Generate a uniform[0,1] random variable U₁ independent of all the Ws and let Y=W₁ if $U_1 \le p$ . Otherwise get a new W and a new U and repeat until you find a $U_i \le f(W_i)/(cg(W_i))$ . You make Y be the last W you generated. This Y has density f.

Markov Chain Monte Carlo

In the last 10 years the following tactic has become popular, particularly for generating multivariate observations. If $W_1, W_2, \ldots$ is an (ergodic) Markov chain with stationary transitions and the stationary initial distribution of W has density f then you can get random variables which have the marginal density f by starting off the Markov chain and letting it run for a long time. The marginal distributions of the W_i converge to f. So you can estimate things like $\int_A f(x) dx$ by computing the fraction of the W_i which land in A.

There are now many versions of this technique including Gibbs Sampling and the Metropolis-Hastings algorithm. (The technique was invented in the 1950s by physicists: Metropolis et al. One of the authors of the paper was Edward Teller ``father of the hydrogen bomb''.)

Importance Sampling

If you want to compute

$\begin{displaymath}\theta \equiv E(T(X)) = \int T(x) f(x) dx \end{displaymath}$

you can generate observations from a different density g and then compute

$\begin{displaymath}\hat\theta = n^{-1} \sum T(X_i) f(X_i)/g(X_i) \end{displaymath}$

Then
$\begin{align*}E(\hat\theta) &= n^{-1} \sum E(T(X_i) f(X_i)/g(X_i) \\ & = \int [T(x) f(x)/g(x)] g(x) dx \\ & = \int T(x) f(x) dx \\ & = \theta \end{align*}$

Variance reduction

Consider the problem of estimating the distribution of the sample mean for a Cauchy random variable. The Cauchy density is

$\begin{displaymath}f(x) = \frac{1}{\pi(1+x^2)} \end{displaymath}$

We generate $U_1, \ldots, U_n$ uniforms and then define $X_i = \tan^{-1}(\pi(U_i-1/2))$ . Then we compute $T=\bar{X}$ . Now to estimate p=P(T > t) we would use

$\begin{displaymath}\hat p = \sum_{i=1}^N 1(T_i > t) /N \end{displaymath}$

after generating N samples of size n. This estimate is unbiased and has standard error $\sqrt{p(1-p)/N}$ .

We can improve this estimate by remembering that -X_i also has Cauchy distribution. Take S_i=-T_i. Remember that S_i has the same distribution as T_i. Then we try (for t>0)

$\begin{displaymath}\tilde p = [\sum_{i=1}^N 1(T_i > t) + \sum_{i=1}^N 1(S_i > t) ]/(2N) \end{displaymath}$

which is the average of two estimates like $\hat p$ . The variance of $\tilde p$ is

(4N)^-1 Var(1(T_i > t)+1(S_i > t)) = (4N)^-1 Var( 1(|T| > t))

which is

$\begin{displaymath}\frac{2p(1-2p)}{4N} = \frac{p(1-2p)}{2N} \end{displaymath}$

Notice that the variance has an extra 2 in the denominator and that the numerator is also smaller - particularly for p near 1/2. So this method of variance reduction has resulted in a need for only half the sample size to get the same accuracy.

Regression estimates

Suppose we want to compute

$\begin{displaymath}\theta = E(\vert Z\vert) \end{displaymath}$

where Z is standard normal. We generate N iid N(0,1) variables $Z_1,\ldots,Z_N$ and compute $\hat\theta = \sum\vert Z_i\vert/N$ . But we know that E(Z_i²) = 1 and can see easily that $\hat\theta$ is positively correlated with $\sum Z_i^2 / N$ . So we consider using

$\begin{displaymath}\tilde\theta = \hat\theta -c(\sum Z_i^2/N-1) \end{displaymath}$

Notice that $E(\tilde\theta) = \theta$ and

$\begin{displaymath}Var(\tilde\theta) = Var(\hat\theta) -2c Cov(\hat\theta, \sum Z_i^2/n) +c^2 Var(\sum Z_i^2/N) \end{displaymath}$

The value of c which minimizes this is

$\begin{displaymath}c=\frac{Cov(\hat\theta, \sum Z_i^2/n)}{Var(\sum Z_i^2/N)} \end{displaymath}$

and this value can be estimated by regressing the |Z_i| on the Z_i²!

Statistical Inference

Definition: A model is a family $\{ P_\theta; \theta \in \Theta\}$ of possible distributions for some random variable X. (Our data set is X, so X will generally be a big vector or matrix or even more complicated object.)

We will assume throughout this course that the true distribution P of Xis in fact some $P_{\theta_0}$ for some $\theta_0 \in \Theta$ . We call $\theta_0$ the true value of the parameter. Notice that this assumption will be wrong; we hope it is not wrong in an important way. If we are very worried that it is wrong we enlarge our model putting in more distributions and making $\Theta$ bigger.

Our goal is to observe the value of X and then guess $\theta_0$ or some property of $\theta_0$ . We will consider the following classic mathematical versions of this:

1.: Point estimation: we must compute an estimate $\hat\theta = \hat\theta(X)$ which lies in $\Theta$ (or something close to $\Theta$ ).
2.: Point estimation of a function of $\theta$ : we must compute an estimate $\hat\phi = \hat\phi(X)$ of $\phi=g(\theta)$ .
3.: Interval (or set) estimation. We must compute a set C=C(X) in $\Theta$ which we think will contain $\theta_0$ .
4.: Hypothesis testing: We must decide whether or not $\theta_0\in\Theta_0$ where $\Theta_0 \subset \Theta$ .
5.: Prediction: we must guess the value of an observable random variable Y whose distribution depends on $\theta_0$ . Typically Yis the value of the variable X in a repetition of the experiment.

There are several schools of statistical thinking with different views on how these problems should be done. The main schools of thought may be summarized roughly as follows:

Neyman Pearson: In this school of thought a statistical procedure should be evaluated by its long run frequency performance. You imagine repeating the data collection exercise many times, independently. The quality of a procedure is measured by its average performance when the true distribution of the X values is $P_{\theta_0}$ .
Bayes: In this school of thought we treat $\theta$ as being random just like X. We compute the conditional distribution of what we don't know given what we do know. In particular we ask how a procedure will work on the data we actually got - no averaging over data we might have got.
Likelihood: This school tries to combine the previous 2 by looking only at the data we actually got but trying to avoid treating $\theta$ as random.

For the next several weeks we do only the Neyman Pearson approach, though we use that approach to evaluate the quality of likelihood methods.

$next$ $up$ $previous$

Richard Lockhart
1999-10-14