Postscript version of this file
STAT 450 Lecture 15
Reading for Today's Lecture:
Goals of Today's Lecture:
- Introduce basic goals of statistical inference.
- Define likelihood, log-likelihood, score functions.
- Introduce maximum likelihood estimation.
Today's notes
Statistical Inference
Definition: A model is a family
of possible distributions for some random variable X. (Our
data set is X, so X will generally be a big vector or matrix or even
more complicated object.)
We will assume throughout this course that the true distribution P of Xis in fact some
for some
.
We call
the true value of the parameter. Notice that this assumption
will be wrong; we hope it is not wrong in an important way. If we are very
worried that it is wrong we enlarge our model putting in more distributions
and making
bigger.
Our goal is to observe the value of X and then guess
or some
property of .
We will consider the following classic mathematical
versions of this:
- 1.
- Point estimation: we must compute an estimate
which lies in
(or something close to ).
- 2.
- Point estimation of a function of :
we must compute an estimate
of
.
- 3.
- Interval (or set) estimation. We must compute a set C=C(X) in
which we think will contain .
- 4.
- Hypothesis testing: We must decide whether or not
where
.
- 5.
- Prediction: we must guess the value of an observable random
variable Y whose distribution depends on .
Typically Yis the value of the variable X in a repetition of the experiment.
There are several schools of statistical thinking with different views
on how these problems should be done. The main schools of thought may
be summarized roughly as follows:
- Neyman Pearson: In this school of thought a statistical
procedure should be evaluated by its long run frequency performance.
You imagine repeating the data collection exercise many times,
independently. The quality of a procedure is measured by its average
performance when the true distribution of the X values is
.
- Bayes: In this school of thought we treat
as being
random just like X. We compute the conditional distribution of what
we don't know given what we do know. In particular we ask how a procedure
will work on the data we actually got - no averaging over data we might
have got.
- Likelihood: This school tries to combine the previous 2
by looking only at the data we actually got but trying to avoid
treating
as random.
For the next several weeks we do only the Neyman Pearson approach,
though we use that approach to evaluate the quality of likelihood
methods.
Likelihood
Suppose you toss a coin 6 times and get Heads twice. If p is the
probability of getting H then the probability of getting 2 heads is
15p2(1-p)4
This probability, thought of as a function of p, is the likelihood
function for this particular data.
Definition: A model is a family
of possible distributions for some random variable X.
Typically the model is described by specifying
the set of possible densities of X.
Definition: The likelihood function is the function L whose domain
is
and whose values are given by
The key point is to think about how the density depends on
not
about how it depends on X. Notice that X, the observed value of the
data, has been plugged into the formula for the density. Notice also
that the coin tossing example is like this but with f being the discrete
density. We use the likelihood in most of our inference problems:
- 1.
- Point estimation: we must compute an estimate
which lies in .
The maximum likelihood estimate
(MLE) of
is the value
which maximizes
over
if such a
exists.
- 2.
- Point estimation of a function of :
we must compute an estimate
of
.
We use
where
is the MLE of .
- 3.
- Interval (or set) estimation. We must compute a set C=C(X) in
which we think will contain .
We will use
for a suitable c.
- 4.
- Hypothesis testing: We must decide whether or not
where
.
We base our decision
on the likelihood ratio
Maximum Likelihood Estimation
To find an MLE we maximize L. This is a typical function
maximization problem which we approach by setting the gradient
of L equal to 0 and then checking to see that the root is
a maximum, not a minimum or saddle point.
We begin by examining some likelihood plots in examples:
Cauchy Data
We have a sample
from the Cauchy
density
The likelihood function is
Here are some plots of this function for 6 samples of size 5.
Here are close up views of these plots for
between -2 and 2.
Now for sample size 25.
Here are close up views of these plots for
between -2 and 2.
I want you to notice the following points:
- The likelihood functions have peaks near the true value of
(which is 0 for the data sets I generated).
- The peaks are narrower for the larger sample size.
- The peaks have a more regular shape for the larger value of n.
- I actually plotted
which has exactly
the same shape as L but runs from 0 to 1 on the vertical scale.
To maximize this likelihood we would have to differentiate L and
set the result equal to 0. Notice that L is a product of n terms
and the derivative will then be
which is quite unpleasant. It is much easier to work with the logarithm
of L since the log of a product is a sum and the logarithm is monotone
increasing.
Definition: The Log Likelihood function is
For the Cauchy problem we have
Here are the logarithms of the likelihoods plotted above:
I want you to notice the following points:
- The log likelihood functions with n=25 have pretty smooth
shapes which look rather parabolic.
- For n=5 there are plenty of local maxima and minima of .
You can see that the likelihood will tend to 0 as
so that the maximum of
will occur at a root of
,
the derivative of
with respect to .
Definition: The Score Function is the gradient of
The MLE
usually solves the Likelihood Equations
In our Cauchy example we find
Here are some plots of the score functions for n=5 for our
Cauchy data sets. Each score is plotted beneath a plot of the
corresponding .
Notice that there are often multiple roots of the likelihood equations.
Here is n=25:
The Binomial Distribution
If X has a Binomial
distribution then
The function L is 0 at
and at
unless X=0or X=n so for
the MLE must be found by setting
U=0 and getting
For X=n the log-likelihood has derivative
for all
so that the likelihood is an increasing
function of
which is maximized at
.
Similarly when X=0 the maximum is at
.
The Normal Distribution
Now we have
iid
.
There are
two parameters
.
We find
Notice that U is a function with two components because
has
two components.
Setting the likelihood equal to 0 and solving gives
and
You need to check that this is actually a maximum. To do so
you compute one more derivative. The matrix H of second
derivatives of
is
Plugging in the mle gives
This matrix is negative definite. Both its eigenvalues are negative.
So
must be a local maximum.
Here is a contour plot of the normal log likelihood for two
data sets with n=10 and n=100.
Here are perspective plots of the same.
Notice that the contours are quite ellipsoidal for
the larger sample size.
We now turn to theory to explain the features of these
plots, at least approximately in large samples.
Richard Lockhart
1999-10-14