next up previous contents
Next: Misc notes on fitting Up: X=interval or ratio variable, Previous: Summarizing the relationship -   Contents

Summarizing the relationship - fitting a straight line to data

This is the most common objective method to fit a special type of curve (a straight line) to a set of data. The ideas behing this method can be readily extended to more complex cases.

This method is so popular because the computation behind it can be done by hand (for small problems) and are easily programmed into a computer.

However, just because 'you have a hammer, doesn't mean that everything is a nail'. Every problem should be investigated carefully to see if the technique is appropriate.

This method involves fitting a straight line to the data points to obtain the best fit. We assume that you know the equation of a line, but will review some important properties.

Equation for a line In previous courses at high school or in linear algebra, the equation of a straight line is often written $ y = mx +b$ where $ m$ is the slope and $ b$ is the intercept.

Just to be difficult (just kidding) Statisticians usually write an fitted linear relationship between $ X$ and $ Y$ as $ \widehat{Y} = b_{0} + b_{1} X$ where $ b_0$ is the intercept and $ b_1$ is the slope. There is a good reason for this notation - in more advanced classes, you would see that our notation extends easily to more complex cases whereas the former does not.

Use JMP here to fit a line to the cereal data and interpret the slopes and intercepts.

How is the line fit? How is the best fitting line found when the points are scattered? We typically use the principle of least squares. The least-squares line is the line that makes the sum of the squares of the deviations of the data points from the line in the vertical direction as small as possible.

Mathematically, the least squares line is the line that minimizes $ {1\over n} \sum \left( Y_i-\widehat{Y}_i \right)^2 $ . This formal definition is not that important - the concept in the previous paragraph is important.

It is possible to write out a formula for the estimated intercept and slope, but who cares - let the computer do the dirty work.

The equation of the fitted line is $ \widehat{Y}_{i}=b_{0} +b_{1} X_{i}$ where $ b_{0}$ is the estimated intercept, and $ b_{1}$ is the estimated slope. The symbol $ \widehat{Y}$ indicates that we are referring to the estimated line and not to a line in the entire population.

Show how to fit the straight line in JMP and how to extract information from the summary table shown by JMP.

Predictions Once the best fittingline is found it can be used to make predictions for new values of $ X$ . All that is done is to substitue the new value of $ X$ into the equation and compute the predicted value $ \widehat{Y}$ .

Residuals After the any curve is fit, it is important to examine if the fitted curve is reasonable. This is done using residuals. The residual for a point is the difference between the observed value and the predicted value, i.e. the residual from fitting a straight line is found as: $ residual_{i} = Y_{i} - (b_{0}+b_{1}X_{i}) = (Y_{i}-\widehat{Y}_{i})$ .

A residual plot can be constructed and will be explained later in the course. These are useful to see if the fitted line is a reasonable summary of the data.


next up previous contents
Next: Misc notes on fitting Up: X=interval or ratio variable, Previous: Summarizing the relationship -   Contents
Copyright 2008: Carl J. Schwarz cschwarz@stat.sfu.ca