next up previous contents
Next: Summarizing the relationship - Up: X=interval or ratio variable, Previous: X=interval or ratio variable,   Contents

Scatterplots

This is used when both variables are interval or ratio scale.

Usually the response variable is plotted along the vertical axis and the explanatory variables is plotted along the horizontal axis. It is not always perfectly clear which is the response and which is the explanatory variable. If there is no distinction between the two variables, then it doesn't matter (usually only happens when finding correlation between variables).

Example, look at relationship between calories/serving and fat from the cereal dataset using JMP. [We will create the graph in class at this point.]

What to look for in a scatterplot

Overall pattern.
- what is the direction of association? A positive association occurs when above-average values of one variable tend to be associated with above-average variables of another. The plot will have an upward slope. A negative association occurs then above-average values of one variable are associated with below-average values of another variable. The plot will have a downward slope. What happens when there is ``no association'' between the two variables?

Form of the relationship.
Is it linear (the points seem to cluster around a straight line?) or is it curvi-linear (the points seem to form a curve)?

Strength of association.
Are the points clustered tightly around the curve? If the points have a lot of scatter above or below the trend line, then the association is not very strong. On the other hand, if the amount of scatter above or below the trend line is very small, then there is a strong association. Beware of plots with lots of white space around the data points. This is usually a sign that someone is trying to be dishonest about the strength of the relationship as portrayed by the data.

Outliers?
Are there any points that seem to be unusual? Outliers are values that are unusually far from the trend curve - i.e. they are further away from the trend curve than you would expect from the usual level of scatter. There is no formal rule to detecting outliers - use common sense. [If you set the role of a variable to be a label, it is easy to identify such points.]

One's usual initial suspicion about any outlier is that it is a mistake, e.g. a transcription error. Every effort should be made to trace the data back to its original source and correct the value if possible. If the data value appears to be correct, then you have a bit of a quandry. Do you keep the data point in even though it doesn't follow the trend line, or do you drop the data point because it appears to be anomalous. Fortunately, with computers it is relatively easy to repeat an analysis with and without an outlier - if there is very little difference in the final outcome - don't worry about it.

In some cases, the outliers are the most interesting part of the data. For example, for many years the ozone hole in the Antarctic was missed because the computers were programmed to ignore readings that were so low that 'they must be in error'!

Lurking variables.
A lurking variable is a third variable that is related to both variables and may confound the association.

For example, the amount of chocolate consumed in Canada and the number of automobile accidents is positively related, but most people would agree that this is coincidental and driven by population growth.

Sometimes the lurking variable is a 'grouping' variable of sort. This is often examined by using a different plotting symbol to distinguish between the values of the third variables. For example, consider the following plot of the relationship between salary and years of experience for nurses.
Image lurking-var

The individual lines show a positive relationship, but the overall pattern when the data are pooled, shows a negative relationship.

We will now demonstrate how to use JMP to give different fibre-groups a different symbol. [From Row menu, use Where to select rows. Then assign those rows using the Rows->Markers menu different symbols.]


next up previous contents
Next: Summarizing the relationship - Up: X=interval or ratio variable, Previous: X=interval or ratio variable,   Contents
Copyright 2008: Carl J. Schwarz cschwarz@stat.sfu.ca