# Chapter 10

Chapter 10: Re-expressing Data: Get it Straight

Creating a model is a mechanical process; knowing when it is appropriate to use it is a thinking/analysis process. A useful model is the ultimate goal. Linear Models have tools that are relatively simple to understand and interpret: slope, yintercept. We can verify that a linear model is appropriate by checking the conditions and looking at the residual plot. Curved Models can be fit, but relatively speaking are more difficult to calculate. First Approach: Make sure that a re-expression can be meaningful. • Once we re-express, decide if the model is appropriate o Create the model o Plot the residuals o If there is still a curve, build another model o When the model has random, scattered residuals then we interpret using the model • Scatter plot shows a mixture of “signal” and “noise” o Signal is the underlying association between the variables o Noise is the random variation unaccounted for by the association o Example: in a scatter plot of height and weight, generally tall people weigh more; however, not all 6 ft tall people weigh the same—that variation is the noise. We want a regression model that describes the signal (the underlying relationship between ht and wt) Residual plot shows us the variation that is not explained by the model If the plot is random (just noise) then we have captured the whole signal If a curve remains in the residual plot, then we missed some of the signal meaning we need to look for a better model • When the appropriate model is found, then o Ask how strong is the model

o Look at the pattern o R 2 --when interpreting keep in mind that it is still variability, but it is variability in the re-expressed variables and NOT the original

o Correlation is strength of a linear association so discuss “r” only if the reexpression makes the relationship linear. GOALS of re-expression: 1. Make the distribution of a variable more symmetric 2. Make the spread of several groups more alike, even if the centers differ. 3. Make the form of a scatter plot more nearly linear. 4. Make the scatter in a scatter plot spread out evenly rather than thickening at one end.

The Ladder of Powers p. 227 Power 2 Name Square of the data values Comment Try with unimodal distribution that are skewed to the left Data with positive and negative values & no bounds are less likely to benefit from reexpression Counts often benefit from a square root reexpression Measurements that cannot be negative often do well from a log re-expression An uncommon re-expression, but can be useful Ratios of two quantities (i.e. mph) often do well from a reciprocal

1

Raw data

½

Square root of data values

“0”

Logarithms

-1/2 -1

Reciprocal square root Reciprocal of the data

Example 1: Graph the following and look at the scatter plot
Wt 2700 2305 2300 2485 2260 2345 2325 2340 2675 1900 2355 2055 3110 2885 2850 2695 2175 2215 2790 2640 2485 MPG 25 33 34 26 32 29 26 35 28 34 25 35 20 27 19 30 33 30 25 26 28 Wt 2885 2695 2680 2655 3065 2970 2690 2910 2975 2920 2575 2935 2920 3640 3295 3380 3145 3200 3610 2885 4000 MPG 27 28 25 27 22 25 24 26 25 21 24 23 27 17 21 21 22 22 23 23 17

1. Describe the shape. Is it linear?______________________________________ 2. Do the regression equation: ________________________________________ 3. Look at the residual plot: __________________________________________ 4. Do a re-expression by using the reciprocal of y__________________________ 5. Is the data more linear? ___________________What does the residual plot look like? __________________________________________________________________

6. What is the prediction line for the re-expressed data?_____________________ 7. What is the predicted mpg for a car weighing 3500 lbs?____________________

Ex 2: Given the following data of population growth in the US, determine if a linear model is appropriate. Scale the years (i.e. make 1800 a 1 for the 1st year, 1825 a 2 for the 2nd year, etc) Year Population (in millions) 5 11 23 44 76 114 151 215 285

1800 1825 1850 1875 1900 1925 1950 1975 2000

1) Enter the data into your calculator and look at the scatter plot. Describe.

2) Describe the residual plot:

3) Which method from the Ladder of Powers looks like it might be a good re-expression?

4) Do the re-expression and describe the results.

5) When you are satisfied with the re-expression, write out the model (the equation)

6) With the model, predict the population for 1960.

Logarithms: If none of the data values are zero or negative, logarithms often are the method of choice. You can take the log of y or log of x or log of both variables. Chart from page 233 Model Name Exponential x-axis X y-axis Log(y) Comment This model is the “0” power on the ladder approach, useful for values that grow by percentage increases

y Equation: log ɵ = a + bx
Logarithmic Log(x) Y A wide range of x-values or a scatter plot descending rapidly at the left but leveling off toward the right Equation: ɵ = a + b log x y Power Log(x) Log(y) The Goldilocks model: when one of the ladder’s powers is too big and the next is too small, this one might be just right Equation: log ɵ = a + b log x y

Ex 3: p 242 # 17 1) Enter data into your calculator and describe the scatter plot:

2) Find the regression of Salary vs. year and plot the residuals. Describe

3) Re-express the data:

4) What model would you use to report the trend in salaries?

5) Predict the salary for 2007.

Ex 4: The decades listed represent: 1 = 1900-1910, 2 = 1911-1920. Create a model to predict future increases in life expectancy.

1

2

3

4

5

6

7

8

9

10

Life Expectancy

48.6

54.4

59.7

62.1

66.5

67.4

68.0

70.7

72.7

74.9

1. Enter data and look at scatter plot. Describe.

2. Run the regression equation and look at the residual plot. Describe

3. What model would work better?

4. Write the model (equation) of the re-expressed data.

5. Predict the life expectancy for decade 11.

Why not just use the curve? We could but again the calculations and mathematics are more difficult. It is better to re-express data to straighten the plot.

