Free Essay

Woolbridge

In: Business and Management

Submitted By mesacuses
Words 54598
Pages 219
C

h

a

p

t

e

r

One

The Nature of Econometrics and Economic Data

C

hapter 1 discusses the scope of econometrics and raises general issues that result from the application of econometric methods. Section 1.3 examines the kinds of data sets that are used in business, economics, and other social sciences. Section 1.4 provides an intuitive discussion of the difficulties associated with the inference of causality in the social sciences.

1.1 WHAT IS ECONOMETRICS?
Imagine that you are hired by your state government to evaluate the effectiveness of a publicly funded job training program. Suppose this program teaches workers various ways to use computers in the manufacturing process. The twenty-week program offers courses during nonworking hours. Any hourly manufacturing worker may participate, and enrollment in all or part of the program is voluntary. You are to determine what, if any, effect the training program has on each worker’s subsequent hourly wage. Now suppose you work for an investment bank. You are to study the returns on different investment strategies involving short-term U.S. treasury bills to decide whether they comply with implied economic theories. The task of answering such questions may seem daunting at first. At this point, you may only have a vague idea of the kind of data you would need to collect. By the end of this introductory econometrics course, you should know how to use econometric methods to formally evaluate a job training program or to test a simple economic theory. Econometrics is based upon the development of statistical methods for estimating economic relationships, testing economic theories, and evaluating and implementing government and business policy. The most common application of econometrics is the forecasting of such important macroeconomic variables as interest rates, inflation rates, and gross domestic product. While forecasts of economic indicators are highly visible and are often widely published, econometric methods can be used in economic areas that have nothing to do with macroeconomic forecasting. For example, we will study the effects of political campaign expenditures on voting outcomes. We will consider the effect of school spending on student performance in the field of education. In addition, we will learn how to use econometric methods for forecasting economic time series.
1

Chapter 1

The Nature of Econometrics and Economic Data

Econometrics has evolved as a separate discipline from mathematical statistics because the former focuses on the problems inherent in collecting and analyzing nonexperimental economic data. Nonexperimental data are not accumulated through controlled experiments on individuals, firms, or segments of the economy. (Nonexperimental data are sometimes called observational data to emphasize the fact that the researcher is a passive collector of the data.) Experimental data are often collected in laboratory environments in the natural sciences, but they are much more difficult to obtain in the social sciences. While some social experiments can be devised, it is often impossible, prohibitively expensive, or morally repugnant to conduct the kinds of controlled experiments that would be needed to address economic issues. We give some specific examples of the differences between experimental and nonexperimental data in Section 1.4. Naturally, econometricians have borrowed from mathematical statisticians whenever possible. The method of multiple regression analysis is the mainstay in both fields, but its focus and interpretation can differ markedly. In addition, economists have devised new techniques to deal with the complexities of economic data and to test the predictions of economic theories.

1.2 STEPS IN EMPIRICAL ECONOMIC ANALYSIS
Econometric methods are relevant in virtually every branch of applied economics. They come into play either when we have an economic theory to test or when we have a relationship in mind that has some importance for business decisions or policy analysis. An empirical analysis uses data to test a theory or to estimate a relationship. How does one go about structuring an empirical economic analysis? It may seem obvious, but it is worth emphasizing that the first step in any empirical analysis is the careful formulation of the question of interest. The question might deal with testing a certain aspect of an economic theory, or it might pertain to testing the effects of a government policy. In principle, econometric methods can be used to answer a wide range of questions. In some cases, especially those that involve the testing of economic theories, a formal economic model is constructed. An economic model consists of mathematical equations that describe various relationships. Economists are well-known for their building of models to describe a vast array of behaviors. For example, in intermediate microeconomics, individual consumption decisions, subject to a budget constraint, are described by mathematical models. The basic premise underlying these models is utility maximization. The assumption that individuals make choices to maximize their wellbeing, subject to resource constraints, gives us a very powerful framework for creating tractable economic models and making clear predictions. In the context of consumption decisions, utility maximization leads to a set of demand equations. In a demand equation, the quantity demanded of each commodity depends on the price of the goods, the price of substitute and complementary goods, the consumer’s income, and the individual’s characteristics that affect taste. These equations can form the basis of an econometric analysis of consumer demand. Economists have used basic economic tools, such as the utility maximization framework, to explain behaviors that at first glance may appear to be noneconomic in nature. A classic example is Becker’s (1968) economic model of criminal behavior.
2

Chapter 1

The Nature of Econometrics and Economic Data

E X A M P L E 1 . 1 (Economic Model of Crime)

In a seminal article, Nobel prize winner Gary Becker postulated a utility maximization framework to describe an individual’s participation in crime. Certain crimes have clear economic rewards, but most criminal behaviors have costs. The opportunity costs of crime prevent the criminal from participating in other activities such as legal employment. In addition, there are costs associated with the possibility of being caught and then, if convicted, the costs associated with incarceration. From Becker’s perspective, the decision to undertake illegal activity is one of resource allocation, with the benefits and costs of competing activities taken into account. Under general assumptions, we can derive an equation describing the amount of time spent in criminal activity as a function of various factors. We might represent such a function as

y where f (x1,x2,x3,x4,x5,x6,x7),

(1.1)

y x1 x2 x3 x4 x5 x6 x7

hours spent in criminal activities “wage” for an hour spent in criminal activity hourly wage in legal employment income other than from crime or employment probability of getting caught probability of being convicted if caught expected sentence if convicted age

Other factors generally affect a person’s decision to participate in crime, but the list above is representative of what might result from a formal economic analysis. As is common in economic theory, we have not been specific about the function f( ) in (1.1). This function depends on an underlying utility function, which is rarely known. Nevertheless, we can use economic theory—or introspection—to predict the effect that each variable would have on criminal activity. This is the basis for an econometric analysis of individual criminal activity.

Formal economic modeling is sometimes the starting point for empirical analysis, but it is more common to use economic theory less formally, or even to rely entirely on intuition. You may agree that the determinants of criminal behavior appearing in equation (1.1) are reasonable based on common sense; we might arrive at such an equation directly, without starting from utility maximization. This view has some merit, although there are cases where formal derivations provide insights that intuition can overlook.
3

Chapter 1

The Nature of Econometrics and Economic Data

Here is an example of an equation that was derived through somewhat informal reasoning.
E X A M P L E 1 . 2 ( J o b Tr a i n i n g a n d W o r k e r P r o d u c t i v i t y )

Consider the problem posed at the beginning of Section 1.1. A labor economist would like to examine the effects of job training on worker productivity. In this case, there is little need for formal economic theory. Basic economic understanding is sufficient for realizing that factors such as education, experience, and training affect worker productivity. Also, economists are well aware that workers are paid commensurate with their productivity. This simple reasoning leads to a model such as

wage

f(educ,exper,training)

(1.2)

where wage is hourly wage, educ is years of formal education, exper is years of workforce experience, and training is weeks spent in job training. Again, other factors generally affect the wage rate, but (1.2) captures the essence of the problem.

After we specify an economic model, we need to turn it into what we call an econometric model. Since we will deal with econometric models throughout this text, it is important to know how an econometric model relates to an economic model. Take equation (1.1) as an example. The form of the function f ( ) must be specified before we can undertake an econometric analysis. A second issue concerning (1.1) is how to deal with variables that cannot reasonably be observed. For example, consider the wage that a person can earn in criminal activity. In principle, such a quantity is well-defined, but it would be difficult if not impossible to observe this wage for a given individual. Even variables such as the probability of being arrested cannot realistically be obtained for a given individual, but at least we can observe relevant arrest statistics and derive a variable that approximates the probability of arrest. Many other factors affect criminal behavior that we cannot even list, let alone observe, but we must somehow account for them. The ambiguities inherent in the economic model of crime are resolved by specifying a particular econometric model: crime
0

+

1

wagem +

2

othinc

3

freqarr
5

4

freqconv
6

avgsen

age

u,

(1.3)

where crime is some measure of the frequency of criminal activity, wagem is the wage that can be earned in legal employment, othinc is the income from other sources (assets, inheritance, etc.), freqarr is the frequency of arrests for prior infractions (to approximate the probability of arrest), freqconv is the frequency of conviction, and avgsen is the average sentence length after conviction. The choice of these variables is determined by the economic theory as well as data considerations. The term u contains unob4

Chapter 1

The Nature of Econometrics and Economic Data

served factors, such as the wage for criminal activity, moral character, family background, and errors in measuring things like criminal activity and the probability of arrest. We could add family background variables to the model, such as number of siblings, parents’ education, and so on, but we can never eliminate u entirely. In fact, dealing with this error term or disturbance term is perhaps the most important component of any econometric analysis. The constants 0, 1, …, 6 are the parameters of the econometric model, and they describe the directions and strengths of the relationship between crime and the factors used to determine crime in the model. A complete econometric model for Example 1.2 might be wage
0 1

educ

2

exper

3

training

u,

(1.4)

where the term u contains factors such as “innate ability,” quality of education, family background, and the myriad other factors that can influence a person’s wage. If we are specifically concerned about the effects of job training, then 3 is the parameter of interest. For the most part, econometric analysis begins by specifying an econometric model, without consideration of the details of the model’s creation. We generally follow this approach, largely because careful derivation of something like the economic model of crime is time consuming and can take us into some specialized and often difficult areas of economic theory. Economic reasoning will play a role in our examples, and we will merge any underlying economic theory into the econometric model specification. In the economic model of crime example, we would start with an econometric model such as (1.3) and use economic reasoning and common sense as guides for choosing the variables. While this approach loses some of the richness of economic analysis, it is commonly and effectively applied by careful researchers. Once an econometric model such as (1.3) or (1.4) has been specified, various hypotheses of interest can be stated in terms of the unknown parameters. For example, in equation (1.3) we might hypothesize that wagem, the wage that can be earned in legal employment, has no effect on criminal behavior. In the context of this particular econometric model, the hypothesis is equivalent to 1 0. An empirical analysis, by definition, requires data. After data on the relevant variables have been collected, econometric methods are used to estimate the parameters in the econometric model and to formally test hypotheses of interest. In some cases, the econometric model is used to make predictions in either the testing of a theory or the study of a policy’s impact. Because data collection is so important in empirical work, Section 1.3 will describe the kinds of data that we are likely to encounter.

1.3 THE STRUCTURE OF ECONOMIC DATA
Economic data sets come in a variety of types. While some econometric methods can be applied with little or no modification to many different kinds of data sets, the special features of some data sets must be accounted for or should be exploited. We next describe the most important data structures encountered in applied work.
5

Chapter 1

The Nature of Econometrics and Economic Data

Cross-Sectional Data
A cross-sectional data set consists of a sample of individuals, households, firms, cities, states, countries, or a variety of other units, taken at a given point in time. Sometimes the data on all units do not correspond to precisely the same time period. For example, several families may be surveyed during different weeks within a year. In a pure cross section analysis we would ignore any minor timing differences in collecting the data. If a set of families was surveyed during different weeks of the same year, we would still view this as a cross-sectional data set. An important feature of cross-sectional data is that we can often assume that they have been obtained by random sampling from the underlying population. For example, if we obtain information on wages, education, experience, and other characteristics by randomly drawing 500 people from the working population, then we have a random sample from the population of all working people. Random sampling is the sampling scheme covered in introductory statistics courses, and it simplifies the analysis of crosssectional data. A review of random sampling is contained in Appendix C. Sometimes random sampling is not appropriate as an assumption for analyzing cross-sectional data. For example, suppose we are interested in studying factors that influence the accumulation of family wealth. We could survey a random sample of families, but some families might refuse to report their wealth. If, for example, wealthier families are less likely to disclose their wealth, then the resulting sample on wealth is not a random sample from the population of all families. This is an illustration of a sample selection problem, an advanced topic that we will discuss in Chapter 17. Another violation of random sampling occurs when we sample from units that are large relative to the population, particularly geographical units. The potential problem in such cases is that the population is not large enough to reasonably assume the observations are independent draws. For example, if we want to explain new business activity across states as a function of wage rates, energy prices, corporate and property tax rates, services provided, quality of the workforce, and other state characteristics, it is unlikely that business activities in states near one another are independent. It turns out that the econometric methods that we discuss do work in such situations, but they sometimes need to be refined. For the most part, we will ignore the intricacies that arise in analyzing such situations and treat these problems in a random sampling framework, even when it is not technically correct to do so. Cross-sectional data are widely used in economics and other social sciences. In economics, the analysis of cross-sectional data is closely aligned with the applied microeconomics fields, such as labor economics, state and local public finance, industrial organization, urban economics, demography, and health economics. Data on individuals, households, firms, and cities at a given point in time are important for testing microeconomic hypotheses and evaluating economic policies. The cross-sectional data used for econometric analysis can be represented and stored in computers. Table 1.1 contains, in abbreviated form, a cross-sectional data set on 526 working individuals for the year 1976. (This is a subset of the data in the file WAGE1.RAW.) The variables include wage (in dollars per hour), educ (years of education), exper (years of potential labor force experience), female (an indicator for gender), and married (marital status). These last two variables are binary (zero-one) in nature
6

Chapter 1

The Nature of Econometrics and Economic Data

Table 1.1 A Cross-Sectional Data Set on Wages and Other Individual Characteristics

obsno 1 2 3 4 5

wage 3.10 3.24 3.00 6.00 5.30

educ 11 12 11 8 12

exper 2 22 2 44 7

female 1 1 0 0 0

married 0 1 0 1 1

525 526

11.56 3.50

16 14

5 5

0 1

1 0

and serve to indicate qualitative features of the individual. (The person is female or not; the person is married or not.) We will have much to say about binary variables in Chapter 7 and beyond. The variable obsno in Table 1.1 is the observation number assigned to each person in the sample. Unlike the other variables, it is not a characteristic of the individual. All econometrics and statistics software packages assign an observation number to each data unit. Intuition should tell you that, for data such as that in Table 1.1, it does not matter which person is labeled as observation one, which person is called Observation Two, and so on. The fact that the ordering of the data does not matter for econometric analysis is a key feature of cross-sectional data sets obtained from random sampling. Different variables sometimes correspond to different time periods in crosssectional data sets. For example, in order to determine the effects of government policies on long-term economic growth, economists have studied the relationship between growth in real per capita gross domestic product (GDP) over a certain period (say 1960 to 1985) and variables determined in part by government policy in 1960 (government consumption as a percentage of GDP and adult secondary education rates). Such a data set might be represented as in Table 1.2, which constitutes part of the data set used in the study of cross-country growth rates by De Long and Summers (1991).
7

Chapter 1

The Nature of Econometrics and Economic Data

Table 1.2 A Data Set on Economic Growth Rates and Country Characteristics

obsno 1 2 3 4

country Argentina Austria Belgium Bolivia

gpcrgdp 0.89 3.32 2.56 1.24

govcons60 9 16 13 18

second60 32 50 69 12

61

Zimbabwe

2.30

17

6

The variable gpcrgdp represents average growth in real per capita GDP over the period 1960 to 1985. The fact that govcons60 (government consumption as a percentage of GDP) and second60 (percent of adult population with a secondary education) correspond to the year 1960, while gpcrgdp is the average growth over the period from 1960 to 1985, does not lead to any special problems in treating this information as a crosssectional data set. The order of the observations is listed alphabetically by country, but there is nothing about this ordering that affects any subsequent analysis.

Time Series Data
A time series data set consists of observations on a variable or several variables over time. Examples of time series data include stock prices, money supply, consumer price index, gross domestic product, annual homicide rates, and automobile sales figures. Because past events can influence future events and lags in behavior are prevalent in the social sciences, time is an important dimension in a time series data set. Unlike the arrangement of cross-sectional data, the chronological ordering of observations in a time series conveys potentially important information. A key feature of time series data that makes it more difficult to analyze than crosssectional data is the fact that economic observations can rarely, if ever, be assumed to be independent across time. Most economic and other time series are related, often strongly related, to their recent histories. For example, knowing something about the gross domestic product from last quarter tells us quite a bit about the likely range of the GDP during this quarter, since GDP tends to remain fairly stable from one quarter to
8

Chapter 1

The Nature of Econometrics and Economic Data

the next. While most econometric procedures can be used with both cross-sectional and time series data, more needs to be done in specifying econometric models for time series data before standard econometric methods can be justified. In addition, modifications and embellishments to standard econometric techniques have been developed to account for and exploit the dependent nature of economic time series and to address other issues, such as the fact that some economic variables tend to display clear trends over time. Another feature of time series data that can require special attention is the data frequency at which the data are collected. In economics, the most common frequencies are daily, weekly, monthly, quarterly, and annually. Stock prices are recorded at daily intervals (excluding Saturday and Sunday). The money supply in the U.S. economy is reported weekly. Many macroeconomic series are tabulated monthly, including inflation and employment rates. Other macro series are recorded less frequently, such as every three months (every quarter). Gross domestic product is an important example of a quarterly series. Other time series, such as infant mortality rates for states in the United States, are available only on an annual basis. Many weekly, monthly, and quarterly economic time series display a strong seasonal pattern, which can be an important factor in a time series analysis. For example, monthly data on housing starts differs across the months simply due to changing weather conditions. We will learn how to deal with seasonal time series in Chapter 10. Table 1.3 contains a time series data set obtained from an article by CastilloFreeman and Freeman (1992) on minimum wage effects in Puerto Rico. The earliest year in the data set is the first observation, and the most recent year available is the last

Table 1.3 Minimum Wage, Unemployment, and Related Data for Puerto Rico

obsno 1 2 3

year 1950 1951 1952

avgmin 0.20 0.21 0.23

avgcov 20.1 20.7 22.6

unemp 15.4 16.0 14.8

gnp 878.7 925.0 1015.9

37 38

1986 1987

3.35 3.35

58.1 58.2

18.9 16.8

4281.6 4496.7

9

Chapter 1

The Nature of Econometrics and Economic Data

observation. When econometric methods are used to analyze time series data, the data should be stored in chronological order. The variable avgmin refers to the average minimum wage for the year, avgcov is the average coverage rate (the percentage of workers covered by the minimum wage law), unemp is the unemployment rate, and gnp is the gross national product. We will use these data later in a time series analysis of the effect of the minimum wage on employment.

Pooled Cross Sections
Some data sets have both cross-sectional and time series features. For example, suppose that two cross-sectional household surveys are taken in the United States, one in 1985 and one in 1990. In 1985, a random sample of households is surveyed for variables such as income, savings, family size, and so on. In 1990, a new random sample of households is taken using the same survey questions. In order to increase our sample size, we can form a pooled cross section by combining the two years. Because random samples are taken in each year, it would be a fluke if the same household appeared in the sample during both years. (The size of the sample is usually very small compared with the number of households in the United States.) This important factor distinguishes a pooled cross section from a panel data set. Pooling cross sections from different years is often an effective way of analyzing the effects of a new government policy. The idea is to collect data from the years before and after a key policy change. As an example, consider the following data set on housing prices taken in 1993 and 1995, when there was a reduction in property taxes in 1994. Suppose we have data on 250 houses for 1993 and on 270 houses for 1995. One way to store such a data set is given in Table 1.4. Observations 1 through 250 correspond to the houses sold in 1993, and observations 251 through 520 correspond to the 270 houses sold in 1995. While the order in which we store the data turns out not to be crucial, keeping track of the year for each observation is usually very important. This is why we enter year as a separate variable. A pooled cross section is analyzed much like a standard cross section, except that we often need to account for secular differences in the variables across the time. In fact, in addition to increasing the sample size, the point of a pooled cross-sectional analysis is often to see how a key relationship has changed over time.

Panel or Longitudinal Data
A panel data (or longitudinal data) set consists of a time series for each crosssectional member in the data set. As an example, suppose we have wage, education, and employment history for a set of individuals followed over a ten-year period. Or we might collect information, such as investment and financial data, about the same set of firms over a five-year time period. Panel data can also be collected on geographical units. For example, we can collect data for the same set of counties in the United States on immigration flows, tax rates, wage rates, government expenditures, etc., for the years 1980, 1985, and 1990. The key feature of panel data that distinguishes it from a pooled cross section is the fact that the same cross-sectional units (individuals, firms, or counties in the above
10

Chapter 1

The Nature of Econometrics and Economic Data

Table 1.4 Pooled Cross Sections: Two Years of Housing Prices

obsno 1 2 3

year 1993 1993 1993

hprice 85500 67300 134000

proptax 42 36 38

sqrft 1600 1440 2000

bdrms 3 3 4

bthrms 2.0 2.5 2.5

250 251 252 253

1993 1995 1995 1995

243600 65000 182400 97500

41 16 20 15

2600 1250 2200 1540

4 2 4 3

3.0 1.0 2.0 2.0

520

1995

57200

16

1100

2

1.5

examples) are followed over a given time period. The data in Table 1.4 are not considered a panel data set because the houses sold are likely to be different in 1993 and 1995; if there are any duplicates, the number is likely to be so small as to be unimportant. In contrast, Table 1.5 contains a two-year panel data set on crime and related statistics for 150 cities in the United States. There are several interesting features in Table 1.5. First, each city has been given a number from 1 through 150. Which city we decide to call city 1, city 2, and so on, is irrelevant. As with a pure cross section, the ordering in the cross section of a panel data set does not matter. We could use the city name in place of a number, but it is often useful to have both.
11

Chapter 1

The Nature of Econometrics and Economic Data

Table 1.5 A Two-Year Panel Data Set on City Crime Statistics

obsno 1 2 3 4

city 1 1 2 2

year 1986 1990 1986 1990

murders 5 8 2 1

population 350000 359200 64300 65100

unem 8.7 7.2 5.4 5.5

police 440 471 75 75

297 298 299 300

149 149 150 150

1986 1990 1986 1990

10 6 25 32

260700 245000 543000 546200

9.6 9.8 4.3 5.2

286 334 520 493

A second useful point is that the two years of data for city 1 fill the first two rows or observations. Observations 3 and 4 correspond to city 2, and so on. Since each of the 150 cities has two rows of data, any econometrics package will view this as 300 observations. This data set can be treated as two pooled cross sections, where the same cities happen to show up in the same year. But, as we will see in Chapters 13 and 14, we can also use the panel structure to respond to questions that cannot be answered by simply viewing this as a pooled cross section. In organizing the observations in Table 1.5, we place the two years of data for each city adjacent to one another, with the first year coming before the second in all cases. For just about every practical purpose, this is the preferred way for ordering panel data sets. Contrast this organization with the way the pooled cross sections are stored in Table 1.4. In short, the reason for ordering panel data as in Table 1.5 is that we will need to perform data transformations for each city across the two years. Because panel data require replication of the same units over time, panel data sets, especially those on individuals, households, and firms, are more difficult to obtain than pooled cross sections. Not surprisingly, observing the same units over time leads to sev12

Chapter 1

The Nature of Econometrics and Economic Data

eral advantages over cross-sectional data or even pooled cross-sectional data. The benefit that we will focus on in this text is that having multiple observations on the same units allows us to control certain unobserved characteristics of individuals, firms, and so on. As we will see, the use of more than one observation can facilitate causal inference in situations where inferring causality would be very difficult if only a single cross section were available. A second advantage of panel data is that it often allows us to study the importance of lags in behavior or the result of decision making. This information can be significant since many economic policies can be expected to have an impact only after some time has passed. Most books at the undergraduate level do not contain a discussion of econometric methods for panel data. However, economists now recognize that some questions are difficult, if not impossible, to answer satisfactorily without panel data. As you will see, we can make considerable progress with simple panel data analysis, a method which is not much more difficult than dealing with a standard cross-sectional data set.

A Comment on Data Structures
Part 1 of this text is concerned with the analysis of cross-sectional data, as this poses the fewest conceptual and technical difficulties. At the same time, it illustrates most of the key themes of econometric analysis. We will use the methods and insights from cross-sectional analysis in the remainder of the text. While the econometric analysis of time series uses many of the same tools as crosssectional analysis, it is more complicated due to the trending, highly persistent nature of many economic time series. Examples that have been traditionally used to illustrate the manner in which econometric methods can be applied to time series data are now widely believed to be flawed. It makes little sense to use such examples initially, since this practice will only reinforce poor econometric practice. Therefore, we will postpone the treatment of time series econometrics until Part 2, when the important issues concerning trends, persistence, dynamics, and seasonality will be introduced. In Part 3, we treat pooled cross sections and panel data explicitly. The analysis of independently pooled cross sections and simple panel data analysis are fairly straightforward extensions of pure cross-sectional analysis. Nevertheless, we will wait until Chapter 13 to deal with these topics.

1.4 CAUSALITY AND THE NOTION OF CETERIS PARIBUS IN ECONOMETRIC ANALYSIS
In most tests of economic theory, and certainly for evaluating public policy, the economist’s goal is to infer that one variable has a causal effect on another variable (such as crime rate or worker productivity). Simply finding an association between two or more variables might be suggestive, but unless causality can be established, it is rarely compelling. The notion of ceteris paribus—which means “other (relevant) factors being equal”—plays an important role in causal analysis. This idea has been implicit in some of our earlier discussion, particularly Examples 1.1 and 1.2, but thus far we have not explicitly mentioned it.
13

Chapter 1

The Nature of Econometrics and Economic Data

You probably remember from introductory economics that most economic questions are ceteris paribus by nature. For example, in analyzing consumer demand, we are interested in knowing the effect of changing the price of a good on its quantity demanded, while holding all other factors—such as income, prices of other goods, and individual tastes—fixed. If other factors are not held fixed, then we cannot know the causal effect of a price change on quantity demanded. Holding other factors fixed is critical for policy analysis as well. In the job training example (Example 1.2), we might be interested in the effect of another week of job training on wages, with all other components being equal (in particular, education and experience). If we succeed in holding all other relevant factors fixed and then find a link between job training and wages, we can conclude that job training has a causal effect on worker productivity. While this may seem pretty simple, even at this early stage it should be clear that, except in very special cases, it will not be possible to literally hold all else equal. The key question in most empirical studies is: Have enough other factors been held fixed to make a case for causality? Rarely is an econometric study evaluated without raising this issue. In most serious applications, the number of factors that can affect the variable of interest—such as criminal activity or wages—is immense, and the isolation of any particular variable may seem like a hopeless effort. However, we will eventually see that, when carefully applied, econometric methods can simulate a ceteris paribus experiment. At this point, we cannot yet explain how econometric methods can be used to estimate ceteris paribus effects, so we will consider some problems that can arise in trying to infer causality in economics. We do not use any equations in this discussion. For each example, the problem of inferring causality disappears if an appropriate experiment can be carried out. Thus, it is useful to describe how such an experiment might be structured, and to observe that, in most cases, obtaining experimental data is impractical. It is also helpful to think about why the available data fails to have the important features of an experimental data set. We rely for now on your intuitive understanding of terms such as random, independence, and correlation, all of which should be familiar from an introductory probability and statistics course. (These concepts are reviewed in Appendix B.) We begin with an example that illustrates some of these important issues.
E X A M P L E 1 . 3 (Effects of Fertilizer on Crop Yield)

Some early econometric studies [for example, Griliches (1957)] considered the effects of new fertilizers on crop yields. Suppose the crop under consideration is soybeans. Since fertilizer amount is only one factor affecting yields—some others include rainfall, quality of land, and presence of parasites—this issue must be posed as a ceteris paribus question. One way to determine the causal effect of fertilizer amount on soybean yield is to conduct an experiment, which might include the following steps. Choose several one-acre plots of land. Apply different amounts of fertilizer to each plot and subsequently measure the yields; this gives us a cross-sectional data set. Then, use statistical methods (to be introduced in Chapter 2) to measure the association between yields and fertilizer amounts.
14

Chapter 1

The Nature of Econometrics and Economic Data

As described earlier, this may not seem like a very good experiment, because we have said nothing about choosing plots of land that are identical in all respects except for the amount of fertilizer. In fact, choosing plots of land with this feature is not feasible: some of the factors, such as land quality, cannot even be fully observed. How do we know the results of this experiment can be used to measure the ceteris paribus effect of fertilizer? The answer depends on the specifics of how fertilizer amounts are chosen. If the levels of fertilizer are assigned to plots independently of other plot features that affect yield—that is, other characteristics of plots are completely ignored when deciding on fertilizer amounts— then we are in business. We will justify this statement in Chapter 2.

The next example is more representative of the difficulties that arise when inferring causality in applied economics.
E X A M P L E 1 . 4 (Measuring the Return to Education)

Labor economists and policy makers have long been interested in the “return to education.” Somewhat informally, the question is posed as follows: If a person is chosen from the population and given another year of education, by how much will his or her wage increase? As with the previous examples, this is a ceteris paribus question, which implies that all other factors are held fixed while another year of education is given to the person. We can imagine a social planner designing an experiment to get at this issue, much as the agricultural researcher can design an experiment to estimate fertilizer effects. One approach is to emulate the fertilizer experiment in Example 1.3: Choose a group of people, randomly give each person an amount of education (some people have an eighth grade education, some are given a high school education, etc.), and then measure their wages (assuming that each then works in a job). The people here are like the plots in the fertilizer example, where education plays the role of fertilizer and wage rate plays the role of soybean yield. As with Example 1.3, if levels of education are assigned independently of other characteristics that affect productivity (such as experience and innate ability), then an analysis that ignores these other factors will yield useful results. Again, it will take some effort in Chapter 2 to justify this claim; for now we state it without support.

Unlike the fertilizer-yield example, the experiment described in Example 1.4 is infeasible. The moral issues, not to mention the economic costs, associated with randomly determining education levels for a group of individuals are obvious. As a logistical matter, we could not give someone only an eighth grade education if he or she already has a college degree. Even though experimental data cannot be obtained for measuring the return to education, we can certainly collect nonexperimental data on education levels and wages for a large group by sampling randomly from the population of working people. Such data are available from a variety of surveys used in labor economics, but these data sets have a feature that makes it difficult to estimate the ceteris paribus return to education.
15

Chapter 1

The Nature of Econometrics and Economic Data

People choose their own levels of education, and therefore education levels are probably not determined independently of all other factors affecting wage. This problem is a feature shared by most nonexperimental data sets. One factor that affects wage is experience in the work force. Since pursuing more education generally requires postponing entering the work force, those with more education usually have less experience. Thus, in a nonexperimental data set on wages and education, education is likely to be negatively associated with a key variable that also affects wage. It is also believed that people with more innate ability often choose higher levels of education. Since higher ability leads to higher wages, we again have a correlation between education and a critical factor that affects wage. The omitted factors of experience and ability in the wage example have analogs in the the fertilizer example. Experience is generally easy to measure and therefore is similar to a variable such as rainfall. Ability, on the other hand, is nebulous and difficult to quantify; it is similar to land quality in the fertilizer example. As we will see throughout this text, accounting for other observed factors, such as experience, when estimating the ceteris paribus effect of another variable, such as education, is relatively straightforward. We will also find that accounting for inherently unobservable factors, such as ability, is much more problematical. It is fair to say that many of the advances in econometric methods have tried to deal with unobserved factors in econometric models. One final parallel can be drawn between Examples 1.3 and 1.4. Suppose that in the fertilizer example, the fertilizer amounts were not entirely determined at random. Instead, the assistant who chose the fertilizer levels thought it would be better to put more fertilizer on the higher quality plots of land. (Agricultural researchers should have a rough idea about which plots of land are better quality, even though they may not be able to fully quantify the differences.) This situation is completely analogous to the level of schooling being related to unobserved ability in Example 1.4. Because better land leads to higher yields, and more fertilizer was used on the better plots, any observed relationship between yield and fertilizer might be spurious.
E X A M P L E 1 . 5 (The Effect of Law Enforcement on City Crime Levels)

The issue of how best to prevent crime has, and will probably continue to be, with us for some time. One especially important question in this regard is: Does the presence of more police officers on the street deter crime? The ceteris paribus question is easy to state: If a city is randomly chosen and given 10 additional police officers, by how much would its crime rates fall? Another way to state the question is: If two cities are the same in all respects, except that city A has 10 more police officers than city B, by how much would the two cities’ crime rates differ? It would be virtually impossible to find pairs of communities identical in all respects except for the size of their police force. Fortunately, econometric analysis does not require this. What we do need to know is whether the data we can collect on community crime levels and the size of the police force can be viewed as experimental. We can certainly imagine a true experiment involving a large collection of cities where we dictate how many police officers each city will use for the upcoming year.
16

Chapter 1

The Nature of Econometrics and Economic Data

While policies can be used to affect the size of police forces, we clearly cannot tell each city how many police officers it can hire. If, as is likely, a city’s decision on how many police officers to hire is correlated with other city factors that affect crime, then the data must be viewed as nonexperimental. In fact, one way to view this problem is to see that a city’s choice of police force size and the amount of crime are simultaneously determined. We will explicitly address such problems in Chapter 16.

The first three examples we have discussed have dealt with cross-sectional data at various levels of aggregation (for example, at the individual or city levels). The same hurdles arise when inferring causality in time series problems.

E X A M P L E 1 . 6 (The Effect of the Minimum Wage on Unemployment)

An important, and perhaps contentious, policy issue concerns the effect of the minimum wage on unemployment rates for various groups of workers. While this problem can be studied in a variety of data settings (cross-sectional, time series, or panel data), time series data are often used to look at aggregate effects. An example of a time series data set on unemployment rates and minimum wages was given in Table 1.3. Standard supply and demand analysis implies that, as the minimum wage is increased above the market clearing wage, we slide up the demand curve for labor and total employment decreases. (Labor supply exceeds labor demand.) To quantify this effect, we can study the relationship between employment and the minimum wage over time. In addition to some special difficulties that can arise in dealing with time series data, there are possible problems with inferring causality. The minimum wage in the United States is not determined in a vacuum. Various economic and political forces impinge on the final minimum wage for any given year. (The minimum wage, once determined, is usually in place for several years, unless it is indexed for inflation.) Thus, it is probable that the amount of the minimum wage is related to other factors that have an effect on employment levels. We can imagine the U.S. government conducting an experiment to determine the employment effects of the minimum wage (as opposed to worrying about the welfare of low wage workers). The minimum wage could be randomly set by the government each year, and then the employment outcomes could be tabulated. The resulting experimental time series data could then be analyzed using fairly simple econometric methods. But this scenario hardly describes how minimum wages are set. If we can control enough other factors relating to employment, then we can still hope to estimate the ceteris paribus effect of the minimum wage on employment. In this sense, the problem is very similar to the previous cross-sectional examples.

Even when economic theories are not most naturally described in terms of causality, they often have predictions that can be tested using econometric methods. The following is an example of this approach.
17

Chapter 1

The Nature of Econometrics and Economic Data

E X A M P L E 1 . 7 (The Expectations Hypothesis)

The expectations hypothesis from financial economics states that, given all information available to investors at the time of investing, the expected return on any two investments is the same. For example, consider two possible investments with a three-month investment horizon, purchased at the same time: (1) Buy a three-month T-bill with a face value of $10,000, for a price below $10,000; in three months, you receive $10,000. (2) Buy a sixmonth T-bill (at a price below $10,000) and, in three months, sell it as a three-month T-bill. Each investment requires roughly the same amount of initial capital, but there is an important difference. For the first investment, you know exactly what the return is at the time of purchase because you know the initial price of the three-month T-bill, along with its face value. This is not true for the second investment: while you know the price of a six-month T-bill when you purchase it, you do not know the price you can sell it for in three months. Therefore, there is uncertainty in this investment for someone who has a three-month investment horizon. The actual returns on these two investments will usually be different. According to the expectations hypothesis, the expected return from the second investment, given all information at the time of investment, should equal the return from purchasing a three-month T-bill. This theory turns out to be fairly easy to test, as we will see in Chapter 11.

SUMMARY
In this introductory chapter, we have discussed the purpose and scope of econometric analysis. Econometrics is used in all applied economic fields to test economic theories, inform government and private policy makers, and to predict economic time series. Sometimes an econometric model is derived from a formal economic model, but in other cases econometric models are based on informal economic reasoning and intuition. The goal of any econometric analysis is to estimate the parameters in the model and to test hypotheses about these parameters; the values and signs of the parameters determine the validity of an economic theory and the effects of certain policies. Cross-sectional, time series, pooled cross-sectional, and panel data are the most common types of data structures that are used in applied econometrics. Data sets involving a time dimension, such as time series and panel data, require special treatment because of the correlation across time of most economic time series. Other issues, such as trends and seasonality, arise in the analysis of time series data but not crosssectional data. In Section 1.4, we discussed the notions of ceteris paribus and causal inference. In most cases, hypotheses in the social sciences are ceteris paribus in nature: all other relevant factors must be fixed when studying the relationship between two variables. Because of the nonexperimental nature of most data collected in the social sciences, uncovering causal relationships is very challenging.
18

Chapter 1

The Nature of Econometrics and Economic Data

KEY TERMS
Causal Effect Ceteris Paribus Cross-Sectional Data Set Data Frequency Econometric Model Economic Model Empirical Analysis Experimental Data Nonexperimental Data Observational Data Panel Data Pooled Cross Section Random Sampling Time Series Data

19

C

h

a

p

t

e

r

T wo

The Simple Regression Model

T

he simple regression model can be used to study the relationship between two variables. For reasons we will see, the simple regression model has limitations as a general tool for empirical analysis. Nevertheless, it is sometimes appropriate as an empirical tool. Learning how to interpret the simple regression model is good practice for studying multiple regression, which we’ll do in subsequent chapters.

2.1 DEFINITION OF THE SIMPLE REGRESSION MODEL
Much of applied econometric analysis begins with the following premise: y and x are two variables, representating some population, and we are interested in “explaining y in terms of x,” or in “studying how y varies with changes in x.” We discussed some examples in Chapter 1, including: y is soybean crop yield and x is amount of fertilizer; y is hourly wage and x is years of education; y is a community crime rate and x is number of police officers. In writing down a model that will “explain y in terms of x,” we must confront three issues. First, since there is never an exact relationship between two variables, how do we allow for other factors to affect y? Second, what is the functional relationship between y and x? And third, how can we be sure we are capturing a ceteris paribus relationship between y and x (if that is a desired goal)? We can resolve these ambiguities by writing down an equation relating y to x. A simple equation is y
0 1

x

u.

(2.1)

Equation (2.1), which is assumed to hold in the population of interest, defines the simple linear regression model. It is also called the two-variable linear regression model or bivariate linear regression model because it relates the two variables x and y. We now discuss the meaning of each of the quantities in (2.1). (Incidentally, the term “regression” has origins that are not especially important for most modern econometric applications, so we will not explain it here. See Stigler [1986] for an engaging history of regression analysis.)
22

Chapter 2

The Simple Regression Model

When related by (2.1), the variables y and x have several different names used interchangeably, as follows. y is called the dependent variable, the explained variable, the response variable, the predicted variable, or the regressand. x is called the independent variable, the explanatory variable, the control variable, the predictor variable, or the regressor. (The term covariate is also used for x.) The terms “dependent variable” and “independent variable” are frequently used in econometrics. But be aware that the label “independent” here does not refer to the statistical notion of independence between random variables (see Appendix B). The terms “explained” and “explanatory” variables are probably the most descriptive. “Response” and “control” are used mostly in the experimental sciences, where the variable x is under the experimenter’s control. We will not use the terms “predicted variable” and “predictor,” although you sometimes see these. Our terminology for simple regression is summarized in Table 2.1.

Table 2.1 Terminology for Simple Regression

y Dependent Variable Explained Variable Response Variable Predicted Variable Regressand

x Independent Variable Explanatory Variable Control Variable Predictor Variable Regressor

The variable u, called the error term or disturbance in the relationship, represents factors other than x that affect y. A simple regression analysis effectively treats all factors affecting y other than x as being unobserved. You can usefully think of u as standing for “unobserved.” Equation (2.1) also addresses the issue of the functional relationship between y and x. If the other factors in u are held fixed, so that the change in u is zero, u 0, then x has a linear effect on y: y
1

x if u

0.

(2.2)

Thus, the change in y is simply 1 multiplied by the change in x. This means that 1 is the slope parameter in the relationship between y and x holding the other factors in u fixed; it is of primary interest in applied economics. The intercept parameter 0 also has its uses, although it is rarely central to an analysis.
23

Part 1

Regression Analysis with Cross-Sectional Data

E X A M P L E 2 . 1 (Soybean Yield and Fertilizer)

Suppose that soybean yield is determined by the model

yield

0

1

fertilizer

u,

(2.3)

so that y yield and x fertilizer. The agricultural researcher is interested in the effect of fertilizer on yield, holding other factors fixed. This effect is given by 1. The error term u contains factors such as land quality, rainfall, and so on. The coefficient 1 measures the effect of fertilizer on yield, holding other factors fixed: yield 1 fertilizer.
E X A M P L E 2 . 2 (A Simple Wage Equation)

A model relating a person’s wage to observed education and other unobserved factors is

wage

0

1

educ

u.

(2.4)

If wage is measured in dollars per hour and educ is years of education, then 1 measures the change in hourly wage given another year of education, holding all other factors fixed. Some of those factors include labor force experience, innate ability, tenure with current employer, work ethics, and innumerable other things.

The linearity of (2.1) implies that a one-unit change in x has the same effect on y, regardless of the initial value of x. This is unrealistic for many economic applications. For example, in the wage-education example, we might want to allow for increasing returns: the next year of education has a larger effect on wages than did the previous year. We will see how to allow for such possibilities in Section 2.4. The most difficult issue to address is whether model (2.1) really allows us to draw ceteris paribus conclusions about how x affects y. We just saw in equation (2.2) that 1 does measure the effect of x on y, holding all other factors (in u) fixed. Is this the end of the causality issue? Unfortunately, no. How can we hope to learn in general about the ceteris paribus effect of x on y, holding other factors fixed, when we are ignoring all those other factors? As we will see in Section 2.5, we are only able to get reliable estimators of 0 and 1 from a random sample of data when we make an assumption restricting how the unobservable u is related to the explanatory variable x. Without such a restriction, we will not be able to estimate the ceteris paribus effect, 1. Because u and x are random variables, we need a concept grounded in probability. Before we state the key assumption about how x and u are related, there is one assumption about u that we can always make. As long as the intercept 0 is included in the equation, nothing is lost by assuming that the average value of u in the population is zero.
24

Chapter 2

The Simple Regression Model

Mathematically, E(u) 0.
(2.5)

Importantly, assume (2.5) says nothing about the relationship between u and x but simply makes a statement about the distribution of the unobservables in the population. Using the previous examples for illustration, we can see that assumption (2.5) is not very restrictive. In Example 2.1, we lose nothing by normalizing the unobserved factors affecting soybean yield, such as land quality, to have an average of zero in the population of all cultivated plots. The same is true of the unobserved factors in Example 2.2. Without loss of generality, we can assume that things such as average ability are zero in the population of all working people. If you are not convinced, you can work through Problem 2.2 to see that we can always redefine the intercept in equation (2.1) to make (2.5) true. We now turn to the crucial assumption regarding how u and x are related. A natural measure of the association between two random variables is the correlation coefficient. (See Appendix B for definition and properties.) If u and x are uncorrelated, then, as random variables, they are not linearly related. Assuming that u and x are uncorrelated goes a long way toward defining the sense in which u and x should be unrelated in equation (2.1). But it does not go far enough, because correlation measures only linear dependence between u and x. Correlation has a somewhat counterintuitive feature: it is possible for u to be uncorrelated with x while being correlated with functions of x, such as x 2. (See Section B.4 for further discussion.) This possibility is not acceptable for most regression purposes, as it causes problems for interpretating the model and for deriving statistical properties. A better assumption involves the expected value of u given x. Because u and x are random variables, we can define the conditional distribution of u given any value of x. In particular, for any x, we can obtain the expected (or average) value of u for that slice of the population described by the value of x. The crucial assumption is that the average value of u does not depend on the value of x. We can write this as E(u x) E(u) 0,
(2.6)

where the second equality follows from (2.5). The first equality in equation (2.6) is the new assumption, called the zero conditional mean assumption. It says that, for any given value of x, the average of the unobservables is the same and therefore must equal the average value of u in the entire population. Let us see what (2.6) entails in the wage example. To simplify the discussion, assume that u is the same as innate ability. Then (2.6) requires that the average level of ability is the same regardless of years of education. For example, if E(abil 8) denotes the average ability for the group of all people with eight years of education, and E(abil 16) denotes the average ability among people in the population with 16 years of education, then (2.6) implies that these must be the same. In fact, the average ability level must be the same for all education levels. If, for example, we think that average ability increases with years of education, then (2.6) is false. (This would happen if, on average, people with more ability choose to become more educated.) As we cannot observe innate ability, we have no way of knowing whether or not average ability is the
25

Part 1

Regression Analysis with Cross-Sectional Data

same for all education levels. But this is an issue that we must address before applying simple regression analysis. In the fertilizer example, if fertilizer amounts are chosen independently of other features of the plots, then (2.6) will hold: the average land quality will not depend on the Q U E S T I O N 2 . 1 amount of fertilizer. However, if more ferSuppose that a score on a final exam, score, depends on classes tilizer is put on the higher quality plots of attended (attend ) and unobserved factors that affect exam perforland, then the expected value of u changes mance (such as student ability): with the level of fertilizer, and (2.6) fails. Assumption (2.6) gives 1 another score u (2.7) 0 1attend interpretation that is often useful. Taking the expected value of (2.1) conditional on When would you expect this model to satisfy (2.6)? x and using E(u x) 0 gives E(y x)
0 1

x

(2.8)

Equation (2.8) shows that the population regression function (PRF), E(y x), is a linear function of x. The linearity means that a one-unit increase in x changes the expect-

Figure 2.1 E(y x) as a linear function of x.

y

E(y x)

0

1

x

x1

x2

x3

26

Chapter 2

The Simple Regression Model

ed value of y by the amount 1. For any given value of x, the distribution of y is centered about E(y x), as illustrated in Figure 2.1. When (2.6) is true, it is useful to break y into two components. The piece 0 1x is sometimes called the systematic part of y—that is, the part of y explained by x—and u is called the unsystematic part, or the part of y not explained by x. We will use assumption (2.6) in the next section for motivating estimates of 0 and 1. This assumption is also crucial for the statistical analysis in Section 2.5.

2.2 DERIVING THE ORDINARY LEAST SQUARES ESTIMATES
Now that we have discussed the basic ingredients of the simple regression model, we will address the important issue of how to estimate the parameters 0 and 1 in equation (2.1). To do this, we need a sample from the population. Let {(xi ,yi ): i 1,…,n} denote a random sample of size n from the population. Since these data come from (2.1), we can write yi
0 1 i

x

ui

(2.9)

for each i. Here, ui is the error term for observation i since it contains all factors affecting yi other than xi . As an example, xi might be the annual income and yi the annual savings for family i during a particular year. If we have collected data on 15 families, then n 15. A scatter plot of such a data set is given in Figure 2.2, along with the (necessarily fictitious) population regression function. We must decide how to use these data to obtain estimates of the intercept and slope in the population regression of savings on income. There are several ways to motivate the following estimation procedure. We will use (2.5) and an important implication of assumption (2.6): in the population, u has a zero mean and is uncorrelated with x. Therefore, we see that u has zero expected value and that the covariance between x and u is zero: E(u) Cov(x,u) 0 E(xu) 0,
(2.10) (2.11)

where the first equality in (2.11) follows from (2.10). (See Section B.4 for the definition and properties of covariance.) In terms of the observable variables x and y and the unknown parameters 0 and 1, equations (2.10) and (2.11) can be written as E(y and E[x(y
0 1 0 1

x)

0

(2.12)

x)]

0,

(2.13)

respectively. Equations (2.12) and (2.13) imply two restrictions on the joint probability distribution of (x,y) in the population. Since there are two unknown parameters to estimate, we might hope that equations (2.12) and (2.13) can be used to obtain good esti27

Part 1

Regression Analysis with Cross-Sectional Data

Figure 2.2 Scatterplot of savings and income for 15 families, and the population regression E(savings income) 0 1income.

savings

E(savings income) 0

0

1

income

income

0

mators of 0 and 1. In fact, they can be. Given a sample of data, we choose estimates ˆ0 and ˆ1 to solve the sample counterparts of (2.12) and (2.13): n n n
1

1 i 1 n

(yi x i (y i

ˆ0 ˆ0

ˆ1xi ) ˆ 1x i)

0. 0.

(2.14)

(2.15)

i

1

This is an example of the method of moments approach to estimation. (See Section C.4 for a discussion of different estimation approaches.) These equations can be solved for ˆ0 and ˆ1. Using the basic properties of the summation operator from Appendix A, equation (2.14) can be rewritten as y ¯ n ˆ0

ˆ1x, ¯

(2.16)

where y ¯

n

1 i 1

yi is the sample average of the yi and likewise for x. This equation allows ¯

us to write ˆ0 in terms of ˆ1, y, and x: ¯ ¯
28

Chapter 2

The Simple Regression Model

ˆ0

y ¯

ˆ1x. ¯

(2.17)

Therefore, once we have the slope estimate ˆ1, it is straightforward to obtain the intercept estimate ˆ0, given y and x. ¯ ¯ Dropping the n 1 in (2.15) (since it does not affect the solution) and plugging (2.17) into (2.15) yields n x i (y i i 1

(y ¯

ˆ 1 x) ¯

ˆ 1x i)

0

which, upon rearrangement, gives n n

x i (y i i 1

y) ¯

ˆ1 i 1

x i (x i

x). ¯

From basic properties of the summation operator [see (A.7) and (A.8)], n n n n

x i (x i i 1

x) ¯ i 1

(x i

x) 2 and ¯ i 1

x i (y i

y) ¯ i 1

(x i

x)(y i ¯

y). ¯

Therefore, provided that n (xi i 1

x)2 ¯

0,

(2.18)

the estimated slope is n ˆ1

(xi i 1 n

x) (yi ¯ (xi x) ¯
2

y) ¯ .
(2.19)

i 1

Equation (2.19) is simply the sample covariance between x and y divided by the sample variance of x. (See Appendix C. Dividing both the numerator and the denominator by n 1 changes nothing.) This makes sense because 1 equals the population covariance divided by the variance of x when E(u) 0 and Cov(x,u) 0. An immediate implication is that if x and y are positively correlated in the sample, then ˆ1 is positive; if x and y are negatively correlated, then ˆ1 is negative. Although the method for obtaining (2.17) and (2.19) is motivated by (2.6), the only assumption needed to compute the estimates for a particular sample is (2.18). This is hardly an assumption at all: (2.18) is true provided the xi in the sample are not all equal to the same value. If (2.18) fails, then we have either been unlucky in obtaining our sample from the population or we have not specified an interesting problem (x does not vary in the population.). For example, if y wage and x educ, then (2.18) fails only if everyone in the sample has the same amount of education. (For example, if everyone is a high school graduate. See Figure 2.3.) If just one person has a different amount of education, then (2.18) holds, and the OLS estimates can be computed.
29

Part 1

Regression Analysis with Cross-Sectional Data

Figure 2.3 A scatterplot of wage against education when educi 12 for all i.

wage

0

12

educ

The estimates given in (2.17) and (2.19) are called the ordinary least squares (OLS) estimates of 0 and 1. To justify this name, for any ˆ0 and ˆ1, define a fitted value for y when x xi such as yi ˆ ˆ0 ˆ1xi ,
(2.20)

for the given intercept and slope. This is the value we predict for y when x xi . There is a fitted value for each observation in the sample. The residual for observation i is the difference between the actual yi and its fitted value: ui ˆ yi yi ˆ yi ˆ0 ˆ1xi .
(2.21)

Again, there are n such residuals. (These are not the same as the errors in (2.9), a point we return to in Section 2.5.) The fitted values and residuals are indicated in Figure 2.4. Now, suppose we choose ˆ0 and ˆ1 to make the sum of squared residuals, n n

u i2 ˆ i 1 i 1

(yi

ˆ0

ˆ1xi )2,

(2.22)

30

Chapter 2

The Simple Regression Model

Figure 2.4 Fitted values and residuals.

y

yi ûi residual y ˆ ˆ0 ˆ 1x

y1

ˆ yi

Fitted value

x1

xi

x

as small as possible. The appendix to this chapter shows that the conditions necessary for ( ˆ0, ˆ1) to minimize (2.22) are given exactly by equations (2.14) and (2.15), without n 1. Equations (2.14) and (2.15) are often called the first order conditions for the OLS estimates, a term that comes from optimization using calculus (see Appendix A). From our previous calculations, we know that the solutions to the OLS first order conditions are given by (2.17) and (2.19). The name “ordinary least squares” comes from the fact that these estimates minimize the sum of squared residuals. Once we have determined the OLS intercept and slope estimates, we form the OLS regression line: y ˆ ˆ0 ˆ1x,
(2.23)

where it is understood that ˆ0 and ˆ1 have been obtained using equations (2.17) and (2.19). The notation y, read as “y hat,” emphasizes that the predicted values from equaˆ tion (2.23) are estimates. The intercept, ˆ0, is the predicted value of y when x 0, although in some cases it will not make sense to set x 0. In those situations, ˆ0 is not, in itself, very interesting. When using (2.23) to compute predicted values of y for various values of x, we must account for the intercept in the calculations. Equation (2.23) is also called the sample regression function (SRF) because it is the estimated version of the population regression function E(y x) 0 1x. It is important to remember that the PRF is something fixed, but unknown, in the population. Since the SRF is
31

Part 1

Regression Analysis with Cross-Sectional Data

obtained for a given sample of data, a new sample will generate a different slope and intercept in equation (2.23). In most cases the slope estimate, which we can write as ˆ1 y/ x, ˆ
(2.24)

is of primary interest. It tells us the amount by which y changes when x increases by ˆ one unit. Equivalently, y ˆ ˆ1 x,
(2.25)

so that given any change in x (whether positive or negative), we can compute the predicted change in y. We now present several examples of simple regression obtained by using real data. In other words, we find the intercept and slope estimates with equations (2.17) and (2.19). Since these examples involve many observations, the calculations were done using an econometric software package. At this point, you should be careful not to read too much into these regressions; they are not necessarily uncovering a causal relationship. We have said nothing so far about the statistical properties of OLS. In Section 2.5, we consider statistical properties after we explicitly impose assumptions on the population model equation (2.1).
E X A M P L E 2 . 3 (CEO Salary and Return on Equity)

For the population of chief executive officers, let y be annual salary (salary) in thousands of dollars. Thus, y 856.3 indicates an annual salary of $856,300, and y 1452.6 indicates a salary of $1,452,600. Let x be the average return equity (roe) for the CEO’s firm for the previous three years. (Return on equity is defined in terms of net income as a percentage of common equity.) For example, if roe 10, then average return on equity is 10 percent. To study the relationship between this measure of firm performance and CEO compensation, we postulate the simple model

salary

0

1

roe

u.

The slope parameter 1 measures the change in annual salary, in thousands of dollars, when return on equity increases by one percentage point. Because a higher roe is good for the company, we think 1 0. The data set CEOSAL1.RAW contains information on 209 CEOs for the year 1990; these data were obtained from Business Week (5/6/91). In this sample, the average annual salary is $1,281,120, with the smallest and largest being $223,000 and $14,822,000, respectively. The average return on equity for the years 1988, 1989, and 1990 is 17.18 percent, with the smallest and largest values being 0.5 and 56.3 percent, respectively. Using the data in CEOSAL1.RAW, the OLS regression line relating salary to roe is

ˆ salary
32

963.191

18.501 roe,

(2.26)

Chapter 2

The Simple Regression Model

where the intercept and slope estimates have been rounded to three decimal places; we use “salary hat” to indicate that this is an estimated equation. How do we interpret the equation? First, if the return on equity is zero, roe 0, then the predicted salary is the intercept, 963.191, which equals $963,191 since salary is measured in thousands. Next, we can write the predicted change in salary as a function of the change in roe: salˆary 18.501 ( roe). This means that if the return on equity increases by one percentage point, roe 1, then salary is predicted to change by about 18.5, or $18,500. Because (2.26) is a linear equation, this is the estimated change regardless of the initial salary. We can easily use (2.26) to compare predicted salaries at different values of roe. Suppose roe 30. Then salˆary 963.191 18.501(30) 1518.221, which is just over $1.5 million. However, this does not mean that a particular CEO whose firm had an roe 30 earns $1,518,221. There are many other factors that affect salary. This is just our prediction from the OLS regression line (2.26). The estimated line is graphed in Figure 2.5, along with the population regression function E(salary roe). We will never know the PRF, so we cannot tell how close the SRF is to the PRF. Another sample of data will give a different regression line, which may or may not be closer to the population regression line.

Figure 2.5 ˆ The OLS regression line salary regression function. 963.191 18.50 roe and the (unknown) population

salary ˆ salary 963.191 18.501 roe

E(salary roe)

0

1

roe

963.191

roe

33

Part 1

Regression Analysis with Cross-Sectional Data

E X A M P L E 2 . 4 (Wage and Education)

For the population of people in the work force in 1976, let y wage, where wage is measured in dollars per hour. Thus, for a particular person, if wage 6.75, the hourly wage is $6.75. Let x educ denote years of schooling; for example, educ 12 corresponds to a complete high school education. Since the average wage in the sample is $5.90, the consumer price index indicates that this amount is equivalent to $16.64 in 1997 dollars. Using the data in WAGE1.RAW where n 526 individuals, we obtain the following OLS regression line (or sample regression function):

wa ˆge

0.90

0.54 educ.

(2.27)

We must interpret this equation with caution. The intercept of 0.90 literally means that a person with no education has a predicted hourly wage of 90 cents an hour. This, of course, is silly. It turns out that no one in the sample has less than eight years of education, which helps to explain the crazy prediction for a zero education value. For a person with eight years of education, the predicted wage is wa ˆge 0.90 0.54(8) 3.42, or Q U E S T I O N 2 . 2 $3.42 per hour (in 1976 dollars). The slope estimate in (2.27) implies that The estimated wage from (2.27), when educ 8, is $3.42 in 1976 dollars. What is this value in 1997 dollars? (Hint: You have enough one more year of education increases hourly information in Example 2.4 to answer this question.) wage by 54 cents an hour. Therefore, four more years of education increase the predicted wage by 4(0.54) 2.16 or $2.16 per hour. These are fairly large effects. Because of the linear nature of (2.27), another year of education increases the wage by the same amount, regardless of the initial level of education. In Section 2.4, we discuss some methods that allow for nonconstant marginal effects of our explanatory variables.
E X A M P L E 2 . 5 (Voting Outcomes and Campaign Expenditures)

The file VOTE1.RAW contains data on election outcomes and campaign expenditures for 173 two-party races for the U.S. House of Representatives in 1988. There are two candidates in each race, A and B. Let voteA be the percentage of the vote received by Candidate A and shareA be the the percentage of total campaign expenditures accounted for by Candidate A. Many factors other than shareA affect the election outcome (including the quality of the candidates and possibly the dollar amounts spent by A and B). Nevertheless, we can estimate a simple regression model to find out whether spending more relative to one’s challenger implies a higher percentage of the vote. The estimated equation using the 173 observations is

ˆ voteA

40.90

0.306 shareA.

(2.28)

This means that, if the share of Candidate A’s expenditures increases by one percentage point, Candidate A receives almost one-third of a percentage point more of the
34

Chapter 2

The Simple Regression Model

total vote. Whether or not this is a causal effect is unclear, but the result is what we might expect.

In some cases, regression analysis is not used to determine causality but to simply look at whether two variables are positively or negatively related, much like a standard correlation analysis. An example of this occurs in Problem 2.12, where you are Q U E S T I O N 2 . 3 asked to use data from Biddle and In Example 2.5, what is the predicted vote for Candidate A if shareA Hamermesh (1990) on time spent sleeping 60 (which means 60 percent)? Does this answer seem reasonable? and working to investigate the tradeoff between these two factors.

A Note on Terminolgy
In most cases, we will indicate the estimation of a relationship through OLS by writing an equation such as (2.26), (2.27), or (2.28). Sometimes, for the sake of brevity, it is useful to indicate that an OLS regression has been run without actually writing out the equation. We will often indicate that equation (2.23) has been obtained by OLS in saying that we run the regression of y on x,
(2.29)

or simply that we regress y on x. The positions of y and x in (2.29) indicate which is the dependent variable and which is the independent variable: we always regress the dependent variable on the independent variable. For specific applications, we replace y and x with their names. Thus, to obtain (2.26), we regress salary on roe or to obtain (2.28), we regress voteA on shareA. When we use such terminology in (2.29), we will always mean that we plan to estimate the intercept, ˆ0, along with the slope, ˆ1. This case is appropriate for the vast majority of applications. Occasionally, we may want to estimate the relationship between y and x assuming that the intercept is zero (so that x 0 implies that y 0); ˆ we cover this case briefly in Section 2.6. Unless explicitly stated otherwise, we always estimate an intercept along with a slope.

2.3 MECHANICS OF OLS
In this section, we cover some algebraic properties of the fitted OLS regression line. Perhaps the best way to think about these properties is to realize that they are features of OLS for a particular sample of data. They can be contrasted with the statistical properties of OLS, which requires deriving features of the sampling distributions of the estimators. We will discuss statistical properties in Section 2.5. Several of the algebraic properties we are going to derive will appear mundane. Nevertheless, having a grasp of these properties helps us to figure out what happens to the OLS estimates and related statistics when the data are manipulated in certain ways, such as when the measurement units of the dependent and independent variables change.
35

Part 1

Regression Analysis with Cross-Sectional Data

Fitted Values and Residuals
We assume that the intercept and slope estimates, ˆ0 and ˆ1, have been obtained for the given sample of data. Given ˆ0 and ˆ1, we can obtain the fitted value yi for each obserˆ vation. [This is given by equation (2.20).] By definition, each fitted value of yi is on the ˆ OLS regression line. The OLS residual associated with observation i, ui, is the differˆ ence between yi and its fitted value, as given in equation (2.21). If ui is positive, the line ˆ underpredicts yi ; if ui is negative, the line overpredicts yi . The ideal case for observation ˆ i is when ui 0, but in most cases every residual is not equal to zero. In other words, ˆ none of the data points must actually lie on the OLS line.
E X A M P L E 2 . 6 (CEO Salary and Return on Equity)

Table 2.2 contains a listing of the first 15 observations in the CEO data set, along with the fitted values, called salaryhat, and the residuals, called uhat. Table 2.2 Fitted Values and Residuals for the First 15 CEOs

obsno 1 2 3 4 5 6 7 8 9 10 11 12

roe 14.1 10.9 23.5 5.9 13.8 20.0 16.4 16.3 10.5 26.3 25.9 26.8

salary 1095 1001 1122 578 1368 1145 1078 1094 1237 833 567 933

salaryhat 1224.058 1164.854 1397.969 1072.348 1218.508 1333.215 1266.611 1264.761 1157.454 1449.773 1442.372 1459.023

uhat 129.0581 163.8542 275.9692 494.3484 149.4923 188.2151 188.6108 170.7606 79.54626 616.7726 875.3721 526.0231 continued 36

Chapter 2

The Simple Regression Model

Table 2.2 (concluded )

obsno 13 14 15

roe 14.8 22.3 56.3

salary 1339 937 2011

salaryhat 1237.009 1375.768 2004.808

uhat 101.9911 438.7678 006.191895

The first four CEOs have lower salaries than what we predicted from the OLS regression line (2.26); in other words, given only the firm’s roe, these CEOs make less than what we predicted. As can be seen from the positive uhat, the fifth CEO makes more than predicted from the OLS regression line.

Algebraic Properties of OLS Statistics
There are several useful algebraic properties of OLS estimates and their associated statistics. We now cover the three most important of these. (1) The sum, and therefore the sample average of the OLS residuals, is zero. Mathematically, n ui ˆ i 1

0.

(2.30)

This property needs no proof; it follows immediately from the OLS first order condiˆ0 ˆ 1x i . tion (2.14), when we remember that the residuals are defined by ui yi ˆ ˆ0 and ˆ1 are chosen to make the residuals add up to In other words, the OLS estimates zero (for any data set). This says nothing about the residual for any particular observation i. (2) The sample covariance between the regressors and the OLS residuals is zero. This follows from the first order condition (2.15), which can be written in terms of the residuals as n x i ui ˆ i 1

0.

(2.31)

The sample average of the OLS residuals is zero, so the left hand side of (2.31) is proportional to the sample covariance between xi and ui. ˆ (3) The point (x,y) is always on the OLS regression line. In other words, if we take ¯¯ equation (2.23) and plug in x for x, then the predicted value is y. This is exactly what ¯ ¯ equation (2.16) shows us.
37

Part 1

Regression Analysis with Cross-Sectional Data

E X A M P L E 2 . 7 (Wage and Education)

For the data in WAGE1.RAW, the average hourly wage in the sample is 5.90, rounded to two decimal places, and the average education is 12.56. If we plug educ 12.56 into the OLS regression line (2.27), we get wa ˆge 0.90 0.54(12.56) 5.8824, which equals 5.9 when rounded to the first decimal place. The reason these figures do not exactly agree is that we have rounded the average wage and education, as well as the intercept and slope estimates. If we did not initially round any of the values, we would get the answers to agree more closely, but this practice has little useful effect.

Writing each yi as its fitted value, plus its residual, provides another way to intepret an OLS regression. For each i, write yi yi ˆ u i. ˆ
(2.32)

From property (1) above, the average of the residuals is zero; equivalently, the sample ¯ average of the fitted values, yi, is the same as the sample average of the yi , or y y. ˆ ˆ ¯ Further, properties (1) and (2) can be used to show that the sample covariance between yi and ui is zero. Thus, we can view OLS as decomposing each yi into two ˆ ˆ parts, a fitted value and a residual. The fitted values and residuals are uncorrelated in the sample. Define the total sum of squares (SST), the explained sum of squares (SSE), and the residual sum of squares (SSR) (also known as the sum of squared residuals), as follows: n SST i n 1

(y i (yi ˆ i 1 n

y) 2 . ¯ y) 2 . ¯ u i2 . ˆ

(2.33)

SSE SSR

(2.34)

(2.35)

i

1

SST is a measure of the total sample variation in the yi ; that is, it measures how spread out the yi are in the sample. If we divide SST by n 1, we obtain the sample variance of y, as discussed in Appendix C. Similarly, SSE measures the sample variation in the ¯ yi (where we use the fact that y y), and SSR measures the sample variation in the ui. ˆ ˆ ¯ ˆ The total variation in y can always be expressed as the sum of the explained variation and the unexplained variation SSR. Thus, SST
38

SSE

SSR.

(2.36)

Chapter 2

The Simple Regression Model

Proving (2.36) is not difficult, but it requires us to use all of the properties of the summation operator covered in Appendix A. Write n n

(yi i 1

y)2 ¯ i 1 n

[(yi [ui ˆ i 1 n

y i) ˆ (yi ˆ n (yi ˆ y)]2 ¯

y)]2 ¯

n

u i2 ˆ i 1

2 i 1 n

ui (yi ˆ ˆ ui (yi ˆ ˆ

y) ¯ i 1

(yi ˆ SSE.

y)2 ¯

SSR Now (2.36) holds if we show that n 2 i 1

y) ¯

ui (yi ˆ ˆ i 1

y) ¯

0.

(2.37)

But we have already claimed that the sample covariance between the residuals and the fitted values is zero, and this covariance is just (2.37) divided by n 1. Thus, we have established (2.36). Some words of caution about SST, SSE, and SSR are in order. There is no uniform agreement on the names or abbreviations for the three quantities defined in equations (2.33), (2.34), and (2.35). The total sum of squares is called either SST or TSS, so there is little confusion here. Unfortunately, the explained sum of squares is sometimes called the “regression sum of squares.” If this term is given its natural abbreviation, it can easily be confused with the term residual sum of squares. Some regression packages refer to the explained sum of squares as the “model sum of squares.” To make matters even worse, the residual sum of squares is often called the “error sum of squares.” This is especially unfortunate because, as we will see in Section 2.5, the errors and the residuals are different quantities. Thus, we will always call (2.35) the residual sum of squares or the sum of squared residuals. We prefer to use the abbreviation SSR to denote the sum of squared residuals, because it is more common in econometric packages.

Goodness-of-Fit
So far, we have no way of measuring how well the explanatory or independent variable, x, explains the dependent variable, y. It is often useful to compute a number that summarizes how well the OLS regression line fits the data. In the following discussion, be sure to remember that we assume that an intercept is estimated along with the slope. Assuming that the total sum of squares, SST, is not equal to zero—which is true except in the very unlikely event that all the yi equal the same value—we can divide (2.36) by SST to get 1 SSE/SST SSR/SST. The R-squared of the regression, sometimes called the coefficient of determination, is defined as
39

Part 1

Regression Analysis with Cross-Sectional Data

R2

SSE/SST

1

SSR/SST.

(2.38)

R2 is the ratio of the explained variation compared to the total variation, and thus it is interpreted as the fraction of the sample variation in y that is explained by x. The second equality in (2.38) provides another way for computing R2. From (2.36), the value of R2 is always between zero and one, since SSE can be no greater than SST. When interpreting R2, we usually multiply it by 100 to change it into a percent: 100 R2 is the percentage of the sample variation in y that is explained by x. If the data points all lie on the same line, OLS provides a perfect fit to the data. In this case, R2 1. A value of R2 that is nearly equal to zero indicates a poor fit of the OLS line: very little of the variation in the yi is captured by the variation in the yi (which ˆ all lie on the OLS regression line). In fact, it can be shown that R2 is equal to the square of the sample correlation coefficient between yi and yi. This is where the term ˆ “R-squared” came from. (The letter R was traditionally used to denote an estimate of a population correlation coefficient, and its usage has survived in regression analysis.)
E X A M P L E 2 . 8 (CEO Salary and Return on Equity)

In the CEO salary regression, we obtain the following:

ˆ salary n

963.191 209, R
2

18.501 roe 0.0132

(2.39)

We have reproduced the OLS regression line and the number of observations for clarity. Using the R-squared (rounded to four decimal places) reported for this equation, we can see how much of the variation in salary is actually explained by the return on equity. The answer is: not much. The firm’s return on equity explains only about 1.3% of the variation in salaries for this sample of 209 CEOs. That means that 98.7% of the salary variations for these CEOs is left unexplained! This lack of explanatory power may not be too surprising since there are many other characteristics of both the firm and the individual CEO that should influence salary; these factors are necessarily included in the errors in a simple regression analysis.

In the social sciences, low R-squareds in regression equations are not uncommon, especially for cross-sectional analysis. We will discuss this issue more generally under multiple regression analysis, but it is worth emphasizing now that a seemingly low Rsquared does not necessarily mean that an OLS regression equation is useless. It is still possible that (2.39) is a good estimate of the ceteris paribus relationship between salary and roe; whether or not this is true does not depend directly on the size of R-squared. Students who are first learning econometrics tend to put too much weight on the size of the R-squared in evaluating regression equations. For now, be aware that using R-squared as the main gauge of success for an econometric analysis can lead to trouble. Sometimes the explanatory variable explains a substantial part of the sample variation in the dependent variable.
40

Chapter 2

The Simple Regression Model

E X A M P L E 2 . 9 (Voting Outcomes and Campaign Expenditures)

In the voting outcome equation in (2.28), R2 0.505. Thus, the share of campaign expenditures explains just over 50 percent of the variation in the election outcomes for this sample. This is a fairly sizable portion.

2.4 UNITS OF MEASUREMENT AND FUNCTIONAL FORM
Two important issues in applied economics are (1) understanding how changing the units of measurement of the dependent and/or independent variables affects OLS estimates and (2) knowing how to incorporate popular functional forms used in economics into regression analysis. The mathematics needed for a full understanding of functional form issues is reviewed in Appendix A.

The Effects of Changing Units of Measurement on OLS Statistics
In Example 2.3, we chose to measure annual salary in thousands of dollars, and the return on equity was measured as a percent (rather than as a decimal). It is crucial to know how salary and roe are measured in this example in order to make sense of the estimates in equation (2.39). We must also know that OLS estimates change in entirely expected ways when the units of measurement of the dependent and independent variables change. In Example 2.3, suppose that, rather than measuring salary in thousands of dollars, we measure it in dollars. Let salardol be salary in dollars (salardol 845,761 would be interpreted as $845,761.). Of course, salardol has a simple relationship to the salary measured in thousands of dollars: salardol 1,000 salary. We do not need to actually run the regression of salardol on roe to know that the estimated equation is: sala ˆrdol 963,191 18,501 roe.
(2.40)

We obtain the intercept and slope in (2.40) simply by multiplying the intercept and the slope in (2.39) by 1,000. This gives equations (2.39) and (2.40) the same interpretation. Looking at (2.40), if roe 0, then sala ˆrdol 963,191, so the predicted salary is $963,191 [the same value we obtained from equation (2.39)]. Furthermore, if roe increases by one, then the predicted salary increases by $18,501; again, this is what we concluded from our earlier analysis of equation (2.39). Generally, it is easy to figure out what happens to the intercept and slope estimates when the dependent variable changes units of measurement. If the dependent variable is multiplied by the constant c—which means each value in the sample is multiplied by c—then the OLS intercept and slope estimates are also multiplied by c. (This assumes nothing has changed about the independent variable.) In the CEO salary example, c 1,000 in moving from salary to salardol.
41

Part 1

Regression Analysis with Cross-Sectional Data

We can also use the CEO salary example to see what happens when we change the units of measurement of the independent variable. Define roedec roe/100 Q U E S T I O N 2 . 4 to be the decimal equivalent of roe; thus, roedec 0.23 means a return on equity of Suppose that salary is measured in hundreds of dollars, rather than in thousands of dollars, say salarhun. What will be the OLS intercept 23 percent. To focus on changing the units and slope estimates in the regression of salarhun on roe? of measurement of the independent variable, we return to our original dependent variable, salary, which is measured in thousands of dollars. When we regress salary on roedec, we obtain ˆ salary 963.191 1850.1 roedec.
(2.41)

The coefficient on roedec is 100 times the coefficient on roe in (2.39). This is as it should be. Changing roe by one percentage point is equivalent to roedec 0.01. From ˆ (2.41), if roedec 0.01, then salary 1850.1(0.01) 18.501, which is what is obtained by using (2.39). Note that, in moving from (2.39) to (2.41), the independent variable was divided by 100, and so the OLS slope estimate was multiplied by 100, preserving the interpretation of the equation. Generally, if the independent variable is divided or multiplied by some nonzero constant, c, then the OLS slope coefficient is also multiplied or divided by c respectively. The intercept has not changed in (2.41) because roedec 0 still corresponds to a zero return on equity. In general, changing the units of measurement of only the independent variable does not affect the intercept. In the previous section, we defined R-squared as a goodness-of-fit measure for OLS regression. We can also ask what happens to R2 when the unit of measurement of either the independent or the dependent variable changes. Without doing any algebra, we should know the result: the goodness-of-fit of the model should not depend on the units of measurement of our variables. For example, the amount of variation in salary, explained by the return on equity, should not depend on whether salary is measured in dollars or in thousands of dollars or on whether return on equity is a percent or a decimal. This intuition can be verified mathematically: using the definition of R2, it can be shown that R2 is, in fact, invariant to changes in the units of y or x.

Incorporating Nonlinearities in Simple Regression
So far we have focused on linear relationships between the dependent and independent variables. As we mentioned in Chapter 1, linear relationships are not nearly general enough for all economic applications. Fortunately, it is rather easy to incorporate many nonlinearities into simple regression analysis by appropriately defining the dependent and independent variables. Here we will cover two possibilities that often appear in applied work. In reading applied work in the social sciences, you will often encounter regression equations where the dependent variable appears in logarithmic form. Why is this done? Recall the wage-education example, where we regressed hourly wage on years of education. We obtained a slope estimate of 0.54 [see equation (2.27)], which means that each additional year of education is predicted to increase hourly wage by 54 cents.
42

Chapter 2

The Simple Regression Model

Because of the linear nature of (2.27), 54 cents is the increase for either the first year of education or the twentieth year; this may not be reasonable. Suppose, instead, that the percentage increase in wage is the same given one more year of education. Model (2.27) does not imply a constant percentage increase: the percentage increases depends on the initial wage. A model that gives (approximately) a constant percentage effect is log(wage)
0 1

educ

u,

(2.42)

where log( ) denotes the natural logarithm. (See Appendix A for a review of logarithms.) In particular, if u 0, then % wage (100
1

) educ.

(2.43)

Notice how we multiply 1 by 100 to get the percentage change in wage given one additional year of education. Since the percentage change in wage is the same for each additional year of education, the change in wage for an extra year of education increases as education increases; in other words, (2.42) implies an increasing return to education. By exponentiating (2.42), we can write wage exp( 0 u). This equation 1educ is graphed in Figure 2.6, with u 0.

Figure 2.6 wage exp(
0 1

educ), with

1

0.

wage

0

educ

43

Part 1

Regression Analysis with Cross-Sectional Data

Estimating a model such as (2.42) is straightforward when using simple regression. Just define the dependent variable, y, to be y log(wage). The independent variable is represented by x educ. The mechanics of OLS are the same as before: the intercept and slope estimates are given by the formulas (2.17) and (2.19). In other words, we obtain ˆ0 and ˆ1 from the OLS regression of log(wage) on educ.
E X A M P L E 2 . 1 0 (A Log Wage Equation)

Using the same data as in Example 2.4, but using log(wage) as the dependent variable, we obtain the following relationship:

ˆ log(wage) n

0.584 526, R2

0.083 educ 0.186.

(2.44)

The coefficient on educ has a percentage interpretation when it is multiplied by 100: wage increases by 8.3 percent for every additional year of education. This is what economists mean when they refer to the “return to another year of education.” It is important to remember that the main reason for using the log of wage in (2.42) is to impose a constant percentage effect of education on wage. Once equation (2.42) is obtained, the natural log of wage is rarely mentioned. In particular, it is not correct to say that another year of education increases log(wage) by 8.3%. The intercept in (2.42) is not very meaningful, as it gives the predicted log(wage), when educ 0. The R-squared shows that educ explains about 18.6 percent of the variation in log(wage) (not wage). Finally, equation (2.44) might not capture all of the nonlinearity in the relationship between wage and schooling. If there are “diploma effects,” then the twelfth year of education—graduation from high school— could be worth much more than the eleventh year. We will learn how to allow for this kind of nonlinearity in Chapter 7.

Another important use of the natural log is in obtaining a constant elasticity model.

E X A M P L E 2 . 1 1 (CEO Salary and Firm Sales)

We can estimate a constant elasticity model relating CEO salary to firm sales. The data set is the same one used in Example 2.3, except we now relate salary to sales. Let sales be annual firm sales, measured in millions of dollars. A constant elasticity model is

log(salary)

0

1

log(sales)

u,

(2.45)

where 1 is the elasticity of salary with respect to sales. This model falls under the simple regression model by defining the dependent variable to be y log(salary) and the independent variable to be x log(sales). Estimating this equation by OLS gives
44

Chapter 2

The Simple Regression Model

ˆ log(salary) n

4.822 209, R2

0.257 log(sales) 0.211.

(2.46)

The coefficient of log(sales) is the estimated elasticity of salary with respect to sales. It implies that a 1 percent increase in firm sales increases CEO salary by about 0.257 percent—the usual interpretation of an elasticity.

The two functional forms covered in this section will often arise in the remainder of this text. We have covered models containing natural logarithms here because they appear so frequently in applied work. The interpretation of such models will not be much different in the multiple regression case. It is also useful to note what happens to the intercept and slope estimates if we change the units of measurement of the dependent variable when it appears in logarithmic form. Because the change to logarithmic form approximates a proportionate change, it makes sense that nothing happens to the slope. We can see this by writing the rescaled variable as c1yi for each observation i. The original equation is log(yi ) ui . If 0 1xi we add log(c1) to both sides, we get log(c1) log(yi ) [log(c1) ui , or 0] 1xi log(c1yi ) [log(c1) ui . (Remember that the sum of the logs is equal to 0] 1xi the log of their product as shown in Appendix A.) Therefore, the slope is still 1, but the intercept is now log(c1) 0. Similarly, if the independent variable is log(x), and we change the units of measurement of x before taking the log, the slope remains the same but the intercept does not change. You will be asked to verify these claims in Problem 2.9. We end this subsection by summarizing four combinations of functional forms available from using either the original variable or its natural log. In Table 2.3, x and y stand for the variables in their original form. The model with y as the dependent variable and x as the independent variable is called the level-level model, because each variable appears in its level form. The model with log(y) as the dependent variable and x as the independent variable is called the log-level model. We will not explicitly discuss the level-log model here, because it arises less often in practice. In any case, we will see examples of this model in later chapters.
Table 2.3 Summary of Functional Forms Involving Logarithms

Model level-level level-log log-level log-log

Dependent Variable y y log(y) log(y)

Independent Variable x log(x) x log(x)

Interpretation of 1 y y % y % y
1

x

( 1/100)% x (100 1) x
1

% x

45

Part 1

Regression Analysis with Cross-Sectional Data

The last column in Table 2.3 gives the interpretation of 1. In the log-level model, 100 1 is sometimes called the semi-elasticity of y with respect to x. As we mentioned in Example 2.11, in the log-log model, 1 is the elasticity of y with respect to x. Table 2.3 warrants careful study, as we will refer to it often in the remainder of the text.

The Meaning of “Linear” Regression
The simple regression model that we have studied in this chapter is also called the simple linear regression model. Yet, as we have just seen, the general model also allows for certain nonlinear relationships. So what does “linear” mean here? You can see by looking at equation (2.1) that y u. The key is that this equation is linear in the 0 1x parameters, 0 and 1. There are no restrictions on how y and x relate to the original explained and explanatory variables of interest. As we saw in Examples 2.7 and 2.8, y and x can be natural logs of variables, and this is quite common in applications. But we need not stop there. For example, nothing prevents us from using simple regression to — estimate a model such as cons inc u, where cons is annual consumption 0 1 and inc is annual income. While the mechanics of simple regression do not depend on how y and x are defined, the interpretation of the coefficients does depend on their definitions. For successful empirical work, it is much more important to become proficient at interpreting coefficients than to become efficient at computing formulas such as (2.19). We will get much more practice with interpreting the estimates in OLS regression lines when we study multiple regression. There are plenty of models that cannot be cast as a linear regression model because they are not linear in their parameters; an example is cons 1/( 0 u. 1inc) Estimation of such models takes us into the realm of the nonlinear regression model, which is beyond the scope of this text. For most applications, choosing a model that can be put into the linear regression framework is sufficient.

2.5 EXPECTED VALUES AND VARIANCES OF THE OLS ESTIMATORS
In Section 2.1, we defined the population model y u, and we claimed that 0 1x the key assumption for simple regression analysis to be useful is that the expected value of u given any value of x is zero. In Sections 2.2, 2.3, and 2.4, we discussed the algebraic properties of OLS estimation. We now return to the population model and study the statistical properties of OLS. In other words, we now view ˆ0 and ˆ1 as estimators for the parameters 0 and 1 that appear in the population model. This means that we will study properties of the distributions of ˆ0 and ˆ1 over different random samples from the population. (Appendix C contains definitions of estimators and reviews some of their important properties.)

Unbiasedness of OLS
We begin by establishing the unbiasedness of OLS under a simple set of assumptions. For future reference, it is useful to number these assumptions using the prefix “SLR” for simple linear regression. The first assumption defines the population model.
46

Chapter 2

The Simple Regression Model

A S S U M P T I O N

S L R . 1

( L I N E A R

I N

P A R A M E T E R S )

In the population model, the dependent variable y is related to the independent variable x and the error (or disturbance) u as

y where 0

0

1

x

u,

(2.47)

and

1

are the population intercept and slope parameters, respectively.

To be realistic, y, x, and u are all viewed as random variables in stating the population model. We discussed the interpretation of this model at some length in Section 2.1 and gave several examples. In the previous section, we learned that equation (2.47) is not as restrictive as it initially seems; by choosing y and x appropriately, we can obtain interesting nonlinear relationships (such as constant elasticity models). We are interested in using data on y and x to estimate the parameters 0 and, especially, 1. We assume that our data were obtained as a random sample. (See Appendix C for a review of random sampling.)

A S S U M P T I O N

S L R . 2

( R A N D O M

S A M P L I N G )

We can use a random sample of size n, {(xi ,yi ): i model.

1,2,…,n}, from the population

We will have to address failure of the random sampling assumption in later chapters that deal with time series analysis and sample selection problems. Not all cross-sectional samples can be viewed as outcomes of random samples, but many can be. We can write (2.47) in terms of the random sample as yi
0 1 i

x

ui , i

1,2,…,n,

(2.48)

where ui is the error or disturbance for observation i (for example, person i, firm i, city i, etc.). Thus, ui contains the unobservables for observation i which affect yi . The ui should not be confused with the residuals, ui, that we defined in Section 2.3. Later on, ˆ we will explore the relationship between the errors and the residuals. For interpreting 0 and 1 in a particular application, (2.47) is most informative, but (2.48) is also needed for some of the statistical derivations. The relationship (2.48) can be plotted for a particular outcome of data as shown in Figure 2.7. In order to obtain unbiased estimators of 0 and 1, we need to impose the zero conditional mean assumption that we discussed in some detail in Section 2.1. We now explicitly add it to our list of assumptions.

A S S U M P T I O N

S L R . 3

( Z E R O

C O N D I T I O N A L

M E A N )

E(u x)

0.
47

Part 1

Regression Analysis with Cross-Sectional Data

Figure 2.7 Graph of yi
0 1 i

x

ui.

y

yi ui E(y x) PRF
0 1

x

u1 y1

x1

xi

x

For a random sample, this assumption implies that E(ui xi ) 0, for all i 1,2,…,n. In addition to restricting the relationship between u and x in the population, the zero conditional mean assumption—coupled with the random sampling assumption— allows for a convenient technical simplification. In particular, we can derive the statistical properties of the OLS estimators as conditional on the values of the xi in our sample. Technically, in statistical derivations, conditioning on the sample values of the independent variable is the same as treating the xi as fixed in repeated samples. This process involves several steps. We first choose n sample values for x1, x2, …, xn (These can be repeated.). Given these values, we then obtain a sample on y (effectively by obtaining a random sample of the ui ). Next another sample of y is obtained, using the same values for x1, …, xn . Then another sample of y is obtained, again using the same xi . And so on. The fixed in repeated samples scenario is not very realistic in nonexperimental contexts. For instance, in sampling individuals for the wage-education example, it makes little sense to think of choosing the values of educ ahead of time and then sampling individuals with those particular levels of education. Random sampling, where individuals are chosen randomly and their wage and education are both recorded, is representative of how most data sets are obtained for empirical analysis in the social sciences. Once we assume that E(u x) 0, and we have random sampling, nothing is lost in derivations by treating the xi as nonrandom. The danger is that the fixed in repeated samples assumption always implies that ui and xi are independent. In deciding when
48

Chapter 2

The Simple Regression Model

simple regression analysis is going to produce unbiased estimators, it is critical to think in terms of Assumption SLR.3. Once we have agreed to condition on the xi , we need one final assumption for unbiasedness.
A S S U M P T I O N S L R . 4 ( S A M P L E V A R I A T I O N T H E I N D E P E N D E N T V A R I A B L E ) I N

In the sample, the independent variables xi , i 1,2,…,n, are not all equal to the same constant. This requires some variation in x in the population.

We encountered Assumption SLR.4 when we derived the formulas for the OLS estin

mators; it is equivalent to i 1

(xi

x) 2 ¯

0. Of the four assumptions made, this is the

least important because it essentially never fails in interesting applications. If Assumption SLR.4 does fail, we cannot compute the OLS estimators, which means statistical analysis is irrelevant. n n

Using the fact that i 1

(xi

x)(yi ¯

y) ¯ i 1

(xi

x)yi (see Appendix A), we can ¯

write the OLS slope estimator in equation (2.19) as n ˆ1

(xi i 1 n

x)yi ¯ .
(2.49)

(xi i 1

x) ¯

2

Because we are now interested in the behavior of ˆ1 across all possible samples, ˆ1 is properly viewed as a random variable. We can write ˆ1 in terms of the population coefficients and errors by substituting the right hand side of (2.48) into (2.49). We have n n

ˆ1

(xi i 1

x)yi ¯
_ i 1 2 x

(xi

x)( ¯

0

1 i

x

ui)
_

s

s

2 x n

,

(2.50)

2 where we have defined the total variation in xi as sx i 1

(xi

x)2 in order to simplify ¯

the notation. (This is not quite the sample variance of the xi because we do not divide by n 1.) Using the algebra of the summation operator, write the numerator of ˆ1 as n n n

(x i i 1

x) ¯

0 i 1 n 0 i 1

(x i (xi

x) ¯ x) ¯

1 i i n 1 i 1 1

x

(x i (xi

x)u i ¯ n (2.51)

x)xi ¯ i 1

(xi

x)ui . ¯

49

Part 1

Regression Analysis with Cross-Sectional Data

n

n

n

As shown in Appendix A, i 1

(xi

x) ¯

0 and i 1 2 1 x

(xi n x)xi ¯ i 1

(xi

x)2 ¯

2 sx .

Therefore, we can write the numerator of ˆ1 as the denominator gives n s

(xi i 1

x)ui . Writing this over ¯

ˆ1

(xi i 1 1 2 sx

x)ui ¯
_ 1 2 (1/sx )

n

di ui , i 1

(2.52)

where di xi x. We now see that the estimator ˆ1 equals the population slope 1, plus ¯ a term that is a linear combination in the errors {u1,u2,…,un }. Conditional on the values of xi , the randomness in ˆ1 is due entirely to the errors in the sample. The fact that these errors are generally different from zero is what causes ˆ1 to differ from 1. Using the representation in (2.52), we can prove the first important statistical property of OLS.

T H E O R E M

2 . 1

( U N B I A S E D N E S S

O F

O L S )

Using Assumptions SLR.1 through SLR.4,

E( ˆ0) for any values of
P R O O F :
0

0

, and E( ˆ1)

1

(2.53)
0

and

1

. In other words, ˆ0 is unbiased for

, and ˆ1 is unbiased for

1

.

In this proof, the expected values are conditional on the sample values of 2 the independent variable. Since sx and di are functions only of the xi , they are nonrandom in the conditioning. Therefore, from (2.53), n n

E( ˆ1)

1

2 E[(1/sx ) i 1 n

di ui ] di E(ui )

1

2 (1/sx ) i 1 n

E(di ui ) di 0 i 1

1

2 (1/sx ) i 1

1

2 (1/sx )

1

,

where we have used the fact that the expected value of each ui (conditional on {x1,x2,...,xn }) is zero under Assumptions SLR.2 and SLR.3. The proof for ˆ0 is now straightforward. Average (2.48) across i to get y ¯ ¯ 0 1x u and plug this into the formula for ˆ0: ¯,

ˆ0

y ¯

ˆ1x ¯

0

1

x ¯

u ¯

ˆ1x ¯

0

(

1

ˆ1)x ¯

u ¯.

Then, conditional on the values of the xi ,

E( ˆ0)

0

E[(

1

ˆ1)x] ¯

E(u ¯)

0

E[(

1

ˆ1)]x, ¯

since E(u ) ¯ 0 by Assumptions SLR.2 and SLR.3. But, we showed that E( ˆ1) 1, which implies that E[( ˆ1 )] 0. Thus, E( ˆ0) . Both of these arguments are valid for any 1 0 values of 0 and 1, and so we have established unbiasedness.
50

Chapter 2

The Simple Regression Model

Remember that unbiasedness is a feature of the sampling distributions of ˆ1 and ˆ0, which says nothing about the estimate that we obtain for a given sample. We hope that, if the sample we obtain is somehow “typical,” then our estimate should be “near” the population value. Unfortunately, it is always possible that we could obtain an unlucky sample that would give us a point estimate far from 1, and we can never know for sure whether this is the case. You may want to review the material on unbiased estimators in Appendix C, especially the simulation exercise in Table C.1 that illustrates the concept of unbiasedness. Unbiasedness generally fails if any of our four assumptions fail. This means that it is important to think about the veracity of each assumption for a particular application. As we have already discussed, if Assumption SLR.4 fails, then we will not be able to obtain the OLS estimates. Assumption SLR.1 requires that y and x be linearly related, with an additive disturbance. This can certainly fail. But we also know that y and x can be chosen to yield interesting nonlinear relationships. Dealing with the failure of (2.47) requires more advanced methods that are beyond the scope of this text. Later, we will have to relax Assumption SLR.2, the random sampling assumption, for time series analysis. But what about using it for cross-sectional analysis? Random sampling can fail in a cross section when samples are not representative of the underlying population; in fact, some data sets are constructed by intentionally oversampling different parts of the population. We will discuss problems of nonrandom sampling in Chapters 9 and 17. The assumption we should concentrate on for now is SLR.3. If SLR.3 holds, the OLS estimators are unbiased. Likewise, if SLR.3 fails, the OLS estimators generally will be biased. There are ways to determine the likely direction and size of the bias, which we will study in Chapter 3. The possibility that x is correlated with u is almost always a concern in simple regression analysis with nonexperimental data, as we indicated with several examples in Section 2.1. Using simple regression when u contains factors affecting y that are also correlated with x can result in spurious correlation: that is, we find a relationship between y and x that is really due to other unobserved factors that affect y and also happen to be correlated with x.

E X A M P L E 2 . 1 2 (Student Math Performance and the School Lunch Program)

Let math10 denote the percentage of tenth graders at a high school receiving a passing score on a standardized mathematics exam. Suppose we wish to estimate the effect of the federally funded school lunch program on student performance. If anything, we expect the lunch program to have a positive ceteris paribus effect on performance: all other factors being equal, if a student who is too poor to eat regular meals becomes eligible for the school lunch program, his or her performance should improve. Let lnchprg denote the percentage of students who are eligible for the lunch program. Then a simple regression model is

math10

0

1

lnchprg

u,

(2.54)
51

Part 1

Regression Analysis with Cross-Sectional Data

where u contains school and student characteristics that affect overall school performance. Using the data in MEAP93.RAW on 408 Michigan high schools for the 1992– 93 school year, we obtain

ˆ math10 n

32.14 408, R2

0.319 lnchprg 0.171

This equation predicts that if student eligibility in the lunch program increases by 10 percentage points, the percentage of students passing the math exam falls by about 3.2 percentage points. Do we really believe that higher participation in the lunch program actually causes worse performance? Almost certainly not. A better explanation is that the error term u in equation (2.54) is correlated with lnchprg. In fact, u contains factors such as the poverty rate of children attending school, which affects student performance and is highly correlated with eligibility in the lunch program. Variables such as school quality and resources are also contained in u, and these are likely correlated with lnchprg. It is important to remember that the estimate 0.319 is only for this particular sample, but its sign and magnitude make us suspect that u and x are correlated, so that simple regression is biased.

In addition to omitted variables, there are other reasons for x to be correlated with u in the simple regression model. Since the same issues arise in multiple regression analysis, we will postpone a systematic treatment of the problem until then.

Variances of the OLS Estimators
In addition to knowing that the sampling distribution of ˆ1 is centered about 1 ( ˆ1 is unbiased), it is important to know how far we can expect ˆ1 to be away from 1 on average. Among other things, this allows us to choose the best estimator among all, or at least a broad class of, the unbiased estimators. The measure of spread in the distribution of ˆ1 (and ˆ0) that is easiest to work with is the variance or its square root, the standard deviation. (See Appendix C for a more detailed discussion.) It turns out that the variance of the OLS estimators can be computed under Assumptions SLR.1 through SLR.4. However, these expressions would be somewhat complicated. Instead, we add an assumption that is traditional for cross-sectional analysis. This assumption states that the variance of the unobservable, u, conditional on x, is constant. This is known as the homoskedasticity or “constant variance” assumption.
A S S U M P T I O N S L R . 5 ( H O M O S K E D A S T I C I T Y )

Var(u x)

2

.

We must emphasize that the homoskedasticity assumption is quite distinct from the zero conditional mean assumption, E(u x) 0. Assumption SLR.3 involves the expected value of u, while Assumption SLR.5 concerns the variance of u (both conditional on x). Recall that we established the unbiasedness of OLS without Assumption SLR.5: the homoskedasticity assumption plays no role in showing that ˆ0 and ˆ1 are unbiased. We add Assumption SLR.5 because it simplifies the variance calculations for
52

Chapter 2

The Simple Regression Model

ˆ0 and ˆ1 and because it implies that ordinary least squares has certain efficiency properties, which we will see in Chapter 3. If we were to assume that u and x are independent, then the distribution of u given x does not depend on x, and so E(u x) E(u) 0 2 and Var(u x) . But independence is sometimes too strong of an assumption. Because Var(u x) E(u2 x) [E(u x)]2 and E(u x) 0, 2 E(u2 x), which means 2 is also the unconditional expectation of u2. Therefore, 2 E(u2) Var(u), because E(u) 0. In other words, 2 is the unconditional variance of u, and so 2 is often called the error variance or disturbance variance. The square root of 2, , is the standard deviation of the error. A larger means that the distribution of the unobservables affecting y is more spread out. It is often useful to write Assumptions SLR.3 and SLR.5 in terms of the conditional mean and conditional variance of y: E(y x) Var(y x)
0 2 1

x.

(2.55) (2.56)

.

In other words, the conditional expectation of y given x is linear in x, but the variance of y given x is constant. This situation is graphed in Figure 2.8 where 0 0 and 1 0.

Figure 2.8 The simple regression model under homoskedasticity.

f(y x)

y

x1 x2 x3

E(y x)

0

1

x

x

53

Part 1

Regression Analysis with Cross-Sectional Data

When Var(u x) depends on x, the error term is said to exhibit heteroskedasticity (or nonconstant variance). Since Var(u x) Var(y x), heteroskedasticity is present whenever Var(y x) is a function of x.
E X A M P L E 2 . 1 3 (Heteroskedasticity in a Wage Equation)

In order to get an unbiased estimator of the ceteris paribus effect of educ on wage, we must assume that E(u educ) 0, and this implies E(wage educ) 0 1educ. If we also 2 make the homoskedasticity assumption, then Var(u educ) does not depend on the 2 level of education, which is the same as assuming Var(wage educ) . Thus, while average wage is allowed to increase with education level — it is this rate of increase that we are interested in describing — the variability in wage about its mean is assumed to be constant across all education levels. This may not be realistic. It is likely that people with more education have a wider variety of interests and job opportunities, which could lead to more wage variability at higher levels of education. People with very low levels of education have very few opportunities and often must work at the minimum wage; this serves to reduce wage variability at low education levels. This situation is shown in Figure 2.9. Ultimately, whether Assumption SLR.5 holds is an empirical issue, and in Chapter 8 we will show how to test Assumption SLR.5.

Figure 2.9 Var (wage educ) increasing with educ.

f(wage educ)

wage

8 12 16

E(wage educ) 0 1educ educ

54

Chapter 2

The Simple Regression Model

With the homoskedasticity assumption in place, we are ready to prove the following:
T H E O R E M 2 . 2 ( S A M P L I N G O L S E S T I M A T O R S ) V A R I A N C E S O F T H E

Under Assumptions SLR.1 through SLR.5,

Var( ˆ1)

2 2 n 2 /sx

(2.57)

(xi i 1

x) ¯

2

n 2

n

1 i 1

Var( ˆ0)

xi2 ,
(2.58)

n

(xi i 1

x )2 ¯

where these are conditional on the sample values {x1,…,xn }.
P R O O F :

We derive the formula for Var( ˆ1), leaving the other derivation as an n 1 2 (1/sx )

exercise. The starting point is equation (2.52): ˆ1
2 x

di ui . Since i 1

1

is just a

constant, and we are conditioning on the xi , s and di xi x are also nonrandom. ¯ Furthermore, because the ui are independent random variables across i (by random sampling), the variance of the sum is the sum of the variances. Using these facts, we have n n

Var( ˆ1)

2 (1/sx )2Var i 1 n 2 (1/sx )2 i 1 n 2 2 (1/sx )2 i 1

di ui d2 i
2

2 (1/sx )2 i 1

d 2Var(ui ) i
2

[since Var(ui )
2 2 2 (1/sx )2sx 2

for all i]

d2 i

2 /sx ,

which is what we wanted to show.

The formulas (2.57) and (2.58) are the “standard” formulas for simple regression analysis, which are invalid in the presence of heteroskedasticity. This will be important when we turn to confidence intervals and hypothesis testing in multiple regression analysis. For most purposes, we are interested in Var( ˆ1). It is easy to summarize how this 2 variance depends on the error variance, 2, and the total variation in {x1,x2,…, xn }, sx . ˆ1). This makes sense since more First, the larger the error variance, the larger is Var( variation in the unobservables affecting y makes it more difficult to precisely estimate 1. On the other hand, more variability in the independent variable is preferred: as the variability in the xi increases, the variance of ˆ1 decreases. This also makes intuitive
55

Part 1

Regression Analysis with Cross-Sectional Data

sense since the more spread out is the sample of independent variables, the easier it is to trace out the relationship between E(y x) and x. That is, the easier it is to estimate 1. If there is little variation in the xi , then it can be hard to pinpoint how E(y x) varies with x. As the sample size increases, so does the total variation in the xi . Therefore, a larger sample size results in a smaller variance for ˆ1. This analysis shows that, if we are interested in ˆ1, and we have a choice, then we should choose the xi to be as spread out as possible. This is sometimes possible with experimental data, but rarely do we have this luxury in the social sciences: usually we must take the xi that we obtain via random sampling. Sometimes we have an opportuQ U E S T I O N 2 . 5 nity to obtain larger sample sizes, although Show that, when estimating 0, it is best to have x 0. What is Var( ˆ0) ¯ this can be costly. n n For the purposes of constructing confi2 2 in this case? (Hint: For any sample of numbers, xi (xi x , ¯) dence intervals and deriving test statistics, i 1 i 1 with equality only if x 0.) ¯ we will need to work with the standard deviations of ˆ1 and ˆ0, sd( ˆ1) and sd( ˆ0). Recall that these are obtained by taking the square roots of the variances in (2.57) and (2.58). In particular, sd( ˆ1) /sx , where is the square root of 2, and sx is the square 2 root of sx .

Estimating the Error Variance
The formulas in (2.57) and (2.58) allow us to isolate the factors that contribute to Var( ˆ1) and Var( ˆ0). But these formulas are unknown, except in the extremely rare case that 2 is known. Nevertheless, we can use the data to estimate 2, which then allows us to estimate Var( ˆ1) and Var( ˆ0). This is a good place to emphasize the difference between the the errors (or disturbances) and the residuals, since this distinction is crucial for constructing an estimator of 2. Equation (2.48) shows how to write the population model in terms of a randomly sampled observation as yi ui , where ui is the error for observation i. 0 1xi We can also express yi in terms of its fitted value and residual as in equation (2.32): ˆ0 ˆ1xi ui. Comparing these two equations, we see that the error shows up in yi ˆ the equation containing the population parameters, 0 and 1. On the other hand, the residuals show up in the estimated equation with ˆ0 and ˆ1. The errors are never observable, while the residuals are computed from the data. We can use equations (2.32) and (2.48) to write the residuals as a function of the errors: ui ˆ or ui ˆ ui ( ˆ0
0

yi

ˆ0

ˆ1xi

(

0

1 i

x

ui )

ˆ0

ˆ1xi ,

)

( ˆ1

1

)xi .

(2.59)

Although the expected value of ˆ0 equals 0, and similarly for ˆ1, ui is not the same as ˆ ui . The difference between them does have an expected value of zero. Now that we understand the difference between the errors and the residuals, we can
56

Chapter 2

The Simple Regression Model

n

return to estimating

2

. First,

2

E(u2), so an unbiased “estimator” of

2

is n

1 i 1

ui2.

Unfortunately, this is not a true estimator, because we do not observe the errors ui . But, we do have estimates of the ui , namely the OLS residuals ui. If we replace the errors ˆ n with the OLS residuals, have n

1 i 1

u i2 ˆ

SSR/n. This is a true estimator, because it

gives a computable rule for any sample of data on x and y. One slight drawback to this estimator is that it turns out to be biased (although for large n the bias is small). Since it is easy to compute an unbiased estimator, we use that instead. The estimator SSR/n is biased essentially because it does not account for two restrictions that must be satisfied by the OLS residuals. These restrictions are given by the two OLS first order conditions: n n

ui ˆ i 1

0, i 1

x i ui ˆ

0.

(2.60)

One way to view these restrictions is this: if we know n 2 of the residuals, we can always get the other two residuals by using the restrictions implied by the first order conditions in (2.60). Thus, there are only n 2 degrees of freedom in the OLS residuals [as opposed to n degrees of freedom in the errors. If we replace ui with ui in (2.60), ˆ the restrictions would no longer hold.] The unbiased estimator of 2 that we will use makes a degrees-of-freedom adjustment: ˆ
2

1 (n 2) i

n

u i2 ˆ
1

SSR/(n

2).

(2.61)

(This estimator is sometimes denoted s2, but we continue to use the convention of putting “hats” over estimators.)
T H E O R E M 2 . 3 ( U N B I A S E D E S T I M A T I O N O F
2

)

Under Assumptions SLR.1 through SLR.5,

E( ˆ 2)
P R O O F :

2

.

If we average equation (2.59) across all i and use the fact that the OLS residuals average out to zero, we have 0 u ( ˆ0 ¯ ( ˆ1 ¯; 0) 1)x subtracting this ˆ1 from (2.59) gives ui ˆ (ui u) ¯ ( x). Therefore, ui2 n (ui ¯ ˆ un)2 ¯ ( ˆ1 1)(xi 2 x )2 2(ui u )( ˆ1 ¯ ¯ x ). Summing across all i gives ¯ u2 ˆi (ui u)2 ¯ 1) (xi 1)(xi n n

( ˆ1

1

)2 i 1

(xi

x )2 ¯

2( ˆ1

1

) i 1

ui (xi

x ). Now, the expected value of the first ¯

i 1

i 1

term is (n 1) 2, something that is shown in Appendix C. The expected value of the second 2 2 2 term is simply 2 because E[( ˆ1 Var( ˆ1) /sx . Finally, the third term can be 1) ] 2 2 2 ˆ1 written as 2( . Putting these three terms 1) s x; taking expectations gives 2 n together gives E i 1

u2 ˆi

(n

1)

2

2

2

2

(n

2) 2, so that E[SSR/(n

2)]

2

.

57

Part 1

Regression Analysis with Cross-Sectional Data

If ˆ 2 is plugged into the variance formulas (2.57) and (2.58), then we have unbiased estimators of Var( ˆ1) and Var( ˆ0). Later on, we will need estimators of the standard deviations of ˆ1 and ˆ0, and this requires estimating . The natural estimator of is ˆ


ˆ2 ,

(2.62)

and is called the standard error of the regression (SER). (Other names for ˆ are the standard error of the estimate and the root mean squared error, but we will not use these.) Although ˆ is not an unbiased estimator of , we can show that it is a consistent estimator of (see Appendix C), and it will serve our purposes well. The estimate ˆ is interesting since it is an estimate of the standard deviation in the unobservables affecting y; equivalently, it estimates the standard deviation in y after the effect of x has been taken out. Most regression packages report the value of ˆ along with the R-squared, intercept, slope, and other OLS statistics (under one of the several names listed above). For now, our primary interest is in using ˆ to estimate the standard deviations of ˆ0 and ˆ1. Since sd( ˆ1) /sx , the natural estimator of sd( ˆ1) is n se( ˆ 1 )

1/ 2

ˆ /s x

ˆ/ i 1

(x i

x) 2 ¯

;

this is called the standard error of ˆ1. Note that se( ˆ1) is viewed as a random variable when we think of running OLS over different samples of y; this is because ˆ varies with different samples. For a given sample, se( ˆ1) is a number, just as ˆ1 is simply a number when we compute it from the given data. Similarly, se( ˆ0) is obtained from sd( ˆ0) by replacing with ˆ . The standard error of any estimate gives us an idea of how precise the estimator is. Standard errors play a central role throughout this text; we will use them to construct test statistics and confidence intervals for every econometric procedure we cover, starting in Chapter 4.

2.6 REGRESSION THROUGH THE ORIGIN
In rare cases, we wish to impose the restriction that, when x 0, the expected value of y is zero. There are certain relationships for which this is reasonable. For example, if income (x) is zero, then income tax revenues (y) must also be zero. In addition, there are problems where a model that originally has a nonzero intercept is transformed into a model without an intercept. Formally, we now choose a slope estimator, which we call ˜1, and a line of the form y ˜ ˜1x,
(2.63)

where the tildas over ˜1 and y are used to distinguish this problem from the much more common problem of estimating an intercept along with a slope. Obtaining (2.63) is called regression through the origin because the line (2.63) passes through the point x 0, y 0. To obtain the slope estimate in (2.63), we still rely on the method of ordi˜ nary least squares, which in this case minimizes the sum of squared residuals
58

Chapter 2

The Simple Regression Model

n

(y i i 1

˜ 1x i) 2.

(2.64)

Using calculus, it can be shown that ˜1 must solve the first order condition n x i (y i i 1

˜ 1x i)

0.

(2.65)

From this we can solve for ˜1: n ˜1

xi yi i 1 n

, x i2

(2.66)

i 1

provided that not all the xi are zero, a case we rule out. Note how ˜1 compares with the slope estimate when we also estimate the intercept (rather than set it equal to zero). These two estimates are the same if, and only if, x ¯ 0. (See equation (2.49) for ˆ1.) Obtaining an estimate of 1 using regression through the origin is not done very often in applied work, and for good reason: if the intercept 0 0 then ˜1 is a biased estimator of 1. You will be asked to prove this in Problem 2.8.

SUMMARY
We have introduced the simple linear regression model in this chapter, and we have covered its basic properties. Given a random sample, the method of ordinary least squares is used to estimate the slope and intercept parameters in the population model. We have demonstrated the algebra of the OLS regression line, including computation of fitted values and residuals, and the obtaining of predicted changes in the dependent variable for a given change in the independent variable. In Section 2.4, we discussed two issues of practical importance: (1) the behavior of the OLS estimates when we change the units of measurement of the dependent variable or the independent variable; (2) the use of the natural log to allow for constant elasticity and constant semi-elasticity models. In Section 2.5, we showed that, under the four Assumptions SLR.1 through SLR.4, the OLS estimators are unbiased. The key assumption is that the error term u has zero mean given any value of the independent variable x. Unfortunately, there are reasons to think this is false in many social science applications of simple regression, where the omitted factors in u are often correlated with x. When we add the assumption that the variance of the error given x is constant, we get simple formulas for the sampling variances of the OLS estimators. As we saw, the variance of the slope estimator ˆ1 increases as the error variance increases, and it decreases when there is more sample variation in the independent variable. We also derived an unbiased estimator for 2 Var(u). In Section 2.6, we briefly discussed regression through the origin, where the slope estimator is obtained under the assumption that the intercept is zero. Sometimes this is useful, but it appears infrequently in applied work.
59

Part 1

Regression Analysis with Cross-Sectional Data

Much work is left to be done. For example, we still do not know how to test hypotheses about the population parameters, 0 and 1. Thus, although we know that OLS is unbiased for the population parameters under Assumptions SLR.1 through SLR.4, we have no way of drawing inference about the population. Other topics, such as the efficiency of OLS relative to other possible procedures, have also been omitted. The issues of confidence intervals, hypothesis testing, and efficiency are central to multiple regression analysis as well. Since the way we construct confidence intervals and test statistics is very similar for multiple regression—and because simple regression is a special case of multiple regression—our time is better spent moving on to multiple regression, which is much more widely applicable than simple regression. Our purpose in Chapter 2 was to get you thinking about the issues that arise in econometric analysis in a fairly simple setting.

KEY TERMS
Coefficient of Determination Constant Elasticity Model Control Variable Covariate Degrees of Freedom Dependent Variable Elasticity Error Term (Disturbance) Error Variance Explained Sum of Squares (SSE) Explained Variable Explanatory Variable First Order Conditions Fitted Value Heteroskedasticity Homoskedasticity Independent Variable Intercept Parameter Ordinary Least Squares (OLS) OLS Regression Line Population Regression Function (PRF) Predicted Variable Predictor Variable Regressand Regression Through the Origin Regressor Residual Residual Sum of Squares (SSR) Response Variable R-squared Sample Regression Function (SRF) Semi-elasticity Simple Linear Regression Model Slope Parameter Standard Error of ˆ1 Standard Error of the Regression (SER) Sum of Squared Residuals Total Sum of Squares (SST) Zero Conditional Mean Assumption

PROBLEMS
2.1 Let kids denote the number of children ever born to a woman, and let educ denote years of education for the woman. A simple model relating fertility to years of education is kids where u is the unobserved error.
0 1

educ

u,

60

Chapter 2

The Simple Regression Model

What kinds of factors are contained in u? Are these likely to be correlated with level of education? (ii) Will a simple regression analysis uncover the ceteris paribus effect of education on fertility? Explain. 2.2 In the simple linear regression model y u, suppose that E(u) 0. 0 1x Letting 0 E(u), show that the model can always be rewritten with the same slope, but a new intercept and error, where the new error has a zero expected value. 2.3 The following table contains the ACT scores and the GPA (grade point average) for 8 college students. Grade point average is based on a four-point scale and has been rounded to one digit after the decimal. Student 1 2 3 4 5 6 7 8 (i) GPA 2.8 3.4 3.0 3.5 3.6 3.0 2.7 3.7 ACT 21 24 26 27 29 25 25 30

(i)

Estimate the relationship between GPA and ACT using OLS; that is, obtain the intercept and slope estimates in the equation ˆ ˆ0 ˆ1ACT. GPA

Comment on the direction of the relationship. Does the intercept have a useful interpretation here? Explain. How much higher is the GPA predicted to be, if the ACT score is increased by 5 points? (ii) Compute the fitted values and residuals for each observation and verify that the residuals (approximately) sum to zero. (iii) What is the predicted value of GPA when ACT 20? (iv) How much of the variation in GPA for these 8 students is explained by ACT? Explain. 2.4 The data set BWGHT.RAW contains data on births to women in the United States. Two variables of interest are the dependent variable, infant birth weight in ounces (bwght), and an explanatory variable, average number of cigarettes the mother smoked
61

Part 1

Regression Analysis with Cross-Sectional Data

per day during pregnancy (cigs). The following simple regression was estimated using data on n 1388 births: bwg ˆht (i) 119.77 0.514 cigs

What is the predicted birth weight when cigs 0? What about when cigs 20 (one pack per day)? Comment on the difference. (ii) Does this simple regression necessarily capture a causal relationship between the child’s birth weight and the mother’s smoking habits? Explain. ˆ0 ˆ1inc,

2.5 In the linear consumption function co ˆns

the (estimated) marginal propensity to consume (MPC) out of income is simply the ˆ0 /inc ˆ1. slope, ˆ1, while the average propensity to consume (APC) is co ˆns/inc Using observations for 100 families on annual income and consumption (both measured in dollars), the following equation is obtained: co ˆns n (i) 124.84 100, R
2

0.853 inc 0.692

Interpret the intercept in this equation and comment on its sign and magnitude. (ii) What is predicted consumption when family income is $30,000? (iii) With inc on the x-axis, draw a graph of the estimated MPC and APC. 2.6 Using data from 1988 for houses sold in Andover, MA, from Kiel and McClain (1995), the following equation relates housing price (price) to the distance from a recently built garbage incinerator (dist): log(pr ˆice) n (i) 9.40 135, R2 0.312 log(dist) 0.162

Interpret the coefficient on log(dist). Is the sign of this estimate what you expect it to be? (ii) Do you think simple regression provides an unbiased estimator of the ceteris paribus elasticity of price with respect to dist? (Think about the city’s decision on where to put the incinerator.) (iii) What other factors about a house affect its price? Might these be correlated with distance from the incinerator? 2.7 Consider the savings function sav
0 1

inc

u, u

inc e,



2 where e is a random variable with E(e) 0 and Var(e) e . Assume that e is independent of inc. (i) Show that E(u inc) 0, so that the key zero conditional mean assumption (Assumption SLR.3) is satisfied. [Hint: If e is independent of inc, then E(e inc) E(e).]

62

Chapter 2

The Simple Regression Model

2 (ii) Show that Var(u inc) e inc, so that the homoskedasticity Assumption SLR.5 is violated. In particular, the variance of sav increases with inc. [Hint: Var(e inc) Var(e), if e and inc are independent.] (iii) Provide a discussion that supports the assumption that the variance of savings increases with family income.

2.8 Consider the standard simple regression model y u under 0 1x Assumptions SLR.1 through SLR.4. Thus, the usual OLS estimators ˆ0 and ˆ1 are unbiased for their respective population parameters. Let ˜1 be the estimator of 1 obtained by assuming the intercept is zero (see Section 2.6). (i) Find E( ˜1) in terms of the xi , 0, and 1. Verify that ˜1 is unbiased for 1 when the population intercept ( 0) is zero. Are there other cases where ˜1 is unbiased? (ii) Find the variance of ˜1. (Hint: The variance does not depend on 0.) n n

(iii) Show that Var( ˜1)

Var( ˆ1). [Hint: For any sample of data, i 1

x2 i i 1

(xi x)2, with strict inequality unless x 0.] ¯ ¯ (iv) Comment on the tradeoff between bias and variance when choosing between ˆ1 and ˜1. 2.9 (i) Let ˆ0 and ˆ1 be the intercept and slope from the regression of yi on xi , using n observations. Let c1 and c2, with c2 0, be constants. Let ˜0 and ˜1 be the intercept and slope from the regression c1yi on c2xi . Show that ˜1 (c1/c2) ˆ1 and ˜0 c1 ˆ0, thereby verifying the claims on units of measurement in Section 2.4. [Hint: To obtain ˜1, plug the scaled versions of x and y into (2.19). Then, use (2.17) for ˜0, being sure to plug in the scaled x and y and the correct slope.] (ii) Now let ˜0 and ˜1 be from the regression (c1 yi ) on (c2 xi ) (with no ˆ1 and ˜0 ˆ0 c1 c2 ˆ1. restriction on c1 or c2). Show that ˜1

COMPUTER EXERCISES
2.10 The data in 401K.RAW are a subset of data analyzed by Papke (1995) to study the relationship between participation in a 401(k) pension plan and the generosity of the plan. The variable prate is the percentage of eligible workers with an active account; this is the variable we would like to explain. The measure of generosity is the plan match rate, mrate. This variable gives the average amount the firm contributes to each worker’s plan for each $1 contribution by the worker. For example, if mrate 0.50, then a $1 contribution by the worker is matched by a 50¢ contribution by the firm. (i) Find the average participation rate and the average match rate in the sample of plans. (ii) Now estimate the simple regression equation pra ˆte ˆ0 ˆ1mrate,

and report the results along with the sample size and R-squared. (iii) Interpret the intercept in your equation. Interpret the coefficient on mrate. (iv) Find the predicted prate when mrate 3.5. Is this a reasonable prediction? Explain what is happening here.
63

Part 1

Regression Analysis with Cross-Sectional Data

(v) How much of the variation in prate is explained by mrate? Is this a lot in your opinion? 2.11 The data set in CEOSAL2.RAW contains information on chief executive officers for U.S. corporations. The variable salary is annual compensation, in thousands of dollars, and ceoten is prior number of years as company CEO. (i) Find the average salary and the average tenure in the sample. (ii) How many CEOs are in their first year as CEO (that is, ceoten 0)? What is the longest tenure as a CEO? (iii) Estimate the simple regression model log(salary)
0 1

ceoten

u,

and report your results in the usual form. What is the (approximate) predicted percentage increase in salary given one more year as a CEO? 2.12 Use the data in SLEEP75.RAW from Biddle and Hamermesh (1990) to study whether there is a tradeoff between the time spent sleeping per week and the time spent in paid work. We could use either variable as the dependent variable. For concreteness, estimate the model sleep u, 0 1totwrk where sleep is minutes spent sleeping at night per week and totwrk is total minutes worked during the week. (i) Report your results in equation form along with the number of observations and R2. What does the intercept in this equation mean? (ii) If totwrk increases by 2 hours, by how much is sleep estimated to fall? Do you find this to be a large effect? 2.13 Use the data in WAGE2.RAW to estimate a simple regression explaining monthly salary (wage) in terms of IQ score (IQ). (i) Find the average salary and average IQ in the sample. What is the standard deviation of IQ? (IQ scores are standardized so that the average in the population is 100 with a standard deviation equal to 15.) (ii) Estimate a simple regression model where a one-point increase in IQ changes wage by a constant dollar amount. Use this model to find the predicted increase in wage for an increase in IQ of 15 points. Does IQ explain most of the variation in wage? (iii) Now estimate a model where each one-point increase in IQ has the same percentage effect on wage. If IQ increases by 15 points, what is the approximate percentage increase in predicted wage? 2.14 For the population of firms in the chemical industry, let rd denote annual expenditures on research and development, and let sales denote annual sales (both are in millions of dollars). (i) Write down a model (not an estimated equation) that implies a constant elasticity between rd and sales. Which parameter is the elasticity? (ii) Now estimate the model using the data in RDCHEM.RAW. Write out the estimated equation in the usual form. What is the estimated elasticity of rd with respect to sales? Explain in words what this elasticity means.
64

Chapter 2

The Simple Regression Model

A

P

P

E

N

D

I

X

2

A

Minimizing the Sum of Squared Residuals We show that the OLS estimates ˆ0 and ˆ1 do minimize the sum of squared residuals, as asserted in Section 2.2. Formally, the problem is to characterize the solutions ˆ0 and ˆ1 to the minimization problem n min b0,b1

(y i i 1

b0

b1 x i ) 2 ,

where b0 and b1 are the dummy arguments for the optimization problem; for simplicity, call this function Q(b0,b1). By a fundamental result from multivariable calculus (see Appendix A), a necessary condition for ˆ0 and ˆ1 to solve the minimization problem is that the partial derivatives of Q(b0,b1) with respect to b0 and b1 must be zero when evaluated at ˆ0, ˆ1: Q( ˆ0, ˆ1)/ b0 0 and Q( ˆ0, ˆ1)/ b1 0. Using the chain rule from calculus, these two equations become n 2 i 1 n

(yi xi (yi

ˆ0 ˆ0

ˆ1xi ) ˆ1xi )

0. 0.

2 i 1

These two equations are just (2.14) and (2.15) multiplied by 2n and, therefore, are solved by the same ˆ0 and ˆ1. How do we know that we have actually minimized the sum of squared residuals? The first order conditions are necessary but not sufficient conditions. One way to verify that we have minimized the sum of squared residuals is to write, for any b0 and b1, n Q(b0,b1) i 1 n

(yi (ui ˆ i 1 n

ˆ0 ( ˆ0 n( ˆ0

ˆ1xi b0) b0)2

( ˆ0 ( ˆ1 ( ˆ1

b0)

( ˆ1

b1)xi )2

b1)xi )2 n n

u i2 ˆ i 1

b1)2 i 1

x2 i

2( ˆ0

b0)( ˆ1

b1) i 1

xi ,

where we have used equations (2.30) and (2.31). The sum of squared residuals does not depend on b0 or b1, while the sum of the last three terms can be written as n [( ˆ0 i 1

b0)

( ˆ1

b1)xi ]2,

as can be verified by straightforward algebra. Because this is a sum of squared terms, it can be at most zero. Therefore, it is smallest when b0 = ˆ0 and b1 = ˆ1.

65

C

h

a

p

t

e

r

Three

Multiple Regression Analysis: Estimation

I

n Chapter 2, we learned how to use simple regression analysis to explain a dependent variable, y, as a function of a single independent variable, x. The primary drawback in using simple regression analysis for empirical work is that it is very difficult to draw ceteris paribus conclusions about how x affects y: the key assumption, SLR.3—that all other factors affecting y are uncorrelated with x—is often unrealistic. Multiple regression analysis is more amenable to ceteris paribus analysis because it allows us to explicitly control for many other factors which simultaneously affect the dependent variable. This is important both for testing economic theories and for evaluating policy effects when we must rely on nonexperimental data. Because multiple regression models can accommodate many explanatory variables that may be correlated, we can hope to infer causality in cases where simple regression analysis would be misleading. Naturally, if we add more factors to our model that are useful for explaining y, then more of the variation in y can be explained. Thus, multiple regression analysis can be used to build better models for predicting the dependent variable. An additional advantage of multiple regression analysis is that it can incorporate fairly general functional form relationships. In the simple regression model, only one function of a single explanatory variable can appear in the equation. As we will see, the multiple regression model allows for much more flexibility. Section 3.1 formally introduces the multiple regression model and further discusses the advantages of multiple regression over simple regression. In Section 3.2, we demonstrate how to estimate the parameters in the multiple regression model using the method of ordinary least squares. In Sections 3.3, 3.4, and 3.5, we describe various statistical properties of the OLS estimators, including unbiasedness and efficiency. The multiple regression model is still the most widely used vehicle for empirical analysis in economics and other social sciences. Likewise, the method of ordinary least squares is popularly used for estimating the parameters of the multiple regression model.

3.1 MOTIVATION FOR MULTIPLE REGRESSION The Model with Two Independent Variables
We begin with some simple examples to show how multiple regression analysis can be used to solve problems that cannot be solved by simple regression.
66

Chapter 3

Multiple Regression Analysis: Estimation

The first example is a simple variation of the wage equation introduced in Chapter 2 for obtaining the effect of education on hourly wage: wage
0 1

educ

2

exper

u,

(3.1)

where exper is years of labor market experience. Thus, wage is determined by the two explanatory or independent variables, education and experience, and by other unobserved factors, which are contained in u. We are still primarily interested in the effect of educ on wage, holding fixed all other factors affecting wage; that is, we are interested in the parameter 1. Compared with a simple regression analysis relating wage to educ, equation (3.1) effectively takes exper out of the error term and puts it explicitly in the equation. Because exper appears in the equation, its coefficient, 2, measures the ceteris paribus effect of exper on wage, which is also of some interest. Not surprisingly, just as with simple regression, we will have to make assumptions about how u in (3.1) is related to the independent variables, educ and exper. However, as we will see in Section 3.2, there is one thing of which we can be confident: since (3.1) contains experience explicitly, we will be able to measure the effect of education on wage, holding experience fixed. In a simple regression analysis—which puts exper in the error term—we would have to assume that experience is uncorrelated with education, a tenuous assumption. As a second example, consider the problem of explaining the effect of per student spending (expend) on the average standardized test score (avgscore) at the high school level. Suppose that the average test score depends on funding, average family income (avginc), and other unobservables: avgscore
0 1

expend

2

avginc

u.

(3.2)

The coefficient of interest for policy purposes is 1, the ceteris paribus effect of expend on avgscore. By including avginc explicitly in the model, we are able to control for its effect on avgscore. This is likely to be important because average family income tends to be correlated with per student spending: spending levels are often determined by both property and local income taxes. In simple regression analysis, avginc would be included in the error term, which would likely be correlated with expend, causing the OLS estimator of 1 in the two-variable model to be biased. In the two previous similar examples, we have shown how observable factors other than the variable of primary interest [educ in equation (3.1), expend in equation (3.2)] can be included in a regression model. Generally, we can write a model with two independent variables as y
0 1 1

x

2 2

x

u,

(3.3)

where 0 is the intercept, 1 measures the change in y with respect to x1, holding other factors fixed, and 2 measures the change in y with respect to x2, holding other factors fixed.
67

Part 1

Regression Analysis with Cross-Sectional Data

Multiple regression analysis is also useful for generalizing functional relationships between variables. As an example, suppose family consumption (cons) is a quadratic function of family income (inc): cons
0 1

inc

2

inc2

u,

(3.4)

where u contains other factors affecting consumption. In this model, consumption depends on only one observed factor, income; so it might seem that it can be handled in a simple regression framework. But the model falls outside simple regression because it contains two functions of income, inc and inc2 (and therefore three parameters, 0, 1, and 2). Nevertheless, the consumption function is easily written as a regression model with two independent variables by letting x1 inc and x2 inc2. Mechanically, there will be no difference in using the method of ordinary least squares (introduced in Section 3.2) to estimate equations as different as (3.1) and (3.4). Each equation can be written as (3.3), which is all that matters for computation. There is, however, an important difference in how one interprets the parameters. In equation (3.1), 1 is the ceteris paribus effect of educ on wage. The parameter 1 has no such interpretation in (3.4). In other words, it makes no sense to measure the effect of inc on cons while holding inc2 fixed, because if inc changes, then so must inc2! Instead, the change in consumption with respect to the change in income—the marginal propensity to consume—is approximated by cons inc
1

2 2inc.

See Appendix A for the calculus needed to derive this equation. In other words, the marginal effect of income on consumption depends on 2 as well as on 1 and the level of income. This example shows that, in any particular application, the definition of the independent variables are crucial. But for the theoretical development of multiple regression, we can be vague about such details. We will study examples like this more completely in Chapter 6. In the model with two independent variables, the key assumption about how u is related to x1 and x2 is E(u x1,x2) 0.
(3.5)

The interpretation of condition (3.5) is similar to the interpretation of Assumption SLR.3 for simple regression analysis. It means that, for any values of x1 and x2 in the population, the average unobservable is equal to zero. As with simple regression, the important part of the assumption is that the expected value of u is the same for all combinations of x1 and x2; that this common value is zero is no assumption at all as long as the intercept 0 is included in the model (see Section 2.1). How can we interpret the zero conditional mean assumption in the previous examples? In equation (3.1), the assumption is E(u educ,exper) 0. This implies that other factors affecting wage are not related on average to educ and exper. Therefore, if we think innate ability is part of u, then we will need average ability levels to be the same across all combinations of education and experience in the working population. This
68

Chapter 3

Multiple Regression Analysis: Estimation

may or may not be true, but, as we will see in Section 3.3, this is the question we need to ask in order to determine whether the method of ordinary least squares produces unbiased estimators. The example measuring student performance [equation (3.2)] is similar to the wage equation. The zero conditional mean assumption is E(u expend,avginc) 0, which means that other factors affecting test scores—school or student characteristics—are, on average, unrelated to per student funding and average family income. Q U E S T I O N 3 . 1 When applied to the quadratic conA simple model to explain city murder rates (murdrate) in terms of sumption function in (3.4), the zero condithe probability of conviction (prbconv) and average sentence length tional mean assumption has a slightly dif(avgsen) is ferent interpretation. Written literally, murdrate u. equation (3.5) becomes E(u inc,inc2) 0. 0 1prbconv 2avgsen Since inc2 is known when inc is known, What are some factors contained in u? Do you think the key assumincluding inc2 in the expectation is redunption (3.5) is likely to hold? dant: E(u inc,inc2) 0 is the same as E(u inc) 0. Nothing is wrong with putting inc2 along with inc in the expectation when stating the assumption, but E(u inc) 0 is more concise.

The Model with k Independent Variables
Once we are in the context of multiple regression, there is no need to stop with two independent variables. Multiple regression analysis allows many observed factors to affect y. In the wage example, we might also include amount of job training, years of tenure with the current employer, measures of ability, and even demographic variables like number of siblings or mother’s education. In the school funding example, additional variables might include measures of teacher quality and school size. The general multiple linear regression model (also called the multiple regression model) can be written in the population as y
0 1 1

x

2 2

x

3 3

x



k k

x

u,

(3.6)

where 0 is the intercept, 1 is the parameter associated with x1, 2 is the parameter associated with x2, and so on. Since there are k independent variables and an intercept, equation (3.6) contains k 1 (unknown) population parameters. For shorthand purposes, we will sometimes refer to the parameters other than the intercept as slope parameters, even though this is not always literally what they are. [See equation (3.4), where neither 1 nor 2 is itself a slope, but together they determine the slope of the relationship between consumption and income.] The terminology for multiple regression is similar to that for simple regression and is given in Table 3.1. Just as in simple regression, the variable u is the error term or disturbance. It contains factors other than x1, x2, …, xk that affect y. No matter how many explanatory variables we include in our model, there will always be factors we cannot include, and these are collectively contained in u. When applying the general multiple regression model, we must know how to interpret the parameters. We will get plenty of practice now and in subsequent chapters, but
69

Part 1

Regression Analysis with Cross-Sectional Data

Table 3.1 Terminology for Multiple Regression

y Dependent Variable Explained Variable Response Variable Predicted Variable Regressand

x1 , x2 , …, xk Independent Variables Explanatory Variables Control Variables Predictor Variables Regressors

it is useful at this point to be reminded of some things we already know. Suppose that CEO salary (salary) is related to firm sales and CEO tenure with the firm by log(salary)
0 1

log(sales)

2

ceoten

3

ceoten2

u.

(3.7)

This fits into the multiple regression model (with k 3) by defining y log(salary), x1 log(sales), x2 ceoten, and x3 ceoten2. As we know from Chapter 2, the parameter 1 is the (ceteris paribus) elasticity of salary with respect to sales. If 3 0, then 100 2 is approximately the ceteris paribus percentage increase in salary when ceoten increases by one year. When 3 0, the effect of ceoten on salary is more complicated. We will postpone a detailed treatment of general models with quadratics until Chapter 6. Equation (3.7) provides an important reminder about multiple regression analysis. The term “linear” in multiple linear regression model means that equation (3.6) is linear in the parameters, j. Equation (3.7) is an example of a multiple regression model that, while linear in the j, is a nonlinear relationship between salary and the variables sales and ceoten. Many applications of multiple linear regression involve nonlinear relationships among the underlying variables. The key assumption for the general multiple regression model is easy to state in terms of a conditional expectation: E(u x1,x2, …, xk ) 0.
(3.8)

At a minimum, equation (3.8) requires that all factors in the unobserved error term be uncorrelated with the explanatory variables. It also means that we have correctly accounted for the functional relationships between the explained and explanatory variables. Any problem that allows u to be correlated with any of the independent variables causes (3.8) to fail. In Section 3.3, we will show that assumption (3.8) implies that OLS is unbiased and will derive the bias that arises when a key variable has been omitted
70

Chapter 3

Multiple Regression Analysis: Estimation

from the equation. In Chapters 15 and 16, we will study other reasons that might cause (3.8) to fail and show what can be done in cases where it does fail.

3.2 MECHANICS AND INTERPRETATION OF ORDINARY LEAST SQUARES
We now summarize some computational and algebraic features of the method of ordinary least squares as it applies to a particular set of data. We also discuss how to interpret the estimated equation.

Obtaining the OLS Estimates
We first consider estimating the model with two independent variables. The estimated OLS equation is written in a form similar to the simple regression case: y ˆ ˆ0 ˆ1x1 ˆ2x2,
(3.9)

where ˆ0 is the estimate of 0, ˆ1 is the estimate of 1, and ˆ2 is the estimate of 2. But how do we obtain ˆ0, ˆ1, and ˆ2? The method of ordinary least squares chooses the estimates to minimize the sum of squared residuals. That is, given n observations on y, x1, and x2, {(xi1,xi2,yi ): i 1,2, …, n}, the estimates ˆ0, ˆ1, and ˆ2 are chosen simultaneously to make n (y i i 1

ˆ0

ˆ 1 x i1

ˆ 2 x i2 ) 2

(3.10)

as small as possible. In order to understand what OLS is doing, it is important to master the meaning of the indexing of the independent variables in (3.10). The independent variables have two subscripts here, i followed by either 1 or 2. The i subscript refers to the observation number. Thus, the sum in (3.10) is over all i 1 to n observations. The second index is simply a method of distinguishing between different independent variables. In the example relating wage to educ and exper, xi1 educi is education for person i in the sample, and xi2 experi is experience for person i. The sum of squared residuals in n equation (3.10) is i 1

(wagei

ˆ0

ˆ1educi

ˆ 2experi)2. In what follows, the i sub-

script is reserved for indexing the observation number. If we write xij, then this means the i th observation on the j th independent variable. (Some authors prefer to switch the order of the observation number and the variable number, so that x1i is observation i on variable one. But this is just a matter of notational taste.) In the general case with k independent variables, we seek estimates ˆ0, ˆ1, …, ˆk in the equation y ˆ The OLS estimates, k ˆ0 ˆ1x1 ˆ2x2 … ˆkxk .
(3.11)

1 of them, are chosen to minimize the sum of squared residuals:
71

Part 1

Regression Analysis with Cross-Sectional Data

n

(y i i 1

ˆ0

ˆ 1 x i1



ˆ k x ik ) 2 .

(3.12)

This minimization problem can be solved using multivariable calculus (see Appendix 3A). This leads to k 1 linear equations in k 1 unknowns ˆ0, ˆ1, …, ˆk: n (y i i n 1

ˆ0 ˆ0 ˆ0

ˆ 1 x i1 ˆ 1 x i1 ˆ 1 x i1

… … …

ˆ k x ik ) ˆ k x ik ) ˆ k x ik )

0 0 0
(3.13)

x i1 (y i i n 1

x i2 (y i i 1

n

x ik (y i i 1

ˆ0

ˆ 1 x i1



ˆ k x ik )

0.

These are often called the OLS first order conditions. As with the simple regression model in Section 2.2, the OLS first order conditions can be motivated by the method of moments: under assumption (3.8), E(u) 0 and E(xj u) 0, where j 1,2, …, k. The equations in (3.13) are the sample counterparts of these population moments. For even moderately sized n and k, solving the equations in (3.13) by hand calculations is tedious. Nevertheless, modern computers running standard statistics and econometrics software can solve these equations with large n and k very quickly. There is only one slight caveat: we must assume that the equations in (3.13) can be solved uniquely for the ˆj. For now, we just assume this, as it is usually the case in wellspecified models. In Section 3.3, we state the assumption needed for unique OLS estimates to exist (see Assumption MLR.4). As in simple regression analysis, equation (3.11) is called the OLS regression line, or the sample regression function (SRF). We will call ˆ0 the OLS intercept estimate and ˆ1, …, ˆk the OLS slope estimates (corresponding to the independent variables x1, x2, …, xk ). In order to indicate that an OLS regression has been run, we will either write out equation (3.11) with y and x1, …, xk replaced by their variable names (such as wage, educ, and exper), or we will say that “we ran an OLS regression of y on x1, x2, …, xk ” or that “we regressed y on x1, x2, …, xk .” These are shorthand for saying that the method of ordinary least squares was used to obtain the OLS equation (3.11). Unless explicitly stated otherwise, we always estimate an intercept along with the slopes.

Interpreting the OLS Regression Equation
More important than the details underlying the computation of the ˆj is the interpretation of the estimated equation. We begin with the case of two independent variables:
72

Chapter 3

Multiple Regression Analysis: Estimation

y ˆ

ˆ0

ˆ1x1

ˆ2x2.

(3.14)

The intercept ˆ0 in equation (3.14) is the predicted value of y when x1 0 and x2 0. Sometimes setting x1 and x2 both equal to zero is an interesting scenario, but in other cases it will not make sense. Nevertheless, the intercept is always needed to obtain a prediction of y from the OLS regression line, as (3.14) makes clear. The estimates ˆ1 and ˆ2 have partial effect, or ceteris paribus, interpretations. From equation (3.14), we have y ˆ ˆ1 x1 ˆ2 x2,

so we can obtain the predicted change in y given the changes in x1 and x2. (Note how the intercept has nothing to do with the changes in y.) In particular, when x2 is held fixed, so that x2 0, then y ˆ ˆ1 x1,

holding x2 fixed. The key point is that, by including x2 in our model, we obtain a coefficient on x1 with a ceteris paribus interpretation. This is why multiple regression analysis is so useful. Similarly, y ˆ holding x1 fixed.
E X A M P L E 3 . 1 ( D e t e r m i n a n t s o f C o l l e g e G PA )

ˆ2 x2,

The variables in GPA1.RAW include college grade point average (colGPA), high school GPA (hsGPA), and achievement test score (ACT ) for a sample of 141 students from a large university; both college and high school GPAs are on a four-point scale. We obtain the following OLS regression line to predict college GPA from high school GPA and achievement test score:

ˆ colGPA

1.29

.453 hsGPA

.0094 ACT.

(3.15)

How do we interpret this equation? First, the intercept 1.29 is the predicted college GPA if hsGPA and ACT are both set as zero. Since no one who attends college has either a zero high school GPA or a zero on the achievement test, the intercept in this equation is not, by itself, meaningful. More interesting estimates are the slope coefficients on hsGPA and ACT. As expected, there is a positive partial relationship between colGPA and hsGPA: holding ACT fixed, another point on hsGPA is associated with .453 of a point on the college GPA, or almost half a point. In other words, if we choose two students, A and B, and these students have the same ACT score, but the high school GPA of Student A is one point higher than the high school GPA of Student B, then we predict Student A to have a college GPA .453 higher than that of Student B. [This says nothing about any two actual people, but it is our best prediction.]
73

Part 1

Regression Analysis with Cross-Sectional Data

The sign on ACT implies that, while holding hsGPA fixed, a change in the ACT score of 10 points—a very large change, since the average score in the sample is about 24 with a standard deviation less than three—affects colGPA by less than one-tenth of a point. This is a small effect, and it suggests that, once high school GPA is accounted for, the ACT score is not a strong predictor of college GPA. (Naturally, there are many other factors that contribute to GPA, but here we focus on statistics available for high school students.) Later, after we discuss statistical inference, we will show that not only is the coefficient on ACT practically small, it is also statistically insignificant. If we focus on a simple regression analysis relating colGPA to ACT only, we obtain

ˆ colGPA

2.40

.0271 ACT;

thus, the coefficient on ACT is almost three times as large as the estimate in (3.15). But this equation does not allow us to compare two people with the same high school GPA; it corresponds to a different experiment. We say more about the differences between multiple and simple regression later.

The case with more than two independent variables is similar. The OLS regression line is y ˆ Written in terms of changes, y ˆ ˆ1 x1 ˆ2 x2 … ˆk xk.
(3.17)

ˆ0

ˆ1x1

ˆ2x2



ˆkxk .

(3.16)

The coefficient on x1 measures the change in y due to a one-unit increase in x1, holding ˆ all other independent variables fixed. That is, y ˆ ˆ1 x1,
(3.18)

holding x2, x3, …, xk fixed. Thus, we have controlled for the variables x2, x3, …, xk when estimating the effect of x1 on y. The other coefficients have a similar interpretation. The following is an example with three independent variables.
E X A M P L E 3 . 2 (Hourly Wage Equation)

Using the 526 observations on workers in WAGE1.RAW, we include educ (years of education), exper (years of labor market experience), and tenure (years with the current employer) in an equation explaining log(wage). The estimated equation is

ˆ log(wage)

.284

.092 educ

.0041 exper

.022 tenure.

(3.19)

As in the simple regression case, the coefficients have a percentage interpretation. The only difference here is that they also have a ceteris paribus interpretation. The coefficient .092
74

Chapter 3

Multiple Regression Analysis: Estimation

means that, holding exper and tenure fixed, another year of education is predicted to increase log(wage) by .092, which translates into an approximate 9.2 percent [100(.092)] increase in wage. Alternatively, if we take two people with the same levels of experience and job tenure, the coefficient on educ is the proportionate difference in predicted wage when their education levels differ by one year. This measure of the return to education at least keeps two important productivity factors fixed; whether it is a good estimate of the ceteris paribus return to another year of education requires us to study the statistical properties of OLS (see Section 3.3).

On the Meaning of “Holding Other Factors Fixed” in Multiple Regression
The partial effect interpretation of slope coefficients in multiple regression analysis can cause some confusion, so we attempt to prevent that problem now. In Example 3.1, we observed that the coefficient on ACT measures the predicted difference in colGPA, holding hsGPA fixed. The power of multiple regression analysis is that it provides this ceteris paribus interpretation even though the data have not been collected in a ceteris paribus fashion. In giving the coefficient on ACT a partial effect interpretation, it may seem that we actually went out and sampled people with the same high school GPA but possibly with different ACT scores. This is not the case. The data are a random sample from a large university: there were no restrictions placed on the sample values of hsGPA or ACT in obtaining the data. Rarely do we have the luxury of holding certain variables fixed in obtaining our sample. If we could collect a sample of individuals with the same high school GPA, then we could perform a simple regression analysis relating colGPA to ACT. Multiple regression effectively allows us to mimic this situation without restricting the values of any independent variables. The power of multiple regression analysis is that it allows us to do in nonexperimental environments what natural scientists are able to do in a controlled laboratory setting: keep other factors fixed.

Changing More than One Independent Variable Simultaneously
Sometimes we want to change more than one independent variable at the same time to find the resulting effect on the dependent variable. This is easily done using equation (3.17). For example, in equation (3.19), we can obtain the estimated effect on wage when an individual stays at the same firm for another year: exper (general workforce experience) and tenure both increase by one year. The total effect (holding educ fixed) is logˆ (wage) .0041 exper .022 tenure .0041 .022 .0261,

or about 2.6 percent. Since exper and tenure each increase by one year, we just add the coefficients on exper and tenure and multiply by 100 to turn the effect into a percent.

OLS Fitted Values and Residuals
After obtaining the OLS regression line (3.11), we can obtain a fitted or predicted value for each observation. For observation i, the fitted value is simply
75

Part 1

Regression Analysis with Cross-Sectional Data

yi ˆ

ˆ0

ˆ1xi1

ˆ2xi2



ˆkxik,

(3.20)

which is just the predicted value obtained by plugging the values of the independent variables for observation i into equation (3.11). We should not forget about the intercept in obtaining the fitted values; otherwise, the answer can be very misleading. As an Q U E S T I O N 3 . 2 example, if in (3.15), hsGPAi 3.5 and In Example 3.1, the OLS fitted line explaining college GPA in terms ˆ ACTi 24, colGPAi 1.29 .453(3.5) of high school GPA and ACT score is .0094(24) 3.101 (rounded to three ˆ places after the decimal). colGPA 1.29 .453 hsGPA .0094 ACT. Normally, the actual value yi for any If the average high school GPA is about 3.4 and the average ACT observation i will not equal the predicted score is about 24.2, what is the average college GPA in the sample? value, yi: OLS minimizes the average ˆ squared prediction error, which says nothing about the prediction error for any particular observation. The residual for observation i is defined just as in the simple regression case, ui ˆ yi y i. ˆ
(3.21)

There is a residual for each observation. If ui 0, then yi is below yi , which means ˆ ˆ that, for this observation, yi is underpredicted. If ui 0, then yi yi, and yi is overˆ ˆ predicted. The OLS fitted values and residuals have some important properties that are immediate extensions from the single variable case: 1. The sample average of the residuals is zero. 2. The sample covariance between each independent variable and the OLS residuals is zero. Consequently, the sample covariance between the OLS fitted values and the OLS residuals is zero. ˆ0 ˆ1x1 3. The point (x1,x2, …, xk,y) is always on the OLS regression line: y ¯ ¯ ¯ ¯ ¯ ¯ ˆ2x2 … ˆkxk. ¯ ¯ The first two properties are immediate consequences of the set of equations used to obtain the OLS estimates. The first equation in (3.13) says that the sum of the residuals n is zero. The remaining equations are of the form i 1

xijui ˆ

0, which imply that the each

independent variable has zero sample covariance with ui. Property 3 follows immediˆ ately from Property 1.

A “Partialling Out” Interpretation of Multiple Regression
When applying OLS, we do not need to know explicit formulas for the ˆj that solve the system of equations (3.13). Nevertheless, for certain derivations, we do need explicit formulas for the ˆj. These formulas also shed further light on the workings of OLS. ˆ0 ˆ1x1 ˆ2x2. Consider again the case with k 2 independent variables, y ˆ ˆ1. One way to express ˆ1 is For concreteness, we focus on
76

Chapter 3

Multiple Regression Analysis: Estimation

n

n

ˆ1 i 1

ri1 y i ˆ i 1

r 21 , ˆi

(3.22)

where the ri1 are the OLS residuals from a simple regression of x1 on x2, using the samˆ ple at hand. We regress our first independent variable, x1, on our second independent variable, x2, and then obtain the residuals (y plays no role here). Equation (3.22) shows that we can then do a simple regression of y on r1 to obtain ˆ1. (Note that the residuˆ als ri1 have a zero sample average, and so ˆ1 is the usual slope estimate from simple ˆ regression.) The representation in equation (3.22) gives another demonstration of ˆ1’s partial effect interpretation. The residuals ri1 are the part of xi1 that is uncorrelated with xi2. ˆ Another way of saying this is that ri1 is xi1 after the effects of xi2 have been partialled ˆ out, or netted out. Thus, ˆ1 measures the sample relationship between y and x1 after x2 has been partialled out. In simple regression analysis, there is no partialling out of other variables because no other variables are included in the regression. Problem 3.17 steps you through the partialling out process using the wage data from Example 3.2. For practical purposes, ˆ0 ˆ1x1 ˆ2x2 measures the change the important thing is that ˆ1 in the equation y ˆ in y given a one-unit increase in x1, holding x2 fixed. In the general model with k explanatory variables, ˆ1 can still be written as in equation (3.22), but the residuals ri1 come from the regression of x1 on x2, …, xk . Thus, ˆ1 ˆ measures the effect of x1 on y after x2, …, xk have been partialled or netted out.

Comparison of Simple and Multiple Regression Estimates
Two special cases exist in which the simple regression of y on x1 will produce the same OLS estimate on x1 as the regression of y on x1 and x2. To be more precise, write the ˜0 ˜1x1 and write the multiple regression as simple regression of y on x1 as y ˜ ˆ0 ˆ1x1 ˆ2x2. We know that the simple regression coefficient ˜1 does not usuy ˆ ally equal the multiple regression coefficient ˆ1. There are two distinct cases where ˜1 and ˆ1 are identical: 1. The partial effect of x2 on y is zero in the sample. That is, ˆ2 2. x1 and x2 are uncorrelated in the sample. 0.

The first assertion can be proven by looking at two of the equations used to determine n ˆ0, ˆ1, and ˆ2: ˆ0 ˆ1xi1 ˆ2xi2) 0 and ˆ0 y ˆ1x1 ˆ2x2. Setting xi1(yi ¯ ¯ ¯ 0 gives the same intercept and slope as does the regression of y on x1. The second assertion follows from equation (3.22). If x1 and x2 are uncorrelated in the sample, then regressing x1 on x2 results in no partialling out, and so the simple regression of y on x1 and the multiple regression of y on x1 and x2 produce identical estimates on x1. Even though simple and multiple regression estimates are almost never identical, we can use the previous characterizations to explain why they might be either very different or quite similar. For example, if ˆ2 is small, we might expect the simple and mul77

ˆ2

i 1

Part 1

Regression Analysis with Cross-Sectional Data

tiple regression estimates of 1 to be similar. In Example 3.1, the sample correlation between hsGPA and ACT is about 0.346, which is a nontrivial correlation. But the coefficient on ACT is fairly little. It is not suprising to find that the simple regression of colGPA on hsGPA produces a slope estimate of .482, which is not much different from the estimate .453 in (3.15).
E X A M P L E 3 . 3 (Participation in 401(k) Pension Plans)

We use the data in 401K.RAW to estimate the effect of a plan’s match rate (mrate) on the participation rate (prate) in its 401(k) pension plan. The match rate is the amount the firm contributes to a worker’s fund for each dollar the worker contributes (up to some limit); thus, mrate .75 means that the firm contributes 75 cents for each dollar contributed by the worker. The participation rate is the percentage of eligible workers having a 401(k) account. The variable age is the age of the 401(k) plan. There are 1,534 plans in the data set, the average prate is 87.36, the average mrate is .732, and the average age is 13.2. Regressing prate on mrate, age gives

pra ˆte

80.12

5.52 mrate

.243 age.

(3.23)

Thus, both mrate and age have the expected effects. What happens if we do not control for age? The estimated effect of age is not trivial, and so we might expect a large change in the estimated effect of mrate if age is dropped from the regression. However, the simple regression of prate on mrate yields pra ˆte 83.08 5.86 mrate. The simple regression estimate of the effect of mrate on prate is clearly different from the multiple regression estimate, but the difference is not very big. (The simple regression estimate is only about 6.2 percent larger than the multiple regression estimate.) This can be explained by the fact that the sample correlation between mrate and age is only .12.

In the case with k independent variables, the simple regression of y on x1 and the multiple regression of y on x1, x2, …, xk produce an identical estimate of x1 only if (1) the OLS coefficients on x2 through xk are all zero or (2) x1 is uncorrelated with each of x2, …, xk . Neither of these is very likely in practice. But if the coefficients on x2 through xk are small, or the sample correlations between x1 and the other independent variables are insubstantial, then the simple and multiple regression estimates of the effect of x1 on y can be similar.

Goodness-of-Fit
As with simple regression, we can define the total sum of squares (SST), the explained sum of squares (SSE), and the residual sum of squares or sum of squared residuals (SSR), as n SST i 1

(y i

y) 2 ¯

(3.24)

78

Chapter 3

Multiple Regression Analysis: Estimation

n

SSE i 1

(yi ˆ n y) 2 ¯

(3.25)

SSR i 1

u i2 . ˆ

(3.26)

Using the same argument as in the simple regression case, we can show that SST SSE SSR.
(3.27)

In other words, the total variation in {yi } is the sum of the total variations in {yi} and ˆ in {ui}. ˆ Assuming that the total variation in y is nonzero, as is the case unless yi is constant in the sample, we can divide (3.27) by SST to get SSR/SST SSE/SST 1.

Just as in the simple regression case, the R-squared is defined to be R2 SSE/SST 1 SSR/SST,
(3.28)

and it is interpreted as the proportion of the sample variation in yi that is explained by the OLS regression line. By definition, R2 is a number between zero and one. R2 can also be shown to equal the squared correlation coefficient between the actual yi and the fitted values yi. That is, ˆ n 2

(yi R2 i 1 n

y) (yi ¯ ˆ n ¯ y) ˆ
(3.29)

(yi i 1

y )2 ¯ i 1

(yi ˆ

¯ y)2 ˆ

(We have put the average of the yi in (3.29) to be true to the formula for a correlation ˆ coefficient; we know that this average equals y because the sample average of the resid¯ uals is zero and yi yi ui.) ˆ ˆ An important fact about R2 is that it never decreases, and it usually increases when another independent variable is added to a regression. This algebraic fact follows because, by definition, the sum of squared residuals never increases when additional regressors are added to the model. The fact that R2 never decreases when any variable is added to a regression makes it a poor tool for deciding whether one variable or several variables should be added to a model. The factor that should determine whether an explanatory variable belongs in a model is whether the explanatory variable has a nonzero partial effect on y in the population. We will show how to test this hypothesis in Chapter 4 when we cover statistical inference. We will also see that, when used properly, R2 allows us to test a group of variables to see if it is important for explaining y. For now, we use it as a goodnessof-fit measure for a given model.
79

Part 1

Regression Analysis with Cross-Sectional Data

E X A M P L E 3 . 4 ( D e t e r m i n a n t s o f C o l l e g e G PA )

From the grade point average regression that we did earlier, the equation with R2 is

ˆ colGPA

1.29 n

.453 hsGPA 141, R
2

.0094 ACT

.176.

This means that hsGPA and ACT together explain about 17.6 percent of the variation in college GPA for this sample of students. This may not seem like a high percentage, but we must remember that there are many other factors—including family background, personality, quality of high school education, affinity for college—that contribute to a student’s college performance. If hsGPA and ACT explained almost all of the variation in colGPA, then performance in college would be preordained by high school performance!
E X A M P L E 3 . 5 (Explaining Arrest Records)

CRIME1.RAW contains data on arrests during the year 1986 and other information on 2,725 men born in either 1960 or 1961 in California. Each man in the sample was arrested at least once prior to 1986. The variable narr86 is the number of times the man was arrested during 1986, it is zero for most men in the sample (72.29 percent), and it varies from 0 to 12. (The percentage of the men arrested once during 1986 was 20.51.) The variable pcnv is the proportion (not percentage) of arrests prior to 1986 that led to conviction, avgsen is average sentence length served for prior convictions (zero for most people), ptime86 is months spent in prison in 1986, and qemp86 is the number of quarters during which the man was employed in 1986 (from zero to four). A linear model explaining arrests is

narr86

0

1

pcnv

2

avgsen

3

ptime86

4

qemp86

u,

where pcnv is a proxy for the likelihood for being convicted of a crime and avgsen is a measure of expected severity of punishment, if convicted. The variable ptime86 captures the incarcerative effects of crime: if an individual is in prison, he cannot be arrested for a crime outside of prison. Labor market opportunities are crudely captured by qemp86. First, we estimate the model without the variable avgsen. We obtain

nar ˆr86

.712

.150 pcnv n

.034 ptime86 .0413

.104 qemp86

2,725, R2

This equation says that, as a group, the three variables pcnv, ptime86, and qemp86 explain about 4.1 percent of the variation in narr86. Each of the OLS slope coefficients has the anticipated sign. An increase in the proportion of convictions lowers the predicted number of arrests. If we increase pcnv by .50 (a large increase in the probability of conviction), then, holding the other factors fixed, nar ˆr86 .150(.5) .075. This may seem unusual because an arrest cannot change by a fraction. But we can use this value to obtain the predicted change in expected arrests for a large group of men. For example, among 100 men, the predicted fall in arrests when pcnv increases by .5 is 7.5.
80

Chapter 3

Multiple Regression Analysis: Estimation

Similarly, a longer prison term leads to a lower predicted number of arrests. In fact, if ptime86 increases from 0 to 12, predicted arrests for a particular man falls by .034(12) .408. Another quarter in which legal employment is reported lowers predicted arrests by .104, which would be 10.4 arrests among 100 men. If avgsen is added to the model, we know that R2 will increase. The estimated equation is

nar ˆr86

.707

.151 pcnv n

.0074 avgsen 2,725, R2

.037 ptime86

.103 qemp86

.0422.

Thus, adding the average sentence variable increases R2 from .0413 to .0422, a practically small effect. The sign of the coefficient on avgsen is also unexpected: it says that a longer average sentence length increases criminal activity.

Example 3.5 deserves a final word of caution. The fact that the four explanatory variables included in the second regression explain only about 4.2 percent of the variation in narr86 does not necessarily mean that the equation is useless. Even though these variables collectively do not explain much of the variation in arrests, it is still possible that the OLS estimates are reliable estimates of the ceteris paribus effects of each independent variable on narr86. As we will see, whether this is the case does not directly depend on the size of R2. Generally, a low R2 indicates that it is hard to predict individual outcomes on y with much accuracy, something we study in more detail in Chapter 6. In the arrest example, the small R2 reflects what we already suspect in the social sciences: it is generally very difficult to predict individual behavior.

Regression Through the Origin
Sometimes, an economic theory or common sense suggests that 0 should be zero, and so we should briefly mention OLS estimation when the intercept is zero. Specifically, we now seek an equation of the form y ˜ ˜1x1 ˜2x2 … ˜kxk ,
(3.30)

where the symbol “~” over the estimates is used to distinguish them from the OLS estimates obtained along with the intercept [as in (3.11)]. In (3.30), when x1 0, x2 0, …, xk 0, the predicted value is zero. In this case, ˜1, …, ˜k are said to be the OLS estimates from the regression of y on x1, x2, …, xk through the origin. The OLS estimates in (3.30), as always, minimize the sum of squared residuals, but with the intercept set at zero. You should be warned that the properties of OLS that we derived earlier no longer hold for regression through the origin. In particular, the OLS residuals no longer have a zero sample average. Further, if R2 is defined as n 1

SSR/SST, where SST is given in (3.24) and SSR is now i 1

(yi

˜1xi1



˜kxik)2, then R2 can actually be negative. This means that the sample average, y, ¯ “explains” more of the variation in the yi than the explanatory variables. Either we should include an intercept in the regression or conclude that the explanatory variables poorly explain y. In order to always have a nonnegative R-squared, some economists prefer to calculate R2 as the squared correlation coefficient between the actual and fit81

Part 1

Regression Analysis with Cross-Sectional Data

ted values of y, as in (3.29). (In this case, the average fitted value must be computed directly since it no longer equals y.) However, there is no set rule on computing R¯ squared for regression through the origin. One serious drawback with regression through the origin is that, if the intercept 0 in the population model is different from zero, then the OLS estimators of the slope parameters will be biased. The bias can be severe in some cases. The cost of estimating an intercept when 0 is truly zero is that the variances of the OLS slope estimators are larger.

3.3 THE EXPECTED VALUE OF THE OLS ESTIMATORS
We now turn to the statistical properties of OLS for estimating the parameters in an underlying population model. In this section, we derive the expected value of the OLS estimators. In particular, we state and discuss four assumptions, which are direct extensions of the simple regression model assumptions, under which the OLS estimators are unbiased for the population parameters. We also explicitly obtain the bias in OLS when an important variable has been omitted from the regression. You should remember that statistical properties have nothing to do with a particular sample, but rather with the property of estimators when random sampling is done repeatedly. Thus, Sections 3.3, 3.4, and 3.5 are somewhat abstract. While we give examples of deriving bias for particular models, it is not meaningful to talk about the statistical properties of a set of estimates obtained from a single sample. The first assumption we make simply defines the multiple linear regression (MLR) model.
A S S U M P T I O N M L R . 1 ( L I N E A R I N P A R A M E T E R S )

The model in the population can be written as

y

0

1 1

x

2 2

x



k k

x

u,

(3.31)

where 0, 1, …, k are the unknown parameters (constants) of interest, and u is an unobservable random error or random disturbance term.

Equation (3.31) formally states the population model, sometimes called the true model, to allow for the possibility that we might estimate a model that differs from (3.31). The key feature is that the model is linear in the parameters 0, 1, …, k. As we know, (3.31) is quite flexible because y and the independent variables can be arbitrary functions of the underlying variables of interest, such as natural logarithms and squares [see, for example, equation (3.7)].
A S S U M P T I O N M L R . 2 ( R A N D O M S A M P L I N G )

We have a random sample of n observations, {(xi1,xi2,…,xik,yi ): i ulation model described by (3.31).
82

1,2,…,n}, from the pop-

Chapter 3

Multiple Regression Analysis: Estimation

Sometimes we need to write the equation for a particular observation i: for a randomly drawn observation from the population, we have yi
0 1 i1

x

2 i2

x



k ik

x

ui .

(3.32)

Remember that i refers to the observation, and the second subscript on x is the variable number. For example, we can write a CEO salary equation for a particular CEO i as log(salaryi)
0 1

log(salesi)

2

ceoteni

3

ceoten2 i

ui .

(3.33)

The term ui contains the unobserved factors for CEO i that affect his or her salary. For applications, it is usually easiest to write the model in population form, as in (3.31). It contains less clutter and emphasizes the fact that we are interested in estimating a population relationship. In light of model (3.31), the OLS estimators ˆ0, ˆ1, ˆ2, …, ˆk from the regression of y on x1, …, xk are now considered to be estimators of 0, 1, …, k. We saw, in Section 3.2, that OLS chooses the estimates for a particular sample so that the residuals average out to zero and the sample correlation between each independent variable and the residuals is zero. For OLS to be unbiased, we need the population version of this condition to be true.
A S S U M P T I O N M L R . 3 ( Z E R O C O N D I T I O N A L M E A N )

The error u has an expected value of zero, given any values of the independent variables. In other words,

E(u x1,x2, …, xk )

0.

(3.34)

One way that Assumption MLR.3 can fail is if the functional relationship between the explained and explanatory variables is misspecified in equation (3.31): for example, if we forget to include the quadratic term inc2 in the consumption function cons 2 u when we estimate the model. Another functional form mis0 1inc 2inc specification occurs when we use the level of a variable when the log of the variable is what actually shows up in the population model, or vice versa. For example, if the true model has log(wage) as the dependent variable but we use wage as the dependent variable in our regression analysis, then the estimators will be biased. Intuitively, this should be pretty clear. We will discuss ways of detecting functional form misspecification in Chapter 9. Omitting an important factor that is correlated with any of x1, x2, …, xk causes Assumption MLR.3 to fail also. With multiple regression analysis, we are able to include many factors among the explanatory variables, and omitted variables are less likely to be a problem in multiple regression analysis than in simple regression analysis. Nevertheless, in any application there are always factors that, due to data limitations or ignorance, we will not be able to include. If we think these factors should be controlled for and they are correlated with one or more of the independent variables, then Assumption MLR.3 will be violated. We will derive this bias in some simple models later.
83

Part 1

Regression Analysis with Cross-Sectional Data

There are other ways that u can be correlated with an explanatory variable. In Chapter 15, we will discuss the problem of measurement error in an explanatory variable. In Chapter 16, we cover the conceptually more difficult problem in which one or more of the explanatory variables is determined jointly with y. We must postpone our study of these problems until we have a firm grasp of multiple regression analysis under an ideal set of assumptions. When Assumption MLR.3 holds, we often say we have exogenous explanatory variables. If xj is correlated with u for any reason, then xj is said to be an endogenous explanatory variable. The terms “exogenous” and “endogenous” originated in simultaneous equations analysis (see Chapter 16), but the term “endogenous explanatory variable” has evolved to cover any case where an explanatory variable may be correlated with the error term. The final assumption we need to show that OLS is unbiased ensures that the OLS estimators are actually well-defined. For simple regression, we needed to assume that the single independent variable was not constant in the sample. The corresponding assumption for multiple regression analysis is more complicated.
A S S U M P T I O N M L R . 4 ( N O P E R F E C T C O L L I N E A R I T Y )

In the sample (and therefore in the population), none of the independent variables is constant, and there are no exact linear relationships among the independent variables.

The no perfect collinearity assumption concerns only the independent variables. Beginning students of econometrics tend to confuse Assumptions MLR.4 and MLR.3, so we emphasize here that MLR.4 says nothing about the relationship between u and the explanatory variables. Assumption MLR.4 is more complicated than its counterpart for simple regression because we must now look at relationships between all independent variables. If an independent variable in (3.31) is an exact linear combination of the other independent variables, then we say the model suffers from perfect collinearity, and it cannot be estimated by OLS. It is important to note that Assumption MLR.4 does allow the independent variables to be correlated; they just cannot be perfectly correlated. If we did not allow for any correlation among the independent variables, then multiple regression would not be very useful for econometric analysis. For example, in the model relating test scores to educational expenditures and average family income, avgscore
0 1

expend

2

avginc

u,

we fully expect expend and avginc to be correlated: school districts with high average family incomes tend to spend more per student on education. In fact, the primary motivation for including avginc in the equation is that we suspect it is correlated with expend, and so we would like to hold it fixed in the analysis. Assumption MLR.4 only rules out perfect correlation between expend and avginc in our sample. We would be very unlucky to obtain a sample where per student expenditures are perfectly correlated with average family income. But some correlation, perhaps a substantial amount, is expected and certainly allowed.
84

Chapter 3

Multiple Regression Analysis: Estimation

The simplest way that two independent variables can be perfectly correlated is when one variable is a constant multiple of another. This can happen when a researcher inadvertently puts the same variable measured in different units into a regression equation. For example, in estimating a relationship between consumption and income, it makes no sense to include as independent variables income measured in dollars as well as income measured in thousands of dollars. One of these is redundant. What sense would it make to hold income measured in dollars fixed while changing income measured in thousands of dollars? We already know that different nonlinear functions of the same variable can appear 2 among the regressors. For example, the model cons u does 0 1inc 2inc 2 not violate Assumption MLR.4: even though x2 inc is an exact function of x1 inc, inc2 is not an exact linear function of inc. Including inc2 in the model is a useful way to generalize functional form, unlike including income measured in dollars and in thousands of dollars. Common sense tells us not to include the same explanatory variable measured in different units in the same regression equation. There are also more subtle ways that one independent variable can be a multiple of another. Suppose we would like to estimate an extension of a constant elasticity consumption function. It might seem natural to specify a model such as log(cons)
0 1

log(inc)

2

log(inc2)

u,

(3.35)

where x1 log(inc) and x2 log(inc2). Using the basic properties of the natural log (see Appendix A), log(inc2) 2 log(inc). That is, x2 2x1, and naturally this holds for all observations in the sample. This violates Assumption MLR.4. What we should do instead is include [log(inc)]2, not log(inc2), along with log(inc). This is a sensible extension of the constant elasticity model, and we will see how to interpret such models in Chapter 6. Another way that independent variables can be perfectly collinear is when one independent variable can be expressed as an exact linear function of two or more of the other independent variables. For example, suppose we want to estimate the effect of campaign spending on campaign outcomes. For simplicity, assume that each election has two candidates. Let voteA be the percent of the vote for Candidate A, let expendA be campaign expenditures by Candidate A, let expendB be campaign expenditures by Candidate B, and let totexpend be total campaign expenditures; the latter three variables are all measured in dollars. It may seem natural to specify the model as voteA
0 1

expendA

2

expendB

3

totexpend

u,

(3.36)

in order to isolate the effects of spending by each candidate and the total amount of spending. But this model violates Assumption MLR.4 because x3 x1 x2 by definition. Trying to interpret this equation in a ceteris paribus fashion reveals the problem. The parameter of 1 in equation (3.36) is supposed to measure the effect of increasing expenditures by Candidate A by one dollar on Candidate A’s vote, holding Candidate B’s spending and total spending fixed. This is nonsense, because if expendB and totexpend are held fixed, then we cannot increase expendA.
85

Part 1

Regression Analysis with Cross-Sectional Data

The solution to the perfect collinearity in (3.36) is simple: drop any one of the three variables from the model. We would probably drop totexpend, and then the coefficient on expendA would measure the effect of increasing expenditures by A on the percentage of the vote received by A, holding the spending by B fixed. The prior examples show that Assumption MLR.4 can fail if we are not careful in specifying our model. Assumption MLR.4 also fails if the sample size, n, is too small in relation to the number of parameters being estimated. In the general regression Q U E S T I O N 3 . 3 model in equation (3.31), there are k 1 In the previous example, if we use as explanatory variables expendA, parameters, and MLR.4 fails if n k 1. expendB, and shareA, where shareA 100 (expendA/totexpend) is Intuitively, this makes sense: to estimate the percentage share of total campaign expenditures made by k 1 parameters, we need at least k 1 Candidate A, does this violate Assumption MLR.4? observations. Not surprisingly, it is better to have as many observations as possible, something we will see with our variance calculations in Section 3.4. If the model is carefully specified and n k 1, Assumption MLR.4 can fail in rare cases due to bad luck in collecting the sample. For example, in a wage equation with education and experience as variables, it is possible that we could obtain a random sample where each individual has exactly twice as much education as years of experience. This scenario would cause Assumption MLR.4 to fail, but it can be considered very unlikely unless we have an extremely small sample size. We are now ready to show that, under these four multiple regression assumptions, the OLS estimators are unbiased. As in the simple regression case, the expectations are conditional on the values of the independent variables in the sample, but we do not show this conditioning explicitly.
T H E O R E M 3 . 1 ( U N B I A S E D N E S S O F O L S )

Under Assumptions MLR.1 through MLR.4,

E( ˆj)

j

,j

0,1, …, k,

(3.37)

for any values of the population parameter j. In other words, the OLS estimators are unbiased estimators of the population parameters.

In our previous empirical examples, Assumption MLR.4 has been satisfied (since we have been able to compute the OLS estimates). Furthermore, for the most part, the samples are randomly chosen from a well-defined population. If we believe that the specified models are correct under the key Assumption MLR.3, then we can conclude that OLS is unbiased in these examples. Since we are approaching the point where we can use multiple regression in serious empirical work, it is useful to remember the meaning of unbiasedness. It is tempting, in examples such as the wage equation in equation (3.19), to say something like “9.2 percent is an unbiased estimate of the return to education.” As we know, an estimate cannot be unbiased: an estimate is a fixed number, obtained from a particular sample, which usually is not equal to the population parameter. When we say that OLS is unbi86

Chapter 3

Multiple Regression Analysis: Estimation

ased under Assumptions MLR.1 through MLR.4, we mean that the procedure by which the OLS estimates are obtained is unbiased when we view the procedure as being applied across all possible random samples. We hope that we have obtained a sample that gives us an estimate close to the population value, but, unfortunately, this cannot be assured.

Including Irrelevant Variables in a Regression Model
One issue that we can dispense with fairly quicky is that of inclusion of an irrelevant variable or overspecifying the model in multiple regression analysis. This means that one (or more) of the independent variables is included in the model even though it has no partial effect on y in the population. (That is, its population coefficient is zero.) To illustrate the issue, suppose we specify the model as y
0 1 1

x

2 2

x

3 3

x

u,

(3.38)

and this model satisfies Assumptions MLR.1 through MLR.4. However, x3 has no effect on y after x1 and x2 have been controlled for, which means that 3 0. The variable x3 may or may not be correlated with x1 or x2; all that matters is that, once x1 and x2 are controlled for, x3 has no effect on y. In terms of conditional expectations, E(y x1,x2,x3) E(y x1,x2) 0 1x1 2x2. Because we do not know that 3 0, we are inclined to estimate the equation including x3: y ˆ ˆ0 ˆ1x1 ˆ2x2 ˆ3x3.
(3.39)

We have included the irrelevant variable, x3, in our regression. What is the effect of including x3 in (3.39) when its coefficient in the population model (3.38) is zero? In terms of the unbiasedness of ˆ1 and ˆ2, there is no effect. This conclusion requires no special derivation, as it follows immediately from Theorem 3.1. Remember, unbiasedness means E( ˆj) 0. Thus, we can conclude that j for any value of j, including j ˆ ˆ ˆ E( ˆ0) 0 (for any values of 0, 1, and 2). 0, E( 1) 1, E( 2) 2, and E( 3) Even though ˆ3 itself will never be exactly zero, its average value across many random samples will be zero. The conclusion of the preceding example is much more general: including one or more irrelevant variables in a multiple regression model, or overspecifying the model, does not affect the unbiasedness of the OLS estimators. Does this mean it is harmless to include irrelevant variables? No. As we will see in Section 3.4, including irrelevant variables can have undesirable effects on the variances of the OLS estimators.

Omitted Variable Bias: The Simple Case
Now suppose that, rather than including an irrelevant variable, we omit a variable that actually belongs in the true (or population) model. This is often called the problem of excluding a relevant variable or underspecifying the model. We claimed in Chapter 2 and earlier in this chapter that this problem generally causes the OLS estimators to be biased. It is time to show this explicitly and, just as importantly, to derive the direction and size of the bias.
87

Part 1

Regression Analysis with Cross-Sectional Data

Deriving the bias caused by omitting an important variable is an example of misspecification analysis. We begin with the case where the true population model has two explanatory variables and an error term: y
0 1 1

x

2 2

x

u,

(3.40)

and we assume that this model satisfies Assumptions MLR.1 through MLR.4. Suppose that our primary interest is in 1, the partial effect of x1 on y. For example, y is hourly wage (or log of hourly wage), x1 is education, and x2 is a measure of innate ability. In order to get an unbiased estimator of 1, we should run a regression of y on x1 and x2 (which gives unbiased estimators of 0, 1, and 2). However, due to our ignorance or data inavailability, we estimate the model by excluding x2. In other words, we perform a simple regression of y on x1 only, obtaining the equation y ˜ ˜0 ˜1x1.
(3.41)

We use the symbol “~” rather than “^” to emphasize that ˜1 comes from an underspecified model. When first learning about the omitted variables problem, it can be difficult for the student to distinguish between the underlying true model, (3.40) in this case, and the model that we actually estimate, which is captured by the regression in (3.41). It may seem silly to omit the variable x2 if it belongs in the model, but often we have no choice. For example, suppose that wage is determined by wage
0 1

educ

2

abil

u.

(3.42)

Since ability is not observed, we instead estimate the model wage
0 1

educ

v,

where v u. The estimator of 1 from the simple regression of wage on educ 2abil is what we are calling ˜1. We derive the expected value of ˜1 conditional on the sample values of x1 and x2. Deriving this expectation is not difficult because ˜1 is just the OLS slope estimator from a simple regression, and we have already studied this estimator extensively in Chapter 2. The difference here is that we must analyze its properties when the simple regression model is misspecified due to an omitted variable. From equation (2.49), we can express ˜1 as n ˜1

(xi1 i 1 n

x1)yi ¯ .
(3.43)

(xi1 i 1

x1)2 ¯

The next step is the most important one. Since (3.40) is the true model, we write y for each observation i as
88

Chapter 3

Multiple Regression Analysis: Estimation

yi

0

1 i1

x

2 i2

x

ui

(3.44)

(not yi ui , because the true model contains x2). Let SST1 be the denom0 1xi1 inator in (3.43). If we plug (3.44) in for yi in (3.43), the numerator in (3.43) becomes n (x i1 i 1 n 1 i 1

x1 )( ¯ (x i1
1

0

1 i1 n

x

2 i2

x

u i) n x1 )2 ¯ SST 1

2 i n 2 i 1 1

(x i1 (x i1

x1 )x i2 ¯ i n 1

(x i1 (x i1 i 1

x1 )u i ¯ x1 )u i . ¯
(3.45)

x1 )x i2 ¯

If we divide (3.45) by SST1, take the expectation conditional on the values of the independent variables, and use E(ui ) 0, we obtain n E( ˜1)

(xi1
1 2 i 1 n

x1)xi2 ¯ .
(3.46)

(xi1 i 1

x1) ¯

2

Thus, E( ˜1) does not generally equal 1: ˜1 is biased for 1. The ratio multiplying 2 in (3.46) has a simple interpretation: it is just the slope coefficient from the regression of x2 on x1, using our sample on the independent variables, which we can write as x2 ˜ ˜0 ˜1x1.
(3.47)

Because we are conditioning on the sample values of both independent variables, ˜1 is not random here. Therefore, we can write (3.46) as E( ˜1)
1 2 1

˜,

(3.48)

˜ which implies that the bias in ˜1 is E( ˜1) 1 2 1. This is often called the omitted variable bias. From equation (3.48), we see that there are two cases where ˜1 is unbiased. The first is pretty obvious: if 2 0—so that x2 does not appear in the true model (3.40)—then ˜1 is unbiased. We already know this from the simple regression analysis in Chapter 2. The second case is more interesting. If ˜1 0, then ˜1 is unbiased for 1, even if 2 0. Since ˜1 is the sample covariance between x1 and x2 over the sample variance of x1, ˜1 0 if, and only if, x1 and x2 are uncorrelated in the sample. Thus, we have the important conclusion that, if x1 and x2 are uncorrelated in the sample, then ˜1 is unbiased. This is not surprising: in Section 3.2, we showed that the simple regression estimator ˜1 and the multiple regression estimator ˆ1 are the same when x1 and x2 are uncorrelated in the sample. [We can also show that ˜1 is unbiased without conditioning on the xi2 if
89

Part 1

Regression Analysis with Cross-Sectional Data

Table 3.2 Summary of Bias in ˜ 1 When x2 is Omitted in Estimating Equation (3.40)

Corr(x1,x2) > 0
2

Corr(x1,x2) < 0 negative bias positive bias

0 0

positive bias negative bias

2

E(x2 x1) E(x2); then, for estimating 1, leaving x2 in the error term does not violate the zero conditional mean assumption for the error, once we adjust the intercept.] When x1 and x2 are correlated, ˜1 has the same sign as the correlation between x1 and x2: ˜1 0 if x1 and x2 are positively correlated and ˜1 0 if x1 and x2 are negatively correlated. The sign of the bias in ˜1 depends on the signs of both 2 and ˜1 and is summarized in Table 3.2 for the four possible cases when there is bias. Table 3.2 warrants careful study. For example, the bias in ˜1 is positive if 2 0 (x2 has a positive effect on y) and x1 and x2 are positively correlated. The bias is negative if 2 0 and x1 and x2 are negatively correlated. And so on. Table 3.2 summarizes the direction of the bias, but the size of the bias is also very important. A small bias of either sign need not be a cause for concern. For example, if the return to education in the population is 8.6 percent and the bias in the OLS estimator is 0.1 percent (a tenth of one percentage point), then we would not be very concerned. On the other hand, a bias on the order of three percentage points would be much more serious. The size of the bias is determined by the sizes of 2 and ˜1. In practice, since 2 is an unknown population parameter, we cannot be certain whether 2 is positive or negative. Nevertheless, we usually have a pretty good idea about the direction of the partial effect of x2 on y. Further, even though the sign of the correlation between x1 and x2 cannot be known if x2 is not observed, in many cases we can make an educated guess about whether x1 and x2 are positively or negatively correlated. In the wage equation (3.42), by definition more ability leads to higher productivity and therefore higher wages: 2 0. Also, there are reasons to believe that educ and abil are positively correlated: on average, individuals with more innate ability choose higher levels of education. Thus, the OLS estimates from the simple regression equav are on average too large. This does not mean that the tion wage 0 1educ estimate obtained from our sample is too big. We can only say that if we collect many random samples and obtain the simple regression estimates each time, then the average of these estimates will be greater than 1.
E X A M P L E 3 . 6 (Hourly Wage Equation)

Suppose the model log(wage) u satisfies Assumptions MLR.1 0 1educ 2abil through MLR.4. The data set in WAGE1.RAW does not contain data on ability, so we estimate 1 from the simple regression
90

Chapter 3

Multiple Regression Analysis: Estimation

ˆ log(wage) n

.584 526, R2

.083 educ .186.

This is only the result from a single sample, so we cannot say that .083 is greater than 1; the true return to education could be lower or higher than 8.3 percent (and we will never know for sure). Nevertheless, we know that the average of the estimates across all random samples would be too large.

As a second example, suppose that, at the elementary school level, the average score for students on a standardized exam is determined by avgscore
0 1

expend

2

povrate

u,

where expend is expenditure per student and povrate is the poverty rate of the children in the school. Using school district data, we only have observations on the percent of students with a passing grade and per student expenditures; we do not have information on poverty rates. Thus, we estimate 1 from the simple regression of avgscore on expend. We can again obtain the likely bias in ˜1. First, 2 is probably negative: there is ample evidence that children living in poverty score lower, on average, on standardized tests. Second, the average expenditure per student is probably negatively correlated with the poverty rate: the higher the poverty rate, the lower the average per-student spending, so that Corr(x1,x2) 0. From Table 3.2, ˜1 will have a positive bias. This observation has important implications. It could be that the true effect of spending is zero; that is, 1 0. However, the simple regression estimate of 1 will usually be greater than zero, and this could lead us to conclude that expenditures are important when they are not. When reading and performing empirical work in economics, it is important to master the terminology associated with biased estimators. In the context of omitting a vari˜ able from model (3.40), if E( ˜1) 1, then we say that 1 has an upward bias. When ˜1) ˜1 has a downward bias. These definitions are the same whether 1 is posE( 1, itive or negative. The phrase biased towards zero refers to cases where E( ˜1) is closer to zero than 1. Therefore, if 1 is positive, then ˜1 is biased towards zero if it has a downward bias. On the other hand, if 1 0, then ˜1 is biased towards zero if it has an upward bias.

Omitted Variable Bias: More General Cases
Deriving the sign of omitted variable bias when there are multiple regressors in the estimated model is more difficult. We must remember that correlation between a single explanatory variable and the error generally results in all OLS estimators being biased. For example, suppose the population model y
0 1 1

x

2 2

x

3 3

x

u,

(3.49)

satisfies Assumptions MLR.1 through MLR.4. But we omit x3 and estimate the model as
91

Part 1

Regression Analysis with Cross-Sectional Data

y ˜

˜0

˜1x1

˜2x2.

(3.50)

Now, suppose that x2 and x3 are uncorrelated, but that x1 is correlated with x3. In other words, x1 is correlated with the omitted variable, but x2 is not. It is tempting to think that, while ˜1 is probably biased based on the derivation in the previous subsection, ˜2 is unbiased because x2 is uncorrelated with x3. Unfortunately, this is not generally the case: both ˜1 and ˜2 will normally be biased. The only exception to this is when x1 and x2 are also uncorrelated. Even in the fairly simple model above, it is difficult to obtain the direction of the bias in ˜1 and ˜2. This is because x1, x2, and x3 can all be pairwise correlated. Nevertheless, an approximation is often practically useful. If we assume that x1 and x2 are uncorrelated, then we can study the bias in ˜1 as if x2 were absent from both the population and the estimated models. In fact, when x1 and x2 are uncorrelated, it can be shown that n E( ˜1)

(xi1
1 i 1 3 n

x1)xi3 ¯ . x1) ¯
2

(xi1 i 1

This is just like equation (3.46), but 3 replaces 2 and x3 replaces x2. Therefore, the bias in ˜1 is obtained by replacing 2 with 3 and x2 with x3 in Table 3.2. If 3 0 and Corr(x1,x3) 0, the bias in ˜1 is positive. And so on. As an example, suppose we add exper to the wage model: wage
0 1

educ

2

exper

3

abil

u.

If abil is omitted from the model, the estimators of both 1 and 2 are biased, even if we assume exper is uncorrelated with abil. We are mostly interested in the return to education, so it would be nice if we could conclude that ˜1 has an upward or downward bias due to omitted ability. This conclusion is not possible without further assumptions. As an approximation, let us suppose that, in addition to exper and abil being uncorrelated, educ and exper are also uncorrelated. (In reality, they are somewhat negatively correlated.) Since 3 0 and educ and abil are positively correlated, ˜1 would have an upward bias, just as if exper were not in the model. The reasoning used in the previous example is often followed as a rough guide for obtaining the likely bias in estimators in more complicated models. Usually, the focus is on the relationship between a particular explanatory variable, say x1, and the key omitted factor. Strictly speaking, ignoring all other explanatory variables is a valid practice only when each one is uncorrelated with x1, but it is still a useful guide.

3.4 THE VARIANCE OF THE OLS ESTIMATORS
We now obtain the variance of the OLS estimators so that, in addition to knowing the central tendencies of ˆj, we also have a measure of the spread in its sampling distribution. Before finding the variances, we add a homoskedasticity assumption, as in Chapter 2. We do this for two reasons. First, the formulas are simplified by imposing the con92

Chapter 3

Multiple Regression Analysis: Estimation

stant error variance assumption. Second, in Section 3.5, we will see that OLS has an important efficiency property if we add the homoskedasticity assumption. In the multiple regression framework, homoskedasticity is stated as follows:
A S S U M P T I O N M L R . 5 ( H O M O S K E D A S T I C I T Y )

Var(u x1,…, xk )

2

.

Assumption MLR.5 means that the variance in the error term, u, conditional on the explanatory variables, is the same for all combinations of outcomes of the explanatory variables. If this assumption fails, then the model exhibits heteroskedasticity, just as in the two-variable case. In the equation wage
0 1

educ

2

exper

3

tenure

u,

homoskedasticity requires that the variance of the unobserved error u does not depend on the levels of education, experience, or tenure. That is, Var(u educ, exper, tenure)
2

.

If this variance changes with any of the three explanatory variables, then heteroskedasticity is present. Assumptions MLR.1 through MLR.5 are collectively known as the Gauss-Markov assumptions (for cross-sectional regression). So far, our statements of the assumptions are suitable only when applied to cross-sectional analysis with random sampling. As we will see, the Gauss-Markov assumptions for time series analysis, and for other situations such as panel data analysis, are more difficult to state, although there are many similarities. In the discussion that follows, we will use the symbol x to denote the set of all independent variables, (x1, …, xk ). Thus, in the wage regression with educ, exper, and tenure as independent variables, x (educ, exper, tenure). Now we can write Assumption MLR.3 as E(y x)
0 1 1

x

2 2

x



k k

x,

2 and Assumption MLR.5 is the same as Var(y x) . Stating the two assumptions in this way clearly illustrates how Assumption MLR.5 differs greatly from Assumption MLR.3. Assumption MLR.3 says that the expected value of y, given x, is linear in the parameters, but it certainly depends on x1, x2, …, xk . Assumption MLR.5 says that the variance of y, given x, does not depend on the values of the independent variables. We can now obtain the variances of the ˆj, where we again condition on the sample values of the independent variables. The proof is in the appendix to this chapter.

T H E O R E M 3 . 2 ( S A M P L I N G O L S S L O P E E S T I M A T O R S )

V A R I A N C E S

O F

T H E

Under Assumptions MLR.1 through MLR.5, conditional on the sample values of the independent variables,
93

Part 1

Regression Analysis with Cross-Sectional Data

Var( ˆj) n 2

SSTj (1

Rj2)

,

(3.51)

for j

1,2,…,k, where SSTj i 1

(xij

xj)2 is the total sample variation in xj, and R2 is ¯ j

the R-squared from regressing xj on all other independent variables (and including an intercept).

Before we study equation (3.51) in more detail, it is important to know that all of the Gauss-Markov assumptions are used in obtaining this formula. While we did not need the homoskedasticity assumption to conclude that OLS is unbiased, we do need it to validate equation (3.51). The size of Var( ˆj) is practically important. A larger variance means a less precise estimator, and this translates into larger confidence intervals and less accurate hypotheses tests (as we will see in Chapter 4). In the next subsection, we discuss the elements comprising (3.51).

The Components of the OLS Variances: Multicollinearity
Equation (3.51) shows that the variance of ˆj depends on three factors: 2, SSTj, and Rj2. Remember that the index j simply denotes any one of the independent variables (such as education or poverty rate). We now consider each of the factors affecting Var( ˆj) in turn. means larger variances for the OLS estimators. This is not at all surprising: more “noise” in the equation (a larger 2) makes it more difficult to estimate the partial effect of any of the independent variables on y, and this is reflected in higher variances for the OLS slope estimators. Since 2 is a feature of the population, it has nothing to do with the sample size. It is the one component of (3.51) that is unknown. We will see later how to obtain an unbiased estimator of 2. For a given dependent variable y, there is really only one way to reduce the error variance, and that is to add more explanatory variables to the equation (take some factors out of the error term). This is not always possible, nor is it always desirable for reasons discussed later in the chapter.
THE ERROR VARIANCE, THE TOTAL SAMPLE VARIATION IN xj , SSTj . From equation (3.51), the larger the total variation in xj, the smaller is Var( ˆj). Thus, everything else being equal, for estimating j we prefer to have as much sample variation in xj as possible. We already discovered this in the simple regression case in Chapter 2. While it is rarely possible for us to choose the sample values of the independent variables, there is a way to increase the sample variation in each of the independent variables: increase the sample size. In fact, when sampling randomly from a population, SSTj increases without bound as the sample size gets larger and larger. This is the component of the variance that systematically depends on the sample size.
94
2

. From equation (3.51), a larger

2

Chapter 3

Multiple Regression Analysis: Estimation

When SSTj is small, Var( ˆj) can get very large, but a small SSTj is not a violation of Assumption MLR.4. Technically, as SSTj goes to zero, Var( ˆj) approaches infinity. The extreme case of no sample variation in xj , SSTj 0, is not allowed by Assumption MLR.4.
2 THE LINEAR RELATIONSHIPS AMONG THE INDEPENDENT VARIABLES, Rj . The

term Rj2 in equation (3.51) is the most difficult of the three components to understand. This term does not appear in simple regression analysis because there is only one independent variable in such cases. It is important to see that this R-squared is distinct from the R-squared in the regression of y on x1, x2, …, xk : Rj2 is obtained from a regression involving only the independent variables in the original model, where xj plays the role of a dependent variable. Consider first the k 2 case: y u. Then Var( ˆ1) 0 1x1 2x2 2 2 2 /[SST1(1 R1 )], where R1 is the R-squared from the simple regression of x1 on x2 (and an intercept, as always). Since the R-squared measures goodness-of-fit, a value of 2 R1 close to one indicates that x2 explains much of the variation in x1 in the sample. This means that x1 and x2 are highly correlated. 2 As R1 increases to one, Var( ˆ1) gets larger and larger. Thus, a high degree of linear relationship between x1 and x2 can lead to large variances for the OLS slope estimators. (A similar argument applies to ˆ 2.) See Figure 3.1 for the relationship between Var( ˆ1) and the R-squared from the regression of x1 on x2. In the general case, Rj2 is the proportion of the total variation in xj that can be explained by the other independent variables appearing in the equation. For a given 2 and SSTj, the smallest Var( ˆj) is obtained when Rj2 0, which happens if, and only if, xj has zero sample correlation with every other independent variable. This is the best case for estimating j, but it is rarely encountered. The other extreme case, Rj2 1, is ruled out by Assumption MLR.4, because 2 Rj 1 means that, in the sample, xj is a perfect linear combination of some of the other independent variables in the regression. A more relevant case is when Rj2 is “close” to one. From equation (3.51) and Figure 3.1, we see that this can cause Var( ˆj) to be large: Var( ˆj) * as Rj2 * 1. High (but not perfect) correlation between two or more of the independent variables is called multicollinearity. Before we discuss the multicollinearity issue further, it is important to be very clear on one thing: a case where Rj2 is close to one is not a violation of Assumption MLR.4. Since multicollinearity violates none of our assumptions, the “problem” of multicollinearity is not really well-defined. When we say that multicollinearity arises for estimating j when Rj2 is “close” to one, we put “close” in quotation marks because there is no absolute number that we can cite to conclude that multicollinearity is a problem. .9 means that 90 percent of the sample variation in xj can be For example, Rj2 explained by the other independent variables in the regression model. Unquestionably, this means that xj has a strong linear relationship to the other independent variables. But whether this translates into a Var( ˆj) that is too large to be useful depends on the sizes of 2 and SSTj. As we will see in Chapter 4, for statistical inference, what ultimately matters is how big ˆj is in relation to its standard deviation. Just as a large value of Rj2 can cause large Var( ˆj), so can a small value of SSTj. Therefore, a small sample size can lead to large sampling variances, too. Worrying
95

Part 1

Regression Analysis with Cross-Sectional Data

Figure 3.1 2 Var ( ˆ1) as a function of R1.

Var ( ˆ 1)

0

2 R1

1

about high degrees of correlation among the independent variables in the sample is really no different from worrying about a small sample size: both work to increase Var( ˆj). The famous University of Wisconsin econometrician Arthur Goldberger, reacting to econometricians’ obsession with multicollinearity, has [tongue-in-cheek] coined the term micronumerosity, which he defines as the “problem of small sample size.” [For an engaging discussion of multicollinearity and micronumerosity, see Goldberger (1991).] Although the problem of multicollinearity cannot be clearly defined, one thing is clear: everything else being equal, for estimating j it is better to have less correlation between xj and the other independent variables. This observation often leads to a discussion of how to “solve” the multicollinearity problem. In the social sciences, where we are usually passive collectors of data, there is no good way to reduce variances of unbiased estimators other than to collect more data. For a given data set, we can try dropping other independent variables from the model in an effort to reduce multicollinearity. Unfortunately, dropping a variable that belongs in the population model can lead to bias, as we saw in Section 3.3. Perhaps an example at this point will help clarify some of the issues raised concerning multicollinearity. Suppose we are interested in estimating the effect of various
96

Chapter 3

Multiple Regression Analysis: Estimation

school expenditure categories on student performance. It is likely that expenditures on teacher salaries, instructional materials, athletics, and so on, are highly correlated: wealthier schools tend to spend more on everything, and poorer schools spend less on everything. Not surprisingly, it can be difficult to estimate the effect of any particular expenditure category on student performance when there is little variation in one category that cannot largely be explained by variations in the other expenditure categories (this leads to high Rj2 for each of the expenditure variables). Such multicollinearity problems can be mitigated by collecting more data, but in a sense we have imposed the problem on ourselves: we are asking questions that may be too subtle for the available data to answer with any precision. We can probably do much better by changing the scope of the analysis and lumping all expenditure categories together, since we would no longer be trying to estimate the partial effect of each separate category. Another important point is that a high degree of correlation between certain independent variables can be irrelevant as to how well we can estimate other parameters in the model. For example, consider a model with three independent variables: y
0 1 1

x

2 2

x

3 3

x

u,

where x2 and x3 are highly correlated. Then Var( ˆ2) and Var( ˆ3) may be large. But the amount of correlation between x2 and x3 has no direct effect on Var( ˆ1). In fact, if x1 is 2 2 uncorrelated with x2 and x3, then R1 0 and Var( ˆ1) /SST1, regardless of how much correlation there is between x2 and x3. If 1 is the parameter of interest, we do not really care about the amount of correlation between x2 and x3. Q U E S T I O N 3 . 4 The previous observation is important Suppose you postulate a model explaining final exam score in terms because economists often include many of class attendance. Thus, the dependent variable is final exam controls in order to isolate the causal effect score, and the key explanatory variable is number of classes attendof a particular variable. For example, in ed. To control for student abilities and efforts outside the classroom, looking at the relationship between loan you include among the explanatory variables cumulative GPA, SAT score, and measures of high school performance. Someone says, approval rates and percent of minorities in “You cannot hope to learn anything from this exercise because a neighborhood, we might include varicumulative GPA, SAT score, and high school performance are likely ables like average income, average housto be highly collinear.” What should be your response? ing value, measures of creditworthiness, and so on, because these factors need to be accounted for in order to draw causal conclusions about discrimination. Income, housing prices, and creditworthiness are generally highly correlated with each other. But high correlations among these variables do not make it more difficult to determine the effects of discrimination.

Variances in Misspecified Models
The choice of whether or not to include a particular variable in a regression model can be made by analyzing the tradeoff between bias and variance. In Section 3.3, we derived the bias induced by leaving out a relevant variable when the true model contains two explanatory variables. We continue the analysis of this model by comparing the variances of the OLS estimators. Write the true population model, which satisfies the Gauss-Markov assumptions, as y
0 1 1

x

2 2

x

u.
97

Part 1

Regression Analysis with Cross-Sectional Data

We consider two estimators of

1

. The estimator ˆ1 comes from the multiple regression ˆ0 ˆ1x1 ˆ2x2.
(3.52)

y ˆ

In other words, we include x2, along with x1, in the regression model. The estimator ˜1 is obtained by omitting x2 from the model and running a simple regression of y on x1: y ˜ ˜0 ˜1x1.
(3.53)

When 2 0, equation (3.53) excludes a relevant variable from the model and, as we saw in Section 3.3, this induces a bias in ˜1 unless x1 and x2 are uncorrelated. On the other hand, ˆ1 is unbiased for 1 for any value of 2, including 2 0. It follows that, if bias is used as the only criterion, ˆ1 is preferred to ˜1. The conclusion that ˆ1 is always preferred to ˜1 does not carry over when we bring variance into the picture. Conditioning on the values of x1 and x2 in the sample, we have, from (3.51), Var( ˆ1)
2

/[SST1(1

2 R1 )],

(3.54)

2 where SST1 is the total variation in x1, and R1 is the R-squared from the regression of x1 on x2. Further, a simple modification of the proof in Chapter 2 for two-variable regression shows that

Var( ˜1)

2

/SST1.

(3.55)

Comparing (3.55) to (3.54) shows that Var( ˜1) is always smaller than Var( ˆ1), unless x1 and x2 are uncorrelated in the sample, in which case the two estimators ˜1 and ˆ1 are the same. Assuming that x1 and x2 are not uncorrelated, we can draw the following conclusions: 1. When 2 0, ˜1 is biased, ˆ1 is unbiased, and Var( ˜1) Var( ˆ1). 2. When 2 0, ˜1 and ˆ1 are both unbiased, and Var( ˜1) Var( ˆ1). From the second conclusion, it is clear that ˜1 is preferred if 2 0. Intuitively, if x2 does not have a partial effect on y, then including it in the model can only exacerbate the multicollinearity problem, which leads to a less efficient estimator of 1. A higher variance for the estimator of 1 is the cost of including an irrelevant variable in a model. The case where 2 0 is more difficult. Leaving x2 out of the model results in a biased estimator of 1. Traditionally, econometricians have suggested comparing the likely size of the bias due to omitting x2 with the reduction in the variance—summa2 rized in the size of R1 —to decide whether x2 should be included. However, when 0, there are two favorable reasons for including x2 in the model. The most impor2 tant of these is that any bias in ˜1 does not shrink as the sample size grows; in fact, the bias does not necessarily follow any pattern. Therefore, we can usefully think of the bias as being roughly the same for any sample size. On the other hand, Var( ˜1) and Var( ˆ1) both shrink to zero as n gets large, which means that the multicollinearity induced by adding x2 becomes less important as the sample size grows. In large samples, we would prefer ˆ1.
98

Chapter 3

Multiple Regression Analysis: Estimation

The other reason for favoring ˆ1 is more subtle. The variance formula in (3.55) is conditional on the values of xi1 and xi2 in the sample, which provides the best scenario for ˜1. When 2 0, the variance of ˜1 conditional only on x1 is larger than that presented in (3.55). Intuitively, when 2 0 and x2 is excluded from the model, the error variance increases because the error effectively contains part of x2. But formula (3.55) ignores the error variance increase because it treats both regressors as nonrandom. A full discussion of which independent variables to condition on would lead us too far astray. It is sufficient to say that (3.55) is too generous when it comes to measuring the precision in ˜1.

Estimating

2

: Standard Errors of the OLS Estimators

We now show how to choose an unbiased estimator of 2, which then allows us to obtain unbiased estimators of Var( ˆj). Since 2 E(u2), an unbiased “estimator” of 2 is the sample average of the n squared errors: n-1 u2. Unfortunately, this is not a true estimator because we do not i i 1

observe the ui . Nevertheless, recall that the errors can be written as ui yi 0 1xi1 xi2 … xik, and so the reason we do not observe the ui is that we do not know 2 k the j. When we replace each j with its OLS estimator, we get the OLS residuals: ui ˆ yi ˆ0 ˆ1xi1 ˆ2xi2 … ˆkxik. It seems natural to estimate 2 by replacing ui with the ui. In the simple regression case, ˆ we saw that this leads to a biased estimator. The unbiased estimator of 2 in the general multiple regression case is n ˆ2

u2 ˆi i 1

(n

k

1)

SSR (n

k

1).

(3.56)

We already encountered this estimator in the k 1 case in simple regression. The term n k 1 in (3.56) is the degrees of freedom (df ) for the general OLS problem with n observations and k independent variables. Since there are k 1 parameters in a regression model with k independent variables and an intercept, we can write df n (k 1) (number of observations)

(number of estimated parameters).

(3.57)

This is the easiest way to compute the degrees of freedom in a particular application: count the number of parameters, including the intercept, and subtract this amount from the number of observations. (In the rare case that an intercept is not estimated, the number of parameters decreases by one.) Technically, the division by n k 1 in (3.56) comes from the fact that the expected value of the sum of squared residuals is E(SSR) (n k 1) 2. Intuitively, we can figure out why the degrees of freedom adjustment is necessary by returning to n the first order conditions for the OLS estimators. These can be written as i 1

ui ˆ

0 and
99

Part 1

Regression Analysis with Cross-Sectional Data

n

xijui ˆ i 1

0, where j

1,2, …, k. Thus, in obtaining the OLS estimates, k

1 restric-

tions are imposed on the OLS residuals. This means that, given n (k 1) of the residuals, the remaining k 1 residuals are known: there are only n (k 1) degrees of freedom in the residuals. (This can be contrasted with the errors ui, which have n degrees of freedom in the sample.) For reference, we summarize this discussion with Theorem 3.3. We proved this theorem for the case of simple regression analysis in Chapter 2 (see Theorem 2.3). (A general proof that requires matrix algebra is provided in Appendix E.)
T H E O R E M 3 . 3 ( U N B I A S E D E S T I M A T I O N
2 2

O F

2

)

Under the Gauss-Markov Assumptions MLR.1 through MLR.5, E( ˆ )

.

The positive square root of ˆ 2, denoted ˆ, is called the standard error of the regression or SER. The SER is an estimator of the standard deviation of the error term. This estimate is usually reported by regression packages, although it is called different things by different packages. (In addition to ser, ˆ is also called the standard error of the estimate and the root mean squared error.) Note that ˆ can either decrease or increase when another independent variable is added to a regression (for a given sample). This is because, while SSR must fall when another explanatory variable is added, the degrees of freedom also falls by one. Because SSR is in the numerator and df is in the denominator, we cannot tell beforehand which effect will dominate. For constructing confidence intervals and conducting tests in Chapter 4, we need to estimate the standard deviation of ˆj, which is just the square root of the variance: sd( ˆj) /[SSTj(1 Rj2)]1/2.

Since is unknown, we replace it with its estimator, ˆ . This gives us the standard error of ˆj : se( ˆj) ˆ /[SSTj (1 Rj2)]1/ 2.
(3.58)

Just as the OLS estimates can be obtained for any given sample, so can the standard errors. Since se( ˆj) depends on ˆ , the standard error has a sampling distribution, which will play a role in Chapter 4. We should emphasize one thing about standard errors. Because (3.58) is obtained directly from the variance formula in (3.51), and because (3.51) relies on the homoskedasticity Assumption MLR.5, it follows that the standard error formula in (3.58) is not a valid estimator of sd( ˆj) if the errors exhibit heteroskedasticity. Thus, while the presence of heteroskedasticity does not cause bias in the ˆj, it does lead to bias in the usual formula for Var( ˆj), which then invalidates the standard errors. This is important because any regression package computes (3.58) as the default standard error for each coefficient (with a somewhat different representation for the intercept). If we suspect heteroskedasticity, then the “usual” OLS standard errors are invalid and some corrective action should be taken. We will see in Chapter 8 what methods are available for dealing with heteroskedasticity.
100

Chapter 3

Multiple Regression Analysis: Estimation

3.5 EFFICIENCY OF OLS: THE GAUSS-MARKOV THEOREM
In this section, we state and discuss the important Gauss-Markov Theorem, which justifies the use of the OLS method rather than using a variety of competing estimators. We know one justification for OLS already: under Assumptions MLR.1 through MLR.4, OLS is unbiased. However, there are many unbiased estimators of the j under these assumptions (for example, see Problem 3.12). Might there be other unbiased estimators with variances smaller than the OLS estimators? If we limit the class of competing estimators appropriately, then we can show that OLS is best within this class. Specifically, we will argue that, under Assumptions MLR.1 through MLR.5, the OLS estimator ˆj for j is the best linear unbiased estimator (BLUE). In order to state the theorem, we need to understand each component of the acronym “BLUE.” First, we know what an estimator is: it is a rule that can be applied to any sample of data to produce an estimate. We also know what an unbiased estimator is: in the current context, an estimator, say ˜j, of j is an unbiased estimator of j if E( ˜j) j for any 0, 1, …, k. What about the meaning of the term “linear”? In the current context, an estimator ˜j of j is linear if, and only if, it can be expressed as a linear function of the data on the dependent variable: n ˜j i 1

w ij y i ,

(3.59)

where each wij can be a function of the sample values of all the independent variables. The OLS estimators are linear, as can be seen from equation (3.22). Finally, how do we define “best”? For the current theorem, best is defined as smallest variance. Given two unbiased estimators, it is logical to prefer the one with the smallest variance (see Appendix C). Now, let ˆ0, ˆ1, …, ˆk denote the OLS estimators in the model (3.31) under Assumptions MLR.1 through MLR.5. The Gauss-Markov theorem says that, for any estimator ˜j which is linear and unbiased, Var( ˆj) Var( ˜j), and the inequality is usually strict. In other words, in the class of linear unbiased estimators, OLS has the smallest variance (under the five Gauss-Markov assumptions). Actually, the theorem says more than this. If we want to estimate any linear function of the j, then the corresponding linear combination of the OLS estimators achieves the smallest variance among all linear unbiased estimators. We conclude with a theorem, which is proven in Appendix 3A.
T H E O R E M 3 . 4 ( G A U S S - M A R K O V T H E O R E M )

Under Assumptions MLR.1 through MLR.5, ˆ0, ˆ1, …, ˆk are the best linear unbiased estimators (BLUEs) of 0, 1, …, k, respectively.

It is because of this theorem that Assumptions MLR.1 through MLR.5 are known as the Gauss-Markov assumptions (for cross-sectional analysis).
101

Part 1

Regression Analysis with Cross-Sectional Data

The importance of the Gauss-Markov theorem is that, when the standard set of assumptions holds, we need not look for alternative unbiased estimators of the form (3.59): none will be better than OLS. Equivalently, if we are presented with an estimator that is both linear and unbiased, then we know that the variance of this estimator is at least as large as the OLS variance; no additional calculation is needed to show this. For our purposes, Theorem 3.4 justifies the use of OLS to estimate multiple regression models. If any of the Gauss-Markov assumptions fail, then this theorem no longer holds. We already know that failure of the zero conditional mean assumption (Assumption MLR.3) causes OLS to be biased, so Theorem 3.4 also fails. We also know that heteroskedasticity (failure of Assumption MLR.5) does not cause OLS to be biased. However, OLS no longer has the smallest variance among linear unbiased estimators in the presence of heteroskedasticity. In Chapter 8, we analyze an estimator that improves upon OLS when we know the brand of heteroskedasticity.

SUMMARY
1. The multiple regression model allows us to effectively hold other factors fixed while examining the effects of a particular independent variable on the dependent variable. It explicitly allows the independent variables to be correlated. 2. Although the model is linear in its parameters, it can be used to model nonlinear relationships by appropriately choosing the dependent and independent variables. 3. The method of ordinary least squares is easily applied to the multiple regression model. Each slope estimate measures the partial effect of the corresponding independent variable on the dependent variable, holding all other independent variables fixed. 4. R2 is the proportion of the sample variation in the dependent variable explained by the independent variables, and it serves as a goodness-of-fit measure. It is important not to put too much weight on the value of R2 when evaluating econometric models. 5. Under the first four Gauss-Markov assumptions (MLR.1 through MLR.4), the OLS estimators are unbiased. This implies that including an irrelevant variable in a model has no effect on the unbiasedness of the intercept and other slope estimators. On the other hand, omitting a relevant variable causes OLS to be biased. In many circumstances, the direction of the bias can be determined. 6. Under the five Gauss-Markov assumptions, the variance of an OLS slope estima2 tor is given by Var( ˆj) /[SSTj(1 Rj2)]. As the error variance 2 increases, so does ˆj), while Var( ˆj) decreases as the sample variation in xj , SSTj, increases. The term Var( Rj2 measures the amount of collinearity between xj and the other explanatory variables. As Rj2 approaches one, Var( ˆj) is unbounded. 7. Adding an irrelevant variable to an equation generally increases the variances of the remaining OLS estimators because of multicollinearity. 8. Under the Gauss-Markov assumptions (MLR.1 through MLR.5), the OLS estimators are best linear unbiased estimators (BLUE).
102

Chapter 3

Multiple Regression Analysis: Estimation

KEY TERMS
Best Linear Unbiased Estimator (BLUE) Biased Towards Zero Ceteris Paribus Degrees of Freedom (df ) Disturbance Downward Bias Endogenous Explanatory Variable Error Term Excluding a Relevant Variable Exogenous Explanatory Variables Explained Sum of Squares (SSE) First Order Conditions Gauss-Markov Assumptions Gauss-Markov Theorem Inclusion of an Irrelevant Variable Intercept Micronumerosity Misspecification Analysis Multicollinearity Multiple Linear Regression Model Multiple Regression Analysis Omitted Variable Bias OLS Intercept Estimate OLS Regression Line OLS Slope Estimate Ordinary Least Squares Overspecifying the Model Partial Effect Perfect Collinearity Population Model Residual Residual Sum of Squares Sample Regression Function (SRF) Slope Parameters Standard Deviation of ˆj Standard Error of ˆj Standard Error of the Regression (SER) Sum of Squared Residuals (SSR) Total Sum of Squares (SST) True Model Underspecifying the Model Upward Bias

PROBLEMS
3.1 Using the data in GPA2.RAW on 4,137 college students, the following equation was estimated by OLS: ˆ colgpa 1.392 n .0135 hsperc 4,137, R
2

.00148 sat

.273,

where colgpa is measured on a four-point scale, hsperc is the percentile in the high school graduating class (defined so that, for example, hsperc 5 means the top five percent of the class), and sat is the combined math and verbal scores on the student achievement test. (i) Why does it make sense for the coefficient on hsperc to be negative? (ii) What is the predicted college GPA when hsperc 20 and sat 1050? (iii) Suppose that two high school graduates, A and B, graduated in the same percentile from high school, but Student A’s SAT score was 140 points higher (about one standard deviation in the sample). What is the predicted difference in college GPA for these two students? Is the difference large? (iv) Holding hsperc fixed, what difference in SAT scores leads to a predicted colgpa difference of .50, or one-half of a grade point? Comment on your answer.
103

Part 1

Regression Analysis with Cross-Sectional Data

3.2 The data in WAGE2.RAW on working men was used to estimate the following equation: ˆ educ 10.36 .094 sibs n 722, R
2

.131 meduc .214,

.210 feduc

where educ is years of schooling, sibs is number of siblings, meduc is mother’s years of schooling, and feduc is father’s years of schooling. (i) Does sibs have the expected effect? Explain. Holding meduc and feduc fixed, by how much does sibs have to increase to reduce predicted years of education by one year? (A noninteger answer is acceptable here.) (ii) Discuss the interpretation of the coefficient on meduc. (iii) Suppose that Man A has no siblings, and his mother and father each have 12 years of education. Man B has no siblings, and his mother and father each have 16 years of education. What is the predicted difference in years of education between B and A? 3.3 The following model is a simplified version of the multiple regression model used by Biddle and Hamermesh (1990) to study the tradeoff between time spent sleeping and working and to look at other factors affecting sleep: sleep
0 1

totwrk

2

educ

3

age

u,

where sleep and totwrk (total work) are measured in minutes per week and educ and age are measured in years. (See also Problem 2.12.) (i) If adults trade off sleep for work, what is the sign of 1? (ii) What signs do you think 2 and 3 will have? (iii) Using the data in SLEEP75.RAW, the estimated equation is sle ˆep 3638.25 n .148 totwrk 706, R
2

11.13 educ .113.

2.20 age

If someone works five more hours per week, by how many minutes is sleep predicted to fall? Is this a large tradeoff? (iv) Discuss the sign and magnitude of the estimated coefficient on educ. (v) Would you say totwrk, educ, and age explain much of the variation in sleep? What other factors might affect the time spent sleeping? Are these likely to be correlated with totwrk? 3.4 The median starting salary for new law school graduates is determined by log(salary)
0 1

LSAT
5

2

GPA u,

3

log(libvol)

4

log(cost)

rank

where LSAT is median LSAT score for the graduating class, GPA is the median college GPA for the class, libvol is the number of volumes in the law school library, cost is the annual cost of attending law school, and rank is a law school ranking (with rank 1 being the best). (i) Explain why we expect 5 0.
104

Chapter 3

Multiple Regression Analysis: Estimation

(ii) What signs to you expect for the other slope parameters? Justify your answers. (iii) Using the data in LAWSCH85.RAW, the estimated equation is log(sa ˆlary) 8.34 .0047 LSAT .038 log(cost) n 136, R2 .248 GPA .0033 rank .842. .095 log(libvol)

What is the predicted ceteris paribus difference in salary for schools with a median GPA different by one point? (Report your answer as a percent.) (iv) Interpret the coefficient on the variable log(libvol). (v) Would you say it is better to attend a higher ranked law school? How much is a difference in ranking of 20 worth in terms of predicted starting salary? 3.5 In a study relating college grade point average to time spent in various activities, you distribute a survey to several students. The students are asked how many hours they spend each week in four activities: studying, sleeping, working, and leisure. Any activity is put into one of the four categories, so that for each student the sum of hours in the four activities must be 168. (i) In the model GPA
0 1

study

2

sleep

3

work

4

leisure

u,

does it make sense to hold sleep, work, and leisure fixed, while changing study? (ii) Explain why this model violates Assumption MLR.4. (iii) How could you reformulate the model so that its parameters have a useful interpretation and it satisfies Assumption MLR.4? 3.6 Consider the multiple regression model containing three independent variables, under Assumptions MLR.1 through MLR.4: y
0 1 1

x

2 2

x

3 3

x

u.
1

You are interested in estimating the sum of the parameters on x1 and x2; call this ˆ ˆ1 ˆ2 is an unbiased estimator of 1. 1 2. Show that 1 3.7 Which of the following can cause OLS estimators to be biased? (i) Heteroskedasticity. (ii) Omitting an important variable. (iii) A sample correlation coefficient of .95 between two independent variables both included in the model.

3.8 Suppose that average worker productivity at manufacturing firms (avgprod) depends on two factors, average hours of training (avgtrain) and average worker ability (avgabil): avgprod
0 1

avgtrain

2

avgabil

u.
105

Part 1

Regression Analysis with Cross-Sectional Data

Assume that this equation satisfies the Gauss-Markov assumptions. If grants have been given to firms whose workers have less than average ability, so that avgtrain and avgabil are negatively correlated, what is the likely bias in ˜1 obtained from the simple regression of avgprod on avgtrain? 3.9 The following equation describes the median housing price in a community in terms of amount of pollution (nox for nitrous oxide) and the average number of rooms in houses in the community (rooms): log(price) (i)
0 1

log(nox)

2

rooms

u.

What are the probable signs of 1 and 2? What is the interpretation of 1? Explain. (ii) Why might nox [more precisely, log(nox)] and rooms be negatively correlated? If this is the case, does the simple regression of log(price) on log(nox) produce an upward or downward biased estimator of 1? (iii) Using the data in HPRICE2.RAW, the following equations were estimated: log(pr ˆice) log(pr ˆice) 9.23 11.71 1.043 log(nox), n 506, R2 .264. .514.

.718 log(nox)

.306 rooms, n

506, R2

Is the relationship between the simple and multiple regression estimates of the elasticity of price with respect to nox what you would have predicted, given your answer in part (ii)? Does this mean that .718 is definitely closer to the true elasticity than 1.043? 3.10 Suppose that the population model determining y is y
0 1 1

x

2 2

x

3 3

x

u,

and this model satisifies the Gauss-Markov assumptions. However, we estimate the model that omits x3. Let ˜0, ˜1, and ˜2 be the OLS estimators from the regression of y on x1 and x2. Show that the expected value of ˜1 (given the values of the independent variables in the sample) is n E( ˜1)

ri1 xi3 ˆ
1 i 1 3 n

, r ˆ
2 i1

i 1

where the ri1 are the OLS residuals from the regression of x1 on x2. [Hint: The formula ˆ for ˜1 comes from equation (3.22). Plug yi ui into this 0 1xi1 2xi2 3xi3 equation. After some algebra, take the expectation treating xi3 and ri1 as nonrandom.] ˆ 3.11 The following equation represents the effects of tax revenue mix on subsequent employment growth for the population of counties in the United States: growth
0 1

shareP

2

shareI

3

shareS

other factors,

where growth is the percentage change in employment from 1980 to 1990, shareP is the share of property taxes in total tax revenue, shareI is the share of income tax revenues,
106

Chapter 3

Multiple Regression Analysis: Estimation

and shareS is the share of sales tax revenues. All of these variables are measured in 1980. The omitted share, shareF , includes fees and miscellaneous taxes. By definition, the four shares add up to one. Other factors would include expenditures on education, infrastructure, and so on (all measured in 1980). (i) Why must we omit one of the tax share variables from the equation? (ii) Give a careful interpretation of 1. 3.12 (i) Consider the simple regression model y u under the first four 0 1x Gauss-Markov assumptions. For some function g(x), for example g(x) x2 or g(x) log(1 x2), define zi g(xi ). Define a slope estimator as n n

˜1 i 1

(zi

z )yi ¯ i 1

(zi

z )xi . ¯

Show that ˜1 is linear and unbiased. Remember, because E(u x) 0, you can treat both xi and zi as nonrandom in your derivation. (ii) Add the homoskedasticity assumption, MLR.5. Show that n n

Var( ˜1)

2

2 i 1

(zi

z )2 ¯ i 1

(zi

z )xi . ¯

(iii) Show directly that, under the Gauss-Markov assumptions, Var( ˆ1) Var( ˜1), where ˆ1 is the OLS estimator. [Hint: The Cauchy-Schwartz inequality in Appendix B implies that n 2 n n

n-1 i 1

(zi

z )(xi ¯

x) ¯

n-1 i 1

(zi

z )2 ¯

n-1 i 1

(xi

x)2 ; ¯

notice that we can drop x from the sample covariance.] ¯

COMPUTER EXERCISES
3.13 A problem of interest to health officials (and others) is to determine the effects of smoking during pregnancy on infant health. One measure of infant health is birth weight; a birth rate that is too low can put an infant at risk for contracting various illnesses. Since factors other than cigarette smoking that affect birth weight are likely to be correlated with smoking, we should take those factors into account. For example, higher income generally results in access to better prenatal care, as well as better nutrition for the mother. An equation that recognizes this is bwght
0 1

cigs

2

faminc

u.

(i) What is the most likely sign for 2? (ii) Do you think cigs and faminc are likely to be correlated? Explain why the correlation might be positive or negative. (iii) Now estimate the equation with and without faminc, using the data in BWGHT.RAW. Report the results in equation form, including the sample size and R-squared. Discuss your results, focusing on whether
107

Part 1

Regression Analysis with Cross-Sectional Data

adding faminc substantially changes the estimated effect of cigs on bwght. 3.14 Use the data in HPRICE1.RAW to estimate the model price
0 1

sqrft

2

bdrms

u,

where price is the house price measured in thousands of dollars. (i) Write out the results in equation form. (ii) What is the estimated increase in price for a house with one more bedroom, holding square footage constant? (iii) What is the estimated increase in price for a house with an additional bedroom that is 140 square feet in size? Compare this to your answer in part (ii). (iv) What percentage of the variation in price is explained by square footage and number of bedrooms? (v) The first house in the sample has sqrft 2,438 and bdrms 4. Find the predicted selling price for this house from the OLS regression line. (vi) The actual selling price of the first house in the sample was $300,000 (so price 300). Find the residual for this house. Does it suggest that the buyer underpaid or overpaid for the house? 3.15 The file CEOSAL2.RAW contains data on 177 chief executive officers, which can be used to examine the effects of firm performance on CEO salary. (i) Estimate a model relating annual salary to firm sales and market value. Make the model of the constant elasticity variety for both independent variables. Write the results out in equation form. (ii) Add profits to the model from part (i). Why can this variable not be included in logarithmic form? Would you say that these firm performance variables explain most of the variation in CEO salaries? (iii) Add the variable ceoten to the model in part (ii). What is the estimated percentage return for another year of CEO tenure, holding other factors fixed? (iv) Find the sample correlation coefficient between the variables log(mktval) and profits. Are these variables highly correlated? What does this say about the OLS estimators? 3.16 Use the data in ATTEND.RAW for this exercise. (i) Obtain the minimum, maximum, and average values for the variables atndrte, priGPA, and ACT. (ii) Estimate the model atndrte
0 1

priGPA

2

ACT

u

and write the results in equation form. Interpret the intercept. Does it have a useful meaning? (iii) Discuss the estimated slope coefficients. Are there any surprises? (iv) What is the predicted atndrte, if priGPA 3.65 and ACT 20? What do you make of this result? Are there any students in the sample with these values of the explanatory variables?
108

Chapter 3

Multiple Regression Analysis: Estimation

(v) If Student A has priGPA 3.1 and ACT 21 and Student B has priGPA 2.1 and ACT 26, what is the predicted difference in their attendance rates? 3.17 Confirm the partialling out interpretation of the OLS estimates by explicitly doing the partialling out for Example 3.2. This first requires regressing educ on exper and tenure, and saving the residuals, r1. Then, regress log(wage) on r1. Compare the coeffiˆ ˆ cient on r1 with the coefficient on educ in the regression of log(wage) on educ, exper, ˆ and tenure.

A

P

P

E

N

D

I

X

3

A

3A.1 Derivation of the First Order Conditions, Equations (3.13) The analysis is very similar to the simple regression case. We must characterize the solutions to the problem n min b0, b1, …, bk

(yi i 1

b0

b1xi1



bk xik)2.

Taking the partial derivatives with respect to each of the bj (see Appendix A), evaluating them at the solutions, and setting them equal to zero gives n 2 n i 1

(yi ˆ0

ˆ0 ˆ1xi1

ˆ1xi1v …

… ˆk xik)

ˆk xik) 0, j

0 1, …, k.

2 i 1

xij (yi

Cancelling the

2 gives the first order conditions in (3.13).

3A.2 Derivation of Equation (3.22) To derive (3.22), write xi1 in terms of its fitted value and its residual from the regression of x1 on to x2, …, xk : xi1 xi1 ri1, i 1, …, n. Now, plug this into the second equaˆ ˆ tion in (3.13): n (xi1 ˆ i 1

ri1)(y i ˆ

ˆ0

ˆ 1 x i1



ˆ k x ik )

0.

(3.60)

By the definition of the OLS residual ui, n ˆ since xi1 is just a linear function of the explanaˆ tory variables xi2, …, xik, it follows that i 1

xi1ui ˆ ˆ

0. Therefore, (3.60) can be expressed

as n ri1(y i ˆ i 1

ˆ0

ˆ 1 x i1



ˆ k x ik )

0.

(3.61)

109

Part 1

Regression Analysis with Cross-Sectional Data

n

Since the ri1 are the residuals from regressing x1 onto x2, …, xk , ˆ n xijri1 ˆ i 1

0 for j

2,

…, k. Therefore, (3.61) is equivalent to n i 1

ri1(yi ˆ

ˆ1xi1)

0. Finally, we use the fact

that i 1

xi1ri1 ˆ ˆ

0, which means that ˆ1 solves n ri1(yi ˆ i 1

ˆ1ri1) ˆ

0. n Now straightforward algebra gives (3.22), provided, of course, that i 1

r i21 ˆ

0; this is

ensured by Assumption MLR.4.

3A.3 Proof of Theorem 3.1 We prove Theorem 3.1 for ˆ1; the proof for the other slope parameters is virtually identical. (See Appendix E for a more succinct proof using matrices.) Under Assumption MLR.4, the OLS estimators exist, and we can write ˆ1 as in (3.22). Under Assumption MLR.1, we can write yi as in (3.32); substitute this for yi in (3.22). Then, using n n n n

ri1 ˆ i 1

0, i 1

xij ri1 ˆ

0 for all j ˆ1

2, …, k, and i 1 n 1 i 1

xi1ri1 ˆ i 1 n

r i21, we have ˆ r i21 . ˆ
(3.62)

ri1 u i ˆ i 1

Now, under Assumptions MLR.2 and MLR.4, the expected value of each ui, given all independent variables in the sample, is zero. Since the ri1 are just functions of the samˆ ple independent variables, it follows that n n

E( ˆ1 X)

1 i 1 n 1 i 1

ri1E(ui X) ˆ i n 1

r i21 ˆ r i21 ˆ i 1 1

ri1 0 ˆ

,

where X denotes the data on all independent variables and E( ˆ1 X) is the expected value of ˆ1, given xi1, …, xik for all i 1, …, n. This completes the proof.

3A.4 Proof of Theorem 3.2 Again, we prove this for j 1. Write ˆ1 as in equation (3.62). Now, under MLR.5, 2 Var(ui X) for all i 1, …, n. Under random sampling, the ui are independent, even conditional on X, and the ri1 are nonrandom conditional on X. Therefore, ˆ n n 2

Var( ˆ1 X) i 1 n

r i21 Var(ui X) ˆ n i 1 2

r i21 ˆ n 2 i 1

r i21 ˆ i 1

2 i 1

r i21 ˆ

r i21 . ˆ

110

Chapter 3

Multiple Regression Analysis: Estimation

n

Now, since n i 1

r i21 is the sum of squared residuals from regressing x1 on to x2, …, xk , ˆ
2 R1 ). This completes the proof.

r i21 ˆ i 1

SST1(1

3A.5 Proof of Theorem 3.4 We show that, for any other linear unbiased estimator ˜1 of 1, Var( ˜1) Var( ˆ1), ˆ1 is the OLS estimator. The focus on j 1 is without loss of generality. where For ˜1 as in equation (3.59), we can plug in for yi to obtain n n n n n

˜1

0 i 1

wi1

1 i 1

wi1xi1

2 i 1

wi1xi2



k i 1

wi1xik i 1

wi1ui .

Now, since the wi1 are functions of the xij, n n n n n

E( ˜1 X)

0 i 1 n 0 i 1

wi1 wi1

1 i 1 n 1 i 1

wi1xi1 wi1xi1

2 i 1 n 2

wi1xi2 wi1xi2 i 1

… …

k i 1 n k

wi1xik i 1

wi1E(ui X)

wi1xik i 1

because E(ui X ) 0, for all i 1, …, n under MLR.3. Therefore, for E( ˜1 X) to equal 1 for any values of the parameters, we must have n n n

wi1 i 1

0, i 1

wi1 x i1

1, i 1

wi1 x ij

0, j

2, …, k.

(3.63)

Now, let ri1 be the residuals from the regression of xi1 on to xi2, …, xik. Then, from ˆ (3.63), it follows that n wi1 ri1 ˆ i 1

1.

(3.64)

Now, consider the difference between Var( ˜1 X) and Var( ˆ1 X) under MLR.1 through MLR.5: n 2 i 1 n

w 21 i

2 i 1

r i21 . ˆ
2

(3.65)

Because of (3.64), we can write the difference in (3.65), without n n 2 n

, as
(3.66)

w2 1 i i 1 i 1

w i1 ri1 ˆ i 1

r i21 . ˆ

But (3.66) is simply n (wi1 i 1

ˆ1 ri1 ) 2 , ˆ

(3.67)

111

Part 1

Regression Analysis with Cross-Sectional Data

n

n

where ˆ1 i 1

w i1 ri1 ˆ i 1

r i21 , as can be seen by squaring each term in (3.67), ˆ

summing, and then cancelling terms. Because (3.67) is just the sum of squared residuˆ als from the simple regression of wi1 on to ri1 —remember that the sample average of ri1 is zero—(3.67) must be nonnegative. This completes the proof. ˆ

112

C

h

a

p

t

e

r

Four

Multiple Regression Analysis: Inference

T

his chapter continues our treatment of multiple regression analysis. We now turn to the problem of testing hypotheses about the parameters in the population regression model. We begin by finding the distributions of the OLS estimators under the added assumption that the population error is normally distributed. Sections 4.2 and 4.3 cover hypothesis testing about individual parameters, while Section 4.4 discusses how to test a single hypothesis involving more than one parameter. We focus on testing multiple restrictions in Section 4.5 and pay particular attention to determining whether a group of independent variables can be omitted from a model.

4.1 SAMPLING DISTRIBUTIONS OF THE OLS ESTIMATORS
Up to this point, we have formed a set of assumptions under which OLS is unbiased, and we have also derived and discussed the bias caused by omitted variables. In Section 3.4, we obtained the variances of the OLS estimators under the Gauss-Markov assumptions. In Section 3.5, we showed that this variance is smallest among linear unbiased estimators. Knowing the expected value and variance of the OLS estimators is useful for describing the precision of the OLS estimators. However, in order to perform statistical inference, we need to know more than just the first two moments of ˆj; we need to know the full sampling distribution of the ˆj. Even under the Gauss-Markov assumptions, the distribution of ˆj can have virtually any shape. When we condition on the values of the independent variables in our sample, it is clear that the sampling distributions of the OLS estimators depend on the underlying distribution of the errors. To make the sampling distributions of the ˆj tractable, we now assume that the unobserved error is normally distributed in the population. We call this the normality assumption.
A S S U M P T I O N M L R . 6 ( N O R M A L I T Y )

The population error u is independent of the explanatory variables x1, x2, …, xk and is normally distributed with zero mean and variance 2: u ~ Normal(0, 2).
113

Part 1

Regression Analysis with Cross-Sectional Data

Assumption MLR.6 is much stronger than any of our previous assumptions. In fact, since u is independent of the xj under MLR.6, E(u x1, …, xk ) E(u) 0, and Var(u x1, 2 . Thus, if we make Assumption MLR.6, then we are necessarily …, xk ) Var(u) assuming MLR.3 and MLR.5. To emphasize that we are assuming more than before, we will refer to the the full set of assumptions MLR.1 through MLR.6. For cross-sectional regression applications, the six assumptions MLR.1 through MLR.6 are called the classical linear model (CLM) assumptions. Thus, we will refer to the model under these six assumptions as the classical linear model. It is best to think of the CLM assumptions as containing all of the Gauss-Markov assumptions plus the assumption of a normally distributed error term. Under the CLM assumptions, the OLS estimators ˆ0, ˆ1, …, ˆk have a stronger efficiency property than they would under the Gauss-Markov assumptions. It can be shown that the OLS estimators are the minimum variance unbiased estimators, which means that OLS has the smallest variance among unbiased estimators; we no longer have to restrict our comparison to estimators that are linear in the yi . This property of OLS under the CLM assumptions is discussed further in Appendix E. A succinct way to summarize the population assumptions of the CLM is y x ~ Normal(
0 1 1

x

2 2

x



k k

x,

2

),

where x is again shorthand for (x1, …, xk ). Thus, conditional on x, y has a normal distribution with mean linear in x1, …, xk and a constant variance. For a single independent variable x, this situation is shown in Figure 4.1. The argument justifying the normal distribution for the errors usually runs something like this: Because u is the sum of many different unobserved factors affecting y, we can invoke the central limit theorem (see Appendix C) to conclude that u has an approximate normal distribution. This argument has some merit, but it is not without weaknesses. First, the factors in u can have very different distributions in the population (for example, ability and quality of schooling in the error in a wage equation). While the central limit theorem (CLT) can still hold in such cases, the normal approximation can be poor depending on how many factors appear in u and how different are their distributions. A more serious problem with the CLT argument is that it assumes that all unobserved factors affect y in a separate, additive fashion. Nothing guarantees that this is so. If u is a complicated function of the unobserved factors, then the CLT argument does not really apply. In any application, whether normality of u can be assumed is really an empirical matter. For example, there is no theorem that says wage conditional on educ, exper, and tenure is normally distributed. If anything, simple reasoning suggests that the opposite is true: since wage can never be less than zero, it cannot, strictly speaking, have a normal distribution. Further, since there are minimum wage laws, some fraction of the population earns exactly the minimum wage, which also violates the normality assumption. Nevertheless, as a practical matter we can ask whether the conditional wage distribution is “close” to being normal. Past empirical evidence suggests that normality is not a good assumption for wages. Often, using a transformation, especially taking the log, yields a distribution that is closer to normal. For example, something like log(price) tends to have a distribution
114

Chapter 4

Multiple Regression Analysis: Inference

Figure 4.1 The homoskedastic normal distribution with a single explanatory variable.

f(ylx)

y

normal distributions

x1 x2 x3

E( y x)

0

1

x

x

that looks more normal than the distribution of price. Again, this is an empirical issue, which we will discuss further in Chapter 5. There are some examples where MLR.6 is clearly false. Whenever y takes on just a few values, it cannot have anything close to a normal distribution. The dependent variable in Example 3.5 provides a good example. The variable narr86, the number of times a young man was arrested in 1986, takes on a small range of integer values and is zero for most men. Thus, narr86 is far from being normally distributed. What can be done in these cases? As we will see in Chapter 5—and this is important—nonnormality of the errors is not a serious problem with large sample sizes. For now, we just make the normality assumption. Normality of the error term translates into normal sampling distributions of the OLS estimators:
T H E O R E M 4 . 1 ( N O R M A L S A M P L I N G D I S T R I B U T I O N S )

Under the CLM assumptions MLR.1 through MLR.6, conditional on the sample values of the independent variables,

ˆj ~ Normal[ j,Var( ˆj)],

(4.1)
115

Part 1

Regression Analysis with Cross-Sectional Data

where Var( ˆj ) was given in Chapter 3 [equation (3.51)]. Therefore,

( ˆj

j

)/sd( ˆj) ~ Normal(0,1).

The proof of (4.1) is not that difficult, given the properties of normally distributed rann

dom variables in Appendix B. Each ˆj can be written as ˆj

j i 1

wijui , where wij

rij /SSRj, rij is the i th residual from the regression of the xj on all the other independent ˆ ˆ variables, and SSRj is the sum of squared residuals from this regression [see equation (3.62)]. Since the wij depend only on the independent variables, they can be treated as nonrandom. Thus, ˆj is just a linear combination of the errors in the sample, {ui : i 1,2, …, n}. Under Assumption MLR.6 Q U E S T I O N 4 . 1 (and the random sampling Assumption Suppose that u is independent of the explanatory variables, and it MLR.2), the errors are independent, identakes on the values 2, 1, 0, 1, and 2 with equal probability of 1/5. Does this violate the Gauss-Markov assumptions? Does this viotically distributed Normal(0, 2) random late the CLM assumptions? variables. An important fact about independent normal random variables is that a linear combination of such random variables is normally distributed (see Appendix B). This basically completes the proof. In Section 3.3, we showed that E( ˆj) j, and we derived Var( ˆj) in Section 3.4; there is no need to re-derive these facts. The second part of this theorem follows immediately from the fact that when we standardize a normal random variable by dividing it by its standard deviation, we end up with a standard normal random variable. The conclusions of Theorem 4.1 can be strengthened. In addition to (4.1), any linear combination of the ˆ0, ˆ1, …, ˆk is also normally distributed, and any subset of the ˆj has a joint normal distribution. These facts underlie the testing results in the remainder of this chapter. In Chapter 5, we will show that the normality of the OLS estimators is still approximately true in large samples even without normality of the errors.

4.2 TESTING HYPOTHESES ABOUT A SINGLE POPULATION PARAMETER: THE t TEST
This section covers the very important topic of testing hypotheses about any single parameter in the population regression function. The population model can be written as y
0 1 1

x



k k

x

u,

(4.2)

and we assume that it satisfies the CLM assumptions. We know that OLS produces unbiased estimators of the j. In this section, we study how to test hypotheses about a particular j. For a full understanding of hypothesis testing, one must remember that the j are unknown features of the population, and we will never know them with certainty. Nevertheless, we can hypothesize about the value of j and then use statistical inference to test our hypothesis. In order to construct hypotheses tests, we need the following result:
116

Chapter 4

Multiple Regression Analysis: Inference

T H E O R E M 4 . 2 ( t D I S T R I B U T I O N S T A N D A R D I Z E D E S T I M A T O R S )

F O R

T H E

Under the CLM assumptions MLR.1 through MLR.6,

( ˆj

j

)/se( ˆj) ~ tn

k 1

,

(4.3)
0

where k 1 is the number of unknown parameters in the population model y … u (k slope parameters and the intercept 0). 1x1 k xk

This result differs from Theorem 4.1 in some notable respects. Theorem 4.1 showed ˆ that, under the CLM assumptions, ( ˆj j)/sd( j) ~ Normal(0,1). The t distribution in (4.3) comes from the fact that the constant in sd( ˆj) has been replaced with the random variable ˆ . The proof that this leads to a t distribution with n k 1 degrees of freedom is not especially insightful. Essentially, the proof shows that (4.3) can be writˆ ten as the ratio of the standard normal random variable ( ˆj j)/sd( j) over the square 2 2 root of ˆ / . These random variables can be shown to be independent, and (n k 2 1) ˆ 2/ 2 n k 1. The result then follows from the definition of a t random variable (see Section B.5). Theorem 4.2 is important in that it allows us to test hypotheses involving the j. In most applications, our primary interest lies in testing the null hypothesis H0: j 0,

(4.4)

where j corresponds to any of the k independent variables. It is important to understand what (4.4) means and to be able to describe this hypothesis in simple language for a particular application. Since j measures the partial effect of xj on (the expected value of) y, after controlling for all other independent variables, (4.4) means that, once x1, x2, …, xj 1, xj 1, …, xk have been accounted for, xj has no effect on the expected value of y. We cannot state the null hypothesis as “xj does have a partial effect on y” because this is true for any value of j other than zero. Classical testing is suited for testing simple hypotheses like (4.4). As an example, consider the wage equation log(wage)
0 1

educ

2

exper

3

tenure

u.

The null hypothesis H0: 2 0 means that, once education and tenure have been accounted for, the number of years in the work force (exper) has no effect on hourly wage. This is an economically interesting hypothesis. If it is true, it implies that a person’s work history prior to the current employment does not affect wage. If 2 0, then prior work experience contributes to productivity, and hence to wage. You probably remember from your statistics course the rudiments of hypothesis testing for the mean from a normal population. (This is reviewed in Appendix C.) The mechanics of testing (4.4) in the multiple regression context are very similar. The hard part is obtaining the coefficient estimates, the standard errors, and the critical values, but most of this work is done automatically by econometrics software. Our job is to learn how regression output can be used to test hypotheses of interest. The statistic we use to test (4.4) (against any alternative) is called “the” t statistic or “the” t ratio of ˆj and is defined as
117

Part 1

Regression Analysis with Cross-Sectional Data

t ˆj

ˆj /se( ˆj).

(4.5)

We have put “the” in quotation marks because, as we will see shortly, a more general form of the t statistic is needed for testing other hypotheses about j. For now, it is important to know that (4.5) is suitable only for testing (4.4). When it causes no confusion, we will sometimes write t in place of t ˆj. The t statistic for ˆj is simple to compute given ˆj and its standard error. In fact, most regression packages do the division for you and report the t statistic along with each coefficient and its standard error. Before discussing how to use (4.5) formally to test H0: j 0, it is useful to see why t ˆj has features that make it reasonable as a test statistic to detect j 0. First, since se( ˆj) is always positive, t ˆj has the same sign as ˆj: if ˆj is positive, then so is t ˆj , and if ˆj is negative, so is t ˆ . Second, for a given value of se( ˆj), a larger value of ˆj leads to j larger values of t ˆj. If ˆj becomes more negative, so does t ˆj. Since we are testing H0: j 0, it is only natural to look at our unbiased estimator of j, ˆj, for guidance. In any interesting application, the point estimate ˆj will never exactly be zero, whether or not H0 is true. The question is: How far is ˆj from zero? A sample value of ˆj very far from zero provides evidence against H0: j 0. However, we must recognize that there is a sampling error in our estimate ˆj, so the size of ˆj must be weighed against its sampling error. Since the the standard error of ˆj is an estimate of the standard deviation of ˆj, t ˆj measures how many estimated standard deviations ˆj is away from zero. This is precisely what we do in testing whether the mean of a population is zero, using the standard t statistic from introductory statistics. Values of t ˆj sufficiently far from zero will result in a rejection of H0. The precise rejection rule depends on the alternative hypothesis and the chosen significance level of the test. Determining a rule for rejecting (4.4) at a given significance level—that is, the probability of rejecting H0 when it is true—requires knowing the sampling distribution of t ˆj when H0 is true. From Theorem 4.2, we know this to be tn k 1. This is the key theoretical result needed for testing (4.4). Before proceeding, it is important to remember that we are testing hypotheses about the population parameters. We are not testing hypotheses about the estimates from a particular sample. Thus, it never makes sense to state a null hypothesis as “H0: ˆ1 0” or, even worse, as “H0: .237 0” when the estimate of a parameter is .237 in the sample. We are testing whether the unknown population value, 1, is zero. Some treatments of regression analysis define the t statistic as the absolute value of (4.5), so that the t statistic is always positive. This practice has the drawback of making testing against one-sided alternatives clumsy. Throughout this text, the t statistic always has the same sign as the corresponding OLS coefficient estimate.

Testing Against One-Sided Alternatives
In order to determine a rule for rejecting H0, we need to decide on the relevant alternative hypothesis. First consider a one-sided alternative of the form H1:
118
j

0.

(4.6)

Chapter 4

Multiple Regression Analysis: Inference

This means that we do not care about alternatives to H0 of the form H1: j 0; for some reason, perhaps on the basis of introspection or economic theory, we are ruling out population values of j less than zero. (Another way to think about this is that the null hypothesis is actually H0: j 0; in either case, the statistic t ˆj is used as the test statistic.) How should we choose a rejection rule? We must first decide on a significance level or the probability of rejecting H0 when it is in fact true. For concreteness, suppose we have decided on a 5% significance level, as this is the most popular choice. Thus, we are willing to mistakenly reject H0 when it is true 5% of the time. Now, while t ˆj has a t distribution under H0—so that it has zero mean—under the alternative j 0, the expected value of t ˆj is positive. Thus, we are looking for a “sufficiently large” positive value of t ˆj in order to reject H0: j 0 in favor of H1: j 0. Negative values of t ˆj provide no evidence in favor of H1. The definition of “sufficiently large,” with a 5% significance level, is the 95th percentile in a t distribution with n k 1 degrees of freedom; denote this by c. In other words, the rejection rule is that H0 is rejected in favor of H1 at the 5% significance level if t ˆj c.
(4.7)

Figure 4.2 5% rejection rule for the alternative H1: j 0 with 28 df.

Area = .05

0 1.701 rejection region

119

Part 1

Regression Analysis with Cross-Sectional Data

By our choice of the critical value c, rejection of H0 will occur for 5% of all random samples when H0 is true. The rejection rule in (4.7) is an example of a one-tailed test. In order to obtain c, we only need the significance level and the degrees of freedom. For example, for a 5% level test and with n k 1 28 degrees of freedom, the critical value is c 1.701. If t ˆj 1.701, then we fail to reject H0 in favor of (4.6) at the 5% level. Note that a negative value for t ˆj , no matter how large in absolute value, leads to a failure in rejecting H0 in favor of (4.6). (See Figure 4.2.) The same procedure can be used with other significance levels. For a 10% level test and if df 21, the critical value is c 1.323. For a 1% significance level and if df 21, c 2.518. All of these critical values are obtained directly from Table G.2. You should note a pattern in the critical values: as the significance level falls, the critical value increases, so that we require a larger and larger value of t ˆj in order to reject H0. Thus, if H0 is rejected at, say, the 5% level, then it is automatically rejected at the 10% level as well. It makes no sense to reject the null hypothesis at, say, the 5% level and then to redo the test to determine the outcome at the 10% level. As the degrees of freedom in the t distribution get large, the t distribution approaches the standard normal distribution. For example, when n k 1 120, the 5% critical value for the one-sided alternative (4.7) is 1.658, compared with the standard normal value of 1.645. These are close enough for practical purposes; for degrees of freedom greater than 120, one can use the standard normal critical values.
E X A M P L E 4 . 1 (Hourly Wage Equation)

Using the data in WAGE1.RAW gives the estimated equation

log(w ˆage) log(w ˆage)

(.284) (.104)

(.092) educ (.0041) exper (.007) educ (.0017) exper n 526, R2 .316,

(.022) tenure (.003) tenure

where standard errors appear in parentheses below the estimated coefficients. We will follow this convention throughout the text. This equation can be used to test whether the return to exper, controlling for educ and tenure, is zero in the population, against the alternative that it is positive. Write this as H0: exper 0 versus H1: exper 0. (In applications, indexing a parameter by its associated variable name is a nice way to label parameters, since the numerical indices that we use in the general model are arbitrary and can cause confusion.) Remember that exper denotes the unknown population parameter. It is nonsense to write “H0: .0041 0” or “H0: ˆexper 0.” Since we have 522 degrees of freedom, we can use the standard normal critical values. The 5% critical value is 1.645, and the 1% critical value is 2.326. The t statistic for ˆexper is

t ˆexper

.0041/.0017

2.41,

and so ˆexper, or exper, is statistically significant even at the 1% level. We also say that “ ˆexper is statistically greater than zero at the 1% significance level.” The estimated return for another year of experience, holding tenure and education fixed, is not large. For example, adding three more years increases log(wage) by 3(.0041)
120

Chapter 4

Multiple Regression Analysis: Inference

.0123, so wage is only about 1.2% higher. Nevertheless, we have persuasively shown that the partial effect of experience is positive in the population.

The one-sided alternative that the parameter is less than zero, H1: j 0,

(4.8)

also arises in applications. The rejection rule for alternative (4.8) is just the mirror image of the previous case. Now, the critical value comes from the left tail of the t distribution. In practice, it is easiest to think of the rejection rule as
Q U E S T I O N 4 . 2

t ˆj

c,

(4.9)

Let community loan approval rates be determined by

where c is the critical value for the alternative H1: j 0. For simplicity, we always assume c is positive, since this is how critwhere percmin is the percent minority in the community, avginc is ical values are reported in t tables, and so average income, avgwlth is average wealth, and avgdebt is some the critical value c is a negative number. measure of average debt obligations. How do you state the null For example, if the significance level is hypothesis that there is no difference in loan rates across neighbor5% and the degrees of freedom is 18, then hoods due to racial and ethnic composition, when average income, c 1.734, and so H0: j 0 is rejected in average wealth, and average debt have been controlled for? How do you state the alternative that there is discrimination against favor of H1: j 0 at the 5% level if t ˆj minorities in loan approval rates? 1.734. It is important to remember that, to reject H0 against the negative alternative (4.8), we must get a negative t statistic. A positive t ratio, no matter how large, provides no evidence in favor of (4.8). The rejection rule is illustrated in Figure 4.3. apprate percmin 2 avginc u, 3 avgwlth 4 avgdebt
0 1

E X A M P L E 4 . 2 (Student Performance and School Size)

There is much interest in the effect of school size on student performance. (See, for example, The New York Times Magazine, 5/28/95.) One claim is that, everything else being equal, students at smaller schools fare better than those at larger schools. This hypothesis is assumed to be true even after accounting for differences in class sizes across schools. The file MEAP93.RAW contains data on 408 high schools in Michigan for the year 1993. We can use these data to test the null hypothesis that school size has no effect on standardized test scores, against the alternative that size has a negative effect. Performance is measured by the percentage of students receiving a passing score on the Michigan Educational Assessment Program (MEAP) standardized tenth grade math test (math10). School size is measured by student enrollment (enroll). The null hypothesis is H0: enroll 0, and the alternative is H1: enroll 0. For now, we will control for two other factors, average annual teacher compensation (totcomp) and the number of staff per one thousand students (staff ). Teacher compensation is a measure of teacher quality, and staff size is a rough measure of how much attention students receive.
121

Part 1

Regression Analysis with Cross-Sectional Data

Figure 4.3 5% rejection rule for the alternative H1: j 0 with 18 df.

Area = .05

0 rejection region –1.734

The estimated equation, with standard errors in parentheses, is

ˆ math10 ˆ math10

(2.274) (6.113)

(.00046) totcomp (.048) staff (.00010) totcomp (.040) staff n 408, R2 .0541.

(.00020) enroll (.00022) enroll

The coefficient on enroll, .0002, is in accordance with the conjecture that larger schools hamper performance: higher enrollment leads to a lower percentage of students with a passing tenth grade math score. (The coefficients on totcomp and staff also have the signs we expect.) The fact that enroll has an estimated coefficient different from zero could just be due to sampling error; to be convinced of an effect, we need to conduct a t test. Since n k 1 408 4 404, we use the standard normal critical value. At the 5% level, the critical value is 1.65; the t statistic on enroll must be less than 1.65 to reject H0 at the 5% level. The t statistic on enroll is .0002/.00022 .91, which is larger than 1.65: we fail to reject H0 in favor of H1 at the 5% level. In fact, the 15% critical value is 1.04, and since .91 1.04, we fail to reject H0 even at the 15% level. We conclude that enroll is not statistically significant at the 15% level.
122

Chapter 4

Multiple Regression Analysis: Inference

The variable totcomp is statistically significant even at the 1% significance level because its t statistic is 4.6. On the other hand, the t statistic for staff is 1.2, and so we cannot reject H0: staff 0 against H1: staff 0 even at the 10% significance level. (The critical value is c 1.28 from the standard normal distribution.) To illustrate how changing functional form can affect our conclusions, we also estimate the model with all independent variables in logarithmic form. This allows, for example, the school size effect to diminish as school size increases. The estimated equation is

ˆ math10 ( 207.66) ˆ math10 (48.70)

(21.16) log(totcomp) (3.98) log(staff ) (4.06) log(totcomp) (4.19) log(staff ) n 408, R2 .0654.

(1.29) log(enroll) (0.69) log(enroll)

The t statistic on log(enroll ) is about 1.87; since this is below the 5% critical value 1.65, we reject H0: log(enroll) 0 in favor of H1: log(enroll) 0 at the 5% level. In Chapter 2, we encountered a model where the dependent variable appeared in its original form (called level form), while the independent variable appeared in log form (called level-log model). The interpretation of the parameters is the same in the multiple regression context, except, of course, that we can give the parameters a ceteris paribus ˆ interpretation. Holding totcomp and staff fixed, we have math10 1.29[ log(enroll)], so that

ˆ math10

(1.29/100)(% enroll )

.013(% enroll ).

Once again, we have used the fact that the change in log(enroll ), when multiplied by 100, is approximately the percentage change in enroll. Thus, if enrollment is 10% higher at a ˆ school, math10 is predicted to be 1.3 percentage points lower (math10 is measured as a percent). Which model do we prefer: the one using the level of enroll or the one using log(enroll )? In the level-level model, enrollment does not have a statistically significant effect, but in the level-log model it does. This translates into a higher R-squared for the level-log model, which means we explain more of the variation in math10 by using enroll in logarithmic form (6.5% to 5.4%). The level-log model is preferred, as it more closely captures the relationship between math10 and enroll. We will say more about using R-squared to choose functional form in Chapter 6.

Two-Sided Alternatives
In applications, it is common to test the null hypothesis H0: alternative, that is, H1: j j

0 against a two-sided

0.

(4.10)

Under this alternative, xj has a ceteris paribus effect on y without specifying whether the effect is positive or negative. This is the relevant alternative when the sign of j is not well-determined by theory (or common sense). Even when we know whether j is positive or negative under the alternative, a two-sided test is often prudent. At a minimum,
123

Part 1

Regression Analysis with Cross-Sectional Data

using a two-sided alternative prevents us from looking at the estimated equation and then basing the alternative on whether ˆj is positive or negative. Using the regression estimates to help us formulate the null or alternative hypotheses is not allowed because classical statistical inference presumes that we state the null and alternative about the population before looking at the data. For example, we should not first estimate the equation relating math performance to enrollment, note that the estimated effect is negative, and then decide the relevant alternative is H1: enroll 0. When the alternative is two-sided, we are interested in the absolute value of the t statistic. The rejection rule for H0: j 0 against (4.10) is t ˆj c,
(4.11)

where denotes absolute value and c is an appropriately chosen critical value. To find c, we again specify a significance level, say 5%. For a two-tailed test, c is chosen to make the area in each tail of the t distribution equal 2.5%. In other words, c is the 97.5th percentile in the t distribution with n k 1 degrees of freedom. When n k 1 25, the 5% critical value for a two-sided test is c 2.060. Figure 4.4 provides an illustration of this distribution.

Figure 4.4 5% rejection rule for the alternative H1: j 0 with 25 df.

Area = .025

Area = .025

0 rejection region –2.06 2.06 rejection region

124

Chapter 4

Multiple Regression Analysis: Inference

When a specific alternative is not stated, it is usually considered to be two-sided. In the remainder of this text, the default will be a two-sided alternative, and 5% will be the default significance level. When carrying out empirical econometric analysis, it is always a good idea to be explicit about the alternative and the significance level. If H0 is rejected in favor of (4.10) at the 5% level, we usually say that “xj is statistically significant, or statistically different from zero, at the 5% level.” If H0 is not rejected, we say that “xj is statistically insignificant at the 5% level.”
E X A M P L E 4 . 3 ( D e t e r m i n a n t s o f C o l l e g e G PA )

We use GPA1.RAW to estimate a model explaining college GPA (colGPA), with the average number of lectures missed per week (skipped) as an additional explanatory variable. The estimated model is

ˆ colGPA ˆ colGPA

(1.39) (0.33)

(.412) hsGPA (.094) hsGPA n 141, R
2

(.015) ACT (.011) ACT .234.

(.083) skipped (.026) skipped

We can easily compute t statistics to see which variables are statistically significant, using a two-sided alternative in each case. The 5% critical value is about 1.96, since the degrees of freedom (141 4 137) is large enough to use the standard normal approximation. The 1% critical value is about 2.58. The t statistic on hsGPA is 4.38, which is significant at very small significance levels. Thus, we say that “hsGPA is statistically significant at any conventional significance level.” The t statistic on ACT is 1.36, which is not statistically significant at the 10% level against a two-sided alternative. The coefficient on ACT is also practically small: a 10-point increase in ACT, which is large, is predicted to increase colGPA by only .15 point. Thus, the variable ACT is practically, as well as statistically, insignificant. The coefficient on skipped has a t statistic of .083/.026 3.19, so skipped is statistically significant at the 1% significance level (3.19 2.58). This coefficient means that another lecture missed per week lowers predicted colGPA by about .083. Thus, holding hsGPA and ACT fixed, the predicted difference in colGPA between a student who misses no lectures per week and a student who misses five lectures per week is about .42. Remember that this says nothing about specific students, but pertains to average students across the population. In this example, for each variable in the model, we could argue that a one-sided alternative is appropriate. The variables hsGPA and skipped are very significant using a two-tailed test and have the signs that we expect, so there is no reason to do a one-tailed test. On the other hand, against a one-sided alternative ( 3 0), ACT is significant at the 10% level but not at the 5% level. This does not change the fact that the coefficient on ACT is pretty small.

Testing Other Hypotheses About

j

Although H0: j 0 is the most common hypothesis, we sometimes want to test whether j is equal to some other given constant. Two common examples are j 1 and 1. Generally, if the null is stated as j
125

Part 1

Regression Analysis with Cross-Sectional Data

H0: where aj is our hypothesized value of t j j

aj ,

(4.12)

, then the appropriate t statistic is aj )/se( ˆj).

( ˆj

As before, t measures how many estimated standard deviations ˆj is from the hypothesized value of j. The general t statistic is usefully written as t (estimate hypothesized value) . standard error
(4.13)

Under (4.12), this t statistic is distributed as tn k 1 from Theorem 4.2. The usual t statistic is obtained when aj 0. We can use the general t statistic to test against one-sided or two-sided alternatives. For example, if the null and alternative hypotheses are H0: j 1 and H1: j 1, then we find the critical value for a one-sided alternative exactly as before: the difference is in how we compute the t statistic, not in how we obtain the appropriate c. We reject H0 in favor of H1 if t c. In this case, we would say that “ ˆj is statistically greater than one” at the appropriate significance level.

E X A M P L E 4 . 4 (Campus Crime and Enrollment)

Consider a simple model relating the annual number of crimes on college campuses (crime) to student enrollment (enroll):

log(crime)

0

1

log(enroll)

u.

This is a constant elasticity model, where 1 is the elasticity of crime with respect to enrollment. It is not much use to test H0: 1 0, as we expect the total number of crimes to increase as the size of the campus increases. A more interesting hypothesis to test would be that the elasticity of crime with respect to enrollment is one: H0: 1 1. This means that a 1% increase in enrollment leads to, on average, a 1% increase in crime. A noteworthy alternative is H1: 1 1, which implies that a 1% increase in enrollment increases campus crime by more than 1%. If 1 1, then, in a relative sense—not just an absolute sense— crime is more of a problem on larger campuses. One way to see this is to take the exponential of the equation:

crime

exp( 0)enroll 1exp(u).

(See Appendix A for properties of the natural logarithm and exponential functions.) For 0 and u 0, this equation is graphed in Figure 4.5 for 1 1, 1 1, and 1 1. 0 We test 1 1 against 1 1 using data on 97 colleges and universities in the United States for the year 1992. The data come from the FBI’s Uniform Crime Reports, and the average number of campus crimes in the sample is about 394, while the average enrollment is about 16,076. The estimated equation (with estimates and standard errors rounded to two decimal places) is
126

Chapter 4

Multiple Regression Analysis: Inference

Figure 4.5 Graph of crime enroll
1

for

1

1,

1

1, and

1

1.
1

crime

>1

1

=1

1

Similar Documents