Assumptions of the Regression Model. The assumptions listed below enable us to calculate

unbiased estimators of the population) and to use these in predicting values and regression

function coefficients (of Y given X). You should be aware of the fact that violation of one or

more of these assumptions reduces the efficiency of the medle, but a detailed discussion of this

topic is beyond the purview of this text. Assume that all these assumptions have been met.

• For each value of X there is an array of possible Y normally distributed about the

regression line.

• The mean of the distribution of possible Y values is on the regression line, i.e., the

expected value of the error term is zero.

• The standard deviation of the distribution of possible Y values is constant regardless of

the value of X (this is called homoscedasticity).

• The error terms are statistically independent of each other, i.e., there is no serial

correlation.

• The error term is statistically independent of X.

Note: These assumptions are very important, in that they enable us to construct predictions

around our point estimate.

Variation in the Regression Model . Recall that the purpose of regression analysis is to predict

the value of a dependent variable given the value of the independent variable. The LSBF

technique yields the best single line to fit the data, but you also want some method of

determining how good this estimating equation is. In order to do this, you must first partition the

variation.

• Total Variation. The sum of squares total (SST) is a measure of the total variation of Y.

SST is the sum of the squared differences between the observed values of Y and the mean

of Y.

SST = (Y i – ) 2

Where:

SST = Sum of squared differences

Y i = Observed value i

= Mean value of Y

While the above formula provides a clear picture of the meaning of SST, you can use the

following formula to speed SST calculation:

Total variation can be partitioned into two variations categories: explained and unexplained. This

can be expressed as

SST = SSR + SSE

• Explained Variation. The sum of squares regression (SSR) is a measure of variation of

Y that is explained by the regression equation. SSR is the sum of the squared differences

between the calculated value of Y (Yc) and the mean of Y ( ).

You can use the following formula to speed SSR calculation:

• Unexplained Variation . The sum of squares error (SSE) is a measure of the variation of

Y that is not explained by the regression equations. SSE is the sum of the squared

differences between the observed values of Y and the calculated value of Y. This is the

random variation of the observations around the regression line.

You can use the following formula to speed SSE calculation:

Analysis of Variance . Variance is equal to variation divided by degrees of freedom (df). In

regression analysis, df is a statistical concept that is used to adjust for sample bias in estimating

the population mean.

• Mean Square Regression (MSR).

For 2-variable linear regression, the value of df for calculating MSR is always one (1). As a

result, in 2-variable linear regression, you can simplify the equation for MSR to read:

• Mean Square Error (MSE).

In 2-variable linear regression, df for calculating MSE is always n – 2. As a result, in simple

regression, you can simplify the equation for MSE to read:

• Analysis of Variance Table. The terms used to analyze variation/variance in the

regression model are commonly summarized in an Analysis of Variance (ANOVA) table.

ANOVA Table

Source Sum of Squares df Mean

Square**

Regression SSR 1 MSR

Error SSE n-2 MSE

Total SST n-1**Mean Square = Sum of Squares/df Constructing an ANOVA Table for the Manufacturing Overhead Example . Before you can calculate variance and variation, you must use the observations to calculate the statistics in the table below. Since we already calculated these statistics to develop the regression equation to estimate manufacturing overhead, we will begin our calculations with the values in the table below: Statistic Value ? 144 ?Y 846 ?XY 22,647 ?X 2 3,872 ?Y 2 133,296 24 141 A 5.8272 B 5.6322 n 6 Step 1. Calculate SST. Step 2. Calculate SSR. Step 3. Calculate SSE. Step 4. Calculate MSR. Step 5. Calculate MSE. Step 6. Combine the calculated values into an ANOVA table. ANOVA Table Source Sum of Squares df Mean Square**

Regression 13,196 1 13,196

Error 814 4 204

Total 14,010 5

**Mean Square = Sum of Squares/df

Step 7. Check SST. Assure that value for SST is equal to SSR plus SSE.

SST = SSR + SSE

14,010 = 13,196 + 814

14,010 = 14,010

5.4 – Measuring How Well The Regression Equation Fits The Data

Statistics Used to Measure Goodness of Fit . How well does the equation fit the data used in

developing the equation? Three statistics are commonly used to determine the “goodness of fit”

of the regression equation:

• Coefficient of determination;

• Standard error of the estimate; and

• T-test for significance of the regression equation.

Calculating the Coefficient of Determination . Most computer software designed to fit a line

using regression analysis will also provide the coefficient of determination for that line. The

coefficient of determination (r 2 ) measures the strength of the association between independent

and dependent variables (X and Y).

The range of r 2 is between zero and one.

0 < r 2 < 1 An r 2 of zero indicates that there is no relationship between X and Y. An r 2 of one indicates that there is a perfect relationship between X and Y. As r 2 gets closer to 1, the better the regression line fits the data set. In fact, r 2 is the ratio of explained variation (SSR) to total variation (SST). An r 2 of .90 indicates that 90 percent of the variation in Y has been explained by its relationship with X; that is, 90 percent of the variation in Y has been explained by the regression line. For the manufacturing overhead example: This means that approximately 94% of the variation in manufacturing overhead (Y) can be explained by its relationship with manufacturing direct labor hours (X). Standard Error of the Estimate . The standard error of the estimate (SEE) is a measure of the accuracy of the estimating (regression) equation. The SEE indicates the variability of the observed (actual) points around the regression line (predicted points). That is, it measures the extent to which the observed values (Yi) differ from their calculated values (Yc). Given the first two assumptions required for use of the regression model (for each value of X there is an array of possible Y values which is normally distributed about the regression line and the mean of this distribution (Yc) is on the regression line), the SEE is interpreted in a way similar to the way in which the standard deviation is interpreted. That is, given a value for X, we would generally expect the following intervals (based on the Empirical Rule): • Yc = 1 SEE contains approximately 68 percent of the total observations (Yi) • Yc = 2 SEE contains approximately 95 percent of the total observations (Yi) • Yc = 3 SEE contains approximately 99 percent of the total observations (Yi) The SEE is equal to the square root of the MSE. For the manufacturing overhead example: Steps for Conducting the T-test for the Significance of the Regression Equation . The regression line is derived from a sample. Because of sampling error, it is possible to get a regression relationship with a rather high r 2 (e.g. greater than 80 percent) when there is no real relationship between X and Y. That is, when there is no statistical significance. This phenomenon will occur only when you have very small sample data sets. You can test the significance of the regression equation by applying the T-test. Applying the T-test is a 4-step process: Step 1. Determine the significance level (). = 1 – confidence level The selection of the significance level is a management decision; that is, management decides the level of risk associated with an estimate which it will accept. In the absence of any other guidance, use a significance level of .10. Step 2. Calculate T. Use the values of MSR and MSE from the ANOVA table: Step 3. Determine the table value of t. From a t Table, select the t value for the appropriate degrees of freedom (df). In 2-variable linear regression: df = n – 2 Step 4. Compare T to the t Table value. Decision rules: If T > t, use the regression equation for prediction purposes. It is likely that the relationship is

significant.

If T < t, do not use the regression equation for prediction purposes. It is likely that the

relationship is not significant.

If T = t, a highly unlikely situation, you are theoretically indifferent and may elect to use or not

use the regression equation for prediction purposes.

Conducting the T-test for the Significance of the Regression Equation for the Manufacturing

Overhead Example .

To demonstrate use of the T-test, we will apply the 4-step procedure to the manufacturing

overhead example:

Step 1. Determine the significance level ( ). Assume that we have been told to use = .05.

Step 2. Calculate T.

Step 3. Determine the table value of t. The partial table below is an excerpt of a t table.

df = n- 2

= 6 – 2

= 4

Partial t Table

df t

2 4.303

3 3.182

4 2.776

5 2.571

6 2.447

Reading from the table, the appropriate value is 2.776.

Step 4. Compare T to the t Table value. Since T (8.043) > t (2.776), use the regression

equation for prediction purposes. It is likely that the relationship is significant.

Note: There is not normally a conflict in the decision indicated by the T-test and the magnitude

of r 2 . If r 2 is high, T is normally > t. A conflict could occur only in a situation where there are

very few data points. In those rare instances where there is a conflict, you should accept the

decision indicated by the T-test. It is a better indicator than r 2 because it takes into account the

sample size (n) through the degrees of freedom (df).

5.5 – Calculating And Using A Prediction Interval

Formulating the Prediction Interval . You can develop a regression equation and use it to

calculate a point estimate for Y given any value of X. However, a point estimate alone does not

provide enough information for sound negotiations. You need to be able to establish a range of

values which you are confident contains the true value of the cost or price which you are trying

to predict. In regression analysis, this range is known as the prediction interval.

For a regression equation based on a small sample, you should develop a prediction interval,

using the following equation:

When X = 21 the prediction interval is 80.9207 Y 167.2861

Constructing a Prediction Interval for the Manufacturing Overhead Example . Assume that we

want to construct a 95 percent prediction interval for the manufacturing overhead estimate at

2,100 manufacturing direct labor hours. Earlier in the chapter, we calculated Y C and the other

statistics in the following table:

Statistic Value

Yc 124.1034

t

(Use n – 2 df)

2.776

SEE 14.27

24

X 2 3,872

Using the table data, you would calculate the prediction interval as follows: When X = 21 the

prediction interval is: 80.9207 Y 167.2861.

Prediction Statement: We would be 95 percent confident that the actual manufacturing

overhead will be between $80,921 and $167,286 at 2,100 manufacturing direct labor hours.