Assumptions of the Regression Model. The assumptions listed below enable us to calculate
unbiased estimators of the population) and to use these in predicting values and regression
function coefficients (of Y given X). You should be aware of the fact that violation of one or
more of these assumptions reduces the efficiency of the medle, but a detailed discussion of this
topic is beyond the purview of this text. Assume that all these assumptions have been met.
• For each value of X there is an array of possible Y normally distributed about the
regression line.
• The mean of the distribution of possible Y values is on the regression line, i.e., the
expected value of the error term is zero.
• The standard deviation of the distribution of possible Y values is constant regardless of
the value of X (this is called homoscedasticity).
• The error terms are statistically independent of each other, i.e., there is no serial
correlation.
• The error term is statistically independent of X.
Note: These assumptions are very important, in that they enable us to construct predictions
around our point estimate.
Variation in the Regression Model . Recall that the purpose of regression analysis is to predict
the value of a dependent variable given the value of the independent variable. The LSBF
technique yields the best single line to fit the data, but you also want some method of
determining how good this estimating equation is. In order to do this, you must first partition the
variation.
• Total Variation. The sum of squares total (SST) is a measure of the total variation of Y.
SST is the sum of the squared differences between the observed values of Y and the mean
of Y.
SST = (Y i – ) 2
Where:
SST = Sum of squared differences
Y i = Observed value i
= Mean value of Y
While the above formula provides a clear picture of the meaning of SST, you can use the
following formula to speed SST calculation:
Total variation can be partitioned into two variations categories: explained and unexplained. This
can be expressed as
SST = SSR + SSE
• Explained Variation. The sum of squares regression (SSR) is a measure of variation of
Y that is explained by the regression equation. SSR is the sum of the squared differences
between the calculated value of Y (Yc) and the mean of Y ( ).
You can use the following formula to speed SSR calculation:
• Unexplained Variation . The sum of squares error (SSE) is a measure of the variation of
Y that is not explained by the regression equations. SSE is the sum of the squared
differences between the observed values of Y and the calculated value of Y. This is the
random variation of the observations around the regression line.
You can use the following formula to speed SSE calculation:
Analysis of Variance . Variance is equal to variation divided by degrees of freedom (df). In
regression analysis, df is a statistical concept that is used to adjust for sample bias in estimating
the population mean.
• Mean Square Regression (MSR).
For 2-variable linear regression, the value of df for calculating MSR is always one (1). As a
result, in 2-variable linear regression, you can simplify the equation for MSR to read:
• Mean Square Error (MSE).
In 2-variable linear regression, df for calculating MSE is always n – 2. As a result, in simple
regression, you can simplify the equation for MSE to read:
• Analysis of Variance Table. The terms used to analyze variation/variance in the
regression model are commonly summarized in an Analysis of Variance (ANOVA) table.
ANOVA Table
Source Sum of Squares df Mean
Square**
Regression SSR 1 MSR
Error SSE n-2 MSE
Total SST n-1
Mean Square = Sum of Squares/df Constructing an ANOVA Table for the Manufacturing Overhead Example . Before you can calculate variance and variation, you must use the observations to calculate the statistics in the table below. Since we already calculated these statistics to develop the regression equation to estimate manufacturing overhead, we will begin our calculations with the values in the table below: Statistic Value ? 144 ?Y 846 ?XY 22,647 ?X 2 3,872 ?Y 2 133,296 24 141 A 5.8272 B 5.6322 n 6 Step 1. Calculate SST. Step 2. Calculate SSR. Step 3. Calculate SSE. Step 4. Calculate MSR. Step 5. Calculate MSE. Step 6. Combine the calculated values into an ANOVA table. ANOVA Table Source Sum of Squares df Mean Square
Regression 13,196 1 13,196
Error 814 4 204
Total 14,010 5
**Mean Square = Sum of Squares/df
Step 7. Check SST. Assure that value for SST is equal to SSR plus SSE.
SST = SSR + SSE
14,010 = 13,196 + 814
14,010 = 14,010
5.4 – Measuring How Well The Regression Equation Fits The Data
Statistics Used to Measure Goodness of Fit . How well does the equation fit the data used in
developing the equation? Three statistics are commonly used to determine the “goodness of fit”
of the regression equation:
• Coefficient of determination;
• Standard error of the estimate; and
• T-test for significance of the regression equation.
Calculating the Coefficient of Determination . Most computer software designed to fit a line
using regression analysis will also provide the coefficient of determination for that line. The
coefficient of determination (r 2 ) measures the strength of the association between independent
and dependent variables (X and Y).
The range of r 2 is between zero and one.
0 < r 2 < 1 An r 2 of zero indicates that there is no relationship between X and Y. An r 2 of one indicates that there is a perfect relationship between X and Y. As r 2 gets closer to 1, the better the regression line fits the data set. In fact, r 2 is the ratio of explained variation (SSR) to total variation (SST). An r 2 of .90 indicates that 90 percent of the variation in Y has been explained by its relationship with X; that is, 90 percent of the variation in Y has been explained by the regression line. For the manufacturing overhead example: This means that approximately 94% of the variation in manufacturing overhead (Y) can be explained by its relationship with manufacturing direct labor hours (X). Standard Error of the Estimate . The standard error of the estimate (SEE) is a measure of the accuracy of the estimating (regression) equation. The SEE indicates the variability of the observed (actual) points around the regression line (predicted points). That is, it measures the extent to which the observed values (Yi) differ from their calculated values (Yc). Given the first two assumptions required for use of the regression model (for each value of X there is an array of possible Y values which is normally distributed about the regression line and the mean of this distribution (Yc) is on the regression line), the SEE is interpreted in a way similar to the way in which the standard deviation is interpreted. That is, given a value for X, we would generally expect the following intervals (based on the Empirical Rule): • Yc = 1 SEE contains approximately 68 percent of the total observations (Yi) • Yc = 2 SEE contains approximately 95 percent of the total observations (Yi) • Yc = 3 SEE contains approximately 99 percent of the total observations (Yi) The SEE is equal to the square root of the MSE. For the manufacturing overhead example: Steps for Conducting the T-test for the Significance of the Regression Equation . The regression line is derived from a sample. Because of sampling error, it is possible to get a regression relationship with a rather high r 2 (e.g. greater than 80 percent) when there is no real relationship between X and Y. That is, when there is no statistical significance. This phenomenon will occur only when you have very small sample data sets. You can test the significance of the regression equation by applying the T-test. Applying the T-test is a 4-step process: Step 1. Determine the significance level (). = 1 – confidence level The selection of the significance level is a management decision; that is, management decides the level of risk associated with an estimate which it will accept. In the absence of any other guidance, use a significance level of .10. Step 2. Calculate T. Use the values of MSR and MSE from the ANOVA table: Step 3. Determine the table value of t. From a t Table, select the t value for the appropriate degrees of freedom (df). In 2-variable linear regression: df = n – 2 Step 4. Compare T to the t Table value. Decision rules: If T > t, use the regression equation for prediction purposes. It is likely that the relationship is
significant.
If T < t, do not use the regression equation for prediction purposes. It is likely that the
relationship is not significant.
If T = t, a highly unlikely situation, you are theoretically indifferent and may elect to use or not
use the regression equation for prediction purposes.
Conducting the T-test for the Significance of the Regression Equation for the Manufacturing
Overhead Example .
To demonstrate use of the T-test, we will apply the 4-step procedure to the manufacturing
overhead example:
Step 1. Determine the significance level ( ). Assume that we have been told to use = .05.
Step 2. Calculate T.
Step 3. Determine the table value of t. The partial table below is an excerpt of a t table.
df = n- 2
= 6 – 2
= 4
Partial t Table
df t
2 4.303
3 3.182
4 2.776
5 2.571
6 2.447
Reading from the table, the appropriate value is 2.776.
Step 4. Compare T to the t Table value. Since T (8.043) > t (2.776), use the regression
equation for prediction purposes. It is likely that the relationship is significant.
Note: There is not normally a conflict in the decision indicated by the T-test and the magnitude
of r 2 . If r 2 is high, T is normally > t. A conflict could occur only in a situation where there are
very few data points. In those rare instances where there is a conflict, you should accept the
decision indicated by the T-test. It is a better indicator than r 2 because it takes into account the
sample size (n) through the degrees of freedom (df).
5.5 – Calculating And Using A Prediction Interval
Formulating the Prediction Interval . You can develop a regression equation and use it to
calculate a point estimate for Y given any value of X. However, a point estimate alone does not
provide enough information for sound negotiations. You need to be able to establish a range of
values which you are confident contains the true value of the cost or price which you are trying
to predict. In regression analysis, this range is known as the prediction interval.
For a regression equation based on a small sample, you should develop a prediction interval,
using the following equation:
When X = 21 the prediction interval is 80.9207 Y 167.2861
Constructing a Prediction Interval for the Manufacturing Overhead Example . Assume that we
want to construct a 95 percent prediction interval for the manufacturing overhead estimate at
2,100 manufacturing direct labor hours. Earlier in the chapter, we calculated Y C and the other
statistics in the following table:
Statistic Value
Yc 124.1034
t
(Use n – 2 df)
2.776
SEE 14.27
24
X 2 3,872
Using the table data, you would calculate the prediction interval as follows: When X = 21 the
prediction interval is: 80.9207 Y 167.2861.
Prediction Statement: We would be 95 percent confident that the actual manufacturing
overhead will be between $80,921 and $167,286 at 2,100 manufacturing direct labor hours.