Summary

  • test for significance of regressor, overall model
  • fitted regression equation
  • check assumptions using residual plots
  • identify outliers and influential points
  • interpret coefficients, R-squared
  • compare fit of models for same response using adjusted R-squared

Regression

A regression of the response variable on the regressor is a mathematical relationship between the mean of and different values of .

Linear regression

The regression is linear, and of the form

Linear refers to the linearity of the parameters:

  • is still linear
  • is not linear

Simple refers to the use of only one regressor.

  • If there are multiple regressors, it is called a multiple linear regression.

Simple Linear Regression

Assumptions

For :

  1. Data obtained by randomisation
  2. Linearity of the relationship
  3. Error term is normal
  4. Constant variance

Note: assumptions can only be checked after model is fitted.

Estimation

Ordinary Least Squares

  • considers all possible “best-fit lines”
  • compute sum of squared residuals
  • line minimising this quantity is the line of best-fit

Understanding R Output

Min1QMedianResiduals3QMaxCoefficientsInterceptxEstimateStd. Errort valuePr(>|t|)Residual standard errorMultiple R-squaredAdjusted R-squaredF-statisticp-valuebeta0beta1

Interpreting this:

However, confidence intervals for can be determined, which would be preferable in some instances:

confint(M1, level)

Hypothesis Testing

  • test: tests significance of one regressor
  • test: tests significance of whole model

Note: In a Simple Linear Regression model, the test is equivalent to the test.

t-test

  1. Assumption remains the same
  2. State null hypothesis (and alternative)
    • or regressor not significant
  3. Find test-statistic
  4. Derive
  5. Conclude if slope is significantly different from at a prespecified

f-test

  1. Assumption remains the same
  2. State null hypothesis (and alternative)
    • or all regressors not significant
  3. Find test-statistic
  4. Derive
  5. Conclude if model is significant

If the test does not reject , it suggests the model does not have significant regressors, and suggests a new model without regressors.

This is called the intercept model.

Diagnostics

  1. Randomisation - determined during data collection
  2. Linearity - checking scatterplot between response and regressor and residual plot.
  3. Normality - checked using residuals of built model
  4. Constant variance - checked using residuals of built model

Scatterplot

IdealLinearity violatedConstant varianceviolated

If linearity violated

  • add higher order terms If not constant variance
  • transform response () (will change coefficient)

Residuals

Using the standard residuals :

Plotting the residuals:

  • Plot against or :
    • Expected: points scatter randomly about , with interval .
    • If there is a funnel shape, constant variance is violated.
  • Plotting against
    • Expected: linearity
    • If not linear, linearity is violated.
  • QQ-plot of
    • Expected: normal
    • If not, linearity is violated.

Outliers, Influential Points

Outlier

A point with standard residuals or

Influential points

Affects parameter estimates greatly. (An outlier may or may not be influential).

Measured using Cook’s distance (which measures the effect of deleting a given observation) (using as the threshold).

Statistic

R-squared statistic

The proportion of total variation of the response (about the sample mean ) explained by the model.

In a simple model,

Multiple Linear Regression

MLRSLR
Regression function is linearSame
Check assumptions using residualsSame
t-tests for individual coefficientsSame
F-test for overall regressionSame
Test significance of a categorical variable which has more than 2 categoriesDifferent
Using adjusted R-squared to compare modelsUse non-adjusted R-squared to compare models.

Indicator Variables, Interaction Terms

Indicator variable

Takes on value 1 if category observed, and 0 otherwise.

Interaction terms

If there is an interaction between two variables, they are considered interaction terms.

Note that there is a new coefficient here, used to signify the coefficient of the interaction term. When dropping insignificant terms - if the interaction term is highly significant, to keep the interaction term, all the main terms of the interaction must be kept.