Summary
- test for significance of regressor, overall model
- fitted regression equation
- check assumptions using residual plots
- identify outliers and influential points
- interpret coefficients, R-squared
- compare fit of models for same response using adjusted R-squared
Regression
A regression of the response variable
on the regressor is a mathematical relationship between the mean of and different values of .
Linear regression
The regression is linear, and of the form
Linear refers to the linearity of the parameters:
is still linear is not linear
Simple refers to the use of only one regressor.
- If there are multiple regressors, it is called a multiple linear regression.
Simple Linear Regression
Assumptions
For
- Data obtained by randomisation
- Linearity of the relationship
- Error term
is normal - Constant variance
Note: assumptions can only be checked after model is fitted.
Estimation
Ordinary Least Squares
- considers all possible “best-fit lines”
- compute sum of squared residuals
- line minimising this quantity is the line of best-fit
Understanding R Output
Interpreting this:
However, confidence intervals for
confint(M1, level)
Hypothesis Testing
test: tests significance of one regressor test: tests significance of whole model
Note: In a Simple Linear Regression model, the
t-test
- Assumption remains the same
- State null hypothesis (and alternative)
or regressor not significant
- Find test-statistic
- Derive
- Conclude if slope
is significantly different from at a prespecified
f-test
- Assumption remains the same
- State null hypothesis (and alternative)
or all regressors not significant
- Find test-statistic
- Derive
- Conclude if model is significant
If the
This is called the intercept model.
Diagnostics
- Randomisation - determined during data collection
- Linearity - checking scatterplot between response
and regressor and residual plot. - Normality - checked using residuals of built model
- Constant variance - checked using residuals of built model
Scatterplot
If linearity violated
- add higher order terms
If not constant variance - transform response (
) (will change coefficient)
Residuals
Plotting the residuals:
- Plot
against or : - Expected: points scatter randomly about
, with interval . - If there is a funnel shape, constant variance is violated.
- Expected: points scatter randomly about
- Plotting
against - Expected: linearity
- If not linear, linearity is violated.
- QQ-plot of
- Expected: normal
- If not, linearity is violated.
Outliers, Influential Points
Outlier
A point with standard residuals
or
Influential points
Affects parameter estimates greatly. (An outlier may or may not be influential).
Measured using Cook’s distance (which measures the effect of deleting a given observation) (using
as the threshold).
Statistic
R-squared statistic
The proportion of total variation of the response (about the sample mean
) explained by the model.
In a simple model,
Multiple Linear Regression
MLR | SLR |
---|---|
Regression function is linear | Same |
Check assumptions using residuals | Same |
t-tests for individual coefficients | Same |
F-test for overall regression | Same |
Test significance of a categorical variable which has more than 2 categories | Different |
Using adjusted R-squared to compare models | Use non-adjusted R-squared to compare models. |
Indicator Variables, Interaction Terms
Indicator variable
Takes on value 1 if category observed, and 0 otherwise.
Interaction terms
If there is an interaction between two variables, they are considered interaction terms.
Note that there is a new coefficient here, used to signify the coefficient of the interaction term. When dropping insignificant terms - if the interaction term is highly significant, to keep the interaction term, all the main terms of the interaction must be kept.