Simple Linear Regression

Simple linear regression is univariate. One independent and one dependent. Assume data generated is linear.

y = observed value, y(hat) = estimated value, y(bar) = average, e = errors, n = sample size.

Estimation methods(by minimising below)
 * Residuals
 * Least Squared Errors ~> This is the most commonly used estimation method and has the best properties
 * Minimum Absolute Deviation
 * Cramer = minimum Chi^2
 * MAPE = Mean Absolute Percentage Error

Assessment of Model
 * Check goodness of fit
 * R^2 and Adjusted R^2 ~> the measure of fit
 * Tells us how well the model fits the data
 * Value from 0 to 1. Closer to 1 the better.
 * SSreg/SStot = 1 - SSres/SStot
 * For scientific data, 0.7 is considered good while for social sciences 0.6 is considered as good
 * Adjusted R^2 is for adjustment for R^2 when new variables are added to the model leading to spurious increases of R^2
 * Standard Error
 * F statistic p value
 * t statistic p value
 * Check regression assumptions

Standard Error = Sample estimate of standard deviation of the residuals

ANOVA = Analysis of Variance
 * df = degrees of freedom = depends on the size and number of parameters to be estimated
 * df(tot) = n - k = where k is the number of explanatory variables and n is the total sample(?)
 * F = RegressionMS/Residual MS = this tells us the ratio of the explained to the unexplained variable

Assumptions of linear regression
 * Linear relationship
 * Homoskedasticity, constant variance - C
 * Error is uncorrelated/independent with x or y - I
 * Error is has an Expected value of 0 - E
 * Response variable y and error is Normally distributed - N
 * NICE
 * If assumptions are wrong

How to analyse residuals
 * Predicted vs Residual plot - check linearity and constant variance
 * Can be used to check if E(error) = 0
 * E(error) can be checked numerically too
 * If E(error) is not 0 then there is a mistake
 * Histogram of residuals - check normality/skew of errors
 * Q-Q or P-P plot - check normality/skew of errors
 * Compare the plotted residuals to the histogram of standardised residuals
 * bow shaped = skew
 * S shaped = kurtosis, too peaked or too flat
 * https://en.wikipedia.org/wiki/P%E2%80%93P_plot
 * Index plot - check independence of errors
 * IF there is a correlation between the residuals and the index plot then there the independence assumption does not hold

Prediction
 * Two types of prediction: In sample and Out of sample.
 * In sample is predicting missing values within the range of the sample.
 * Out of sample is predicting values that go beyond the range of the sample.
 * Extrapolation is risky because we only know about the range of sample we collected
 * If we don't know the actual value of y, we do not know the prediction error pe but we can still estimate the standard error of the prediction

Outliers
 * Anomalous result. Can be called influential points because they influence our perception
 * How to detect
 * Graph, frequency table, range of continuous variable, scatterplots, box-plots, plot standardised residuals against standardised fitted values
 * What to do?
 * Try to find out where they came from. Usually coding errors.
 * If they are errors, drop them.
 * If they are not errors but influence the results greatly, drop them.
 * But have to argue why
 * Run sensitivity analysis
 * Run two analyses one with outlier and one without to see how badly outlier affects results