Omitted Relevant Variables, Multicollinearity and Dummy Variables (and more?)

Omitted Relevant Variables, Outliers and High Leverage Points
 * Outlier - when the value deviates a lot from the trend
 * A HLP is a point that has an extreme value on one or more of the explanatory factors
 * Influential observation - the point has a lot of influence on regression when included
 * Signs of outliers/high leverage point
 * High standardised residuals +3/-3 ~> Outliers
 * Leverage values (under distance) is higher than 2p/n with p being number of variables and n being number of data points ~> HLP
 * Cook's (under distance), greater than 1 ~> HLP
 * When variables relevant to the regression are omitted, there is a bias in the model
 * To see if the bias is positive or negative, you have to look at the variables that are leftover when the relevant variable is taken away
 * E(B(hat)1) - B1 = B2*cov(x,y)/var(x)
 * Therefore, this means that
 * Missing regressors with zero coefficients do not cause bias
 * Uncorrelated missing regressors do not cause bias
 * Need to know signs of B2 and cov(x,y) to know sign of bias. var(x) will always be positive so irrelevant.


 * How to avoid this?
 * Think about what variables are important (intuition + theory)
 * Estimate the regression coefficients
 * Look how in literature/past models/etc the researchers made their models
 * Check R^2 and adjusted R^2

Multicollinearity
 * Happens when one variable is strongly correlated to another
 * Can happen when many variables are in the model
 * Sometimes the correlation can get very very big
 * In this case it is problematic to include both the highly correlated variables
 * Can be more subtle eg: x1,x2 is correlated with x3,x4,x5
 * Effects:
 * Unreliable regression coefficients
 * Overall F-statistics is highly significant
 * Large difference between R^2 and adjusted R^2
 * Detecting Multicollinearity:
 * Correlation Matrix
 * http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software
 * Correlation Scatterplot
 * Should be one straight-ish line not an oblong donut
 * The Variance Inflation Factor
 * VIF = 1/(1-Rj^2)
 * If VIF > 5, there is a problem
 * Shows how much variation in variable j is independent of other variables
 * Best to include just one of the variables in the model in this case
 * Create index of the set of multicollinear variables

Autocorrelation Dummy/Indicator Variables
 * Auto-correlation is when the successive error terms are correlated(usually occurs in time series)
 * how to detect:
 * Use the Durbin-Watson statistic. If it's larger than 1, there is an autocorrelation effect.
 * It needs an appropriate time series methods. EG: moving average
 * Variables that can either take a 0 or 1 value in the model. 1 if the variable exists/applies/is true and 0 if not.
 * The default dummy variable is not presented in the model. It is considered as the baseline. Eg: if there are 5 dummy variables, the model will only show 4 dummy variables present in the model.
 * Choose baseline very carefully
 * Use the most meaningful
 * Don't use too many
 * It helps to group categories
 * Don't have a dummy with more than 2 settings

Polynomials
 * For data points with a curving trend, you can use polynomials to fit the regression onto the data points.
 * The polynomial points can be treated as another variable ie x1^2 can be treated as x2
 * Higher order cubics, quartics are rarely used and difficult to explain/interpret
 * risky extrapolation, multiple maxima and minima, inflexible, all in all higher order polynomials not good
 * issues with collinearity

Transformations
 * Reasons for transformations
 * data cannot be plotted
 * theoretical reasons
 * assumptions such as constant variance and normality are not true according to residual analysis
 * Transformations include:
 * y ~> ln(y), x ~> x^2, x ~> sqrt(x), y ~> y^2, y ~>1/y
 * the log transformation is used very often to fix skewness in data, make exponential relationships linear and reduce certain types of heteroskedasticity

Interactions
 * Sometimes two variables interact together in a regression. In other words, one variable may act differently according to the value or range of values of another variable
 * Not collinearity or multicollinearity
 * eg variables x1 and x2 interact and we show that in the regression model through y = ax1 + bx2 +cx1*x2 + e
 * You need to have strong reasons to put in the interaction variable and not put in the corresponding variables. If you put A*B, then put in A and B in the model.
 * Be careful when using dummy variables as interactions variables too because they have a 0-1 value.

Partial F tests
 * This is to know which independent variables are useful or not
 * Used for comparing reduced models to full models
 * Reduced models have less variables than full models
 * Formula:

☀http://stats.stackexchange.com/questions/157102/what-is-a-partial-f-statistic