Housing Market Analysis
Supervised Learning: Linear Regression
Linear regression is a linear approach for modeling the relationship between a scalar response and one or more explanatory variables.  
Supervised Learning: Linear Regression
Linear regression is a linear approach for modeling the relationship between a scalar response and one or more explanatory variables.  
The data set housesat.csv contains 1,000 observation and 21 columns (19 variables). The goal is to build a model that best predicts the house price. Before building a final model, one must check for major influential points (outliers), multicollinearity, normality, major influential points (outliers), non- independence error, and non-constant error variance.
One uses the outlierTest function, plots a QQ Plot, and other residual graphs to eliminate possible outliers. From the first round of outlierTest, one eliminates observations: 270,657,815, 301,519,313,231,22,780,265. 2nd round observations eliminated: 153,245,279,452,550,938,321,422. 3rd round observations eliminated: 498,69,125,402. 4th round observations eliminated: 865,475,213.
Between every round of outlier elimination, a summary performed to check on the overall p-value of the initial linear model lm(price~.-id-date, data.HouseData).
This QQ plot of residuals was performed prior the 4th round of outlier elimination. We also check for normality of the residuals and do an influential plot.
observations 522 and 49 also shows as major influential points but they are kept, because it decreased the final p-value.
One then goes ahead checks for multicollinearity and eliminate a couple of variables using both forward and backwards stepwise regression. One uses stepwise because the procedure adds or removes independent variables one at a time using the variable's statistical significance
The vif function shows that sqft_living has an 8+ multicollinearity coefficient, but it is kept and sqft_above is removed from the model instead by stepwise.
The final model has R-squared: 0.7826 with 12 variables with p-values less than 0.05
Just looking at the variables p-values, one can tell that the sqft_living, grade, and lat are the most important variables that contributes to the high R-squared value in our final model
Other things to note, is the results of the ncvTest for non-constant error variance and the Durbin Watson test for non-independence error. The p-value for non-constant error variance is extremely low and while the p-value for the Durbin Watson test is high, the alternative hypothesis shows independence.
An interesting data visual, in one’s opinion, is actually using the date as the index, converting the data set into a time series object, and plotting the price of homes on a monthly timeline.
From this, other than the outlier in Feb 01, one can tell that there are price peaks during the summer months such as June through August.