6  Linear Regression

Coverage: Linear Regression, General Linear Regression, Lasso, Ridge, Elastic Net Regression

6.1 Theoritical Knowledge

The multivariate linear Regression \(y = X \beta\) with loss function \(Q = \left\| (y - X \beta) \right\|^2 = (y - X \beta) ^T (y - X \beta)\)

\[ \frac{\partial Q}{\partial \beta} = -2X^Ty + 2X^TX\beta = 0 \\ \to \hat{\beta} = {(X^TX)}^{-1}X^Ty \]

H is called Hat Matrix where: \(\hat{y} = X \hat{\beta} = X (X^TX)^{-1}X^Ty = Hy\)

Normal Equation: \((X^TX)\hat{\beta} = X^Ty \to X^T(y-X\hat{\beta}) = 0\)

\[ R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST} \]

\[ \text{Adjusted-}R^2 = \frac{SSR/k}{SST/(n-1)} = 1- \frac{SSE/(n-k-1)}{SST/(n-1)} \]

Notice that here SSE means Sum Square Error, and SSR means Sum Square Regression. Sometimes you might see ESS which is Explained Sum Square which equals to SSR here, and RSS is Residual Sum Square.

In R, use lm() to fit a linear regression model.

6.2 Example: diamonds dataset

library(tidyverse)
library(caret)
library(car)
library(glmnet)
data(diamonds)
head(diamonds)
# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
sum(is.na(diamonds))
[1] 0
str(diamonds)
tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

6.2.1 Without preprocessing

fit1 <- lm(price~., diamonds)
summary(fit1)

Call:
lm(formula = price ~ ., data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-21376.0   -592.4   -183.5    376.4  10694.2 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  5753.762    396.630   14.507  < 2e-16 ***
carat       11256.978     48.628  231.494  < 2e-16 ***
cut.L         584.457     22.478   26.001  < 2e-16 ***
cut.Q        -301.908     17.994  -16.778  < 2e-16 ***
cut.C         148.035     15.483    9.561  < 2e-16 ***
cut^4         -20.794     12.377   -1.680  0.09294 .  
color.L     -1952.160     17.342 -112.570  < 2e-16 ***
color.Q      -672.054     15.777  -42.597  < 2e-16 ***
color.C      -165.283     14.725  -11.225  < 2e-16 ***
color^4        38.195     13.527    2.824  0.00475 ** 
color^5       -95.793     12.776   -7.498 6.59e-14 ***
color^6       -48.466     11.614   -4.173 3.01e-05 ***
clarity.L    4097.431     30.259  135.414  < 2e-16 ***
clarity.Q   -1925.004     28.227  -68.197  < 2e-16 ***
clarity.C     982.205     24.152   40.668  < 2e-16 ***
clarity^4    -364.918     19.285  -18.922  < 2e-16 ***
clarity^5     233.563     15.752   14.828  < 2e-16 ***
clarity^6       6.883     13.715    0.502  0.61575    
clarity^7      90.640     12.103    7.489 7.06e-14 ***
depth         -63.806      4.535  -14.071  < 2e-16 ***
table         -26.474      2.912   -9.092  < 2e-16 ***
x           -1008.261     32.898  -30.648  < 2e-16 ***
y               9.609     19.333    0.497  0.61918    
z             -50.119     33.486   -1.497  0.13448    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1130 on 53916 degrees of freedom
Multiple R-squared:  0.9198,    Adjusted R-squared:  0.9198 
F-statistic: 2.688e+04 on 23 and 53916 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(fit1)

Variance Inflation Factor (VIF) is useful to detect multicollinear.

vif(fit1)
             GVIF Df GVIF^(1/(2*Df))
carat   22.439582  1        4.737044
cut      1.945645  4        1.086758
color    1.178326  6        1.013768
clarity  1.347474  7        1.021531
depth    1.782401  1        1.335066
table    1.787765  1        1.337073
x       57.518327  1        7.584084
y       20.592160  1        4.537859
z       23.585582  1        4.856499
fit1a <- lm(price~.-x-z-y, diamonds)
vif(fit1a)
            GVIF Df GVIF^(1/(2*Df))
carat   1.323098  1        1.150260
cut     1.926004  4        1.085381
color   1.169317  6        1.013120
clarity 1.303877  7        1.019134
depth   1.378257  1        1.173992
table   1.786714  1        1.336680
summary(fit1a)

Call:
lm(formula = price ~ . - x - z - y, data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max 
-16828.8   -678.7   -199.4    464.6  10341.2 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  -969.661    360.432   -2.690  0.00714 ** 
carat        8895.194     12.079  736.390  < 2e-16 ***
cut.L         615.613     22.985   26.784  < 2e-16 ***
cut.Q        -326.638     18.390  -17.762  < 2e-16 ***
cut.C         156.333     15.814    9.886  < 2e-16 ***
cut^4         -15.975     12.648   -1.263  0.20657    
color.L     -1908.010     17.718 -107.689  < 2e-16 ***
color.Q      -626.087     16.112  -38.858  < 2e-16 ***
color.C      -172.056     15.063  -11.423  < 2e-16 ***
color^4        20.319     13.833    1.469  0.14187    
color^5       -85.245     13.068   -6.523 6.95e-11 ***
color^6       -50.085     11.881   -4.216 2.50e-05 ***
clarity.L    4206.854     30.867  136.290  < 2e-16 ***
clarity.Q   -1831.804     28.811  -63.580  < 2e-16 ***
clarity.C     919.725     24.672   37.278  < 2e-16 ***
clarity^4    -361.609     19.728  -18.330  < 2e-16 ***
clarity^5     213.910     16.108   13.280  < 2e-16 ***
clarity^6       2.986     14.030    0.213  0.83148    
clarity^7     110.147     12.375    8.901  < 2e-16 ***
depth         -21.024      4.079   -5.154 2.56e-07 ***
table         -24.803      2.978   -8.329  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1156 on 53919 degrees of freedom
Multiple R-squared:  0.9161,    Adjusted R-squared:  0.916 
F-statistic: 2.942e+04 on 20 and 53919 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(fit1a)

fit2 <- lm(log(price)~., diamonds)
summary(fit2)

Call:
lm(formula = log(price) ~ ., data = diamonds)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2544 -0.0911  0.0015  0.0902 10.2241 

Coefficients:
              Estimate Std. Error  t value Pr(>|t|)    
(Intercept) -2.5530834  0.0616229  -41.431  < 2e-16 ***
carat       -0.6120540  0.0075551  -81.012  < 2e-16 ***
cut.L        0.1045411  0.0034923   29.934  < 2e-16 ***
cut.Q       -0.0370562  0.0027956  -13.255  < 2e-16 ***
cut.C        0.0371749  0.0024056   15.454  < 2e-16 ***
cut^4        0.0115784  0.0019229    6.021 1.74e-09 ***
color.L     -0.4526278  0.0026943 -167.993  < 2e-16 ***
color.Q     -0.1035542  0.0024512  -42.246  < 2e-16 ***
color.C     -0.0118522  0.0022878   -5.181 2.22e-07 ***
color^4      0.0188754  0.0021016    8.982  < 2e-16 ***
color^5     -0.0068063  0.0019850   -3.429 0.000606 ***
color^6      0.0021343  0.0018044    1.183 0.236892    
clarity.L    0.8907095  0.0047012  189.466  < 2e-16 ***
clarity.Q   -0.2599052  0.0043856  -59.264  < 2e-16 ***
clarity.C    0.1435463  0.0037523   38.255  < 2e-16 ***
clarity^4   -0.0662975  0.0029962  -22.127  < 2e-16 ***
clarity^5    0.0285591  0.0024473   11.670  < 2e-16 ***
clarity^6   -0.0039048  0.0021309   -1.833 0.066881 .  
clarity^7    0.0290474  0.0018805   15.447  < 2e-16 ***
depth        0.0521543  0.0007045   74.028  < 2e-16 ***
table        0.0089978  0.0004524   19.890  < 2e-16 ***
x            1.1646945  0.0051112  227.871  < 2e-16 ***
y            0.0325386  0.0030037   10.833  < 2e-16 ***
z            0.0427049  0.0052026    8.208 2.29e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1756 on 53916 degrees of freedom
Multiple R-squared:  0.9701,    Adjusted R-squared:  0.9701 
F-statistic: 7.597e+04 on 23 and 53916 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(fit2)

vif(fit2)
             GVIF Df GVIF^(1/(2*Df))
carat   22.439582  1        4.737044
cut      1.945645  4        1.086758
color    1.178326  6        1.013768
clarity  1.347474  7        1.021531
depth    1.782401  1        1.335066
table    1.787765  1        1.337073
x       57.518327  1        7.584084
y       20.592160  1        4.537859
z       23.585582  1        4.856499
fit2a <- lm(log(price)~.-x-z-y, diamonds)
vif(fit2a)
            GVIF Df GVIF^(1/(2*Df))
carat   1.323098  1        1.150260
cut     1.926004  4        1.085381
color   1.169317  6        1.013120
clarity 1.303877  7        1.019134
depth   1.378257  1        1.173992
table   1.786714  1        1.336680
summary(fit2a)

Call:
lm(formula = log(price) ~ . - x - z - y, data = diamonds)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.9717 -0.2181  0.0572  0.2479  1.6083 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.5193818  0.1055868  52.273  < 2e-16 ***
carat        2.1946870  0.0035386 620.211  < 2e-16 ***
cut.L        0.0682284  0.0067332  10.133  < 2e-16 ***
cut.Q       -0.0090647  0.0053872  -1.683   0.0925 .  
cut.C        0.0291522  0.0046325   6.293 3.14e-10 ***
cut^4        0.0065684  0.0037051   1.773   0.0763 .  
color.L     -0.5051100  0.0051903 -97.318  < 2e-16 ***
color.Q     -0.1582296  0.0047199 -33.524  < 2e-16 ***
color.C     -0.0038111  0.0044125  -0.864   0.3877    
color^4      0.0400952  0.0040522   9.895  < 2e-16 ***
color^5     -0.0192706  0.0038283  -5.034 4.82e-07 ***
color^6      0.0041225  0.0034805   1.184   0.2362    
clarity.L    0.7615612  0.0090423  84.222  < 2e-16 ***
clarity.Q   -0.3710533  0.0084401 -43.963  < 2e-16 ***
clarity.C    0.2181958  0.0072276  30.189  < 2e-16 ***
clarity^4   -0.0704719  0.0057792 -12.194  < 2e-16 ***
clarity^5    0.0521882  0.0047187  11.060  < 2e-16 ***
clarity^6    0.0006772  0.0041100   0.165   0.8691    
clarity^7    0.0058294  0.0036253   1.608   0.1078    
depth        0.0000854  0.0011950   0.071   0.9430    
table        0.0068963  0.0008723   7.906 2.71e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3387 on 53919 degrees of freedom
Multiple R-squared:  0.8886,    Adjusted R-squared:  0.8886 
F-statistic: 2.151e+04 on 20 and 53919 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(fit2a)

The second model looks better.

6.2.2 With preprocessing

new_diamonds <- diamonds %>% 
    predict(dummyVars(price ~ ., data = ., sep = "_", levelsOnly = FALSE), newdata = .) %>%
    as.data.frame() %>%
    mutate(logprice = log(diamonds$price))
colnames(new_diamonds) <- gsub("\\^", "_", colnames(new_diamonds))

set.seed(42)
training.idx <- new_diamonds$logprice %>%
    createDataPartition(p = 0.75, list = F)
trainset <- new_diamonds[training.idx, ]
testset <- new_diamonds[-training.idx, ]

prepproc <- preProcess(trainset, method = c("center", "scale"))
trainset.scaled <- predict(prepproc, trainset)
testset.scaled <- predict(prepproc, testset)

mu <- prepproc$mean["logprice"]
sigma <- prepproc$std["logprice"]

train.x <- trainset.scaled %>% select(-logprice)
test.x <- testset.scaled %>% select(-logprice)
fit3 <- lm(logprice~.-x-z-y, trainset.scaled)
summary(fit3)

Call:
lm(formula = logprice ~ . - x - z - y, data = trainset.scaled)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7768 -0.2137  0.0560  0.2444  1.5795 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.959e-14  1.653e-03   0.000   1.0000    
carat        1.025e+00  1.899e-03 540.026  < 2e-16 ***
cut.L        2.457e-02  2.677e-03   9.179  < 2e-16 ***
cut.Q       -4.757e-03  2.791e-03  -1.705   0.0883 .  
cut.C        1.268e-02  2.237e-03   5.667 1.46e-08 ***
cut_4        2.137e-03  1.869e-03   1.143   0.2530    
color.L     -1.603e-01  1.888e-03 -84.918  < 2e-16 ***
color.Q     -5.285e-02  1.831e-03 -28.871  < 2e-16 ***
color.C      2.460e-05  1.810e-03   0.014   0.9892    
color_4      1.505e-02  1.782e-03   8.448  < 2e-16 ***
color_5     -6.681e-03  1.702e-03  -3.926 8.65e-05 ***
color_6      5.532e-04  1.681e-03   0.329   0.7420    
clarity.L    1.891e-01  2.585e-03  73.170  < 2e-16 ***
clarity.Q   -8.582e-02  2.271e-03 -37.790  < 2e-16 ***
clarity.C    6.390e-02  2.479e-03  25.779  < 2e-16 ***
clarity_4   -2.354e-02  2.295e-03 -10.257  < 2e-16 ***
clarity_5    1.698e-02  2.012e-03   8.440  < 2e-16 ***
clarity_6   -1.291e-03  1.878e-03  -0.687   0.4920    
clarity_7    1.763e-03  1.755e-03   1.005   0.3151    
depth       -3.771e-05  1.929e-03  -0.020   0.9844    
table        1.549e-02  2.215e-03   6.990 2.80e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3325 on 40436 degrees of freedom
Multiple R-squared:  0.8895,    Adjusted R-squared:  0.8895 
F-statistic: 1.628e+04 on 20 and 40436 DF,  p-value: < 2.2e-16
vif(fit3)
    carat     cut.L     cut.Q     cut.C     cut_4   color.L   color.Q   color.C 
 1.319348  2.621717  2.850233  1.832031  1.278920  1.304379  1.226508  1.199441 
  color_4   color_5   color_6 clarity.L clarity.Q clarity.C clarity_4 clarity_5 
 1.161859  1.059918  1.033804  2.445376  1.887271  2.248751  1.926817  1.481558 
clarity_6 clarity_7     depth     table 
 1.291236  1.126544  1.362239  1.796295 
par(mfrow = c(2, 2))
plot(fit3)

ytr.true <- exp(trainset$logprice)
yte.true <- exp(testset$logprice)
ytr.pred3 <- exp(predict(fit3, train.x) * sigma + mu)
yte.pred3 <- exp(predict(fit3, test.x) * sigma + mu)


pf.lm <- data.frame(
    model = "lm",
    train.mse = mean((ytr.pred3-ytr.true)^2),
    test.mse = mean((yte.pred3-yte.true)^2)
)
pf.lm
  model train.mse   test.mse
1    lm 229599179 4206268380
pf.linearRegression <- postResample(
    pred = yte.pred3, 
    obs = yte.true
)

6.3 Stepwise regression and best subset

There’re two Information Criterions:

Akaike Information Criterion \(\text{AIC} = - 2 \log{L} + 2 k\)

Bayesian Information Criterion (aka Schwarz IC) \(\text{BIC} = - 2 \log{L} + k \log{N}\)

Here \(L\) denotes the maximum likelihood, \(k\) denotes number of parameters in the model, and \(N\) denotes the number of observations.

When sample size \(N > 7 (e^2 \approx 7.389056)\), BIC is more likely to drop variables compare with AIC.

\[\text{Criterion}=\text{Penalty for Error}+\text{Penalty for Complexity}\]

The smaller, the better.

  • Stepwise Regression: Backward (start from full model) or Forward (start from null model) then drop or add variables based on greedy strategy.

  • Best Subset Regression: Iterate all subsets of full variables, then find the best subset.

6.3.1 Stepwise Regression

library(MASS)

Attaching package: 'MASS'
The following object is masked from 'package:dplyr':

    select

6.3.1.1 Backward (start from full model)

set k = 2 for AIC, k = log(nrow(train.data)) for BIC.

When you set trace = 1, every step will be printed.

fit_s1a <- lm(logprice~., trainset.scaled)
fit_step_backward <- stepAIC(fit_s1a, k = log(nrow(trainset.scaled)), trace = 0) # MASS pkg
summary(fit_step_backward)

Call:
lm(formula = logprice ~ carat + cut.L + cut.Q + cut.C + cut_4 + 
    color.L + color.Q + color.C + color_4 + clarity.L + clarity.Q + 
    clarity.C + clarity_4 + clarity_5 + clarity_7 + depth + table + 
    x + y + z, data = trainset.scaled)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2146 -0.0910  0.0021  0.0900  9.8143 

Coefficients:
              Estimate Std. Error  t value Pr(>|t|)    
(Intercept) -1.795e-14  8.879e-04    0.000  1.00000    
carat       -2.533e-01  4.176e-03  -60.651  < 2e-16 ***
cut.L        3.546e-02  1.438e-03   24.663  < 2e-16 ***
cut.Q       -1.589e-02  1.500e-03  -10.593  < 2e-16 ***
cut.C        1.484e-02  1.203e-03   12.338  < 2e-16 ***
cut_4        5.114e-03  1.005e-03    5.090 3.60e-07 ***
color.L     -1.442e-01  1.015e-03 -142.104  < 2e-16 ***
color.Q     -3.530e-02  9.829e-04  -35.917  < 2e-16 ***
color.C     -3.310e-03  9.605e-04   -3.446  0.00057 ***
color_4      8.162e-03  9.336e-04    8.743  < 2e-16 ***
clarity.L    2.220e-01  1.386e-03  160.173  < 2e-16 ***
clarity.Q   -6.169e-02  1.210e-03  -50.991  < 2e-16 ***
clarity.C    4.353e-02  1.322e-03   32.916  < 2e-16 ***
clarity_4   -2.260e-02  1.190e-03  -18.988  < 2e-16 ***
clarity_5    1.105e-02  1.053e-03   10.498  < 2e-16 ***
clarity_7    1.260e-02  9.034e-04   13.949  < 2e-16 ***
depth        7.160e-02  1.159e-03   61.777  < 2e-16 ***
table        2.002e-02  1.190e-03   16.821  < 2e-16 ***
x            1.276e+00  6.303e-03  202.424  < 2e-16 ***
y            1.456e-02  3.555e-03    4.096 4.21e-05 ***
z            2.997e-02  3.932e-03    7.622 2.55e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1786 on 40436 degrees of freedom
Multiple R-squared:  0.9681,    Adjusted R-squared:  0.9681 
F-statistic: 6.14e+04 on 20 and 40436 DF,  p-value: < 2.2e-16
ytr.pred.sb <- exp(fit_step_backward %>% predict(train.x) * sigma + mu)
yte.pred.sb <- exp(fit_step_backward %>% predict(test.x) * sigma + mu)



pf.lm.sb <- data.frame(
    model = "Stepwise-Backward",
    train.mse = mean((ytr.pred.sb-ytr.true)^2),
    test.mse = mean((yte.pred.sb-yte.true)^2)
)
pf.lm.sb
              model train.mse test.mse
1 Stepwise-Backward   1200659  1243213
pf.Stepwise.Backward <- postResample(
    pred = yte.pred.sb, 
    obs = yte.true
)

6.3.1.2 Forward (start from null model)

fit_s2a <- lm(logprice~1, trainset.scaled)
fit_s2b <- lm(logprice~., trainset.scaled)
fit_step_forward <- stepAIC(fit_s2a, scope = list(lower = fit_s2a, upper = fit_s2b), 
                        direction = "forward", k = log(nrow(trainset.scaled)), trace = 0) # MASS pkg
summary(fit_step_forward)

Call:
lm(formula = logprice ~ x + clarity.L + color.L + carat + depth + 
    clarity.Q + color.Q + clarity.C + clarity_4 + cut.L + table + 
    clarity_7 + cut_4 + clarity_5 + color_4 + z + cut.C + cut.Q + 
    y + color.C, data = trainset.scaled)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2146 -0.0910  0.0021  0.0900  9.8143 

Coefficients:
              Estimate Std. Error  t value Pr(>|t|)    
(Intercept) -1.795e-14  8.879e-04    0.000  1.00000    
x            1.276e+00  6.303e-03  202.424  < 2e-16 ***
clarity.L    2.220e-01  1.386e-03  160.173  < 2e-16 ***
color.L     -1.442e-01  1.015e-03 -142.104  < 2e-16 ***
carat       -2.533e-01  4.176e-03  -60.651  < 2e-16 ***
depth        7.160e-02  1.159e-03   61.777  < 2e-16 ***
clarity.Q   -6.169e-02  1.210e-03  -50.991  < 2e-16 ***
color.Q     -3.530e-02  9.829e-04  -35.917  < 2e-16 ***
clarity.C    4.353e-02  1.322e-03   32.916  < 2e-16 ***
clarity_4   -2.260e-02  1.190e-03  -18.988  < 2e-16 ***
cut.L        3.546e-02  1.438e-03   24.663  < 2e-16 ***
table        2.002e-02  1.190e-03   16.821  < 2e-16 ***
clarity_7    1.260e-02  9.034e-04   13.949  < 2e-16 ***
cut_4        5.114e-03  1.005e-03    5.090 3.60e-07 ***
clarity_5    1.105e-02  1.053e-03   10.498  < 2e-16 ***
color_4      8.162e-03  9.336e-04    8.743  < 2e-16 ***
z            2.997e-02  3.932e-03    7.622 2.55e-14 ***
cut.C        1.484e-02  1.203e-03   12.338  < 2e-16 ***
cut.Q       -1.589e-02  1.500e-03  -10.593  < 2e-16 ***
y            1.456e-02  3.555e-03    4.096 4.21e-05 ***
color.C     -3.310e-03  9.605e-04   -3.446  0.00057 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1786 on 40436 degrees of freedom
Multiple R-squared:  0.9681,    Adjusted R-squared:  0.9681 
F-statistic: 6.14e+04 on 20 and 40436 DF,  p-value: < 2.2e-16
ytr.pred.sf <- exp(fit_step_forward %>% predict(train.x) * sigma + mu)
yte.pred.sf <- exp(fit_step_forward %>% predict(test.x) * sigma + mu)



pf.lm.sf <- data.frame(
    model = "Stepwise-Forward",
    train.mse = mean((ytr.pred.sf-ytr.true)^2),
    test.mse = mean((yte.pred.sf-yte.true)^2)
)
pf.lm.sb
              model train.mse test.mse
1 Stepwise-Backward   1200659  1243213
pf.Stepwise.Forward <- postResample(
    pred = yte.pred.sf, 
    obs = yte.true
)

6.3.2 Best Subset Regression

library(leaps)
fit_bs <- regsubsets(logprice~., trainset.scaled, nvmax = 30)
# nvmax = 30: sets the maximum number of predictors to consider.
summary(fit_bs)
Subset selection object
Call: regsubsets.formula(logprice ~ ., trainset.scaled, nvmax = 30)
23 Variables  (and intercept)
          Forced in Forced out
carat         FALSE      FALSE
cut.L         FALSE      FALSE
cut.Q         FALSE      FALSE
cut.C         FALSE      FALSE
cut_4         FALSE      FALSE
color.L       FALSE      FALSE
color.Q       FALSE      FALSE
color.C       FALSE      FALSE
color_4       FALSE      FALSE
color_5       FALSE      FALSE
color_6       FALSE      FALSE
clarity.L     FALSE      FALSE
clarity.Q     FALSE      FALSE
clarity.C     FALSE      FALSE
clarity_4     FALSE      FALSE
clarity_5     FALSE      FALSE
clarity_6     FALSE      FALSE
clarity_7     FALSE      FALSE
depth         FALSE      FALSE
table         FALSE      FALSE
x             FALSE      FALSE
y             FALSE      FALSE
z             FALSE      FALSE
1 subsets of each size up to 23
Selection Algorithm: exhaustive
          carat cut.L cut.Q cut.C cut_4 color.L color.Q color.C color_4 color_5
1  ( 1 )  " "   " "   " "   " "   " "   " "     " "     " "     " "     " "    
2  ( 1 )  " "   " "   " "   " "   " "   " "     " "     " "     " "     " "    
3  ( 1 )  " "   " "   " "   " "   " "   "*"     " "     " "     " "     " "    
4  ( 1 )  "*"   " "   " "   " "   " "   "*"     " "     " "     " "     " "    
5  ( 1 )  "*"   " "   " "   " "   " "   "*"     " "     " "     " "     " "    
6  ( 1 )  "*"   " "   " "   " "   " "   "*"     " "     " "     " "     " "    
7  ( 1 )  "*"   " "   " "   " "   " "   "*"     "*"     " "     " "     " "    
8  ( 1 )  "*"   " "   " "   " "   " "   "*"     "*"     " "     " "     " "    
9  ( 1 )  "*"   " "   " "   " "   " "   "*"     "*"     " "     " "     " "    
10  ( 1 ) "*"   "*"   " "   " "   " "   "*"     "*"     " "     " "     " "    
11  ( 1 ) "*"   "*"   " "   " "   " "   "*"     "*"     " "     " "     " "    
12  ( 1 ) "*"   "*"   " "   " "   " "   "*"     "*"     " "     " "     " "    
13  ( 1 ) "*"   "*"   " "   " "   "*"   "*"     "*"     " "     " "     " "    
14  ( 1 ) "*"   "*"   "*"   "*"   " "   "*"     "*"     " "     " "     " "    
15  ( 1 ) "*"   "*"   "*"   "*"   " "   "*"     "*"     " "     " "     " "    
16  ( 1 ) "*"   "*"   "*"   "*"   " "   "*"     "*"     " "     "*"     " "    
17  ( 1 ) "*"   "*"   "*"   "*"   " "   "*"     "*"     " "     "*"     " "    
18  ( 1 ) "*"   "*"   "*"   "*"   "*"   "*"     "*"     " "     "*"     " "    
19  ( 1 ) "*"   "*"   "*"   "*"   "*"   "*"     "*"     " "     "*"     " "    
20  ( 1 ) "*"   "*"   "*"   "*"   "*"   "*"     "*"     "*"     "*"     " "    
21  ( 1 ) "*"   "*"   "*"   "*"   "*"   "*"     "*"     "*"     "*"     " "    
22  ( 1 ) "*"   "*"   "*"   "*"   "*"   "*"     "*"     "*"     "*"     "*"    
23  ( 1 ) "*"   "*"   "*"   "*"   "*"   "*"     "*"     "*"     "*"     "*"    
          color_6 clarity.L clarity.Q clarity.C clarity_4 clarity_5 clarity_6
1  ( 1 )  " "     " "       " "       " "       " "       " "       " "      
2  ( 1 )  " "     "*"       " "       " "       " "       " "       " "      
3  ( 1 )  " "     "*"       " "       " "       " "       " "       " "      
4  ( 1 )  " "     "*"       " "       " "       " "       " "       " "      
5  ( 1 )  " "     "*"       " "       " "       " "       " "       " "      
6  ( 1 )  " "     "*"       "*"       " "       " "       " "       " "      
7  ( 1 )  " "     "*"       "*"       " "       " "       " "       " "      
8  ( 1 )  " "     "*"       "*"       "*"       " "       " "       " "      
9  ( 1 )  " "     "*"       "*"       "*"       "*"       " "       " "      
10  ( 1 ) " "     "*"       "*"       "*"       "*"       " "       " "      
11  ( 1 ) " "     "*"       "*"       "*"       "*"       " "       " "      
12  ( 1 ) " "     "*"       "*"       "*"       "*"       " "       " "      
13  ( 1 ) " "     "*"       "*"       "*"       "*"       " "       " "      
14  ( 1 ) " "     "*"       "*"       "*"       "*"       " "       " "      
15  ( 1 ) " "     "*"       "*"       "*"       "*"       "*"       " "      
16  ( 1 ) " "     "*"       "*"       "*"       "*"       "*"       " "      
17  ( 1 ) " "     "*"       "*"       "*"       "*"       "*"       " "      
18  ( 1 ) " "     "*"       "*"       "*"       "*"       "*"       " "      
19  ( 1 ) " "     "*"       "*"       "*"       "*"       "*"       " "      
20  ( 1 ) " "     "*"       "*"       "*"       "*"       "*"       " "      
21  ( 1 ) " "     "*"       "*"       "*"       "*"       "*"       "*"      
22  ( 1 ) " "     "*"       "*"       "*"       "*"       "*"       "*"      
23  ( 1 ) "*"     "*"       "*"       "*"       "*"       "*"       "*"      
          clarity_7 depth table x   y   z  
1  ( 1 )  " "       " "   " "   "*" " " " "
2  ( 1 )  " "       " "   " "   "*" " " " "
3  ( 1 )  " "       " "   " "   "*" " " " "
4  ( 1 )  " "       " "   " "   "*" " " " "
5  ( 1 )  " "       "*"   " "   "*" " " " "
6  ( 1 )  " "       "*"   " "   "*" " " " "
7  ( 1 )  " "       "*"   " "   "*" " " " "
8  ( 1 )  " "       "*"   " "   "*" " " " "
9  ( 1 )  " "       "*"   " "   "*" " " " "
10  ( 1 ) " "       "*"   " "   "*" " " " "
11  ( 1 ) " "       "*"   "*"   "*" " " " "
12  ( 1 ) "*"       "*"   "*"   "*" " " " "
13  ( 1 ) "*"       "*"   "*"   "*" " " " "
14  ( 1 ) "*"       "*"   "*"   "*" " " " "
15  ( 1 ) "*"       "*"   "*"   "*" " " " "
16  ( 1 ) "*"       "*"   "*"   "*" " " " "
17  ( 1 ) "*"       "*"   "*"   "*" " " "*"
18  ( 1 ) "*"       "*"   "*"   "*" " " "*"
19  ( 1 ) "*"       "*"   "*"   "*" "*" "*"
20  ( 1 ) "*"       "*"   "*"   "*" "*" "*"
21  ( 1 ) "*"       "*"   "*"   "*" "*" "*"
22  ( 1 ) "*"       "*"   "*"   "*" "*" "*"
23  ( 1 ) "*"       "*"   "*"   "*" "*" "*"
result <- summary(fit_bs)
result$which[which.min(result$bic),]
(Intercept)       carat       cut.L       cut.Q       cut.C       cut_4 
       TRUE        TRUE        TRUE        TRUE        TRUE        TRUE 
    color.L     color.Q     color.C     color_4     color_5     color_6 
       TRUE        TRUE        TRUE        TRUE       FALSE       FALSE 
  clarity.L   clarity.Q   clarity.C   clarity_4   clarity_5   clarity_6 
       TRUE        TRUE        TRUE        TRUE        TRUE       FALSE 
  clarity_7       depth       table           x           y           z 
       TRUE        TRUE        TRUE        TRUE        TRUE        TRUE 
n <- nrow(trainset.scaled)  # Sample size
p <- 1:(length(result$rss))  # Model sizes (number of predictors + intercept)
aic_values <- n * log(result$rss / n) + 2 * p  # Compute AIC for each model
which.min(aic_values)  # Get the model size with the lowest AIC
[1] 22
result$which[12,]
(Intercept)       carat       cut.L       cut.Q       cut.C       cut_4 
       TRUE        TRUE        TRUE       FALSE       FALSE       FALSE 
    color.L     color.Q     color.C     color_4     color_5     color_6 
       TRUE        TRUE       FALSE       FALSE       FALSE       FALSE 
  clarity.L   clarity.Q   clarity.C   clarity_4   clarity_5   clarity_6 
       TRUE        TRUE        TRUE        TRUE       FALSE       FALSE 
  clarity_7       depth       table           x           y           z 
       TRUE        TRUE        TRUE        TRUE       FALSE       FALSE 
which.max(result$adjr2)
[1] 23
result$which[1,]
(Intercept)       carat       cut.L       cut.Q       cut.C       cut_4 
       TRUE       FALSE       FALSE       FALSE       FALSE       FALSE 
    color.L     color.Q     color.C     color_4     color_5     color_6 
      FALSE       FALSE       FALSE       FALSE       FALSE       FALSE 
  clarity.L   clarity.Q   clarity.C   clarity_4   clarity_5   clarity_6 
      FALSE       FALSE       FALSE       FALSE       FALSE       FALSE 
  clarity_7       depth       table           x           y           z 
      FALSE       FALSE       FALSE        TRUE       FALSE       FALSE 

6.4 Lasso, Ridge and Elastic Net Regression

A general idea of penealized regression (i.e., Lasso, Ridge and Elastic Net Regression) is to add some restriction to minimize.

6.4.1 Lasso

The defination of Lasso is based on the loss function: \(Q = \left\| (y - X \beta) \right\|^2\) minimized subject to \(\left| \beta \right|_1 < t\)

It can also be written:

\[\min_{\beta \in \mathbb{R}^p} {(\frac{1}{N} \left\| (y - X \beta) \right\|^2 + \lambda \left| \beta \right|_1)}\]

Lasso will let coeffections of the model to 0, which is useful for feature selection.

set alpha = 1 for function glmnet.

lambda <- 10^seq(1,-2,length = 100)

fit.lasso <- glmnet(train.x, trainset.scaled$logprice, alpha = 1, lambda = lambda)

plot(fit.lasso, xvar = "lambda", label = TRUE)

cv.lasso <- cv.glmnet(as.matrix(train.x), trainset.scaled$logprice,alpha = 1, lambda = lambda, nfolds = 10)

plot(cv.lasso)

cv.lasso

Call:  cv.glmnet(x = as.matrix(train.x), y = trainset.scaled$logprice,      lambda = lambda, nfolds = 10, alpha = 1) 

Measure: Mean-Squared Error 

     Lambda Index Measure       SE Nonzero
min 0.01000   100 0.03992 0.003268      11
1se 0.01874    91 0.04297 0.003157       8
best_lam <- cv.lasso$lambda.min
lasso_coef <- predict(cv.lasso, s = best_lam, type = "coefficients")

import_coefs <- as.matrix(lasso_coef)
active_vars <- data.frame(
  Variable = rownames(import_coefs),
  Coefficient = as.numeric(import_coefs)
) %>% filter(Coefficient != 0 & Variable != "(Intercept)")


active_vars
    Variable  Coefficient
1      cut.L  0.006033271
2    color.L -0.130907441
3    color.Q -0.025134709
4    color_4  0.002078327
5  clarity.L  0.172731638
6  clarity.Q -0.035257206
7  clarity.C  0.007961533
8      depth  0.030957031
9          x  0.987147413
10         y  0.010949118
11         z  0.052448464
cat("Dropped Variables: ", rownames(import_coefs)[import_coefs[,1] == 0])
Dropped Variables:  carat cut.Q cut.C cut_4 color.C color_5 color_6 clarity_4 clarity_5 clarity_6 clarity_7 table
ytr.pred.lasso <- exp(predict(cv.lasso,s =best_lam, newx=as.matrix(train.x)) * sigma + mu)
yte.pred.lasso <- exp(predict(cv.lasso,s = best_lam,newx= as.matrix(test.x)) * sigma + mu)


pf.lasso <- data.frame(
    model = "lasso",
    train.mse = mean((ytr.pred.lasso-ytr.true)^2),
    test.mse = mean((yte.pred.lasso-yte.true)^2)
)
pf.lasso
  model train.mse test.mse
1 lasso   3193370  4703154
pf.lasso2 <- postResample(
    pred = yte.pred.lasso, 
    obs = yte.true
)

6.4.2 Ridge

loss function \(Q = (y - X \beta) ^T (y - X \beta) + \beta^T \beta\)

Ridge works best in situation that OLS has high variance.

\[ \frac{\partial Q}{\partial \beta} = -2X^Ty + 2X^TX\beta + 2 \lambda \beta= 0 \\ \to \hat{\beta} = {(X^TX + \lambda I)}^{-1}X^Ty \]

set alpha = 0 for function glmnet.

lambda <- 10^seq(1,-2,length = 100)

fit.ridge <- glmnet(train.x, trainset.scaled$logprice, alpha = 0, lambda = lambda)

plot(fit.ridge, xvar = "lambda", label = TRUE)

cv.ridge <- cv.glmnet(as.matrix(train.x), trainset.scaled$logprice,alpha = 0, lambda = lambda, nfolds = 10)

plot(cv.ridge)

cv.ridge

Call:  cv.glmnet(x = as.matrix(train.x), y = trainset.scaled$logprice,      lambda = lambda, nfolds = 10, alpha = 0) 

Measure: Mean-Squared Error 

     Lambda Index Measure       SE Nonzero
min 0.01000   100 0.04562 0.008444      23
1se 0.08697    69 0.05379 0.006803      23
best_lam <- cv.ridge$lambda.min
ridge_coef <- predict(cv.ridge, s = best_lam, type = "coefficients")

ridge_coef
24 x 1 sparse Matrix of class "dgCMatrix"
                   s=0.01
(Intercept) -1.564928e-14
carat       -5.448247e-02
cut.L        3.267201e-02
cut.Q       -1.197818e-02
cut.C        1.219518e-02
cut_4        2.998444e-03
color.L     -1.446473e-01
color.Q     -3.744780e-02
color.C     -2.738590e-03
color_4      9.394073e-03
color_5     -3.019684e-03
color_6      7.682292e-04
clarity.L    2.097124e-01
clarity.Q   -6.202300e-02
clarity.C    4.186375e-02
clarity_4   -2.048792e-02
clarity_5    9.100290e-03
clarity_6   -1.519407e-03
clarity_7    1.037701e-02
depth        4.537965e-02
table        1.932575e-02
x            8.605183e-01
y            1.107601e-01
z            1.491461e-01
ytr.pred.ridge <- exp(predict(cv.ridge,s =best_lam, newx=as.matrix(train.x)) * sigma + mu)
yte.pred.ridge <- exp(predict(cv.ridge,s = best_lam,newx= as.matrix(test.x)) * sigma + mu)


pf.ridge <- data.frame(
    model = "ridge",
    train.mse = mean((ytr.pred.ridge-ytr.true)^2),
    test.mse = mean((yte.pred.ridge-yte.true)^2)
)
pf.ridge
  model train.mse test.mse
1 ridge 391270522  2773539
pf.ridge2 <- postResample(
    pred = yte.pred.ridge, 
    obs = yte.true
)

6.4.3 Elastic Net

The alpha argument of the glmnet function is described as:

\[ \frac{1-\alpha}{2} \left\|\beta \right\|_2^2 + \alpha \left\|\beta \right\|_1 \]

That’s why when we set \(\alpha = 1\) means Lasso Regression, and 0 for Ridge Regression.

6.5 Model Diagnosis

6.6 Summary

rbind(
    pf.lm,
    pf.lm.sb,
    pf.lm.sf,
    pf.lasso,
    pf.ridge
)
              model train.mse   test.mse
1                lm 229599179 4206268380
2 Stepwise-Backward   1200659    1243213
3  Stepwise-Forward   1200659    1243213
4             lasso   3193370    4703154
5             ridge 391270522    2773539
rbind(
    pf.linearRegression,
    pf.Stepwise.Backward,
    pf.Stepwise.Forward,
    pf.lasso2,
    pf.ridge2
)
                          RMSE   Rsquared       MAE
pf.linearRegression  64855.751 0.01863255 2350.7410
pf.Stepwise.Backward  1114.994 0.93432609  486.5987
pf.Stepwise.Forward   1114.994 0.93432609  486.5987
pf.lasso2             2168.676 0.81577988  664.7580
pf.ridge2             1665.395 0.88162500  587.0151

6.7 Bad Example (Random Numbers)

x <- rnorm(50, 6,3) 
y <- 10+ 2*x - 3 *x^3 -6*x^5 + 0.6 * x^7 + rnorm(50,10,8)

plot(x,y)

lm(y~x) %>%
    summary()

Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-9551661 -7168900 -2518269  3203266 51954823 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -14210959    3541874  -4.012  0.00021 ***
x             2911495     486816   5.981 2.68e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11180000 on 48 degrees of freedom
Multiple R-squared:  0.427, Adjusted R-squared:  0.4151 
F-statistic: 35.77 on 1 and 48 DF,  p-value: 2.683e-07
lm(y~x+I(x^2)+I(x^3)+I(x^4)+I(x^5)+I(x^6)+I(x^7)) %>%
    summary()

Call:
lm(formula = y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + 
    I(x^7))

Residuals:
     Min       1Q   Median       3Q      Max 
-14.0609  -5.1669  -0.8895   4.8241  16.1121 

Coefficients:
              Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  5.714e+01  2.726e+01    2.096   0.0421 *  
x           -7.012e+01  5.435e+01   -1.290   0.2040    
I(x^2)       4.399e+01  3.591e+01    1.225   0.2274    
I(x^3)      -1.544e+01  1.123e+01   -1.376   0.1762    
I(x^4)       1.855e+00  1.873e+00    0.990   0.3278    
I(x^5)      -6.150e+00  1.712e-01  -35.933   <2e-16 ***
I(x^6)       6.295e-03  8.064e-03    0.781   0.4394    
I(x^7)       5.999e-01  1.529e-04 3922.666   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.238 on 42 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 2.854e+13 on 7 and 42 DF,  p-value: < 2.2e-16