glm(formular, family = ?)8 Generalized Linear Regression
NOTE THAT GLM here is for Generalized Linear Regression, which is different from General linear model.
Wikipedia says: Not to be confused with Multiple linear regression, General linear model or General linear methods.
Generalized Linear Regression (GLM) is linear regression that variables are categorical, and the response variable is exponential family
- variables are categorical, count data, etc
- A distibution for the regression
- A link function (here, \(g\)) connects the predictions to the mean of the distribution
\[\mathbb{E}[Y] = \mu = g^{-1}(X\beta)\]
Examples: Logistics Regression, Probit Regression, Poisson Regression
The logistic regression, \(X\beta = \log{\frac{\mu}{1-\mu}}\)
GLM uses likelihood function, Deviance: smaller deviance -> better model
GLM introduces the conception of deviance, smaller deviance means better model, which is defined:
\[D = -2\log{\frac{\text{Likelihood of Current Model}}{\text{Likelihood of Saturated Model}}}\]
For large sample, the distribution of D is proved by Wilks’ Theorem.
\[D \sim \chi^2 \quad df=k_{full} - k_{reduced}\]
8.1 Logitics Regression Revisited
use family = binomial for logistics regression.
no.yes <- c("No", "Yes")
smoking <- gl(2,1,8,no.yes)
obesity <- gl(2,2,8,no.yes)
snoring <- gl(2,4,8,no.yes)
n.tot <- c(60,17,8,2,187,85,51,23)
n.hyp <- c(5,2,1,0,35,13,15,8)
data.frame(smoking, obesity, snoring, n.tot, n.hyp) smoking obesity snoring n.tot n.hyp
1 No No No 60 5
2 Yes No No 17 2
3 No Yes No 8 1
4 Yes Yes No 2 0
5 No No Yes 187 35
6 Yes No Yes 85 13
7 No Yes Yes 51 15
8 Yes Yes Yes 23 8
hyp.tbl <- cbind(n.hyp, n.tot-n.hyp)
glm.hyp <- glm(hyp.tbl~smoking+obesity+snoring, family = binomial("logit"))
summary(glm.hyp)
Call:
glm(formula = hyp.tbl ~ smoking + obesity + snoring, family = binomial("logit"))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.37766 0.38018 -6.254 4e-10 ***
smokingYes -0.06777 0.27812 -0.244 0.8075
obesityYes 0.69531 0.28509 2.439 0.0147 *
snoringYes 0.87194 0.39757 2.193 0.0283 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 14.1259 on 7 degrees of freedom
Residual deviance: 1.6184 on 4 degrees of freedom
AIC: 34.537
Number of Fisher Scoring iterations: 4
anova(glm.hyp, test = "Chisq")Analysis of Deviance Table
Model: binomial, link: logit
Response: hyp.tbl
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 7 14.1259
smoking 1 0.0022 6 14.1237 0.962724
obesity 1 6.8274 5 7.2963 0.008977 **
snoring 1 5.6779 4 1.6184 0.017179 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
8.2 Poisson Regression
The response of Poisson Regression is count data.
- Aggregated counts over time
- Individual level data with event indicators
\[ \log{\lambda} = X \beta\]
library(ISwR)
data(eba1977)
str(eba1977)'data.frame': 24 obs. of 4 variables:
$ city : Factor w/ 4 levels "Fredericia","Horsens",..: 1 2 3 4 1 2 3 4 1 2 ...
$ age : Factor w/ 6 levels "40-54","55-59",..: 1 1 1 1 2 2 2 2 3 3 ...
$ pop : int 3059 2879 3142 2520 800 1083 1050 878 710 923 ...
$ cases: int 11 13 4 5 11 6 8 7 11 15 ...
fit.ps <- glm(cases~city+age, data = eba1977, family = poisson)
summary(fit.ps)
Call:
glm(formula = cases ~ city + age, family = poisson, data = eba1977)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.24374 0.20363 11.019 <2e-16 ***
cityHorsens -0.09844 0.18129 -0.543 0.587
cityKolding -0.22706 0.18770 -1.210 0.226
cityVejle -0.22706 0.18770 -1.210 0.226
age55-59 -0.03077 0.24810 -0.124 0.901
age60-64 0.26469 0.23143 1.144 0.253
age65-69 0.31015 0.22918 1.353 0.176
age70-74 0.19237 0.23517 0.818 0.413
age75+ -0.06252 0.25012 -0.250 0.803
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 27.704 on 23 degrees of freedom
Residual deviance: 20.673 on 15 degrees of freedom
AIC: 135.06
Number of Fisher Scoring iterations: 5