联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp

您当前位置:首页 >> Java编程Java编程

日期:2022-12-12 08:10

Statistics GR 5205 004 / GU 4205 005

Columbia University

1 r>Binary Response Variable

In many regression applications, the response Y has only two possible

qualitative outcomes:

– Financial status of firm: sound status/headed toward insolvency

– Coronary heart disease status: has the disease/does not have the disease

– In a study of labor force participation of married women: married woman

in labor force/married woman not in labor force

The constraint on the mean responses to belong to [0, 1] rule out a

linear response function

2

Heart Disease Data

Response: “chd”: indicates whether the person has heart disease or not

The men vary in height (in inches) and the number of cigarettes

(cigs) smoked per day

> data(wcgs, package="faraway")

> summary(wcgs[,c("chd","height","cigs")])

chd height cigs

no :2897 Min. :60.00 Min. : 0.0

yes: 257 1st Qu.:68.00 1st Qu.: 0.0

Median :70.00 Median : 0.0

Mean :69.78 Mean :11.6

3rd Qu.:72.00 3rd Qu.:20.0

Max. :78.00 Max. :99.0

3

Heart Disease Data

> plot(height ~ chd, wcgs)

> wcgs$y <- ifelse(wcgs$chd == "no",0,1)

> plot(jitter(y,0.1) ~ jitter(height), wcgs, xlab="Height",

+ ylab="Heart Disease", pch=".")

Figure: Plots of the presence/absence of heart disease according to height.

4

Heart Disease Data

Predict heart disease and explain the relationship between height,

cigarette usage and heart disease.

For the same height and cigs, both outcomes occur. So we model the

probability of getting heart disease P(Y = 1 | X ) rather than Y itself

Figure: Interleaved histograms of the distribution of heights and cigarette usage for

men with and without heart disease.

5

Logistic Regression

Logistic regression defines the probability mass function:

P(Y = 1 | X) = exp(βX)

1 + exp(βX)

which implies that

P(Y = 0 | X) = 1 P(Y = 1 | X) = 1

1 + exp(βX)

where X is a (p + 1)-dim. vector with X0 ≡ 1, and β0 is the intercept

6

Logistic Regression

This plot shows P(Y = 1 | X) and P(Y = 0 | X), plotted as functions of βX

7

Logistic Regression

The logit function

logit(x) = log

(x1 x)

maps the unit interval (0, 1) to the entire real line (?∞,∞)

The inverse logit function, or expit function

expit(x) = logit?1(x) =

exp(x)

1 + exp(x)

maps the real line to the unit interval

In logistic regression, the inverse logit function is used to map the linear

predictor βX to a probability of Y = 1:

P(Y = 1 | X) = logit1(βX)

8

Logistic Regression

Geometric interpretation: a logistic regression fit based on two predictors can

be represented by a S-shape surface in the 3D space

9

Logistic Regression

The linear predictor in logistic regression is the conditional log odds:

log

[

P(Y = 1 | X)

P(Y = 0 | X)

]

= βX = β0 + β1X1 + · · ·+ βpXp

Interpret logistic regression: a one unit increase in Xj results in a change

of βj in the (conditional) log odds

Or: a one unit increase in Xj results in a multiplicative change of exp(βj)

in the conditional odds

exp(βj) is also called the odds ratio, as it is the ratio of the two odds,

corresponding to two scenarios where the values of Xj differ by one unit

10

Heart Disease Example

> lmod <- glm(chd ~ height + cigs, family = binomial, wcgs)

> summary(lmod)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.0041 -0.4425 -0.3630 -0.3499 2.4357

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -4.50161 1.84186 -2.444 0.0145 *

height 0.02521 0.02633 0.957 0.3383

cigs 0.02313 0.00404 5.724 1.04e-08 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1781.2 on 3153 degrees of freedom

Residual deviance: 1749.0 on 3151 degrees of freedom

AIC: 1755

Number of Fisher Scoring iterations: 5 11

Heart Disease Example

(beta <- coef(lmod))

plot(jitter(y,0.1) ~ jitter(height), wcgs, xlab="Height",

ylab="Heart Disease",pch=".")

curve(ilogit(beta[1] + beta[2]*x + beta[3]*0),add=TRUE)

curve(ilogit(beta[1] + beta[2]*x + beta[3]*20),add=TRUE,lty=2)

plot(jitter(y,0.1) ~ jitter(cigs), wcgs, xlab="Cigarette Use",

ylab="Heart Disease",pch=".")

curve(ilogit(beta[1] + beta[2]*60 + beta[3]*x),add=TRUE)

curve(ilogit(beta[1] + beta[2]*78 + beta[3]*x),add=TRUE,lty=2)

12

Heart Disease Example

Figure: Predicted probability of heart disease. Left: solid line represents a

nonsmoker, the dashed line is a pack-a-day smoker. Right: solid line represents a

very short man (60 in.), the dashed line represents a very tall man (78 in.)

13

Latent variable model for logistic regression

It may make sense to view the binary outcome Y as being a

dichotomization of a latent continuous outcome Yc :

Y = I(Yc ≥ 0)

Suppose Yc | X follows a logistic distribution with CDF

F (Yc | X) = exp(Yc  βX)

1 + exp(Yc βX)

In this case, Y | X follows the logistic regression model:

P(Y = 1 | X) = P(Yc ≥ 0 | X) = 1 exp(0 βX)

Mean and variance relationship for logistic regression

Since Y | X follows Bernoulli(logit(β?X)), the mean is

E[Y | X] = P(Y = 1 | X) = exp(βX)

1 + exp(β?X)

And the variance is

Var[Y | X] = P(Y = 1 | X) · P(Y = 0 | X)

=

exp(βX)

(1 + exp(βX))2

Since the variance depends on X, logistic regression models are always

heteroscedastic (unequal error variances)

15

Estimation in logistic regression

Assuming independent observations (x1, y1), . . . , (xn, yn), the

log-likelihood for logistic regression is

L(β | Y,X) = log


版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp