Download dataset mort.RData. There are two dataframes mort.m
and mort.f
containing mortality data from the Czech Republic in the year 2013 for males and females. Each dataframe has three columns: i
is an age group (going from the age i years to i+1 years), Di
is the number of deaths in the age group in 2013, Yi
is the number of people alive when they reached the age i.
Your task is to calculate and plot Nelson-Aalen and Kaplan-Meier estimates for males and females. Do you feel comfortable to use the data in this way? Do we need any additional assumptions to make the results interpretable?
Applies when the data are grouped
Our goal is still to estimate the survival function, hazard, and density function, but this is complicated by the fact that we don’t know exactly when during each time interval an event occurs.
Notation:
We could apply the K-M formula directly. However, this approach is unsatisfactory for grouped data … it treats the problem as though it were in discrete time, with events happening only at 1 yr, 2 yr, etc. In fact, what we are trying to calculate here is the conditional probability of dying within the interval, given survival to the beginning of it.
We now make the assumption that the censoring process is such that the censored survival times occur uniformly throughout the j-th interval. So, we can assume that censorings occur on average halfway through the interval: r′j=rj−cj/2. I.e., average number of individuals who are at risk during this interval. The assumption yields the Actuarial Estimator. It is appropriate if censorings occur uniformly throughout the interval. This assumption is sometimes known as the actuarial assumption.
The event that an individual survives beyond [tj−1,tj) is equivalent to the event that an individual survives beyond [t0,t1) and … and that an individual survives beyond [tj−2,tj−1) and that an individual survives beyond [tj−1,tj). Let us define the following quantities:
Then by chain rule, Pj=p1…pj. The conditional probability of an death in [tj−1,tj) given that the individual survives beyond [tj−2,tj−1) is estimated by dj/r′j. Thus, pj is estimated by r′j−djr′j. So, the actuarial (lifetime) estimate of the survival function is ˆS(t)=k−1∏j=1r′j−djr′jfor tk−1≤t<tk.
Remarks:
R
is that ˆS(t0)=1.The form of the estimated survivor function obtained using this method is sensitive to the choice of the intervals used in its construction. On the other hand, the lifetable estimate is particularly well suited to situations in which the actual death times are unknown, and only available information is the number of deaths and the number of censored observations which occur in a series of consecutive time intervals.
R
with KMsurv
package requires three elements:
Incidentally, KM in KMsurv
stands for Klein and Moeschberger, not Kaplan-Meier!
library(KMsurv)
The National Center for Health Services Research studied 36 for-profit nursing homes to assess the effects of different financial incentives on length of stay. ‘Treated’ nursing homes received higher per diems for Medicaid patients, and bonuses for improving a patient’s health and sending them home.
Study included 1601 patients admitted between May 1, 1981 and April 30, 1982.
Variables include:
los
- Length of stay of a resident (in days)age
- Age of a residentrx
- Nursing home assignment (1:bonuses, 0:no bonuses)gender
- Gender (1:male, 0:female)married
- (1:married, 0:not married)health
- Health status (2:second best, 5:worst)censor
- Censoring indicator (1:censored, 0:discharged)Suppose we wish to use the actuarial method, but the data do not come grouped.
Consider the treated nursing home patients, with length of stay (los
) grouped into 100 day intervals:
nurshome = read.table("http://www.karlin.mff.cuni.cz/~vavraj/cda/data/NursingHome.dat",
head=F, skip=14, col.names = c("los", "age", "rx", "gender", "married", "health", "censor"))
nurshome$int = floor(nurshome$los/100)*100 # group in 100 day intervals
nurshome = nurshome[nurshome$rx==1,] # keep only treated homes
nurshome[1:3,] # check out a few observations
## los age rx gender married health censor int
## 1 37 86 1 0 0 2 0 0
## 2 61 77 1 0 0 4 0 0
## 3 1084 75 1 0 0 3 1 1000
tab = table(nurshome$int, nurshome$censor)
intEndpts = (0:11)*100
ntotal = dim(nurshome)[1] # nr patients
cens = tab[,2] # nr censored in each interval
released = tab[,1] # nr released in each interval
fitlt = lifetab(tis = intEndpts, ninit=ntotal, nlost=cens, nevent=released)
names(fitlt) # check out components of fitlt object
## [1] "nsubs" "nlost" "nrisk" "nevent" "surv" "pdf" "hazard" "se.surv" "se.pdf" "se.hazard"
round(fitlt, 5) # restrict output to 5 decimals
## nsubs nlost nrisk nevent surv pdf hazard se.surv se.pdf se.hazard
## 0-100 712 0 712.0 330 1.00000 0.00463 0.00603 0.00000 0.00019 0.00032
## 100-200 382 0 382.0 86 0.53652 0.00121 0.00254 0.01869 0.00012 0.00027
## 200-300 296 0 296.0 65 0.41573 0.00091 0.00247 0.01847 0.00011 0.00030
## 300-400 231 0 231.0 38 0.32444 0.00053 0.00179 0.01755 0.00008 0.00029
## 400-500 193 1 192.5 32 0.27107 0.00045 0.00181 0.01666 0.00008 0.00032
## 500-600 160 0 160.0 13 0.22601 0.00018 0.00085 0.01568 0.00005 0.00023
## 600-700 147 0 147.0 13 0.20764 0.00018 0.00093 0.01521 0.00005 0.00026
## 700-800 134 30 119.0 10 0.18928 0.00016 0.00088 0.01469 0.00005 0.00028
## 800-900 94 29 79.5 4 0.17337 0.00009 0.00052 0.01429 0.00004 0.00026
## 900-1000 61 30 46.0 4 0.16465 0.00014 0.00091 0.01422 0.00007 0.00045
## 1000-1100 27 27 13.5 0 0.15033 NA NA 0.01468 NA NA
Non-grouped approach
library(survival)
nurshome$delta <- 1-nurshome$censor
fit <- survfit(Surv(los,delta)~1, data = nurshome[nurshome$rx==1,])
Estimated Survival
x = rep(intEndpts, each = 2)[2:23]
y = rep(fitlt$surv, each = 2)
plot(fit, conf.int = F, col="red", xlab="Time Interval [days]", ylab="Survival probability")
lines(x, y, col="blue")
title(main = "Duration of stay in nursing homes", cex=.6)
legend("topright", c("Kaplan-Meier", "Lifetable estimator"), col = c("red", "blue"), lty = 1, bty = "n")
Estimated Hazard
You can use library(muhaz)
to get smoothed estimator of hazard function.
y = rep(fitlt$hazard, each = 2)
library(muhaz)
haz = muhaz(nurshome$los[nurshome$rx==1],nurshome$delta[nurshome$rx==1])
plot(haz, col = "red", xlab="Time Interval [days]", ylab="Hazard")
lines(x, y, type="l", col="blue")
title(main = "Duration of stay in nursing homes", cex=.6)
legend("topright", c("Smoothed estimator", "Lifetable estimator"), col = c("red", "blue"), lty = 1, bty = "n")
Go back to the Warm-up Task with mortality datasets mort.RData. Perform actuarial estimates of survival function and hazard function and compare it with the classical Kaplan-Meier approach from the start of the lecture.
Compare your results to the official mortality tables created by ČSÚ: here for the males, here for the females. Can you recognize the columns in their table? Are their results comparable to yours? What is the cause of that in your opinion?
Include the code, output and comments in your report.
Bonus task: Could you find suitable correction of our approach, so that results are more similar to ČSÚ output? Hint: look at the given data, forget about KM approach and just think about conditional probabilities in a simple way.