The Kaplan-Meier [KM] estimator of the survival function ˆS(t)=∏{i:ti≤t}[1−Δ¯N(ti)¯Y(ti)]=ˆS(t−)⋅[1−Δ¯N(t)¯Y(t)] is motivated by decomposition of the survival probability by chain rule. If the survival distribution is continuous the increment process Δ¯N attains only values 0 (no death at time t) or 1 (exactly one death at time t). In practice, however, time is not measured continuously or with necessary precision. Then ties begin to occur in the data and more death at the same time can be observed. If the ties come “accidentally”, for example, by rounding the true value, and its presence is rare, then [KM] works still fine. However, when the grouping of time (now rather categorical than continuous) comes from the design of the study itself, the resulting estimate could possibly result in unsatisfactory outputs.
Several methods for adjusting this fact have been proposed, we will introduce the Lifetable or Actuarial estimator.
Applies when the event times are grouped
Our goal is still to estimate the survival function, hazard, and density function, but this is complicated by the fact that we don’t know exactly when during each time interval an event occurs.
Notation:
We could apply the K-M formula directly. However, it treats the problem as though it were in discrete time, with events happening only at endpoints of intervals. But the true events event times lie somewhere within the preceding interval. In fact, what we are trying to calculate here is the conditional probability of dying within the interval, given survival to the beginning of it.
We now make the assumption that the censoring process is such that the censored survival times occur uniformly throughout the j-th interval. So, we can assume that censorings occur on average halfway through the interval: r′j=rj−cj/2. I.e., average number of individuals who are at risk during this interval. The assumption yields the Actuarial Estimator. It is appropriate if censorings occur uniformly throughout the interval. This assumption is sometimes known as the actuarial assumption.
The event that an individual survives beyond [tj−1,tj) is equivalent to the event that an individual survives beyond [t0,t1) and … and that an individual survives beyond [tj−2,tj−1) and that an individual survives beyond [tj−1,tj). Let us define the following quantities:
Then by chain rule, Pj=p1⋅⋯⋅pj. The conditional probability of an death in [tj−1,tj) given that the individual survives beyond [tj−2,tj−1) is estimated by dj/r′j. Thus, pj is estimated by r′j−djr′j. So, the actuarial (lifetime) estimate of the survival function is ˆS(t)=k−1∏j=1r′j−djr′jfor tk−1≤t<tk.
Remarks:
R
is that ˆS(t0)=1.The form of the estimated survival function obtained using this method is sensitive to the choice of the intervals used in its construction. On the other hand, the lifetable estimate is particularly well suited to situations in which the actual death times are unknown, and only available information is the number of deaths and the number of censored observations which occur in a series of consecutive time intervals.
Another way to deal with such type of data is to use Interval-censoring approaches that better fit this situation. Traditional non-parametric estimator of survival function is then Turnbull estimator (very similar to [KM]) which accounts for jumps not at a given time but during whole intervals. More can be found in Survival Analysis with Interval-Censored Data (2018) by Bogaerts, Komárek and Lesaffre.
In R
function lifetab
from KMsurv
package (KM stands for Klein and Moeschberger, not Kaplan-Meier!) requires four elements:
library(KMsurv)
lifetab(tis, #[K+1] a vector of end points of time intervals (including t_0)
ninit, #[1] the number of subjects initially entering the study
nlost, #[K] a vector of the number of individuals lost follow or withdrawn alive for whatever reason
nevent) #[K] a vector of the number of individuals who experienced the event
The output then contains:
nsubs
- the number of subject entering the intervals who have not experienced the event.nlost
- the number of individuals lost follow or withdrawn alive for whatever reason.nrisk
- the estimated number of individuals at risk of experiencing the event.nevent
- the number of individuals who experienced the event.surv
- the estimated survival function at the start of the intervals.pdf
- the estimated probability density function at the midpoint of the intervals.hazard
- the estimated hazard rate at the midpoint of the intervals.se.surv
- the estimated standard deviation of survival at the beginning of the intervals.se.pdf
- the estimated standard deviation of the prbability density function at the midpoint of the intervals.se.hazard
- the estimated standard deviation of the hazard function at the midpoint of the intervals The row.names
are the intervals.The National Center for Health Services Research studied 36 for-profit nursing homes to assess the effects of different financial incentives on length of stay. ‘Treated’ nursing homes received higher per diems for Medicaid patients, and bonuses for improving a patient’s health and sending them home.
Study included 1601 patients admitted between May 1, 1981 and April 30, 1982.
Variables include:
los
- Length of stay of a resident (in days)age
- Age of a residentrx
- Nursing home assignment (1:bonuses, 0:no bonuses)gender
- Gender (1:male, 0:female)married
- (1:married, 0:not married)health
- Health status (2:second best, 5:worst)censor
- Censoring indicator (1:censored, 0:discharged) (Beware!)nurshome = read.table("http://www.karlin.mff.cuni.cz/~vavraj/cda/data/NursingHome.dat",
head=F, skip=14, col.names = c("los", "age", "rx", "gender", "married", "health", "censor"))
In this dataset, length of stay (los
) can be considered continuous enough so that classical Kaplan-Meier estimator looks fairly smooth:
We wish to demonstrate the use of the actuarial method. To do so, consider the treated nursing home patients, with length of stay (los
) grouped into 100 day intervals:
nurshome$int = floor(nurshome$los/100)*100 # group in 100 day intervals
Comparison of different approaches of estimation:
We can see that for low values of time there is merely any difference since r′j and ¯Y(tj) are large values with almost no impact on ratios used for estimation. However, there seems to emerge some difference for high values of time, since the negligible differences have accumulated and the actuarial update of ratios begins to notably lower the estimating ratios.
This grouped data approach enables us to estimate the hazard functions using formulae in the section above. The resulting estimate can be found in hazard
output of lifetab
function.
We will not dive deep into the theory, but there is also a method for obtaining smoothed estimator of hazard function using library(muhaz)
. Check that these methods provide similar results:
library(muhaz)
haz = muhaz(nurshome$los, nurshome$censor==0)
plot(haz)
Based on other covariates in nurshome
dataset choose some interesting two subpopulations.
Advice: You can utilize plotting functions from previous exercise.
Download dataset mort.RData. There are three dataframes mort.m
, mort.f
and mort
containing mortality data from the Czech Republic in the year 2019 for males, females and both genders combined. Each dataframe has three columns:
i
is an age group (going from the age i years to i+1 years),Di
is the number of deaths in the i-th age group in 2019,Yi
is the number of people alive when they reached the age i (all at risk in i-th age group).Think properly about the structure of this dataset. Does it follow the same structure as supposed in previous Exercise and the one supposed above? What does belonging to a certain age group mean?
The ratio (Yi-Di)/Yi
estimates the probability of survival of another year in i-th age group. The resulting estimator is then:
Inspired by the actuarial approach we base the estimate on (Yi/2-Di)/(Yi/2)
and obtain the following estimate, which inappropriately shortens the life span: If we blindly use
lifetab
function we obtain:
Try to reproduce the plots above. Why does lifetab
function give such non-senses? Can you try to explain it?
Compare (using suitable plots) your results to the official mortality tables created by ČSÚ: here for the males, here for the females. Can you recognize the columns in their table? Are their results comparable to yours?
If interested, examine the official guidelines (in Czech) of ČSÚ and see how the estimates could be further improved.
Include the (important) code, output and comments in your report.