Disease Survival Burden Assessment

# ── Change these values to re-run for any patient or disease ─────────────────

pt_age      <- 50          # Patient age at diagnosis (integer, ≥ 50)
pt_sex      <- "M"         # "M" = Male, "F" = Female
dis_name    <- "Disease X" # Short label used in plots and tables
dis_median  <- 3           # Reported median overall survival (years)
dis_ci_low  <- 1           # Lower 95% CI bound on median (years)
dis_ci_high <- 10          # Upper 95% CI bound on median (years)

1. US SSA Period Life Table

The US Social Security Administration publishes annual Period Life Tables giving the conditional probability of dying within one year \(q(x)\) and remaining life expectancy \(e(x)\) at each exact age \(x\), separately for males and females. We use the survexp.us ratetable bundled with the survival package, which contains the SSA Period Life Table reformatted for survival analysis. Ages 50–99 are shown.

# survexp.us: SSA period life table as daily hazard rates (per survival package)
# Dimensions: age (one entry per year, stored in days), sex, calendar year
yr_use <- as.character(max(as.integer(dimnames(survexp.us)[[3]])))
n_ages <- dim(survexp.us)[1]

lt_all <- data.frame(
  age = seq_len(n_ages) - 1L,
  q_m = 1 - exp(-as.numeric(survexp.us[, "male",   yr_use]) * 365.25),
  q_f = 1 - exp(-as.numeric(survexp.us[, "female", yr_use]) * 365.25)
) |>
  mutate(
    l_m = round(100000 * c(1, cumprod(1 - head(q_m, -1)))),
    l_f = round(100000 * c(1, cumprod(1 - head(q_f, -1))))
  )

# Complete life expectancy ê(x) = Σ k_p_x + 0.5 (summed over remaining ages)
lt_ex_fn <- function(q) {
  n <- length(q)
  vapply(seq_len(n), function(i) round(sum(cumprod(1 - q[i:n])) + 0.5, 1), numeric(1L))
}

lt <- lt_all |>
  mutate(e_m = lt_ex_fn(q_m), e_f = lt_ex_fn(q_f)) |>
  filter(age >= 50, age <= 99) |>
  select(age, q_m, l_m, e_m, q_f, l_f, e_f)
kable(
  lt |>
    select(age, q_m, e_m, q_f, e_f) |>
    rename(
      Age           = age,
      `Male q(x)`   = q_m,
      `Male e(x)`   = e_m,
      `Female q(x)` = q_f,
      `Female e(x)` = e_f
    ),
  digits  = 4,
  caption = paste0("US SSA Period Life Table (", yr_use, ") — Ages 50–99")
)
US SSA Period Life Table (2020) — Ages 50–99
Age Male q(x) Male e(x) Female q(x) Female e(x)
50 0.0060 28.4 0.0035 32.4
51 0.0065 27.6 0.0038 31.5
52 0.0070 26.8 0.0041 30.6
53 0.0077 25.9 0.0045 29.7
54 0.0084 25.1 0.0049 28.8
55 0.0091 24.3 0.0053 28.0
56 0.0099 23.6 0.0057 27.1
57 0.0107 22.8 0.0062 26.3
58 0.0116 22.0 0.0067 25.4
59 0.0125 21.3 0.0073 24.6
60 0.0136 20.5 0.0079 23.8
61 0.0147 19.8 0.0086 23.0
62 0.0157 19.1 0.0092 22.2
63 0.0168 18.4 0.0098 21.4
64 0.0178 17.7 0.0105 20.6
65 0.0189 17.0 0.0111 19.8
66 0.0202 16.3 0.0119 19.0
67 0.0216 15.7 0.0129 18.2
68 0.0231 15.0 0.0140 17.5
69 0.0247 14.4 0.0153 16.7
70 0.0263 13.7 0.0167 15.9
71 0.0281 13.1 0.0183 15.2
72 0.0303 12.4 0.0200 14.5
73 0.0325 11.8 0.0217 13.8
74 0.0365 11.2 0.0245 13.1
75 0.0395 10.6 0.0269 12.4
76 0.0439 10.0 0.0299 11.7
77 0.0480 9.4 0.0330 11.1
78 0.0534 8.9 0.0371 10.4
79 0.0582 8.4 0.0412 9.8
80 0.0640 7.8 0.0459 9.2
81 0.0703 7.3 0.0511 8.6
82 0.0773 6.9 0.0571 8.0
83 0.0866 6.4 0.0642 7.5
84 0.0960 5.9 0.0724 7.0
85 0.1071 5.5 0.0815 6.5
86 0.1167 5.1 0.0900 6.0
87 0.1309 4.7 0.1019 5.6
88 0.1464 4.4 0.1150 5.1
89 0.1632 4.0 0.1294 4.7
90 0.1812 3.7 0.1452 4.4
91 0.2005 3.5 0.1623 4.0
92 0.2208 3.2 0.1807 3.7
93 0.2421 3.0 0.2004 3.4
94 0.2643 2.7 0.2212 3.2
95 0.2872 2.5 0.2431 2.9
96 0.3105 2.4 0.2658 2.7
97 0.3340 2.2 0.2892 2.5
98 0.3575 2.1 0.3131 2.3
99 0.3807 1.9 0.3372 2.1

2. Background Population Survival

From the life table we build the survival function \(S_{bg}(t)\) for a person of the specified age and sex. Starting from age \(a\), the probability of surviving at least \(t\) more years is:

\[S_{bg}(t) = \prod_{x=a}^{a+t-1}\bigl(1 - q(x)\bigr), \quad t = 0, 1, 2, \ldots\]

q_col <- if (pt_sex == "M") "q_m" else "q_f"
lt_pt <- lt |> filter(age >= pt_age)
q_x   <- lt_pt[[q_col]]

# Survival at each integer year from diagnosis
S_bg <- c(1, cumprod(1 - q_x))
t_bg <- 0L:(length(S_bg) - 1L)

# Median remaining years (first time S drops below 0.5)
bg_median <- t_bg[which(S_bg < 0.5)[1L]] - 1L

# Restricted mean survival time (RMST) over 30-year horizon
horizon <- min(30L, max(t_bg))
bg_rmst <- sum(S_bg[t_bg < horizon])   # ∫₀ᴴ S(t) dt for unit-step function

For a male aged 50, the SSA life table implies:

  • Median remaining survival: 30 years (expected death near age 80)
  • 30-year restricted mean survival (RMST): 24.7 years

3. Disease Survival Model

Given only a reported median and 95% CI, we fit a log-normal survival distribution, \(\log T \sim \mathcal{N}(\mu,\,\sigma^2)\).

The log-normal reproduces the stated median exactly. Its two parameters are derived as:

\[\mu = \ln(\text{median}), \qquad \sigma = \frac{\ln(\text{CI}_\text{high}) - \ln(\text{CI}_\text{low})}{2 \times 1.96}\]

The spread \(\sigma\) encodes how wide the survival distribution is — a large \(\sigma\) means high patient-to-patient variability in outcome.

mu_d    <- log(dis_median)
sigma_d <- (log(dis_ci_high) - log(dis_ci_low)) / (2 * 1.96)

# Theoretical mean of this log-normal
dis_mean_raw <- exp(mu_d + sigma_d^2 / 2)

# Disease RMST over the same horizon (fine-grid numerical integration)
t_fine     <- seq(0, horizon, by = 0.01)
S_dis_fine <- 1 - plnorm(t_fine, meanlog = mu_d, sdlog = sigma_d)
dis_rmst   <- sum(diff(t_fine) * S_dis_fine[-length(S_dis_fine)])
Parameter Value
\(\mu\) (location) 1.099
\(\sigma\) (shape) 0.587
Median (exact) 3 years
Theoretical mean 3.6 years
30-year RMST 3.6 years

Note. \(\sigma\) is derived from the CI on the median estimate, not from patient-level heterogeneity data. It serves as a scale parameter for the shape of the modelled survival curve.


4. Kaplan–Meier Survival Curves

We generate \(n = 10{,}000\) synthetic survival times from each model and apply the standard Kaplan–Meier estimator.

  • Background cohort: sampled via the inverse CDF of the SSA life table step function.
  • Disease cohort: sampled from the fitted log-normal distribution; times are capped at the life-table horizon.
set.seed(2024)
n_sim <- 10000L

# Background: inverse-CDF of the step survival function
bg_times <- sapply(runif(n_sim), function(u) {
  idx <- which(S_bg <= u)
  if (length(idx) == 0L) max(t_bg) else t_bg[idx[1L]]
})

# Disease: log-normal sample, capped at life-table horizon
dis_times <- pmin(
  rlnorm(n_sim, meanlog = mu_d, sdlog = sigma_d),
  max(t_bg)
)

km_bg  <- survfit(Surv(bg_times,  rep(1L, n_sim)) ~ 1)
km_dis <- survfit(Surv(dis_times, rep(1L, n_sim)) ~ 1)
tidy_km <- function(km, label) {
  data.frame(
    time  = c(0, km$time),
    surv  = c(1, km$surv),
    lower = c(1, km$lower),
    upper = c(1, km$upper),
    group = label
  )
}

bg_label <- paste0(
  "Background (", ifelse(pt_sex == "M", "male", "female"), ", age ", pt_age, ")"
)

bg_df <- tidy_km(km_bg, bg_label)

# Disease CI: parametric bounds from reported median 95% CI.
# With n = 10,000 from a continuous distribution the KM CI is negligible;
# the meaningful uncertainty is in the reported median itself.
t_dis <- c(0, km_dis$time)
dis_df <- data.frame(
  time  = t_dis,
  surv  = c(1, km_dis$surv),
  lower = 1 - plnorm(t_dis, log(dis_ci_high), sigma_d),
  upper = 1 - plnorm(t_dis, log(dis_ci_low),  sigma_d),
  group = dis_name
)

plot_df <- rbind(bg_df, dis_df)

curve_colors <- setNames(c("#2470b3", "#902000"), c(bg_label, dis_name))
curve_types  <- setNames(c("solid",   "dashed"),  c(bg_label, dis_name))

ggplot(plot_df, aes(x = time, y = surv * 100, color = group, linetype = group)) +
  geom_ribbon(aes(ymin = lower * 100, ymax = upper * 100, fill = group),
              alpha = 0.15, color = NA) +
  geom_step(linewidth = 0.85) +
  scale_color_manual(values = curve_colors) +
  scale_fill_manual(values = curve_colors) +
  scale_linetype_manual(values = curve_types) +
  geom_hline(yintercept = 50, color = "gray55", linetype = "dotted", linewidth = 0.4) +
  annotate("text", x = horizon * 0.97, y = 53, label = "50%",
           hjust = 1, size = 3.2, color = "gray50") +
  scale_x_continuous(limits = c(0, horizon), breaks = seq(0, horizon, by = 5)) +
  scale_y_continuous(limits = c(0, 101),     breaks = seq(0, 100,    by = 20)) +
  labs(
    title    = paste0(dis_name, "  —  Survival vs Background Population"),
    subtitle = paste0(
      ifelse(pt_sex == "M", "Male", "Female"), ", age at diagnosis ", pt_age,
      "   |   Disease median ", dis_median,
      " yr  (95% CI ", dis_ci_low, "–", dis_ci_high, " yr)"
    ),
    x       = "Years from diagnosis",
    y       = "Survival (%)",
    color   = NULL, linetype = NULL,
    caption = paste0(
      "Background 95% CI: Greenwood formula (KM, n = ", format(n_sim, big.mark = ","), ").  ",
      "Disease 95% CI: log-normal survival at reported median CI bounds (",
      dis_ci_low, "–", dis_ci_high, " yr)."
    )
  ) +
  theme_gray() +
  theme(legend.position = "bottom") +
  guides(fill = "none")


5. Survival Table

KM estimates at clinically relevant time points. The 95% CI uses Greenwood’s formula (log-log transformation).

t_pts <- c(1, 2, 3, 5, 10, 15, 20)
t_pts <- t_pts[t_pts <= horizon]

extract_km <- function(km, times) {
  s <- summary(km, times = times, extend = TRUE)
  data.frame(
    n_risk = s$n.risk,
    surv   = round(s$surv  * 100, 1),
    lower  = round(s$lower * 100, 1),
    upper  = round(s$upper * 100, 1)
  )
}

bg_tbl  <- extract_km(km_bg,  t_pts)
dis_tbl <- extract_km(km_dis, t_pts)

km_out <- data.frame(
  `Year`                            = t_pts,
  `Bg n at risk`                    = bg_tbl$n_risk,
  `Background survival % (95% CI)`  = paste0(bg_tbl$surv,
                                              " (", bg_tbl$lower, "–", bg_tbl$upper, ")"),
  `Dis n at risk`                   = dis_tbl$n_risk,
  `Disease survival % (95% CI)`     = paste0(dis_tbl$surv,
                                             " (", dis_tbl$lower, "–", dis_tbl$upper, ")"),
  `Difference (pp)`                 = bg_tbl$surv - dis_tbl$surv,
  check.names = FALSE
)

kable(km_out,
      caption = paste0("Kaplan–Meier survival estimates — Background vs ", dis_name),
      align   = c("r", "r", "l", "r", "l", "r"))
Kaplan–Meier survival estimates — Background vs Disease X
Year Bg n at risk Background survival % (95% CI) Dis n at risk Disease survival % (95% CI) Difference (pp)
1 10000 99.4 (99.2–99.5) 9672 96.7 (96.4–97.1) 2.7
2 9937 98.8 (98.6–99) 7517 75.2 (74.3–76) 23.6
3 9877 98.1 (97.8–98.3) 4963 49.6 (48.7–50.6) 48.5
5 9742 96.7 (96.4–97.1) 1909 19.1 (18.3–19.9) 77.6
10 9269 91.6 (91.1–92.2) 210 2.1 (1.8–2.4) 89.5
15 8629 84.6 (83.9–85.3) 37 0.4 (0.3–0.5) 84.2
20 7777 75.7 (74.9–76.6) 3 0 (0–0.1) 75.7

6. Disease Burden Assessment

bg_km_tbl     <- summary(km_bg)$table
bg_med_km     <- round(bg_km_tbl[["median"]],  1)
bg_med_lower  <- round(bg_km_tbl[["0.95LCL"]], 1)
bg_med_upper  <- round(bg_km_tbl[["0.95UCL"]], 1)

dis_km_tbl    <- summary(km_dis)$table
dis_med_km    <- round(dis_km_tbl[["median"]],  1)
dis_med_lower <- round(dis_km_tbl[["0.95LCL"]], 1)
dis_med_upper <- round(dis_km_tbl[["0.95UCL"]], 1)

yll_med  <- round(bg_med_km - dis_median, 1)
yll_rmst <- round(bg_rmst   - dis_rmst,   1)

burden_tbl <- data.frame(
  Metric = c(
    paste0("Median remaining survival — background (",
           ifelse(pt_sex == "M", "male", "female"), ", age ", pt_age, ")"),
    paste0("Median overall survival — ", dis_name),
    "Years of life lost  (median comparison)",
    paste0(horizon, "-yr RMST — background"),
    paste0(horizon, "-yr RMST — ", dis_name),
    paste0("Years of life lost  (", horizon, "-yr RMST comparison)")
  ),
  Estimate = c(
    paste0(bg_med_km,  " years  (95% CI ", bg_med_lower, "–", bg_med_upper,
           " yr;  to age ≈", pt_age + as.integer(bg_med_km), ")"),
    paste0(dis_median, " years  (95% CI ", dis_ci_low, "–", dis_ci_high, " yr)"),
    paste0(yll_med,  " years"),
    paste0(round(bg_rmst,  1), " years"),
    paste0(round(dis_rmst, 1), " years"),
    paste0(yll_rmst, " years")
  ),
  stringsAsFactors = FALSE
)

kable(burden_tbl,
      col.names = c("Metric", "Estimate"),
      caption   = "Disease burden summary")
Disease burden summary
Metric Estimate
Median remaining survival — background (male, age 50) 31 years (95% CI 30–31 yr; to age ≈81)
Median overall survival — Disease X 3 years (95% CI 1–10 yr)
Years of life lost (median comparison) 28 years
30-yr RMST — background 24.7 years
30-yr RMST — Disease X 3.6 years
Years of life lost (30-yr RMST comparison) 21.1 years

A male aged 50 without the disease has an expected median remaining survival of 31 years (95% CI 30–31 yr; to near age 81), per the US SSA Period Life Table.

With Disease X (median OS 3 yr, 95% CI 1–10 yr), the estimated years of life lost are:

  • 28 years comparing medians
  • 21.1 years comparing 30-year restricted mean survival times

Interpretation. The disease survival curve represents overall survival as typically reported from clinical trials, incorporating background mortality. The background curve is the expected survival of a matched general-population individual. The gap between the two curves is the attributable disease burden. To update this analysis, change the six parameters in the first code chunk (params) and re-knit.