## Abstract

The knowledge we have accumulated over the past few years in the field of cancer immunotherapy has prompted the research community to challenge the status quo of trial design and endpoint selection across all drug development phases. For the design of randomized phase III studies using overall survival (OS) as the primary endpoint in particular, the paradigm has shifted from the conventional approach based on a proportional hazards model to those that account for the unique survival kinetics observed in immuno-oncology trials, such as long-term survival and delayed clinical effect. These new approaches usually require complex modeling or simulations, as well as assumptions about the length of delay in clinical effect and the long-term survival rate, making the process of implementing these new designs challenging. Here, a late-stage randomized clinical trial design is proposed based on milestone survival to simplify the process of sample size determination while keeping OS as the primary endpoint. The new design also allows assessment in milestone survival and is unaffected by the uncertainty of the survival kinetics demonstrated by cancer immunotherapies. *Cancer Immunol Res; 6(3); 250–4. ©2018 AACR*.

## Introduction

Progress in the field of cancer immunotherapy had been slow for decades, but several immunotherapeutic agents, particularly immune checkpoint inhibitors, have demonstrated a significant improvement in overall survival (OS) across different tumor types, including melanoma, lung cancer, renal cell carcinoma, head and neck cancer, and urothelial cancer (1–13). Two important lessons have been learned from late-stage trials of immune checkpoint inhibitors. First, these agents demonstrate unique response and survival kinetics, including a delayed clinical effect and long-term survival in a proportion of patients (14). Second, the primary focus of future randomized clinical trials with immune checkpoint inhibitors needs to shift from improvement in OS to raising the tail or plateau of the survival curve (15).

It is now recognized that clinical trial design needs to be tailored to the unique features shown by immune checkpoint inhibitors (14, 16). Trial size may need to increase to address the long-term survival effect and take into account the reduced number of patients at risk of death. In order to capture the long-term effect of treatment in a trial based on group sequential design (i.e., a trial with predefined stopping criteria for rejecting or accepting the null hypothesis at each interim analysis; refs. 17, 18), early-randomized patients could be evaluated at the interim analysis after a predefined minimum length of follow-up by using the conventional log-rank test (19) or Cox regression model (20). Substantial effort has gone into identifying the populations in which treatment effect may be maximized and, more importantly, novel endpoints to replace OS (21, 22).

Overlapping time-to-event curves during early stages of the study (or late separation of the curves) have been observed with different endpoints, including OS or progression-free survival (PFS), in several tumor types treated with immune checkpoint inhibitors (1, 8, 9). The statistical power may improve or worsen relative to that defined in the study design, depending on the magnitude of the observed treatment effect after separation of the curves (14).

Timing of any interim analysis requires careful consideration, because early analysis could lead to a false-negative result and waste resources (14). In addition, the conventional group sequential design with an interim analysis after 50% of events have occurred (50% information fraction) may be inadequate; by design, some clinical studies with nivolumab, a programmed death-1 immune checkpoint inhibitor, collect interim data at an information fraction of more than 70% (3, 4, 5, 8).

Study designs with time-to-event endpoints typically assume an exponential distribution with a constant hazard rate during the study period, and also, that all individuals will eventually experience the event of interest. Exponential distribution implies that the relative treatment effect of the experimental group over the control group, quantified by the ratio of the hazard rates [i.e., hazard ratio (HR)], remains constant throughout the study. These assumptions are reasonable in chemotherapy studies, when early antitumor effects and separation of the time-to-event curves from the study start are anticipated. However, for cancer immunotherapies, there has been a paradigm shift from this conventional approach, based on a proportional hazards model, to those that account for the aforementioned unique survival kinetics.

Despite the statistical methodology with nonproportional hazards or long-term survival (23, 24), the uncertainty of the survival kinetics with immuno-oncology agents continues to be a challenge. Even if these novel survival attributes are incorporated into the study design, any deviation from the assumptions used may significantly prolong the study or cause loss of statistical power (14). This article describes a simplified approach to study design that does not require intensive modeling of these unpredictable survival kinetics, while retaining OS as the primary endpoint and incorporating the tail of the survival curve as part of both design and analysis.

## Clinical Trial Design and Analysis

One of the fundamental components of clinical trial planning is sample size determination, which requires at least three parameters: (i) the relative effect size of a prespecified endpoint; (ii) the false-positive rate (FPR) or type I error rate (i.e., the probability of incorrectly rejecting the null hypothesis); and (iii) statistical power (i.e., the probability of correctly rejecting the null hypothesis). When the study has a time-to-event endpoint, accrual rate also plays a key role, because it has a direct impact on the speed of event accumulation.

Oncology studies are frequently designed with coprimary endpoints, defined herein as endpoints that require an adjustment of the FPR to ensure that the overall (or experiment-wise) rate is maintained at a desired level. Failure to account for multiplicity when more than one primary endpoint is evaluated in a study can lead to false conclusions about the efficacy or safety of a given drug. Coprimary endpoints serve multiple purposes, including a comprehensive and collective assessment of a treatment effect, as well as providing potential early access to an effective treatment for patients; any one of the coprimary endpoints meeting the prespecified FPR may be used to gain regulatory approval. Although combinations of endpoints vary, mortality is almost always included and plays the dominant role. The most common combinations are OS with overall response rate, disease control rate, or PFS.

The conventional approach of designing late-stage randomized clinical trials with coprimary endpoints was usually driven by OS; the number of events and the study duration determined based on OS dictated the operating statistical characteristics of the other endpoint(s). Thus, study conduct was driven by the time in which the prespecified number of events (e.g., deaths) was reached. In order to address the issue of unanticipated prolongation of study duration, a time-driven study design is proposed, with coprimary endpoints of OS and milestone survival (see below), in which the sample size is determined based on the milestone survival with partially allocated FPR.

### Milestone survival

Milestone survival represents the Kaplan–Meier survival probability at a predefined time point. A hypothetical example of milestone survival at 2 years is shown in Fig. 1. Milestone survival has previously been proposed as an intermediate endpoint (15). A cohort of early randomized patients, which will eventually have a longer follow-up duration, is included in the interim analysis with an intermediate endpoint of milestone survival, while the entire study continues with a primary endpoint of OS. At the interim analysis, milestone survival in this group of patients provides a first glimpse of the long-term clinical benefit. Although milestone survival possesses some advantages, such as a predictable analysis time and early access to the direct measure of long-term survival effect, it is a cross-sectional analysis that does not take into account all of the time-to-event data. Any increasing or decreasing relative treatment effect after the selected milestone could lead to a decreased or increased FPR, respectively. However, if our interest is raising the tail of the survival curve, then milestone survival needs to be incorporated into trial design.

### False-positive rate

FPR, or type I error rate, is the probability of incorrectly rejecting the null hypothesis. In other words, it is the chance of erroneously declaring that one treatment is superior to the other in a randomized clinical trial setting. Conventionally, this probability is chosen to be 0.05. In designing studies using two coprimary endpoints, the most conservative approach is usually taken by assuming that these endpoints are uncorrelated. In this case, the sum of the FPRs between two endpoints will be equal to the experiment-wise FPR (0.05), which may be either equally split between the endpoints (0.025 each), or 80% allocated to the more dominant endpoint (0.04 vs. 0.01; refs. 25–27).

Although milestone survival is measured at a specific time point, as a survival endpoint, it should be correlated with OS. A meta-analysis of 25 trials that included 20,013 patients with metastatic non–small cell lung cancer showed a moderate correlation of 0.80 between OS HR and 12-month milestone survival endpoints (28). For a moderate-to-high correlation (0.5–0.8), the FPR allocated to the more prominent endpoints, such as OS, falls in the range of 0.04 to 0.045 when the allocation ratio is 3:1 with an experiment-wise FPR of 0.05.

### Sample size determination

Sample size can be estimated once the relative treatment effect based on milestone survival, FPR, and desired statistical power are defined. The milestone survival is equivalent to the proportion of patients who are alive (i.e., the number of survivors divided by the total number of patients) when there is no censoring before the milestone time point. In this case, the sample size calculation for milestone survival becomes trivial and can be simplified to that of the binomial proportion, with appropriate methodology available (29). However, it is not uncommon to have observations censored because of patients lost to follow-up. Censored observations before the milestone time point lead to loss of statistical power due to inflation of the milestone survival variance (i.e., variability of the milestone survival estimate); efforts should therefore be made to reduce the number of patients lost to follow-up. To minimize the impact of censoring, both event and censoring distributions could be incorporated into the sample size estimation. In addition, different transformations can be applied to the milestone survival and its corresponding variance (e.g., the natural log-transformed cumulative hazard function) to improve the precision of the normal approximation (30).

Sample size estimation yields the total number of randomized patients required to meet statistical power and FPR criteria allocated to the milestone survival endpoint. The final milestone survival analysis is performed when all randomized patients have been followed for at least the milestone duration, and thus, the analysis time is determined when the milestone survival endpoint is defined at the study design stage. The OS analysis will be performed at the same time as that of the final milestone survival, and therefore, is also driven by time, rather than by the number of events.

## An Illustrative Example

Consider a hypothetical example of a randomized clinical trial with two coprimary endpoints: 2-year milestone survival, defined as the Kaplan–Meier survival probability at 2 years, and OS, defined as the time from randomization to death due to any cause. If we assume a moderate correlation of 0.7 between the two endpoints, the two-sided type I error rates assigned to the milestone and OS (using a 1:3 allocation ratio) would be 0.014 and 0.042, respectively, to guarantee an experiment-wise FPR of 0.05. Assuming a 2-year milestone survival of 31% in the control arm, 342 randomized patients would provide 90% power to show a statistically significant difference at a two-sided type I error rate of 0.014, if the 2-year milestone survival rate is 51% in the experimental arm (29). This calculation also assumes that all patients are followed for at least 2 years before performing the analysis, with no censored observations before the 2-year milestone. Assuming that accrual takes 10 months, the entire study duration will be 34 months.

In order to assess the statistical power for the OS endpoint, an assumption needs to be made about the shape of the survival curves to give a 2-year milestone survival of 31% and 51% for the control and experimental arms, respectively. Let us assume the control arm follows a mixture cure rate model (i.e., the population comprises patients at risk of death and cured patients), with exponential distribution satisfying the conditions of 2-year milestone survival of 31%, a median survival time of 12 months, and a cure rate of 20% (Fig. 1B; black curve). For the experimental arm, a mixture cure rate model is considered with a piecewise exponential distribution and a delayed clinical effect ranging from 0 to 8 months in 2-month increments. Five out of many potential survival curves crossing the targeted 2-year milestone survival of 51% are shown in Fig. 1B, with various combinations of long-term survival proportions and delay durations. This translates into post-separation HRs of 0.57, 0.52, 0.46, 0.38, and 0.27, respectively. Based on these assumptions, the estimated number of events was approximately 193 when the milestone survival analysis was completed 2 years after the last patient was randomized without censoring. When the delay in clinical effect is 8 months, the statistical power of the OS endpoint based on a log-rank test is approximately 90%, with a study-wise power of 96%. These assumptions were reasonable based on the data accumulated thus far for immune checkpoint inhibitors. For future cancer immunotherapies, different model and distribution assumptions may be required to assess the operating characteristics of the proposed method.

The sample size will increase in the presence of censoring for administrative reasons or due to loss to follow-up. Administrative censoring would lead to insufficient follow-up duration for some patients. For example, if the study was designed to conduct a 2-year milestone survival analysis 18 months after the last patient was randomized, there would be a group of patients who had not reached a minimum 2-year follow-up duration. A total of 395 patients would be needed to achieve a power of 90%. On the other hand, if patients were lost to follow-up, the sample size would also need to be adjusted, even if all patients reached a minimum of 2 years of follow-up duration. A lost to follow-up rate of 0.002, equivalent to 5% by 2 years, would lead to a sample size of 354 patients. When the delay in clinical effect is 8 months, the statistical power of the OS endpoint for both types of censoring based on a log-rank test is 85% and 90%, respectively, with a similar study-wise power of 96%.

## Discussion

The simple approach to study design proposed here is suitable for use with cancer immunotherapies that demonstrate unique survival kinetics, as it is less affected by the unpredictable survival kinetics, retains OS as the primary endpoint, and incorporates the tail of the survival curve in the analysis. The hypothetical, randomized clinical trial used to illustrate this approach had coprimary endpoints of milestone survival and OS, and the sample size was based on the milestone endpoint with a smaller allocated FPR. The benefits of this newly proposed study design include the following: (i) the design endpoint is a simple binomial outcome; (ii) the study design incorporates clinically meaningful benefits in long-term survival; (iii) OS remains the most prominent endpoint; (iv) as a decision-making endpoint, milestone survival requires stronger evidence (compared with OS) to declare the study success; (v) the design is less affected by uncertainty in the extent of both long-term survival and delayed clinical effect; (vi) both endpoints measure survival; (vii) study duration is defined a priori and is therefore predictable; and (viii) sufficient follow-up duration is built into the study.

The proposed trial design implements a 3:1 allocation ratio to minimize the impact on the FPR for the OS endpoint. Despite the use of milestone survival, OS remains the gold-standard endpoint for cancer therapy evaluation. Although milestone survival is measured at a specific time point, it should be correlated with the coprimary endpoint of OS. The FPR depends on the correlation between the two endpoints. In cases of moderate-to-high correlation (0.5–0.8) between OS and milestone survival, the FPR allocated to the more prominent endpoint (OS) falls in the range of 0.04 to 0.045 when the allocation ratio is 3:1 with an experiment-wise FPR of 0.05. Specifically, it is recommended that an FPR of at least 0.04 (80%) be allocated to the OS endpoint in a design with an experiment-wise FPR of 0.05 (e.g., a 3:1 allocation ratio yields an FPR of 0.0422 for OS with a correlation of 0.7). If the correlation is unknown, a conservative assumption of no correlation would lead to an FPR partition of 0.04 versus 0.01 between OS and milestone survival.

The unequal FPR allocation offers several advantages. First, the loss of FPR is relatively small for the OS endpoint when the moderate-to-high correlation is assumed between these two endpoints. It is a small price to pay in terms of type I error rate, but a huge gain in the reduction of false-negative rate (i.e., increase in power). Second, it preserves enough power in the scenarios of prolonged duration and delayed clinical effect. Third, stronger evidence of efficacy is required for the milestone survival endpoint to declare study success, should the OS endpoint fail to meet its threshold for success.

The hypothetical randomized clinical trial example presented here illustrates the ease of using milestone survival as the endpoint of choice in study design. Not only does it alleviate the difficulty of arbitrarily defining the proportion of long-term survivors and extent of the delayed effect, it also provides a meaningful measurement of treatment improvement in clinical practice. Moreover, the follow-up duration is defined once the milestone is determined. Preferably, all patients should be followed for a minimum of the milestone duration before the final analysis (2 years in the aforementioned example); this will mitigate concerns about the robustness of the milestone survival estimate due to administrative censoring (i.e., censoring of patients with insufficient follow-up duration) before the milestone time point.

The most challenging aspect of this design is in determining the milestone. It must be short enough to facilitate drug development, but long enough to capture any long-term survival benefit. It is important to note that the milestone itself does not necessarily represent long-term survival or cure. It reflects the time point beyond which the treatment effect is unlikely to change. Multiple phase II and phase III studies have demonstrated that OS among patients treated with immune checkpoint inhibitors begins to level off at least 2 years after first day of treatment or random assignment (1, 2, 9, 13); consequently, the milestone survival time for these agents may be the 2-year time point. The proposed study design is constructed using binomial distribution and assumes no censoring before the milestone. However, the analysis could alternatively be based on Kaplan–Meier survival probabilities that account for censoring, and it is also possible to build both censoring and event distributions into the sample size calculation.

In conclusion, the proposed late-stage randomized clinical trial design based on milestone survival simplifies the process of the sample size determination and retains OS as the dominant primary endpoint, but also allows assessment of milestone survival. Accumulating knowledge will ultimately lead to further refinements in study design to ensure that effective treatments can be made available to patients in need.

## Disclosure of Potential Conflicts of Interest

T.-T. Chen has ownership interest in Bristol-Myers Squibb.

## Acknowledgments

The content of this manuscript was originally presented by the author at the FDA–AACR Immuno-Oncology Drug Development Workshop held in Washington, DC, in 2016.

The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked *advertisement* in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

- Received August 24, 2017.
- Revision received November 17, 2017.
- Accepted January 17, 2018.

- ©2018 American Association for Cancer Research.