previous day
all days

View: session overviewtalk overview

13:00-14:30 Session IS7: INVITED : Selective inference after variable selection
Valid post-selection inference for cox regression parameters, with and without the proportional hazards assumption.

ABSTRACT. The problem of how to best select variables for confounding adjustment forms one of the key challenges in the evaluation of exposure or treatment effects in observational studies. A major drawback of common selection procedures is that there is no finite sample size at which they are guaranteed to deliver tests and confidence intervals with adequate performance. In this talk, I will consider this problem in the context of estimating a conditional causal hazard ratio in an observational study where all key confounders are measured. An added complication with time-to-event outcomes is that one must adjust not only for confounders, but also variables that render censoring non-informative.

Assuming first that the hazard ratio of interest is constant, I will describe a simple procedure for obtaining valid inference for the conditional hazard ratio after variable selection. This procedure involves three different selection steps, in order to best capture variables that account for confounding and informative censoring, and can be implemented using standard software for the Lasso.

Our proposed estimator (along with common alternatives) has the disadvantage that it may not converge to something easily interpretable when the proportional hazards assumption fails. The resulting tests and confidence intervals also typically lose their validity under misspecification. I will therefore outline an alternative proposal based on a nonparametric estimand that reduces to a Cox model parameter under the proportional hazards assumption, but which continues to capture the association of interest under arbitrary types of misspecification. I will then outline how to obtain valid inference for this novel estimand whilst incorporating variable selection.

The different proposals will be illustrated in the analysis of an observational study on the predictive relationship between the initial concentration of serum monoclonal protein and the progression to multiple myeloma or another plasma-cell cancer.

Selective inference for the Lasso in statistical practice

ABSTRACT. Nowadays multivariable clinical models are ubiquitous, facilitating personalized medicine and guiding therapy decisions. Such models are often developed by the use of variable selection, but the uncertainty introduced by selection is often ignored in the dissemination. Part of the reason is that classical methods for statistical inference to estimate e.g. confidence intervals are not applicable after selection. One way to facilitate proper post-selection inference is by means of the selective inference framework, which addresses inference for statistical hypotheses that are formulated and analysed using the same set of data. In recent years the methodology was developed for the widely used Lasso method, i.e. L1-penalized regression, but there are also approaches agnostic of the model selection procedure.

We will present some practical considerations when working with the selective inference framework for regression problems. We will discuss a systematic simulation study in linear and logistic Lasso regression. Our focus lies on the properties of selective confidence intervals derived from different approaches, in particular selective coverage, power to exclude zero and stability. We elaborate on the practical use and interpretation of selective inference using two real-data case-studies of typical applications. First, selective inference after the selection of anthropometric features to estimate body fat in men. Second, selective inference for the main exposure after confounder selection in a cardiology dataset.

We found selective inference to be challenging in terms of interpretation and computation. Development of corresponding user-friendly software is still in its infancy. Lasso confidence intervals tended to be very wide and quite variable, but could potentially improve model selection properties, in particular false positive findings. Simple selection agnostic methods showed unsatisfactory trade-offs between selection and inference accuracy, while modern approaches were much more conservative and computationally expensive, limiting their practical usability. In conclusion, selective inference using the Lasso is a promising tool for statistical modelling, but remains difficult to use in practice.

Biologically-informed development of treatment selection scores from high-dimensional omics data

ABSTRACT. Precision medicine therapeutic approaches rely on matching mechanism of action of a therapy to biological and molecular characteristics of a patient or the patient’s disease. Therapies that can correct for aberrant or missing gene products or compensate for a disrupted biological pathway hold promise for the treatment of the corresponding disease. In oncology, many predictors based on multivariable scores generated from high dimensional omics data have been developed for purposes of prognosis; some of those have been secondarily assessed for their value in informing choice between different therapy options. Repurposing a prognostic predictor may be suboptimal because there are fundamental differences between the goals of prognostication and therapy selection. The modified covariates method of Tian and colleagues (JASA 2014;109:2350-2358) is one approach that has been proposed specifically for development of a therapy selection predictor. Biologically informed enhancements of the modified covariates approach that use information about biological pathways significantly associated with outcome or that use pre-specified variable groupings are proposed in this talk. These biologically informed approaches are found to yield treatment selection predictors with improved performance relative to that of the original modified covariates method on some real omics data from patients with cancer. The discussion additionally highlights challenges in identification of data that are suitable for treatment selection predictor development as well as issues to consider in selection of metrics to evaluate performance of these predictors.

13:00-14:30 Session OC6A: causal inference
Incident and prevalent-user designs and the definition of study time origin in pharmacoepidemiology: a systematic review

ABSTRACT. Background: Guidelines for observational comparative effectiveness and drug safety research recommend implementing an incident-user design whenever possible, since it reduces the risk of selection bias in exposure effect estimation compared to a prevalent-user design. The uptake of these guidelines has not been studied extensively. Methods: We reviewed 89 observational effectiveness and safety cohort studies published in six pharmacoepidemiological journals in 2018 and 2019. We developed an extraction tool to assess how frequently incident-user and prevalent-user designs were implemented. Specifically, we extracted information about the extent to which start of follow-up, treatment initiation and moment of meeting eligibility criteria were aligned in the study design. Results: Of the 89 studies included, 40% implemented an incident-user design for both the active arm and the comparator arm, while 13% implemented a prevalent-user design in both arms. Of the 40 studies implementing a prevalent-user design in at least one treatment arm, 4 provided a rationale for including prevalent users. The start of follow-up, treatment initiation and moment of meeting eligibility criteria were aligned in both treatment arms in 22% of studies. We provided examples of studies that minimized the risk of introducing bias due to left truncation, immortal time, or a time lag. Conclusions: Almost half of the included studies followed the recommendation to implement an incident-user design. Rationales for implementing a prevalent-user design were scarcely reported. Information on the extent to which start of follow-up, treatment initiation and moment of meeting eligibility criteria were aligned by design was often incomplete. We recommend that the choice for a particular study time origin is explicitly motivated to enable assessment of validity of the study.

A framework for meta-analysis of studies with baseline exposure through standardized survival curves

ABSTRACT. Meta-analyses of survival studies assessing the effect of a well specified exposure at baseline, ideally combine measures of the ‘causal’ effect(s) seen in different studies into one overall meaningful measure. Standard forest plots of hazard ratios however mix apples and oranges into an ill interpretable cocktail. To enhance interpretation and transportability, one must confront between study heterogeneity in more than the usual dimensions. While randomized studies expect different covariate distributions between studies, observational studies additionally encounter baseline covariate heterogeneity between exposure groups within studies. There is the issue of study duration and hence maximum follow-up time. Also the nature of possibly informative or explainable censoring must be addressed. Last but not least there is the complicating role of the non-collapsible hazard ratio typically presented in forest plots as the study-specific and overall summary of survival contrasts between exposure groups. With many features ignored, implausible and hidden underlying assumptions may critically affect interpretation of the pooled effect measure. We propose several different forms of standardized survival curves over a fitting time horizon to derive estimates of distinct target estimands most relevant in specific settings. Corresponding forest plots then show standardized risk differences at well-chosen time points. Alternatively post hoc summaries may contrasts standardized survival curves by hazard ratios or restricted mean survival time in this two-step approach. We discuss advantages and disadvantages of direct standardization besides transportability of results for meta-analyses of randomized treatments, observed treatments as well as patient-specific exposure characteristics. Our case study analyzes individual patient data of six studies to evaluate how the adjusted effect of tumor marker p16 on overall survival of squamous anal cancer patients may help guide treatment policies.

Causal inference in practice: two case studies in nephrology

ABSTRACT. For certain medical research questions, randomization is unethical or infeasible and then causal effects are estimated from observational data. If confounding and exposure status are time-dependent, this requires sophisticated methodology such as target trial emulation involving longitudinal matching methods. Here we discuss issues in two examples from nephrology aiming to quantify survival benefits of (first/second) kidney transplantation compared to remaining on dialysis and never receiving an organ, across ages and across times since waitlisting.

We analyzed data from the Austrian Dialysis and Transplant Registry comprising patients on dialysis and waitlisted for a kidney transplant with repeated updates on patient characteristics and waitlisting status.

As often with registry data, a tricky task was data management, i.e. dealing with inconsistencies and incompleteness. Data availabilities also had to be taken into consideration when deciding on the most relevant causal effect that could be identified and estimated. We adapted the approaches of Gran et al. (2010) and Schaubel et al. (2006) by emulating a series of target trials, where each trial was initiated at the time of a transplantation (relative to time of first/second waitlisting). Transplanted patients contributed to the treatment group while patients with current active waitlisting status were classified to the control group. Controls were artificially censored if they were transplanted at a later time and their transplantation then initiated a further trial of the series. Estimation of inverse probability of treatment and censoring weights (IPTWs, IPCWs) to achieve exchangeability and to account for artificial censoring required many practical decisions. We applied pooled logistic regression adjusted for time-varying patient characteristics to estimate IPTWs and trial specific Cox proportional hazards models to compute yearly updated IPCWs.

The marginal effect and the effect conditional on age and duration of waitlisting expressed on different scales (hazard ratios, survival probabilities and restricted mean survival times) were obtained from Cox models weighted by the product of IPTWs and IPCWs fitted to the stacked data set of all trials. The bootstrap approach to compute confidence intervals additionally increased computation time. We critically discuss our proposed solution and possible alternatives and show results from our nephrological case studies.

Machine learning, G-computation and small sample sizes: a simulation study

ABSTRACT. Drawing causal inferences from cohorts consisting of some hundred patients is challenging, especially when the data-generating mechanism is unknown. While machine learning approaches are increasingly used in prediction, their applications for causal inference are more recent. However, several causal inference methods involve a prediction step where such data-adaptive approaches could be helpful. We propose an approach combining machine learning and G-computation1 to estimate the causal effect of a binary exposure on a binary outcome. We evaluated and compared, through a simulation study, the performances of penalized logistic regressions, neural network, support vector machine, boosted classification, regression trees, and an ensemble method called super learner2. We proposed six different scenarios, including various sample sizes and relationships between covariates, binary exposure, and binary outcome. In a cohort of 252 patients, we have also illustrated the application of these approaches used to estimate the effect of barbiturates prescribed during the first 24h of an episode of intracranial hypertension. We reported that, used in a G-computation approach to estimate the probabilities of the individual counterfactual outcomes, the super learner tended to outperform other approaches in terms of bias and variance, especially for small sample sizes. Support vector machine also resulted in performant properties, albeit the mean bias was slightly higher compared to the super learner. We show empirically that, contrary to a preconception, using a machine learning approach to draw causal inferences can be valid even for a sample constituted by one hundred subjects, as in the majority of medical studies. The G-computation with the super learner is available in the R package RISCA. References 1. Hernán M, Robins JM. Causal Inference: What if? Boca Raton: Chapman & Hall/CRC; 2020. 2. Naimi AI, Balzer LB. Stacked generalization: an introduction to super learning. European Journal of Epidemiology. 2018 May;33(5):459–464.

Valid Uncertainty Interval for the Average Causal Effect in a High-dimensional Settingv

ABSTRACT. During the last years, a great extent of work has been done on constructing confidence intervals for average causal effect parameters that are uniformly valid over a set of data generating processes even when high-dimensional nuisance models are estimated by post-model-selection or machine learning estimators. These developments assume that all the confounders are observed to ensure point identification. We contribute by showing that valid inference can be obtained in the presence of unobserved confounders and high-dimensional nuisance models. We thus propose uncertainty intervals, which allow for nonzero confounding bias. The later bias is specified and estimated and is function of the amount of unobserved confounding allowed for. We show that valid inference can ignore the finite sample bias and randomness in the estimated value of confounding bias by assuming that the amount of unobserved confounding is small relative to the sample size; the latter is formalized in terms of convergence rates. An interpretation is that more confounders are collected as the sample size grows. Simulation results are presented to illustrate finite sample properties, and the application of the results is illustrated with a study, where we fit the effect of regular food intake on health.

13:00-14:30 Session OC6B: Adaptive clinical trial design
Adaptive designs for three-arm gold-standard non-inferiority trials

ABSTRACT. A common criticism of non-inferiority trials comparing an experimental treatment to an active control is that they may lack assay sensitivity. This denotes the ability of a trial to distinguish an effective treatment from an ineffective one. The 'gold-standard' non-inferiority trial design circumvents this concern by comparing three groups in a hierarchical testing procedure. First, the experimental treatment is compared to a placebo group in an effort to show superiority. Only if this succeeds, the experimental treatment is tested for non-inferiority against an active control group. Ethical and practical considerations require sample sizes of clinical trials to be as large as necessary, but as small as possible. These considerations come especially pressing in the gold-standard design, as patients are exposed to placebo doses while the control treatment is already known to be effective. Group sequential trial designs are known to reduce the expected sample size under the alternative hypothesis. In their pioneer work, Schlömer and Brannath (2013) show that the gold-standard design is no exception to this rule. They calculate approximately optimal rejection boundaries for the group-sequential gold-standard design using sample size allocation ratios of the optimal single-stage design. We extend their work by relaxing the constraints put on the group allocation ratios and allowing for futility stops at interim. Using this new design, we will present methods to calculate power and type I error conditional on the observations at interim. These extensions tackle a pressing issue in the traditional gold-standard design: Disappointing results at interim might make a positive result in the final analysis very unlikely. In such scenarios, stopping a trial for futility or reassessing the required sample size whilst controlling the conditional type I error rate can minimize the risk of further patients being unnecessarily exposed to ineffective treatment. Besides the extended design options, we analyze different choices of optimality criteria. The above considerations suggest that the null hypothesis also plays a vital role in the assessment of the gold-standard design. Therefore, optimality criteria that incorporate the design performance under the alternative and the null hypothesis are introduced.

A comparison of estimation methods adjusting for selection bias in adaptive enrichment designs with time-to-event endpoints

ABSTRACT. Adaptive enrichment designs in clinical trials have been developed to enhance drug developments. They permit, at interim analyses during the trial, to select the sub-population that benefits the most from the treatment. This results in improvements both in terms of resources and ethics, by reducing the number of patients receiving non effective treatments. Because of this selection, the naive maximum-likelihood estimation of the treatment effect, commonly used in classical randomized controlled trials, is biased. In the literature, several methods have been proposed to obtain a better estimation of the treatments’ effects in such contexts. To date, most of the works have focused on normally endpoints, and some estimators have been proposed for time-to-event endpoints but they have not all been compared side-byside. In this work, we conduct an extensive simulation study, inspired by a real case-study in Heart Failure, to compare the maximum-likelihood estimator (MLE) with an unbiased estimator, shrinkage estimators and bias-adjusted estimators for the estimation of the treatment effect with time-to-event data. The performances of the estimators are evaluated in terms of bias, variance and mean squared error. Based on the results, we recommend using the unbiased estimator and the single-iteration bias-adjusted estimator: the former completely eradicates the bias, but is highly variable with respect to a naive estimation; the latter is less biased than the MLE estimator and only slightly more variable.

A stochastically curtailed two-arm randomised phase II trial design for binary outcomes

ABSTRACT. Randomised controlled trials are considered the best way of assessing a new treatment. However, in the early phases of assessing cancer treatments, where treatment outcome is binary, trials are often single-arm. Consequently, the results from the new trial then have to be compared to results from another source. This can make it more difficult to estimate how well the new treatment works, as the two sets of results may come from groups that are different in various ways.

Though a number of reasons exist for choosing a single-arm trial, the primary reason is that single-arm designs require fewer participants than their randomised equivalents. Therefore, the development of randomised trial designs that reduce sample size is of value to the trials community.

This talk introduces a randomised two-arm binary outcome trial design that greatly reduces expected sample size compared to typical randomised trial designs. Sample size is reduced in two ways: firstly, the design uses stochastic curtailment, which means allowing the possibility of stopping a trial before the final conclusions are known with certainty. Secondly, the proposed design uses randomised blocks, which allows investigators to control the number of interim analyses, the points in the trial at which early stopping may occur. This proposed design is compared with existing designs that also use early stopping, in terms of maximum and expected sample size, through the use of a loss function. Comparisons are also made using an example from a real trial. The comparisons show that in many cases, the proposed design is superior to existing designs. Further, the proposed design may be more practical than existing designs, by allowing a flexible number of interim analyses. One existing design produces superior design realisations when the anticipated response rate is low. However, when using this existing design, the probability of incorrectly concluding that a treatment works can be higher than expected if the response rates are different to what was anticipated. Therefore, when considering randomised designs in cancer trials or for any trial with a binary outcome, we recommend the proposed approach over other designs that allow early stopping.

Fast-tracking clinical trial innovations – a COVID-19 silver lining

ABSTRACT. Background: Platform trials can save time, resources, and speed decision-making by offering the flexibility to add new treatment options to an ongoing trial as they become available. But as a modern approach to research, their use has been limited to a concise number of trials mostly in oncology. COVID-19 has provided a ripe opportunity to change this. High profile platform trials RECOVERY and Solidarity have received considerable favorable attention for their efficient conduct and success in identifying improved treatments. However, the full extent of platform design adoption in COVID-19 has not yet been assessed. We present a rapid review to systematically explore how platform trials are operating in the fight against COVID-19. Methods: We conducted searches in PubMed,, and the Cytel COVID-19 Clinical Trials Tracker between October 21st and November 4th 2020. Platform trials were defined by their explicitly stated flexibility to add future arms. Selection of relevant trials was validated on a sample between two independent reviewers. Results: Forty-five platform trials in COVID-19 were registered in the first 10 months of 2020. Of these, 41 (91%) incorporate adaptive features and 13 (29%) clearly state a Bayesian approach. Multiple trials have tested the same therapies, and justification for this is not clear in all cases. REMAP-CAP has opened 15 treatment arms for COVID-19, while remaining trials have had between 2 and 9 active arms. RECOVERY has both opened and closed 3 arms to date. Twenty trials (44%) have publicly shared their protocols; only three provide full statistical analysis plans. Fifteen trials (33%) have committed to sharing individual patient data (IPD). Conclusions: The huge surge of platform design use in COVID-19 is promising for the future of clinical trial design and conduct. Future work may examine what unique challenges these trials face during conduct compared to more standard designs. COVID-19 is a global tragedy, but the experience gathered from the rapid and widespread implementation of platform trials will greatly accelerate our understanding and adoption of these innovative designs.

Adaptive clinical trials with selection of composite endpoints and sample size reassessment

ABSTRACT. For randomized clinical trials where a single, primary, binary endpoint would require unfeasibly large sample sizes, composite endpoints are widely chosen as the primary endpoint. For example, in cardiovascular trials, it is usual to consider the composite of death, fatal myocardial infarction (MI), and hospitalization as primary endpoint, and death or fatal-MI as main secondary endpoint. Despite being commonly used, composite endpoints entail challenges in designing and interpreting results. Given that the components may be of different relevance and have different effect sizes, the choice of components must be made carefully (EMA, 2017). Especially, sample size calculations for composite binary endpoints depend not only on the anticipated effect sizes and event rates of the composite components but also the correlation between them (Bofill & Gómez, 2019). However, information on the correlation between endpoints is usually not reported in the literature which can be an obstacle for sound trial design. We consider two-arm randomized controlled trials with a primary composite binary endpoint and an endpoint that consists only of the clinically more important component of the composite endpoint. We propose a trial design that allows an adaptive modification of the primary endpoint based on blinded information obtained at an interim analysis. Especially, we consider a decision rule to select between a composite endpoint and its most relevant component as primary endpoint. The decision rule chooses the endpoint with the lower estimated required sample size. Additionally, the sample size is reassessed using the estimated event rates and correlation, and the expected effect sizes of the composite components. We investigate the statistical power and significance level under the proposed design through simulations. We show that the adaptive design is equally or more powerful than designs without adaptive modification on the primary endpoint. Besides, the targeted power is achieved even if the correlation is misspecified at the planning stage while maintaining the type I error. All the computations are implemented in R and illustrated by means of a cardiovascular trial.

13:00-14:30 Session OC6C: Joint models &2-stage approach
Marginalized two-part joint modeling of longitudinal semi-continuous responses and survival data: with application to medical expenses

ABSTRACT. The distribution of medical cost data is generally right-skewed, with a substantial number of zero values. Data censorship due to incomplete follow-up and death of participants is quite common. It is important to consider the potential dependence of survival status and medical cost where this censorship is death-related. Dismissing cost data in survival analysis can lead to a biased estimation of the parameters. Despite the wide use of conventional two-part joint models (CTJMs) to capture zero inflation, they are limited to conditional interpretations of the regression coefficients in the model's continuous part. In the present study, we propose a marginalized two-part joint model (MTJM) to analyze semi-continuous longitudinal cost data and survival data. The aim of this study was to extend the joint modeling approach to handle marginal inferences about the covariate effects on the average expenditures. We conducted a simulation study to evaluate the performance of the proposed MTJM compared to the CTJM. To illustrate the properties of the MTJM, we applied the model to a set of real electronic health record data recently collected in Iran. We found that the MTJM yields a smaller standard error and root mean square error of estimates. Furthermore, we identified a significant positive correlation between cost and survival confirmed the simulation results. In summary, the MTJM is a better alternative to the CTJM for handling marginal inferences for death-related censoring.

Healthy life expectancy computation using the Item Response Theory combined with a joint modeling approach

ABSTRACT. Healthy life expectancy and more broadly life expectancy without specific conditions constitute key epidemiological quantities for public health policies, notably to estimate the associated needs and to organise and plan health and welfare systems. In some contexts, the target condition is a specific disease diagnosed at a certain time, and an illness-death multi-state approach can be employed to derive such quantities from prospective cohort data. In other contexts, such as the disability-free life expectancy, the conditions are defined as the occurrence of one or various limitations that are intermittently assessed at cohort visits using items from measurement scales. We propose an original approach to compute epidemiological quantities when the target conditions are defined according to the items of a measurement scale repeatedly measured over time. We rely on the Item Response Theory (IRT) that we extend to handle repeated item measures and association with a time-to-event. Specifically, (i) the underlying dimension measured by the repeated items is modeled over time using a mixed model, (ii) ordinal and continuous items are considered using appropriate measurement equations, and (iii) association with the time-to-event is modeled within the shared random effect framework. From the Maximum Likelihood Estimates of this IRT joint model (obtained within lcmm R package), we derive quantities such as life expectancy without certain items limitations or expected duration with intermediate levels of item limitations. The methodology is validated using simulations and contrasted with the multi-state modeling approach. Given the major challenge of aging and associated dependency, we apply this methodology to the analysis of the limitations in basic and instrumental activities of daily living (ADL and IADL). Using the long-term data of French population-based aging cohorts, we provide estimations of life expectancy under different scenarios of dependency, for instance life expectancy without severe limitation in any basic ADL or expected duration with mild to moderate dependency (i.e., IADL limitation but no ADL limitation). Compared to multi-state models, this methodology has the advantage of integrating all the available information of the measurement scale and its underlying structure, and naturally handles interval censoring.

Modelling the effect of longitudinal latent toxicity profiles on survival: an application to osteosarcoma

ABSTRACT. Background In many clinical applications involving longitudinal data, the interest lies in analysing the evolution of similar latent profiles related to subgroups of individuals rather than in studying their observed attributes. Since latent variables can be considered as the outcomes of a stochastic latent process which determines the evolution of observed responses, these characteristics may reflect patients’ quality-of-life, including valuable information related to patient’s health status and disease progression. In cancer trial, the analysis of longitudinal chemotherapy data is a complex task due to the presence of negative feedbacks between exposure to cytotoxic drugs and treatment toxicities. Models for time-to-event able to deal with the longitudinal nature of toxicity evolution during treatment are necessary, still not well developed.

Objectives The main purpose is to study the evolution of longitudinal latent toxicity progression and their association with time-to-event outcome. New methods to reconstruct the longitudinal latent profiles of toxicity evolution by means of Latent Markov Models (LMM) and Functional Data Analysis (FDA) are developed. Inclusion into survival models is further discussed.

Methods Data from MRC-BO06/EORTC-80931 randomised controlled trial for osteosarcoma treatment are used. Non-haematological chemotherapy-induced toxicity for nausea, mucositis, ototoxicity, infection, neurological and cardiac toxicity are registered at each cycle and graded according to the Common Terminology Criteria for Adverse Events. First, a LMM approach is applied to identify and reconstruct the latent profiles over chemotherapy cycles in terms of observed toxicity grades. FDA techniques to represent the latent characteristics in terms of dynamic functions are exploited. Through dimensionality reduction methods, the functional latent profiles are finally included into survival models.

Results & Conclusions The proposed methodology allows (i) to move from complex chemotherapy data to different subgroups of individuals with similar patterns of toxicity progression by identifying the latent states, (ii) to reconstruct the longitudinal latent toxicity profiles in a tailored way and (iii) to quantify the association with patient’s survival. This approach represents a novelty for osteosarcoma treatment, providing new insights for childhood cancer. Thanks to its flexibility, this procedure could be tailored to other fields according to the needs of the study.

Joint model versus linear mixed model to analyze longitudinal data of health-related quality of life in cancer clinical trials

ABSTRACT. Health-related quality of life (HRQoL) is increasingly used as an endpoint in cancer randomized clinical trials. In palliative settings, the observation of the longitudinal HRQoL outcome is often truncated by death. HRQoL and risk of death being hardly independent, unbiased HRQoL estimates cannot be obtained unless modeling the death process jointly with the HRQoL outcome, for instance with a joint model (JM). However, the linear mixed model (LMM) remains the usual strategy to analyze HRQoL longitudinal data. The aim of this work is to highlight the situations where the LMM fails, and to measure the size and direction of the biases and their consequences on the clinical conclusions. We have first compared the most frequently used LMM, namely the random intercept and slope model, and its corresponding JM on data of patients suffering from metastatic pancreatic cancer. HRQoL was assessed using the EORTC QLQ-C30 questionnaire and we focused on six dimensions of interest. From this application, we have also derived assertions on the situations where biases arise from the LMM, then confirmed and complemented through a simulation study. Death acting as informative dropout, larger the risk of death in a given arm, larger the bias of the slope governing the HRQoL trajectory in this arm. Thus, if death is associated with poor HRQoL, the LMM will be too optimistic. In the application, the JM found that (compared to the LMM) HRQoL in the control arm improved less (global health status/QoL, role functioning, fatigue, pain), deteriorated more (physical functioning) or deteriorated instead of improved (social functioning). If survival is similar between arms, the two slopes will be equally biased by the LMM, maintaining the right slope difference. However, if survival is better in the experimental arm because of the direct treatment effect and/or a better HRQoL (indirect treatment effect), the HRQoL slope will be less biased in this arm, and thus the arm-by-time interaction effect will be biased in disfavor of the experimental arm. Based on real and simulated data, this work reveals when, how and why the LMM produces biased estimates and shows the interest of using instead a JM.

Joint modelling of the temporal relationships between multivariate longitudinal markers of Alzheimer’s disease progression and clinical endpoints

ABSTRACT. Alzheimer’s disease and related dementias (ADRD) are a major public health issue with substantial economic and social costs. Many neuropathological processes have been identified in the preclinical progression of ADRD including protein accumulation in the brain, regional brain atrophies and cognitive dysfunction with repercussions on daily life. Yet, the underlying mechanisms linking these processes over time are still poorly understood. The wealth of biomarker data measured repeatedly over time now available in aging cohorts offers an unprecedented opportunity to better understand the complex pathways of ADRD, essential to improve prevention strategies and individual healthcare. However, current statistical methods are not designed to identify a structure of temporal dependence between repeated biomarkers. We propose an original joint modelling framework to describe the complex dynamics of the different processes involved in ADRD progression along with the ADRD diagnosis and death, and apprehend their temporal relationships. The longitudinal submodel is a dynamic causal model which combines the multivariate mixed model theory with difference equations to explain the future change over time of each dimension according to the history of the others (Bachirou et al., 2020). The longitudinal model accommodates various natures of biomarkers (continuous, ordinal). The association of the biomarkers with ADRD diagnosis and death are described via a shared random effect joint modelling approach. Inference is performed in the Maximum Likelihood framework within the CInLPN R package. The methodology is applied on a French epidemiological cohort on cognitive ageing, the 3-City study. In particular, we aim to disentangle the temporal relationships between three major drivers of ADRD natural history, cerebral atrophy, cognition and functional autonomy, in link with the two major clinical events in ADRD progression: diagnosis and death. This methodology and application envision to help understand the complex interplay between the biomarkers of ADRD, and identify relevant targets which may slow down the progression at each stage of the disease. Although primarily motivated by the ADRD study, this dynamic causal joint model constitutes a novel tool to investigate temporal associations between repeated markers with potential notably in longitudinal mediation analyses.

Bachirou et al., Biometrics, 2019,

13:00-14:30 Session OC6D: Time-to-event methods for non-proportional hazards
Methods for analyzing time-to-event endpoints in immuno-oncological trials with delayed treatment effects

ABSTRACT. In cancer drug research and development, immunotherapy plays an ever more important role. A common feature of immunotherapies is a delayed treatment effect which is quite challenging when dealing with time-to-event endpoints. In case of time-to-event endpoints, regulatory authorities often require a log-rank test, which is the standard statistical method. The log-rank test is known to be most powerful under proportional-hazards alternatives but suffers a substantial loss in power if this assumption is violated. Hence, a rather long follow-up period is required to detect a significant effect in immunooncology trials. For that reason, the question arises whether methods exist that are more susceptible to delayed treatment effects and that can be applied early on to generate evidence anticipating the final decision of the log-rank test to reduce the trial duration without inflation of the type I error. Such alternative methods include, for example, weighted log-rank statistics with weights that can either be fixed at the design stage of the trial, or chosen based on the observed data, tests based on the restricted mean survival time, survival proportions, accelerated failure time (AFT) models, or additive hazard models. We evaluate and compare these different methods systematically with regard to type I error control and power in the presence of delayed treatment effects. Our systematic simulation study for type I error includes aspects such as different censoring rates and types, different times of delay, and different failure time distributions. For a preliminary power evaluation, simulations based on published data from the CHECKMATE trials were performed. First results show that most methods achieve type I error rate control and that, by construction, the weighted log-rank tests which place more weight on late time points have a greater power to detect differences when the treatment effect is delayed. It is furthermore investigated whether and to what extent these methods can be applied at an early stage of the trial to predict the decision of the log-rank test later on.

Non-proportional hazards in immuno-oncology: Is an old perspective needed?

ABSTRACT. A fundamental concept in two‐arm non‐parametric survival analysis is the comparison of observed versus expected numbers of events on one of the treatment arms (the choice of which arm is arbitrary), where the expectation is taken assuming that the true survival curves in the two arms are identical. This concept is at the heart of the counting‐process theory that provides a rigorous basis for methods such as the log‐rank test. It is natural, therefore, to maintain this perspective when extending the log‐rank test to deal with non‐proportional hazards, for example, by considering a weighted sum of the “observed ‐ expected” terms, where larger weights are given to time periods where the hazard ratio is expected to favor the experimental treatment. In doing so, however, one may stumble across some rather subtle issues, related to difficulties in the interpretation of hazard ratios, that may lead to strange conclusions. An alternative approach is to view non‐parametric survival comparisons as permutation tests. With this perspective, one can easily improve on the efficiency of the log‐rank test, while thoroughly controlling the false positive rate. In particular, for the field of immuno‐oncology, where researchers often anticipate a delayed treatment effect, sample sizes could be substantially reduced without loss of power.

Weighted hazard ratio for time to event endpoints under non proportional hazards

ABSTRACT. Non-proportional hazards (NPH) have been observed in confirmatory clinical trials with time to event outcomes. Under NPH, the hazard ratio does not stay constant over time and the log-rank test is no longer the most powerful test. The weighted log-rank test (WLRT) has been introduced to deal with the presence of non-proportionality in clinical trial data. The WLRT allows for different weights to be assigned at time points where we expect a higher treatment effect in a particular time of the study. For example, in a delayed treatment effect setting, we may down-weight the early time points and hence emphasise the latter part of the survival curve. We will focus our attention on the WLRT and the complementary Cox model based on time-varying treatment effect proposed by Lin and Leon (Lin & Leon, 2017). The model incorporates treatment effect β weighted by the effect adjustment factor A(t) which is essentially the weight function w(t) scaled by max(w(t)). The estimate derived from the model provides a description of the treatment effect over time where β is interpreted as the maximal treatment effect. For this study, two scenarios representing the NPH pattern were simulated. The WLRT and the complementary Cox model were applied to the simulated data. In the diminishing treatment effect scenario, the NPH are simulated in a way such that the chosen weight function would be optimal from a testing perspective. However, the proposed model (Lin & Leon, 2017) gives biased estimates and overestimates the treatment effect. According to the (Lin & Leon, 2017) suggestion of estimation, the treatment effect appears to diminish more rapidly than the real time profile of the treatment effect. This leads to a large difference between the estimated treatment effect and the true treatment effect. We conclude that the proposed weighted log-rank test (Lin & Leon, 2017) and the complementary Cox model based on time-varying treatment effect cannot be recommended as the primary analysis tool in a randomised controlled trial.

References Lin, R. S. & Leon, L. F., 2017. Estimation of treatment effects in weighted log-rank tests. Contemporary Clinical trials Communications, Volume 8, pp. 147-155.

Comparison of statistical methods for estimating time-varying treatment effect on adverse event risk

ABSTRACT. Background: In pharmacoepidemiology, assessing the effect of drug exposure on the risk of an adverse event is challenging because exposure can vary over time and its effect can be complex. Cohort and nested case control (NCC) designs are widely used in this context. However, the complexity introduced into the exposure-response relationship by the need to account for the temporal variation of exposure makes it necessary to evaluate the properties of the resulting estimators.

Methods: We simulated 1000 prospective cohorts of 5000 individuals for fixed or time-varying exposure with a unique change (from "unexposed" to "exposed") during follow-up. We varied exposure prevalence, hazard ratios of event associated with exposure and proportions of subjects who experienced the event (cases). The resulting data for the entire cohort were analyzed using conventional Cox model with time-invariant or time-dependent covariates. For the NCC design, k controls (k up to 20) were selected for each case and analysis relied on conditional logistic regression.

Results: In all scenarios, the cohort design had small relative bias (<5%), was more precise and had greater power than the NCC design. The NCC design displayed negative bias that tended to increase with higher proportion of events for both types of exposure and higher hazard ratio, while it decreased with lower exposure prevalence for fixed exposure and lower hazard ratio. It also decreased with increasing number of controls per case (from more than 20% for one control to less than 8% for 5 controls or more), but remained substantial relative to the cohort design. These methods were applied to estimate the risk of breast cancer associated with menopausal hormone therapy (MHT) in 38,092 women of the E3N cohort, of whom 45.1% were considered exposed to MHT (ever users) at baseline and 5.9% were diagnosed with breast cancer during follow-up. Differences observed between the two designs were consistent with simulated data.

Conclusion: Although the NCC design may help refine cohort analyses, results from this design should be interpreted with caution given its potential limitations. More complex exposures (e.g., time-varying with multiple changes or decreasing hazard ratios) are under evaluation.

Evidence Synthesis of Time-To-Event Outcomes in the Presence of Non-Proportional Hazards

ABSTRACT. Introduction: In the world of evidence-based medicine systematic reviews and meta-analyses are often considered as providing the strongest and highest quality evidence for treatment recommendations. Time-to-event outcomes are often synthesised using effect measures from Cox proportional hazards models assuming a constant hazard ratio over time. However, where treatment effects vary over time an assumption of proportional hazards is not always valid. Any bias will be further compounded if studies vary markedly in duration of follow-up. Several methods have been proposed for synthesising time-to-event outcomes in the presence of non-proportional hazards. However, guidance on choosing between these methods and the implications for downstream decision making is lacking.

Methods: Through application to a network of individual patient data from melanoma trials reporting overall survival we compared five methods for estimating treatment effects from time-to-event outcomes which relax the proportional hazards assumption. We compared a two-stage network meta-analysis model synthesising restricted mean survival time with four one-stage approaches: piecewise exponential, fractional polynomial, Royston-Parmar and an accelerated failure time generalised gamma model. All models were fitted in the Bayesian setting. To assess the performance of these methods under varying amounts of non-proportionality we conducted a simulation study. Individual patient data was generated from a mixture Weibull distribution assuming a treatment-time interaction. We assessed performance using bias, mean-square error and coverage.

Results: All models fitted the melanoma data reasonably well with some differences in the survival curves and initial variation in the treatment rankings. However, from 18 months onwards, all models consistently ranked the same treatment as the most effective. The simulation study demonstrated the potential for different conclusions from different modelling approaches.

Conclusions: The generalised gamma, piecewise exponential, fractional polynomial, Royston-Parmar and two-stage restricted mean survival time models can all accommodate non-proportional hazards and differing lengths of trial follow-up within an evidence synthesis of time-to-event outcomes. Further work is needed in this area to extend the simulation study to the network meta-analysis setting and provide guidance on the key considerations for informing model choice for the purposes of clinical decision making.

13:00-14:30 Session OC6E: relative survival and net benefit estimation
Additional benefit method assessment for time-to-event endpoints – A comparison of ESMOs and IQWiGs approach

ABSTRACT. Background: New cancer treatments are often promoted as major advances after a significant phase III trial. Therefore, a clear and unbiased knowledge about the magnitude of the clinical benefit of newly approved treatments is important to assess the amount of reimbursement from public health insurance of new treatments. To perform these evaluations, two distinct “additional benefit assessment” methods are currently used in Europe. The European Society for Medical Oncology (ESMO) developed the Magnitude of Clinical Benefit Scale Version 1.1 classifying new treatments into 5 categories using a dual rule considering the relative and absolute benefit assessed by the lower limit of the 95% HR confidence interval or the observed absolute difference in median treatment outcomes, respectively[1]. As an alternative, the German IQWiG compares the upper limit of the 95% HR confidence interval to specific relative risk scaled thresholds classifying new treatments into 6 categories[2]. Until now, these methods have only been compared empirically and further research is required to understand the differences between the benefit assessment methods. Method(s): We evaluate and compare the two methods in a simulation study with focus on time-to-event outcomes. The simulation includes aspects such as different censoring rates and types, incorrect HRs assumed for sample size calculation, informative censoring, and different failure time distributions. Since no “placebo” method reflecting a true (deserved) maximal score is available, different thresholds of the simulated treatment effects are used as alternatives. The methods’ performance is assessed via ROC curves, sensitivity / specificity, and the methods’ percentage of achieved maximal scores. Results and Conclusion: The results of the first step of a comprehensive comparison between the methods indicate that IQWiG’s method is usually more conservative than ESMO’s dual rule. Moreover, in some scenarios such as quick disease progression or incorrect assumed HR, IQWiG’s method is too liberal compared to ESMO. Nevertheless, further research is required, e.g. investigation of the methods’ performance under non-proportional hazards. In addition, the American Society of Clinical Oncology (ASCO) has developed another method for the assessment of additional benefit of new treatments using the HR point estimate, which remains to be compared to IQWiG and ESMO.

Obtaining long-term stage-specific relative survival estimates in the presence of incomplete historical stage information

ABSTRACT. Long-term estimates of cancer survival which are both up-to-date and stage-specific are valuable to researchers, clinicians and patients alike. However, the completeness of recording for cancer stage is historically poor in some registries, making it challenging to provide long-term stage-specific survival estimates. It is well-known that stage-specific survival differences are driven by differences in prognosis shortly after diagnosis and that the impact of cancer on survival is also greatest in the short-term. Hence, estimated survival metrics using period analysis1 (analysing follow-up information for a recent time window to provide up-to-date survival estimates) are unlikely to be sensitive to the imputed values for historical stage data. Our aim was to use period analysis and multiple imputation together with flexible parametric models (FPMs) on the excess hazard scale2 to estimate up-to-date stage-specific long-term survival metrics, when historical stage information is largely incomplete. We further intended to show the lack of sensitivity to extreme assumptions about the historical stage distribution. We used data from the Surveillance, Epidemiology, and End Results (SEER) Program for three cancer sites with varying prognoses and high stage completeness: lung, colon and melanoma. The period window for analysis was set to 2015-2017, and we further defined a pre-window 2012-2015 to evaluate various scenarios. We artificially inflated the proportion of missing stage information prior to the period window (in some scenarios conditional on stage) to mimic other registry settings. Our standard imputation model included sex, age, grade, the event indicator and Nelson-Aalen cumulative hazard estimate2. Four separate scenarios were imposed for individuals diagnosed before the pre-window, with some aiming to emulate the most extreme imputed stage distributions possible. In each case, a FPM was fitted on the excess hazard scale and the differences in stage-specific survival metrics were assessed. Estimates were also obtained from non-parametric approaches for validation. There was little difference between the 10-year marginal stage-specific relative survival, regardless of the assumed stage distribution for the historical stage data. We show that when conducting a period analysis, multiple imputation can be used to obtain stage-specific long-term estimates of relative survival, even when the historical stage information is largely incomplete.

A latent class model for the estimation of the excess mortality hazard for correcting inaccurate background mortality

ABSTRACT. Context: Net survival is the survival that would be observed if only deaths from the studied disease were considered. In the absence of known cause of death, net survival is estimated through excess mortality by splitting the observed mortality into excess and background mortalities. The latter is obtained from general population lifetables, usually stratified on a limited number of covariates. Specifically, lifetables do not always include some covariates that may influence the background mortality, whether they are observable or not in the data. This can lead to biased estimates of their effects on the excess mortality. To address this issue, regression models have been proposed to estimate excess mortality [1-2]. However, they only consider the case of a single missing covariate observable in the data. Objective: To propose a latent class approach for modelling excess mortality by correcting inaccurate background mortality when one or more covariates are missing in the lifetable, whether they are observable or not in the data. Additionally, it allows to characterize unobserved (latent) subgroups of patients. Method: The latent class model includes two modelling components: a multinomial logistic regression model for the latent class membership, and an excess hazard model with multiplicative parameters according to the profiles of the latent classes. We assessed performance of this model through simulation studies and compared them to other models [1-2]. We considered plausible scenarios from an epidemiological standpoint based on one or more missing covariates in the lifetable, these covariates being potentially unobservable in the data. Results: Compared to both models developed in the presence of a single observable covariate in the data and missing from the lifetable, the proposed model showed comparable performance (biases close to 0, similar mean square errors). It remained robust in the other scenarios where more covariates are missing. Applied to population-based colon cancer registry data, the proposed model provided estimates of excess hazard ratio in each latent class. Conclusion: In case of insufficiently stratified lifetables, a latent class model performed well in estimating excess mortality. Moreover, it yields an a posteriori classification of patients allowing a better description of their epidemiological profiles.

Robust statistical inference for the matched net benefit and win ratio.

ABSTRACT. As alternatives to the time-to-first-event analysis of composite endpoints, the net benefit (NB) and the win ratio (WR) -- which assess treatment effects using prioritized component outcomes based on clinical importance -- have been proposed. However, statistical inference of NB and WR relies on large-sample assumptions, which can lead to an invalid test statistic and inadequate, unsatisfactory confidence intervals, especially when the sample size is small or the proportion of wins is near 0 or 1.

For this talk, we will show how to address these limitations in a paired-sample design. We first introduce a new test statistic under the null hypothesis of no treatment difference. Then, we present new ways to estimate the confidence intervals of NB and WR. The confidence interval estimations use the method of variance estimates recovery (MOVER). The MOVER combines two separate individual-proportion confidence intervals into a hybrid interval for each estimand of interest. We assess the performance of the proposed test statistic and MOVER confidence interval estimations through simulation studies.

The results show that the MOVER confidence intervals are as good as the large-sample confidence intervals when the sample is large and the proportions of wins is bounded away from 0 and 1. Moreover, the MOVER intervals outperform their competitors when the sample is small or the proportion of wins is near the boundaries 0 and 1. We illustrate the method (and its competitors) using three examples from randomized clinical studies.

A unifying framework for flexible excess hazard modeling with applications in cancer epidemiology

ABSTRACT. Excess hazard modeling has become the preferred tool in population-based cancer survival research as it overcomes drawbacks of both the traditional overall survival setting as well as of the so-called cause-specific setting. This is achieved by assuming an additive decomposition of the overall hazard into two components: the excess hazard, e.g. due to the cancer of interest, and the population hazard due to all other causes of death.

Despite its widespread use, a unifying framework supporting virtually any application fully implemented in a ready to use package is not, to the best of our knowledge, available in the literature. We thus propose a solution which allows for excess hazard modeling in addition to traditional overall survival analysis. The general nature of our approach extends to two further levels. The first is that the link-based additive model proposed allows for many types of covariate effects. In particular, any type of smoother and modeling spatial effects. Estimation is achieved using a carefully structured efficient and stable penalized likelihood algorithm.

The proposed model is then extensively evaluated through a battery of simulation studies in which it is also compared with the analogous state-of-the-art framework (Fauvernier et al., 2019, JRSSC, 68(5),1233–1257), implemented in the R package survPen. Two practical applications in breast, colon and lung cancer epidemiology are presented and highlight the benefits of using this flexible framework. The model which includes non-linear time-dependent effects and spatial effects is, in fact, found to be superior to that with only linear effects. Additional insight on socio-demographic inequality in cancer survival is also found by including spatial effects. Some theoretical properties are discussed. The proposed approach is readily available in the R package GJRM.

13:00-14:30 Session OC6F: Deep learning
Cluster analysis on emergency COVID-19 data: A result-based multiple imputation for missing data

ABSTRACT. Background and Objective In 2020, hospitals have been confronted with an influx of COVID-19 confirmed patients. Grouping patients based on clinical features could help clinicians to identify a structure of patients who needs more attention. This study considers cluster analysis to identify different clinical phenotypes with similar properties while accounting for the presence of missing data. Several frameworks exist for handling missing data in cluster analysis, in this study, a new perspective was introduced for multiple imputation in cluster analysis that focused on the result of clustering. Method To handle the uncertainty of missing values, m imputed datasets were generated. The model-based clustering strategy was applied on the imputed datasets. Based on BIC criterion, the best method and the best number of groups were defined for all imputed datasets. The most repetitive number of groups and types was fixed. In the next step, cluster analysis was re-applied on m imputed datasets by the fixed number of clusters and type. The results of the statistical analysis were reported for each of the groups in imputed datasets. According to Rubin’s rules, in the pooled step, the final results were combined by mean and the statistical inferences were applied by considering between and within variance. Results The performance of the proposed framework was compared in several scenarios. The proposed method with 20 clinical features was performed on 628 confirmed COVID-19 patients who presented at University Hospital of Liege from March to May 2020. Based on model-based clustering and BIC criterion for multiple imputation, the patients were classified into four clusters. The rate of hospitalization in Cluster2 with older patients was higher than those in Cluster1. The oldest patients were assigned to Cluster3 and Cluster 4. The rate of comorbidity was almost close to 100% in Cluster 4 and percentage of infectious disease in cluster3 was less than Cluster4; however, Cluster3 had a higher rate of hospitalization than Cluster4. Conclusions The proposed method handled cluster analysis on missing data by multiple imputations. Also, the present study identified four clusters of patients confirmed with COVID-19 and the corresponding rate of hospitalization based on clinical features.

Statistical Power for Single Cell Representations

ABSTRACT. One of the most common applications of single-cell RNA-sequencing (scRNA-seq) experiments is to discover groups of cells with a similar expression profile in an attempt to define cell identities. The similarity of these expression profiles is typically examined in a low-dimensional latent space, which can be learned by deep generative models such as variational autoencoders (VAEs). However, the quality of representations in VAEs varies greatly depending on the number of cells under study, which is also reflected in the assignment to specific cell identities. We propose a strategy to answer what number of cells is needed so that a pre-specified percentage of the cells in the latent space is well represented. We train scVI, a VAE that has been adapted to scRNA-seq data, on a varying number of cells and evaluate the quality of the learned representation by use of the estimated log-likelihood lower bound of each cell. The distribution arising from the values of the log-likelihoods are then compared to a permutation-based distribution of log-likelihoods. We generate the permutation-based distribution by randomly drawing a small subset of cells before training the VAE and permuting the expression values of each gene among these randomly drawn cells. By doing so, we ensure that the overall structure of the latent representation is preserved, and at the same time, we obtain a null distribution for the log-likelihoods. We then compare log-likelihood distributions for different numbers of cells. We also harness the properties of VAEs by artificially increasing the number of samples in small datasets by generating synthetic data and combining them with the original pilot datasets. We illustrate the performance of our approach with two application examples. First, we show an application for sample size planning on a subset of the Tabula Muris dataset. Second, we show how to estimate statistical power for latent representations when studying cell subtypes using the PBMC dataset from 10x Genomics. We show that such analyses can potentially improve the reliability of downstream analyses such as cell identity detection.

Effects of Interactions and Dataset Size in Neural Networks and Other Machine Learning Approaches

ABSTRACT. Machine Learning and Deep Learning are powerful models able to predict the presence of a disease in a patient. A good predictive performance of such a model depends on the size of the training dataset and the complexity of the relationships between the presence of the disease and the variables included in the model. Usually, this complexity is unknown. Several datasets were simulated to allow comparing the effects of the two factors in linear prediction models. Datasets were constructed to represent a situation similar to that of the Framingham study in which the purpose of the model was to predict the presence of coronary disease. Three dataset sizes and three complexity levels were considered. A few logistic regressions (penalized and non-penalized) and neural networks (with different numbers of hidden layers) were trained on the simulated datasets. All models were evaluated by cross-validation with accuracy, F1-score, and area under the ROC curve as criteria. Logistic regressions were trained with terms of various interaction orders. Comparisons showed that logistic regressions were less influenced by the size of the dataset but needed the right interaction terms to achieve good performance. On the contrary, neural networks were more sensitive to the dataset size but did not need interaction terms to make good predictions. In the most complex scenarios, neural networks performed better than logistic regressions without interaction terms. However, logistic regressions with the right interaction terms provided better results.

CamemBERT word-embedding for Information Extraction in a Biomedical Context

ABSTRACT. Deep Learning has totally changed the landscape of Natural Language Processing (NLP) over the last few years. Very large word-embedding models pre-trained on colossal amounts of data (such as BERT) are now used to perform NLP tasks. However, these models are rarely trained on French biomedical data because these data are difficult to obtain due to the European General Data Protection Regulation. Within this context, we wondered whether these word-embedding models are able to perform French NLP tasks in the biomedical field in which they are not yet trained. The French language model CamemBERT was used in two different approaches (feature-based and fine-tuning) and three recurrent networks (Recurrent Neural Network, Long Shot-Term Memory, and Bidirectional Long Short-Term Memory) in a biomedical Natural Language Understanding task. This task aims to extract directly socio-demographic and biomedical information from patient speech. For each approach-network combination, 50 bootstraps were performed to compare the average performance in terms of precision, recall, and F1 score. The study showed that extracting socio-demographic and biomedical information was much easier with CamemBERT than without it whatever the approach and the network. The fine-tuning approach had a slight advantage over the feature-based approach. Bidirectional Long Short-Term Memory performed better than Long Short-Term Memory and Recurrent Neural Network. Interestingly, CamemBERT performed satisfactorily without previous training in the biomedical field.

Interpretable effect estimates of semi-structured predictors in deep distributional regression models

ABSTRACT. Interpretable effect estimates and reliable predictions in applications that rely on unstructured data, such as images or electronic health records, require flexible modeling approaches. Deep-learning based models have proven exceptional in prediction tasks and seamlessly incorporate unstructured data, while lacking interpretability. Classical regression models provide readily interpretable effects, but are constrained to tabular data, which are typically obtained via feature extraction in the case of image data. In this work, we propose a flexible class of deep distributional regression models which trade off flexibility and interpretability of model parameters by additively decomposing the contribution of prognostic and predictive features on a predefined interpretational scale, such as the odds-scale. The backbone of these models are parametric transformation models, which estimate the entire conditional distribution of the outcome given the features by decomposing this conditional distribution into an a priori chosen, parameter-free target distribution and a parametric transformation function. The transformation function is parameterized via several neural networks, which are jointly trained by minimizing the negative log-likelihood induced by the transformation model. The benefits of deep transformation models in prediction tasks have been demonstrated before. However, interpretable effect estimates require concrete model architectures, which have received much less attention. For instance, co-linear features between tabular predictors and the image have to be circumvented by orthogonalizing image effects from tabular features. We discuss how to interpret the resulting image effects and regression coefficients in practice. We apply deep distributional regression to prediction of functional outcome after acute ischemic stroke, measured on an ordinal scale, and based on MRI data and routinely collected clinical parameters at baseline, such as age, blood pressure and time-to-first-image. Although we use observational data in this example, data from RCTs, for instance, allow estimation of predictive treatment effects using the very same models. The proposed models unite the benefits of deep learning and regression models by being able to capture non-linear effects of tabular features and the image as a whole, while still possessing interpretable model components.

A Bayesian nonparametric approach for modeling SF-6D health state utility scores

ABSTRACT. Background: Typically, models that were used for health state valuation data have been parametric. Recently, many researchers have explored the use of non-parametric Bayesian methods in this field. Objectives: In the present paper we report on the results from using a nonparametric model to predict a Bayesian SF-6D health state valuation algorithm along with estimating the effect of the individual characteristics on health state valuations. Methods: A sample of 126 Lebanese members from the American University of Beirut valued 49 SF-6D health states using the standard gamble technique. Results from applying the nonparametric model were reported and compared to those obtained using a standard parametric model. The covariates’ effect on health state valuations was also reported. Results: The nonparametric Bayesian model was found to perform better than the parametric model 1) at predicting health state values within the full estimation data and in an out-of-sample validation in terms of mean predictions, root mean squared error and the patterns of standardized residuals, and 2) at allowing for the covariates’ effect to vary by health state. The findings also suggest a potential age effect with some gender effect. Conclusion: The nonparametric model is theoretically more flexible and produces better utility predictions from the SF-6D than previously used classical parametric model. In addition, the Bayesian model is more appropriate to account the covariates’ effect. Further research is encouraged.

Small-sample accuracy of approximations of individual polynomial growth curves’ prediction error in linear mixed regression

ABSTRACT. Empirical Bayes estimates of random coefficients in mixed model regression allow for estimating individual growth curves. The expected average squared prediction error (ASPE) quantifies the prediction error of estimated growth curves. From the mean squared error of Bayes estimates based on the method of moments (Rao, 1975), two asymptotic expressions for the expected ASPE can be derived for empirical Bayes estimates based on maximum likelihood estimation. In an extensive Monte Carlo study, the small-sample accuracy of these expressions for the expected ASPE, when employing restricted maximum likelihood estimation, is examined for linear and quadratic trends across time. Varied are the number of subjects, the number of time points, relevant model parameters, as well as the time span of prediction. For the best approximation formula underestimation of the expected ASPE was smaller than 2% for a linear trend, and smaller than 5% for a quadratic trend. These results also held under various violations of normality for the random coefficients in mixed model regression. To obtain a safe upper bound for the prediction error when planning a study on individual growth curves, that is, when deciding upon the study length and the number of measurement occasions, these results suggest to multiply the prediction error calculated by the best approximation formula by a factor 1/0.98 for a linear trend and by a factor 1/0.95 for a quadratic trend.

Assessment of performance measures for external validation of multivariable prediction models: Simulations

ABSTRACT. Multivariable prediction models (MPMs) are developed for the diagnosis of diseases or the prediction of their progress. Before an MPM can be used in clinical practice, it has to be internally and externally validated. Commonly used performance measure for external validation are the C-statistic, the Brier score and the R-squared measure (Collins et al. 2014). We consider requirements and conditions for the assessment of these performance measures in a simulation framework, and we will present the prototype of such a simulation framework. A key feature of external validation is generalisability, an umbrella term for transportability and reproducibility of an MPM. If the development population and the external validation population are closely related, then we can assess the reproducibility of the model; if they are not closely related, then transportability will be the issue (Debray et al. 2015). When developing our simulation framework to assess the impact of relatedness between development and external validation population on performance measures, a crucial point was to establish a simple, reasonable and easy to communicate approach. Our simulation results show that the C-statistic, the Brier score and the R-squared performance measures work well in transportability settings. On the other hand, the results revealed unexpected behaviour of the Brier score in reproducibility settings. Furthermore, our simulation framework is simple, reasonable and easy to communicate.

References Collins et al.: External validation of multivariable prediction models: a systematic review of methodological conduct and reporting. BMC Medical Research Methodology 2014 14:40. doi:10.1186/1471-2288-14-40 Debray et al.: A new framework to enhance the interpretation of external validation studies of clinical prediction models. Journal of Clinical Epidemiology 2015; 68:279-289. doi: 10.1016/j.jclinepi.2014.06.018

Mathematical proof of the equivalence between Post-test Predictive Probabilities and Predictive Values

ABSTRACT. In the last years, in order to evaluate of diagnostic performance, many Authors (see for example Crouser et al, 2019) use the so-called Post-Test Predictive Probability (Positive and Negative) in addiction to the usuals indices, such as Sensitivity, Specificity, Positive Predictive Value and Negative Predictive Value, Positive Likelihood Ratio and Negative Likelihood Ratio. Positive Post-Test Predictive Probability is formulated as a function of Pre-test Predictive Probability and Positive Likelihood Ratio, while Negative Post-Test Predictive Probability is formulated as a function of Pre-test Predictive Probability and Negative Likelihood Ratio. The Pre-test Predictive Probability corresponds to the Prevalence. Moreover, simply by starting from the definition of Positive Predictive Value and Negative Predictive Value based on the Bayes’ theorem, through a few mathematical passages (in detail, making explicit the Likelihood Ratios as functions of Sensitivity and Specificity), we demonstrate that they are equivalent to Positive Post-Test Predictive Probability and to the complement of Negative Post-Test Predictive Probability, respectively. In conclusion, we think that it is unnecessary, if not detrimental and confusing, to introduce other parameters in addition to Sensitivity, Specificity, Positive Likelihood Ratio, Negative Likelihood Ratio, Positive Predictive Value, and Negative Predictive Value in order to evaluate a diagnostic test performance, especially if they don’t get any additional information.

References: Crouser ED, Parrillo JE, Seymour CW, Angus DC et al. Monocyte Distribution Width: A Novel Indicator of Sepsis-2 and Sepsis-3 in High-Risk Emergency Department Patients. Crit Care Med 2019; 47:1018-1025

Development and validation of a clinical risk score to predict the risk of SARS-CoV-2 infection from administrative data

ABSTRACT. Background The novel coronavirus (SARS-CoV-2) pandemic spread rapidly worldwide, early increasing exponentially in Italy. To date, there is a lack of studies describing the clinical characteristics of the population most at risk of infection. Objectives Our aim was to identify clinical predictors of SARS-CoV-2 infection risk and to develop and validate a score predicting SARS-CoV-2 infection risk comparing it with unspecific surrogates. Methods A retrospective case/control study using administrative health-related database was carried out in Southern Italy (Campania region) among beneficiaries of Regional Health Service aged over 30 years. For each subject with Covid-19 infection confirmed diagnosis (case), up to five controls were randomly matched for gender, age and municipality of residence. Odds ratios and 90% confidence intervals for associations between candidate predictors and risk of infection were estimated by means of conditional logistic regression. SARS-CoV-2 Infection Score (SIS) was developed by generating a total aggregate score obtained from the assignment of a weight at each selected covariate, according to the least absolute shrinkage and selection operator (LASSO) method, using the coefficients estimated from the model. Finally, the score was categorized by assigning increasing values from 1 to 4. To evaluate the clinical utility of SIS for predicting infection and to compare the discriminatory ability of specific and unspecific predictors of SARS-CoV-2 infection, ROC curves and corresponding AUCs were used. Results Subjects suffering from diabetes, anaemias, Parkinson’s disease, mental disorders, cardiovascular and inflammatory bowel and kidney diseases showed an increased risk of SARS-CoV-2 infection. Similar estimates were recorded for men and women and younger and older than 65 years. Fifteen conditions significantly contributed to the SIS. As SIS value increases, risk progressively increases, being the odds of SARS-CoV-2 infection among people with the highest SIS value (SIS= 4), 1.74 times higher than those unaffected by any SIS contributing conditions (SIS=1). However, there was no evidence that specific scores had different discriminatory ability. Conclusions This study identified conditions and diseases making individuals more vulnerable to SARS-CoV-2 infection. Our results can be a tool supporting decision-makers for the identification of the population most vulnerable to Covid-19, allowing them to adopt preventive measures.

Subject-specific networks as features for predictive modelling - A scoping review of methods

ABSTRACT. Background: Recent advances in biotechnology (e.g. imaging modalities, microarrays and sequencing) enable the acquisition of high-dimensional data on individuals, posing challenges for prediction models which traditionally use covariates such as patient characteristics. Alternative forms of covariate representations for the features derived from these modern data modalities should be considered that can utilize their intrinsic interconnection. The connectivity information between these features can be represented as a network defined by a set of nodes and edges and specified for each individual in the dataset. Then, global or local graph-theoretical features may yield potential prognostic biomarkers instead of or in addition to traditional covariates and may replace the often unsuccessful search for individual biomarkers in a high-dimensional predictor space.

Methods: We conducted a scoping review to identify, summarize and critically appraise the state-of-art in the use of subject-specific networks for predictive modelling in biomedicine published during the time period 2000-2020 in the electronic databases PubMed, Scopus, Embase and arXiv.

Results: Our scoping review revealed that network approaches have been predominantly applied in neurological and pathopsychological studies, followed by genomics, cardiology and pathology (N=143). Data-driven network construction was mainly based on Pearson correlation coefficients of time series, but also alternative approaches (e.g. Gaussian graphical modelling, measures of distributional similarity) could be found. For independent variables measured only once for a subject, individual network construction was primarily based on leave-one-out resampling quantifying each individual’s contribution to the overall group-level structure. Graph-theoretical features were mainly assessed locally (nodes) and used for classification between outcome groups by supervised learning techniques (support vector machines, random forests). While the value of the subject-specific network framework was highlighted in most of the identified studies, we found methodological concerns regarding network sparsification approaches and graph-theoretical feature extraction and selection.

Conclusion: This study demonstrates the potential of subject-specific networks as a new powerful approach to predictive modelling in medicine, in particular for modern data modalities spanning a high-dimensional predictor space. Although the intersection of network science and predictive modelling is a promising area of research, more work is needed on adequate network sparsification and outcome-guided graph-theoretical feature extraction and selection.

Individual dynamic predictions in joint analysis of non-linear longitudinal model and parametric competing risk model: application to sepsis patients

ABSTRACT. Introduction Systemic inflammatory reaction to an infection, called sepsis, potentially leads to fatal organ dysfunctions1. The SOFA score2, widely used in Intensive Care Unit (ICU) context, allows to quantify these organ dysfunctions. The objective of the work is to propose an original joint modelling approach based on a longitudinal model for daily SOFA measurement, coupled with a competing risk survival model to provide dynamic predictions of the risk of death during ICU stay. Methods Training data contains 4050 patients admitted in ICU for sepsis randomly taken from the OUTCOME REA database. The joint model is composed of a non-linear mixed effects sub-model to model individual SOFA evolution, and a full parametric sub-distribution hazard model, adjusted on age, to model the risk of death in the presence of ICU discharge as a competing event. Parameters are estimated with SAEM algorithm implemented on Monolix. This modelling approach is original and allows more flexibility than previous reported joint models based on linear mixed effects models3. Performances are assessed on a validation set composed of 1996 other patients randomly taken from the same database. Predictive performances are based on a landmark analysis with four landmark times at 5, 7, 10 and 15 days after admission and a horizon time of 30 days after admission. For each landmark time we report time-dependent AUC and Brier score. Results Median age was 65.1 years (IQR : 52.7 – 76.2) and median admission SOFA was 6 (IQR : 4 – 9). The 30-day mortality was 19%. A 15-day follow-up allows to well predict the 30-day post admission mortality with an AUC of 0.90 and a Brier score of 0.10. Early landmark times also showed good properties to predict 30-day mortality (for landmark=5d: AUC = 0.83, Brier score = 0.11). Discussion The joint modelling approach allows to quantify the future risk of death associated with any individual SOFA history, while genuinely overcoming issues with time-dependant covariates in sub-distribution hazard models. Despite better predictive performances than previous studies, further investigations on the properties of such models (e.g. validity of inference), based on simulation, are needed.

Estimating the treatment selection ability of a marker: review, comparison, and improvement of existing approaches

ABSTRACT. Treatment-selection markers are essential for precision medicine; they help clinicians choosing between two (or more) treatments. Most of these markers were initially continuous; hence, the early phases of treatment-selection marker assessment required methods to measure the overall ability of the marker for treatment selection without threshold defined. Historically, marker assessments have been carried out through analyses of treatment-marker interactions. However, the conclusions of such analyses depend on the model chosen (additive or multiplicative), and these assessments do not reflect the effect of a marker-based treatment choice on patients’ health. More recently, several indexes have been proposed, some purely statistical and some grounded in the medical decision-making theory. Graphical approaches have also been suggested but without sum-up indexes. This work presents and compare some of the existing approaches.

First, the context of treatment-selection marker assessment is presented with associated questions. Second, some existing methods are described, insisting on the underlying links between them and on possible improvements. More precisely, the work focuses on the ability of comparing the indexes stemming from various studies. Then, a simulation study is performed to compare the statistical properties of the methods considered (i.e., mean relative bias, RMSE, power, coverage, and mean width of the confidence intervals) by varying the ability of the marker for treatment selection and the risk of event (e.g, treatment failure, disease progression, etc.) in each treatment arm.

Within the context of treatment selection, approaches to marker evaluation should reflect the consequences of the marker-based treatment choice on patients’ health. They should also provide indexes able to yield an interpretation that is independent of the risk of event in each treatment arm to enable comparisons between several studies.

Double cut-point identification of continuous diagnostic test variables for clinical use

ABSTRACT. The identification of classification rules on a continuous diagnostic test is often of interest in laboratory/clinical data applications such as the case ofvibration-controlled transient elastography (VCTE) (Fibroscan, Echosens, Paris), a simple and non-invasive realiable biomarker of liver fibrosis.

In order to confirm or exclude the presence of liver fibrosis using VCTE results (avoding performing more invasive tests i.e. liver biopsy) two cut-points could be identified with two different approaches. The first consists in giving a predicted risk equal to a sufficiently low/high fixed values, and the second in using fixed negative and positive predictive values. Both approaches include an intermediate risk classification.

However, the commonly used approaches identify double or single cut-points on the ground of “reversed probabilities” of the distribution of the diagnostic result in diseased/non diseased populations by: fixed values of sensitivity and specificity, maximum sensitivity*specificity, maximum Youden function, nearest to (0,1) point on the ROC curve, etc.

In our opinion, the use of these methods is due to three reasons: 1. the presence of case-control data where the predicted risk cannot be calculated; 2. the non monotonicity of the nonparametric estimate of predictive values as function of the diagnostic test result; 3. the advocated advantage of cut-points identified on measures that do not depend on the prevalence

Of note, the latter aspect is in contrast with the clinical use of the diagnostic test in a population with a given prevalence and a classification rule that should depend on the numerical predicted risk [1].

In this work we show how in case of cross sectional/cohort data, such as the motivating application [2], the identification of two cut-points can be carried out to achieve the desidered sufficiently low/high fixed values for the predicted risk, or predictive values, instead of relying on sensitivity and specificity. We prove that in the latter case the interval between the two cut-points is slightly reduced with respect to the former. We discuss how cut-points could be reasonably modified to be used in populations with difference prevalence.

Patient heterogeneity assessments via network-based ANOVA

ABSTRACT. Studying heterogeneity between individuals lies at the core of precision medicine. Often, information about individuals can be represented as networks characterized by individual-specific edge values [1]. In this project, we develop a procedure to identify those networks that can be aggregated in a cluster where the determination of the final clustering is based on notions of statistical significant differences between clusters. In particular, we first use an unsupervised hierarchical algorithm to identify latent classes of similar networks. Similarity between networks is computed via appropriate distance measures between graphs (e.g. shortest-path kernel [2], DeltaCon [3], GTOM [4]). To determine the optimal number of clusters, we recursively test for distances between two groups of networks, progressing from the root node to the end nodes of the clustering dendrogram. The test itself is based on computing pairwise distances between graphs and is inspired by distance-wise ANOVA algorithms, commonly used in ecology [5]. Permutations are used to assess significance. We show the merits and pitfalls of our approach via extensive simulations and an application to inflammatory diseases. The results of our strategy pave the way towards deciding which networks can sensibly be aggregated. This is not only relevant in stratified medicine or for molecular reclassification of disease, but also in areas such as genetic epistasis detection in which conclusions need to be drawn from multiple statistical epistasis networks across different analysis protocols.

[1] Kuijjer, M. L., Tung, M. G., Yuan, G., Quackenbush, J., & Glass, K. (2019). Estimating sample-specific regulatory networks. Iscience, 14, 226-240. [2] Borgwardt, K. M., & Kriegel, H. P. (2005, November). Shortest-path kernels on graphs. In Fifth IEEE international conference on data mining (ICDM'05) (pp. 8-pp). IEEE. [3] Koutra, D., Shah, N., Vogelstein, J. T., Gallagher, B., & Faloutsos, C. (2016). Deltacon: Principled massive-graph similarity function with attribution. ACM Transactions on Knowledge Discovery from Data (TKDD), 10(3), 1-43. [4] Yip, A. M., & Horvath, S. (2006, June). The Generalized Topological Overlap Matrix for Detecting Modules in Gene Networks. In BIOCOMP (pp. 451-457). [5] Anderson, M. J. (2005). Permutational multivariate analysis of variance. Department of Statistics, University of Auckland, Auckland, 26, 32-46.

A comparison of prioritisation tools for CVD risk assessment in UK Biobank: Primary care records versus polygenic risk scores

ABSTRACT. Background: Public health guidelines in England recommend individuals at higher estimated risk of cardiovascular disease (CVD) be prioritised for formal CVD risk assessment. However, no prioritisation tool has been proposed. We propose and compare two strategies involving a prioritisation tool leveraging routinely collected data within primary care records or polygenic risk scores (PRS) followed by the recommended QRISK2 CVD risk assessment model.

Data: 108,685 participants in UK Biobank aged between 40-69 years, with primary care records, measured biomarkers and genetic data, and without prevalent CVD or statin use. Incident CVD events were defined using Hospital Episode Statistics and death registry records.

Methods: Strategy 1 uses a prioritisation tool using multivariate-linear-mixed models to leverage sporadically recorded longitudinal primary care records followed by the QRISK2 model. The prioritisation tool was derived using sex-specific Cox models with QRISK2 risk factors, using last observed values and mixed model estimates. Strategy 2 uses PRS for the prioritisation tool followed by the QRISK2 model. The prioritisation tool was derived using sex-specific Cox models with the risk factors: age, a PRS for coronary artery disease and a PRS for stroke. Prioritisation tools were applied to 40-54-year olds. Those in the top 50% of predicted risk, as well as all 55-69 year olds, underwent a formal risk assessment using the QRISK2 model based on UK Biobank baseline data. For each strategy we estimated the % reduction in (i) the number needed to screen (NNS) to prevent one CVD event and (ii) identified CVD events, in comparison to carrying out formal CVD assessment on all 40-69-year olds. We applied a recalibration approach to enable performance to reflect that in the general English population.

Results/conclusions: A prioritisation tool based on routinely available data outperforms a tool incorporating PRS. Strategy 1 led to reductions in NNS of 27.3% for women and 26.5% for men, whilst only reducing the percentage of identified CVD events by 0.05% for women and 1.5% for men. Comparatively, strategy 2 reduced the NNS by 26.9% for women and 25.1% for men, and reduced the percentage of identified CVD events by 0.67% for women and 3.3% for men.

Operational characteristics of generalized pairwise comparisons for hierarchically ordered endpoints

ABSTRACT. The method of generalized pairwise comparisons (GPC) is a multivariate extension of the well-known non-parametric Wilcoxon-Mann-Whitney test. It allows comparing two groups of observations based on multiple hierarchically ordered endpoints, regardless of the number or type of the latter. The summary measure, ``net benefit'', quantifies the difference between the probabilities that a random observation from one group is doing better than an observation from the opposite group. The method takes into account the correlations between the endpoints. We have performed a simulation study for the case of two hierarchical endpoints to evaluate the impact of their correlation on the type-I error probability and power of the test based on GPC. The simulations show that the power of the GPC test for the primary endpoint is modified if the secondary endpoint is included in the hierarchical GPC analysis. The change in power depends on the correlation between the endpoints. Interestingly, a decrease in power can occur, regardless of whether there is any marginal treatment effect on the secondary endpoint. It appears that the overall power of the hierarchical GPC procedure depends, in a complex manner, on the entire variance-covariance structure of the set of outcomes. Any additional factors (such as thresholds of clinical relevance, drop out, or censoring scheme) will also affect the power and will have to be taken into account when designing a trial based on the hierarchical GPC procedure.

Landmark Prediction of Survival for HIV-infected Patients by Considering AIDS as an Intermediate Event

ABSTRACT. Background: The overall survival of HIV-infected patients may be shortened by rapid progression to acquired immunodeficiency syndrome (AIDS). So, instead of the occurrence of death as a single event, the progression of disease should be modeled. Here, progression-to-AIDS is considered as an intermediate event. Landmarking prediction method focused on dynamic prediction of survival time given the information of the intermediate events at the moment of prediction. In this study landmark method is used to dynamic prediction of survival of HIV-infected patients, given the history of AIDS status. Methods: The information of 1530 HIV-infected patients, which was related to a registry-based retrospective cohort study were analyzed in this study. A method of landmarking was utilized for dynamic prediction of survival of the patients. Dynamic C-index and time-dependent area under the ROC curve (AUC) were used to evaluate the performance of the model. Results: About 34.4% of patients were diagnosed at the advanced stage of HIV and their CD4 cell count were less than 200 cells/mm3. The majority of HIV infected patients were male (76.5%) and about 47.7% of patients have received ART. The probability of dying within 5-years was lower in females (p=0.027) and patients who their CD4 cell count were lower at the time of diagnosis (p<0.001). The results also revealed that progression to AIDS can increase the hazard of death and shorten the survival time of HIV infected patients (p=0.040). Moreover, the probability of death within the next 5-years was lower in patients who received ART either their disease progressed to AIDS or not (p<0.001). The effect of ART on the survival time were related to the level of CD4 cell count, such that the impact of receiving ART on the hazard of death were lower in patients with lower level of CD4 cell count. The value of the dynamic C-index=0.79 and AUC=0.80 indicate a promising predictive accuracy. Conclusions: As the disease progression could be considered in the landmark model and its performance was promising in analyzing the survival processes of HIV-infected patients, it could be an adequate choice for analyzing such survival data in the future studies.

15:15-16:45 Session IS8: INVITED : Variance modelling for multilevel data and joint models
A general framework and implementation for variance modelling in joint model settings

ABSTRACT. The rise in availability of electronic health record data enables us to answer more detailed clinical questions; however, the associated increased complexity raises substantial statistical and computational challenges. Recent work in the area of joint models has introduced an extended mixed effects framework, encompassing multiple outcomes of any type, each of which could be repeatedly measured (longitudinal), with any number of levels, and with any number of random effects at each level (Crowther, 2020). This allows for sharing and linking between outcome models in an extremely flexibly way, either by linking random effects directly, or the expected value of one outcome (or function of it) within the linear predictor of another. Non-linear and time-dependent effects are also seamlessly incorporated to the linear predictor through the use of splines or fractional polynomials. In this talk, I’ll present an extension to the framework to further allow modelling of variance components directly, allowing each random effect or residual variance to have its own complex linear predictor, such as allowing for heteroskedasticity, which in turn provides new tools for joint modelling. Throughout my talk I will illustrate an accompanying user-friendly implementation in Stata, showing how to build and estimate a joint longitudinal-survival model with complex variance components, quantifying how between-subject variation in the level 1 variance structure of a continuous biomarker (e.g., blood pressure), can be associated with survival. Dynamic predictions from such a joint model will also be derived and presented. Due to the generality of the implementation, multiple outcomes, such as multiple biomarkers or competing risks, are also immediately available.

Crowther MJ. merlin - a unified framework for data analysis and methods development in Stata. Stata Journal 2020;20(4):763-784.

Dynamic structural equation modeling for intensive longitudinal data

ABSTRACT. Objective: Recent methodological innovations have opened up an exciting new horizon of research opportunities. Technological developments such as smartphones and other wearable devices have created new data collection methods such as ambulatory assessments, experience sampling, and ecological momentary assessments. Characteristic of the intensive longitudinal data that are obtained with these methods is that they consist of large numbers of repeated measurements of what individuals are doing, thinking and feeling while living their daily life. As such, these data provide us the opportunity to study individual processes as they unfold over time, and to investigate individual differences therein. But to fully realize this unique potential of intensive longitudinal data, we need new statistical techniques that adequately deal with the specifics of these data, and that can uncover the meaningful patterns hidden in them. This has led to diverse innovations, including the development of dynamic structural equation modeling (DSEM), a new toolbox in the software package Mplus.

Statistical Methods: DSEM forms a combination of: a) time series analysis to model the lagged relations between variables, thereby accounting for the autocorrelation structure within individuals; b) multilevel modeling to ensure the proper decomposition of variance into within-person and between-person components, and to allow for individual differences in means, slopes, and variances; and c) structural equation modeling such that multiple observed variables can be analyzed simultaneously, and can be combined using factor analysis and/or mediation analysis. Additionally, DSEM can account for unequal time-intervals between observations, and for trends and cycles over time. Application: In this talk, I will briefly sketch the general DSEM framework, and then discuss several specific ways in which DSEM can be used to analyze particular data structures and tackle specific research questions. These include data from a design in which a randomized controlled trial is combined with intensive longitudinal measurements before and after the intervention, and data for which there may be a need to account for patters due to cycles such as day-of-the-week patterns or circadian rhythms. These empirical examples illustrate the wide variety of modeling opportunities that are offered by DSEM.

Jointly modelling longitudinal heteroscedasticity and a time-to-event outcome

ABSTRACT. In the clinical literature it has been shown that individuals with higher variability in their blood pressure measurements have a greater risk of cardiovascular disease. This is typically explored by calculating a variability measure, e.g. the standard deviation, from a set of blood pressure measurements per individual, and including this as an explanatory variable in a regression model for the time to the first cardiovascular event. However, this leads to regression dilution bias in the estimated association parameter because the variability measure is subject to measurement error.

We will explore statistical models which allow within-individual variability, as well as the mean, to depend on covariates and/or random effects, e.g. mixed effects location scale models (Hedeker et al, 2008). We propose a joint model with mixed effects location scale and time-to-event sub-models for the longitudinal blood pressure measurements and time to first cardiovascular event respectively (Barrett et al, 2019). The time-to-event sub-model incorporates the random effect associated with the longitudinal within-individual variability, which allows direct estimation of the association between blood pressure variability and the risk of CVD events.

We use simulation studies and data from the Atherosclerosis Risk in Communities (ARIC) study to compare the joint model with the usual method used in the literature and a two-stage method. We demonstrate that substantial bias may be incurred by the usual method and slight to moderate bias with the two-stage method, especially when blood pressure measurements are taken concurrently with the time-to-event follow-up. From the analysis of ARIC study data, the estimated hazard ratio for the association between visit-to-visit systolic blood pressure variability and cardiovascular disease from a joint model with random intercept, slope and variability effects is 1.08 (95% CI 1.04, 1.09) per unit increase in systolic blood pressure standard deviation.

References 1. Hedeker et al (2008). An application of a mixed-effects location scale model for analysis of ecological momentary assessment (EMA) data. Biometrics 64: 627-634. 2. Barrett et al (2019). Estimating the association between blood pressure variability and cardiovascular disease: An application using the ARIC Study. Statistics in Medicine 38:1855-1868.

15:15-16:45 Session OC7A: Causal inference in survival analysis
Marginal structural Cox models in nested case-control studies with time-varying treatments

ABSTRACT. Marginal structural Cox proportional hazards models (Cox-MSMs) are frequently employed in survival analyses in both clinical and epidemiological studies to estimate the causal effects of time-varying treatments [1]. However, the fitting procedures of Cox-MSMs to nested case-control (NCC) sampling data have not been straightforward, which may be due to the difficulties in calculating the inverse probability weights. In this study, we propose a method to fit Cox-MSMs using inverse probability weighting under NCC sampling to obtain consistent estimators of the causal effects of time-varying treatments. Here, we consider an observational (full) cohort where subjects had received a particular treatment of interest or control treatment at each follow-up start date, but treatment changes can occur during follow-up. After NCC sampling, we additionally sample subjects whose treatment statuses had changed at any time point and subjects whose observations were censored before the administrative censoring time to calculate the inverse probability of treatment and censoring weights. We obtain the estimates of causal treatment effects by maximizing the weighted Cox partial likelihood. In our method, we must assemble both treatment and covariate histories only for NCC samples (i.e., cases and controls) and additional samples. The confidence interval of effect estimates is then calculated based on the robust variance estimator for the NCC sampling data. Our method can employ additional matching by baseline covariates and counter matching by treatment history [2]. A simulation study was conducted to evaluate the finite sample performance of the proposed method in the presence of treatment-confounder feedback compared with an ordinary analysis of an NCC study that employs Cox models using the information just before outcome occurrence (ordinary NCC analysis) and a Cox-MSM fitted to a full cohort (MSM-full). The results showed that our proposed method and MSM-full provided negligibly biased estimates of the causal effect of treatment while the ordinary NCC analysis produced biased estimates. When compared with simple random sampling, both additional matching and counter matching improved the efficiency of the proposed method. We also applied the proposed method to a pharmacoepidemiological study that examined the effect of antihyperlipidemics on hyperglycemia incidence using a Japanese medical claims database.

Benefit-based organ allocation in liver transplantation

ABSTRACT. The field of liver transplantation (LT) is dealing with an ever-increasing shortage of donor organs. Since donor livers are scarce, patients are placed on a waiting list (WL). Unfortunately, many cannot be treated on time. To deal with this scarcity, WL patients are prioritized by medical urgency. Although this strategy prevents WL deaths, it also denies livers to (less sick) patients that could have a larger gain in life-years. Thus, donated organs are not distributed optimally. Benefit-based organ allocation is proposed to resolve this ineffective use of deceased-donor livers. Benefit-based allocation not only considers the WL mortality, but it also accounts for the post-operative outcome, which is a critical addition to the current system. More formally, benefit is here estimated as the average difference in restricted mean survival time attributable to LT under a certain allocation strategy. Hence, to estimate the benefit, the following two quantities are contrasted: the post-transplant survival and the survival in a “counterfactual” world where transplantation never happens. Methods from causal inference, in particular inverse probability of censoring weighting, are used to deal with the dependent censoring caused by transplantation. The post-transplant survival model is estimated given the patients’ history until transplantation and the organ condition. The pre-treatment model is estimated via landmark analysis methods that are able to combine the two time axes of this problem: the time since first eligibility for transplantation and the calendar time. The first is relevant due to the time-dependent nature of MELD-score (i.e. one of the strongest predictors), while the latter is important because donor organs become available at specific calendar dates. The effects of the different allocation schemes is quantified by a Monte Carlo simulation. By simulating the influx of both patients on the waiting list and incoming donor livers, it is possible to compare the number of life-years saved for the current allocation scheme and for a benefit-based allocation scheme.

Extending multistate models with g-computation to evaluate the effect of treatment timing

ABSTRACT. The decision when to intervene rather than allow nature to take its course is a critically important question in medicine. To avoid unnecessary treatment, a ‘wait-and-see’ approach is commonly advised where a natural recovery of the patient is awaited before starting treatment. There is however a huge lack of information on the impact of the exact duration of such a delay period on the outcome of patients. In the absence of trials, large observational studies are the best source of data to study the effect of timing of treatment. In this work we develop causal methods that -under certain assumptions- allow evaluating treatment timing using such observational data sources.

We combine multistate modelling with g-computation to target the counterfactual cumulative proportion of recovered patients for different delay periods. The multistate model describes the speed at which patients transition between the disease, treatment and recovery states. It uses Cox proportional hazards models for each of the transitions, employing the clock-reset time scale. In contrast to common use of multistate models that describes actual observed speeds of transitions, we here use the model to evaluate counterfactual outcomes. In particular, we evaluate the effect of different delay periods by predicting the cumulative percentage of recovered patients had -counter to the fact- all unrecovered patients transitioned to treatment at the same wait time. This is done by using g-computation methodology on the transition model from treatment to recovery where the wait time at which patients entered the treatment state is one of the covariates. Uncertainty is quantified by bootstrapping.

We apply the developed methodology to study the effect of delaying the start of intrauterine insemination treatment in a cohort of 1896 couples with unexplained subfertility. We estimate the expected cumulative proportion of pregnant woman 1.5 years after diagnosis for treatment strategies where couples start treatment if not yet pregnant after 0, 3, 6 or 9 months. Our method allows contrasting the expected number of additional pregnancies with the expected number of extra treatments, allowing a well-informed evaluation of different delay strategies.

Causal assessment of surrogacy for time-to-event endpoints using meta-analytic data

ABSTRACT. With the ongoing development of treatment procedure and the increase of survival in oncology, clinical trials based on endpoint such as overall survival can require long follow-up times in order to observe enough events to ensure sufficient statistical power. This increase of the follow-up times can compromise the feasibility of the trial. The use of surrogate endpoints can thus be attractive for these trials. However, in order to yield valid conclusion of the treatment efficacy regarding the true endpoint based on the sole observation of the surrogate the latter must have been previously statistically validated as being a good surrogate. In this work we propose an alternative approach for surrogate validation when both the surrogate and the true endpoint are time-to-event. This approach is based on the causal framework of mediation analysis and is developped for meta-analytic data. It uses a joint regression model for the hazard functions of both endpoints. The mediation analysis enables one to study the decomposition of the total effect of the treatment on the true endpoint into a direct effect and an indirect effect through the surrogate. This decomposition is defined using a mediation formula for counterfactual outcomes. The surrogate can then be deemed validated if the ratio of indirect effect over total effect is sufficiently high. The meta-analytic nature of the data is taken into account in the model by using shared random effects at the individual and trial levels. The indirect effect of the treatment on the true endpoint through the surrogate is allowed as the composition of a direct effect of the treatment on the surrogate and a direct effect of the surrogate on the true endpoint. The estimation of the parameters of this model is carried out through likelihood maximization. The estimators are then used to compute the time-dependent ratio of the indirect over total effect of the treatment on the true endpoint. We designed a simulation study to evualuate the estimators. We applied this method for the assessment of the desease-free survival as a surrogate of the overall survival for adjuvant chemoterapy in the context of resectable gastric cancers.

Estimating treatment effects on survival in an entirely treated cohort: negative controls in longitudinal data

ABSTRACT. Treatments are sometimes introduced for all patients in a particular cohort. When an entire cohort of patients receives a treatment it is not straightforward to estimate its effect (the treatment effect in the treated) because there are no directly comparable untreated patients. Attempts can be made to find a suitable control group, (e.g. historical controls), but underlying differences between the treated and untreated can result in bias. The application that motivates this relates to a disease-modifying treatment in cystic fibrosis, ivacaftor, which has been available for everyone in the UK with a specific CF-causing genetic mutation since 2012. Its impact on several health outcomes has been demonstrated in randomized controlled trials, but it is of interest to understand the causal effect of ivacaftor on survival, which has not been possible to investigate in trials due to short term follow-up.

Observational data provide the opportunity to estimate causal effects of treatments. A strong assumption made in most such investigations is that of positivity, meaning individuals have a probability strictly less than one of receiving the treatment. When the entirely cohort of interest is treated the positivity assumption is violated, meaning that ‘standard’ causal inference methods cannot be applied.

We show how negative control outcomes combined with an extension of difference-in-differences analysis to survival outcomes can be used to assess bias in treatment effect estimates and obtain unbiased estimates under certain assumptions, making use of longitudinal data on treatment use and covariates. Causal diagrams and the potential outcomes framework are used to explain the methods and assumptions. We will discuss the use of different underlying analysis models, including Cox regression and Aalen’s additive hazards model.

The methods are applied to the motivating example using longitudinal data from the UK Cystic Fibrosis Registry. Negative control outcomes observed in patients who do not receive the treatment are used to enable estimation of the causal effect of ivacaftor on survival, including patients eligible for the new treatment but before its availability and patients with an ineligible genotype. The methods described provide a framework for robustly estimating the effects of new disease-modifying treatments on survival.

15:15-16:45 Session OC7B: Clinical trial design and sample size calculation
Analysis and sample size calculation for a survival model conditional on a binary surrogate endpoint

ABSTRACT. The primary endpoint in oncology is usually overall survival, where differences between therapies may only be observable after many years. To avoid withholding of a promising therapy, preliminary approval based on a surrogate endpoint is possible [1]. The approval can be confirmed later by assessing overall survival within the same study. Then, the correlation between surrogate endpoint and overall survival has to be taken into account for sample size calculation and analysis. This relation can be modeled by means of a conditional survival model which was proposed by Xia et al. [2]. They investigated the correlation and assessed power of the logrank test but did not develop methods for statistical testing, parameter estimation, and sample size calculation.

In this talk, a new statistical testing procedure based on the conditional model and Maximum Likelihood estimators for its parameters will be presented. An asymptotic test for survival difference will be given and an approximate sample size formula will be derived. Furthermore, an exact test for survival difference and an algorithm for exact sample size determination will be provided. Type I error rate, power, and required sample size for both newly developed tests will be determined exactly. These characteristics will be compared to those of the logrank test.

It will be shown that for small sample sizes the asymptotic parametric test and the logrank test exceed the significance level. For a given sample size, the power of the asymptotic and the exact parametric test is similar, whereas the power of the logrank test is considerably lower in some situations. The sample size needed to attain a prespecified power is comparable for the asymptotic and the exact parametric test, but considerably higher for the logrank test in some situations.

We conclude that the presented exact test performs well under the assumptions of the conditional model and is a better choice than the asymptotic parametric test or the logrank test, respectively. Furthermore, the talk will give some insights in performing exact calculations for parametric survival time models, thus facilitating the planning, conduct, and analysis of oncology trials with the option of accelerated approval.

Dangers of wrongly assuming linearity in a trial sample size calculation when treatment affects rate of change

ABSTRACT. For certain conditions, such as multiple sclerosis, treatments might aim to lessen deterioration over time. A trial outcome in this setting could be a continuous measure (e.g. Expanded Disability Severity Scale, EDSS) recorded on multiple occasions. Such outcomes could be analysed using a mixed-effects model with random slopes and intercepts for participants, with the treatment effect being the difference between the slopes over time in the treated and placebo groups. Thus, an effective treatment will slow the mean rate of deterioration compared to participants receiving control. A sample size can be obtained by estimating the necessary variances and covariances from a pre-existing dataset, e.g. an observational study conducted in a similar setting. These estimates can be used in standard sample-size formulae for mixed-effects models. A random slopes model assumes trajectories are linear, and therefore any treatment effect increases linearly over time, but this may not be the case. Our simulation study assesses what effect a non-linear trajectory and a treatment effect not proportional to time have on the proposed trial’s nominal power. Simulations were based on EDSS data from the MS-STAT trial[1]. We used four trajectories – steady decline (i.e. linear trajectory), early and late decline (based on exponential curves), and intermediate decline (inverse logit) – and simulated 5000 observational studies of 1000 people each, to calculate trial sample sizes. Trials of the appropriate size were then simulated, with each of the following treatment effects: none (to assess Type I error), proportional to time, non-proportional. Preliminary results suggest that, provided the treatment effect is proportional to time, powers are generally close to nominal regardless of whether trajectories are linear or not. However, when the treatment effect is not proportional to time, its estimate from a random slopes model can be badly biased, leading to powers far from nominal levels. These errors can be exacerbated when the length of the proposed trial differs from that of the observational study used to plan the sample size. In such cases, it might be more appropriate to use a model that allows trajectories to vary freely over time instead of the random slopes model.

Estimating sample size for biomarker-strategy designs with survival endpoints

ABSTRACT. In response to the rapid growth of precision medicine and the number of molecules entering the drug development pipeline, several study designs including the biomarker-strategy design (BSD) have been proposed. Contrary to traditional designs, the emphasis here is on the comparison of treatment strategies and not on the treatment molecules as such. Patients are assigned to either a biomarker-based strategy (BBS) arm where biomarker-positive patients receive an experimental treatment or a non-biomarker-based strategy (NBBS) arm where patients receive a treatment regardless of their biomarker status.

We examined several designs of BSD according to the biomarker assessment and the treatment received in NBBS arm and used frailty survival models to analyse them. Depending on the limits and specificity of each, we proposed statistical models that best described each design. We thus developed a partially clustered frailty model (PCFM) for the (standard) case where the biomarker status is only known in BBS arm. The PCFM allows us to account for the complex structure of BSD that may consider clustering only in one arm. In addition, we proposed an approach to calculate sample size for survival data relying on PCFM. We also proposed statistical tests to measure the overall strategy effect as well as the biomarker-by-strategy interaction effect.

We conducted extensive simulations to assess, for each design, the robustness and performances of the different statistical models. We also performed power analysis and sample size estimation to compare the performance of PCFM to more traditional frailty models, and provided guidelines on the use of BSD in survival analysis. We also performed power analyses to compare the performance of PCFM to more traditional frailty models.

Multidimensional Go/No-Go decision rules

ABSTRACT. Within clinical drug development, early phase trials are an integral component of the development plan. They constitute the generation of data that provide information on key elements such as proof of principle, proof of concept and dose selection prior to the commencement of confirmatory trials. Therefore, the focus of early phase trials should be on generating data that provides meaningful information regarding critical decision making in particular regarding whether to conduct the confirmatory phase. A strict control of the type I error in a confirmatory sense is usually not needed for this purpose. Instead go/no-go decision criteria are needed that ensure certain probabilities for correct decision making. A statistical go/no-go is a pre-defined, quantifiable criterion to allow for decision making given the observed results in the trial. In order to improve the probability for a correct conclusion, it is often desirable to base the decision on more than one endpoint.

Two possible methods for deriving a decision criterion when having multiple endpoints are the intersection and the union of the individual go/no-go areas for the separate endpoints. For the intersection area, a go is achieved by all endpoints passing their individual go-criterion while for the union area, it is sufficient if at least one endpoint has crossed the individual go-criterion. While both methods are easy to implement, they have also some disadvantages which may lead to non-intuitive decision making. We investigate different ways of defining the go-, no-go- and consider-region based on more than one endpoint. The different approaches are compared to each other regarding their properties. Results are illustrated using a real data example.

Online control of the False Discovery Rate in platform trials

ABSTRACT. When testing multiple hypotheses, the control of a suitable error rate is desirable even in exploratory trials. For such applications, e.g., the control of the False Discovery Rate (FDR) has been proposed. The FDR is defined as the expected proportion of false positive rejections among all rejections. Conventional methods to control the FDR, e.g., the Benjamini-Hochberg procedure, assume that all p-values are available at the time point of test decision. In perpetual platform trials however, treatment arms can enter and leave the trial at any time during its conduct. Therefore, the number of hypotheses is not fixed in advance and the hypotheses are not tested at once, but sequentially. Recently, the concept of online control of the FDR was introduced (Javanmard and Montanari, 2018), where hypothesis tests and test decisions can be performed in a sequential manner and there application to platform trials has been proposed (Robertson and Wason, 2018).

We investigate different procedures proposed by, e.g., Javanmard and Montanari, 2018, to control the online FDR in the setting of platform trials. This includes methods distributing the alpha level unequally among the hypotheses of interest and increasing the local significance level in case of recent discoveries. The results depend sensitively on prior distributions of effects sizes, e.g., whether true alternatives are uniformly distributed over time or not. We optimize design parameters depending on prior distributions to maximize operating characteristics such as the overall power, which is the proportion of rejected alternatives among all alternatives. Furthermore, we investigate the impact on error rates by including both concurrent and non-concurrent control data. By including the latter the power can be increased, but the control of the FDR may be negatively affected in case of time trends. Finally, we show how the procedures can be extended to allow for interim analyses with the option of early stopping for individual hypotheses.

Javanmard, A, and Montanari, A (2018). Online Rules for Control of False Discovery Rate and False Discovery Exceedance. Annals of Statistics, 46(2): 526-554.

Robertson, DS, and Wason, JMS (2018) Online control of the false discovery rate in biomedical research. arXiv preprint (

15:15-16:45 Session OC7C: frailty model and recurrent events
A non-mixture cure model with frailty correction for inaccurate background mortality to estimate time-to-cure.

ABSTRACT. Context: Cure models are used in population-based studies to estimate net survival and its asymptotic value, the cure fraction. Net survival, the survival that would be observed if the studied disease (e.g. cancer) were the only possible cause of death, is estimated by splitting the observed mortality into two forces: one due to cancer (excess mortality) and one due to other causes (expected mortality). Usually, the latter is drawn from the general population life tables but this assumption may not hold. Objectives: To propose a non-mixture cure model accounting for inaccuracy of general population life tables as other causes mortality tables. Methods: Boussari et al. previously proposed a non-mixture cure model which allows direct estimation of the time-to-cure as the time from which the excess mortality becomes null. In the present work, we allow the cancer patients’ expected mortality to be different from the general population observed mortality. To account for this non-comparability bias, we proposed two models assuming that the expected mortality equals the population mortality multiplied by either a constant parameter or a frailty term, with estimations based on maximum likelihood method.We assessed the performance of the three models in a simulation study designed to mimic real data. The simulation scenarii were built by varying the cure fraction (low, medium and high), the time-to-cure (early, late), the sample size, the censoring process as well as the magnitude of non-comparability effect (null, moderate, severe) and its variability (null, narrow, wide). We applied these three models to colon cancer data from French cancer registries (FRANCIM). Results: In the simulation study, the three models performed equally when the comparability assumption held. In presence of heterogeneous non-comparability effect, the frailty rescaled cure model outperformed the constant rescaled cure model while meaningful biases were observed with Boussari model. In the application, the frailty cure model provided the lowest AIC and BIC and the non-comparability was meaningful in the two new cure models, for the colon cancer data. Conclusions: We recommend the proposed models for the estimation of net survival, cure proportion and time-to-cure. An R package implementing these models will be available soon.

A family of discrete random effect distributions for modelling bivariate time-to-event data

ABSTRACT. Random effect models for time-to-event-data, also known as frailty models, provide a conceptually simple and appealing way of modelling individual heterogeneities resulting from factors which may be difficult or impossible to measure; an example is heterogeneity induced by genetics or through environmental exposure.

The frailty is usually assumed to have a continuous distribution. In some areas of application, however, the frailty distribution might have considerable discrete impacts, such as the (unobserved) number of sexual partners for sexually transmitted diseases. In others, the true frailty distribution might be multimodal and the most crucial differences in time-to-event data might be explained by the clusters' distance to the different high-density areas of the frailty distribution. In such scenarios, a discrete frailty model might be more appropriate for capturing important differences in individual heterogeneities.

Hence, we model a family of discrete frailty distributions, originally introduced by Farrington et al. (2012), and provide implementations and applications for both bivariate current-status and right-censored, possibly left-truncated data.We suggest an interpretation of the discrete frailties as being ordered latent risk categories. As we often encountered, only a few risk categories representing most of the probability mass, this facilitates the interpretation of the proposed model. Furthermore, our estimation algorithm is designed such that the frailty distribution and the distribution of the conditional survival time is chosen by the data among a set of distributions.

We analyze clustered time-to-event data of diabetic patients at 'high-risk' of retinopathy with one eye being randomly assigned to laser treatment. We assume that the frailty distribution is stratified by type I and type II diabetes. Preliminary data analysis suggests, that type II patients have higher risk of loss of vision than type I diabetic patients at each distinct risk-category at any time. This is balanced, however, by more type II diabetic patients being in the lowest risk-category.

References: Farrington C. P., Unkel S. and Anaya-Izquierdo K. (2012): The relative frailty variance and shared frailty models, Journal of the Royal Statistical Society Series B, Vol. 74, pp. 673-696.

A general approach to fit flexible hazard regression models with multiple random effects

ABSTRACT. Excess mortality hazard regression models are now widely used when the cause of death is unknown or unreliable, especially in the context of population-based cancer registry data. Given the hierarchical structure of such data (patients nested within geographical areas, hospitals, etc.), the assumption of independence between survival times, on which standard survival modelling procedures rely, might not be adequate. We recently proposed an R package, named mexhaz, to fit flexible excess mortality hazard models accounting for the unobserved between-cluster heterogeneity by including a normally distributed random intercept defined at the cluster level. However, it might sometimes be necessary to also allow for between-cluster variations in the effect of one or several covariates. The objective of this work is thus to present an extension of our model that allows the user to fit flexible (excess) hazard regression models with multiple Gaussian random effects. The logarithm of the baseline hazard is modelled by a B-spline of time, and non-linear and time-dependent effects of covariables can be included. Parameter estimation is based on a likelihood maximisation procedure implemented in the R software. Multivariate adaptive Gauss-Hermite quadrature is used to approximate the cluster-specific marginal likelihood. The performance of the approach is demonstrated through a simulation study assessing the impact of different scenarios (varying the number of clusters, the cluster size, the variances of the random effects and their correlation) on parameter estimation. We then illustrate the use of the appproach on population-based cancer registry data by analysing the between-registry variation of the effect of covariates such as gender and year of diagnosis on survival from cancer.

Handling recurrent events in the context of clinical trials

ABSTRACT. Introduction: In clinical trials, the effectiveness of a product can be assessed by the decrease of harmful events for the patient. While the primary endpoint is frequently limited to the first event, the literature shows that it is more precise to consider all events, either by counting them, or by modeling recurrent events with methods extending traditional survival approaches.

Methods: A review of methods to describe or model recurrent events in clinical research was carried out. The recommendations of regulatory health authorities and guidelines (ICH, EFSA, Consort) have also been revised regarding the analysis of adverse events. For this work, we have retained four generalized linear models for the analysis of the number of adverse events (Poisson regression, negative binomial, zero-inflated Poisson, and zero-inflated negative binomial) and four models for the analysis of adverse events as recurrent events (Andersen-Gill, Prentice William Peterson, marginal and frailty models). These approaches were implemented with SAS software and evaluated on data from three clinical trials where adverse events corresponded to primary efficacy or safety objectives.

Results: The four generalized linear models studied for the analysis of the number of adverse events differed by handling overdispersion in the data. On our applications, the negative binomial distribution was the most appropriate distribution, correctly describing the heterogeneity of the data, but this could not be improved with zero-inflated models. In the literature, simulation studies showed that methods for recurrent events make the best use of all the data available and improve the precision of the analyzes compared to the methodology considering only the first event. On our applications, the Andersen-Gill and frailty models were the most relevant methods and gave similar results in terms of significance and goodness of fit.

Conclusion: The choice of the method for analyzing adverse events depends on the objectives of the study, the relationship between events over time and the heterogeneity between individuals. This literature review and analyses on real data have resulted in a decision tree that will be applied for the statistical analyzes of our next clinical trials with recurrent events.

Retro-prospective modelling of recurrent events

ABSTRACT. Risk factors analysis of recurrent events is a research topic frequently encountered in biomedical domains. When the event of interest occurs before the inclusion in the study, the information is generally not used in statistical analyses aimed at identifying risk factors for a recurrent event. Indeed, to avoid selection bias, only post-inclusion information is used, although the integration of pre-inclusion data could increase the statistical power when the explanatory variables are non-time-dependent. We here propose a weighted survival model allowing to study both prospective and retrospective events. This work is motivated by the analysis of venous thromboembolism (VTE) recurrence risk factors in the MARTHA cohort. The MARTHA study was initially designed to investigate VTE risk factors. From 1994 to 2010, 1,473 patients were recruited at La Timone Hospital in Marseille. Among them, 402 had already had several VTE. Between 2013 and 2018, 780 patients were contacted to gather information on post-inclusion VTE, which led to the identification of 152 recurrences. Our total sample consisted of 1,473 individuals including 554 recurrences. The studied variables were gender, VTE family history and age at first VTE (before or after 50). The association between these variables and risk of VTE recurrence was modelled through a weighted Cox model. The weights were calculated from the inverse of the survival probability up to the date of recurrence data collection using a delayed-entry Cox model applied to the available death event. In the prospective analysis of 780 subjects including 152 recurrences, male gender (HR=1.8±0.18,p=1.10-3) and presence of VTE family history (HR=1.6±0.17,p=8.10-3) were significantly associated with an increased risk of recurrence, but not age (HR=1.1±0.18,p=0.76). The retro-prospective analysis allowed to refine these observations. The regression coefficients obtained were consistent and the standard deviations almost divided by 2: HR=1.7±0.09 (p=2.10-9), HR=1.2±0.09 (p=0.04) and HR=1.3±0.10 (p=0.02), respectively. The proposed methodology enables to optimize the analysis of non-time-dependent risk factors of a recurrent event by integrating both pre- and post-inclusion data. This modelling has an immediate field of application in the context of genetic association studies on the risk of a disease’s recurrence where studied DNA polymorphisms are fixed at birth.

15:15-16:45 Session OC7D: functional data analysis
Trajectory clustering using mixed classification models

ABSTRACT. Trajectory classification has become frequent in clinical research to understand the heterogeneity of individual trajectories. The standard classification model for trajectories assumes no between-individual variance within groups. However, this assumption is often not appropriate, and so the error variance of the model may be overestimated, leading to a biased classification. Hence, two extensions of the standard classification model were developed through a mixed model. A first one considers an equal between-individual variance across groups, and a second one considers unequal between-individual variance. Simulations were performed to evaluate the impact of these considerations on the classification. The simulation results showed that the first extended model gives a lower misclassification percentage (with differences up to 50%) than the standard one in case of a true variance between individuals inside groups. The second model decreases the misclassification percentage compared with the first one (up to 11%) when the between-individual variance is unequal between groups. However, these two extensions require higher number of repeated measurements to be adjusted correctly. Using human chorionic gonadotropin trajectories after a curettage for hydatidiform mole, the standard classification model mainly classified trajectories according to their level whereas the two extended models classified them according to their pattern, leading to more clinically relevant groups. In conclusion, for studies with a non-negligible number of repeated measurements, the use, in first instance, of a classification model that considers equal between-individual variance across groups rather than a standard classification model, appears more appropriate. A model that considers unequal between-individual variance may find its place thereafter.

Bayesian concurrent functional regression for sensor data

ABSTRACT. Introduction. Functional data analysis (FDA) methods have recently been developed to analyse several variables measured repeatedly and concurrently over a domain such as time in a cohort of individuals. However, many FDA methods require data to be measured regularly, with data being collected at the same fixed times for all individuals. Often, with studies in humans, there tend to be missing data. Some studies focus on sparse but regular data and some focus on dense but irregular data. Of those who focus on both sparse and irregular data, only a few have readily available software to implement their methods and they use only complete case data for modelling, and hence some information is lost. We have developed a Bayesian model for function-on-function regression in the situation of sparse and irregular data which uses all the data for modelling and easily obtain inferences.

Methods A simulation study was performed to compare the Bayesian model with other methods available in software to determine which performs best when there are sparse and irregular data. Four functions for the parameter were considered (linear, exponential, fifth order polynomial and sinusoidal). Missingness was induced in four ways, 10%, 20% and 40% missing at random as well as missing chunk of data. Three sample sizes were considered, 50, 100 and 250. Number of observations per individual considered were, 50,100, 300 and 1000. Bayesian model was then applied to concurrently measured glucose (every 5 minutes for 1 week) and electrocardiogram (ECG) data (every 10 minutes for 1 week) in a cohort of n = 17 type 1 diabetics. All models were fitted using R v 3.5.


The Bayesian model is competitive with other models particularly in complex (fifth order polynomial and sinusoidal functions) and irregular data. Its performance drops in the linear and exponential functions. It was found the Bayesian model is robust to missingness compared to other models.


For irregular sensor data with missingness we recommend the use of this Bayesian functional regression model.

ERNEST: A Representation Learning-based Cross-Subject EEG Channel Selection Algorithm

ABSTRACT. BACKGROUND: EEG is a non-invasive powerful technology that finds applications in several research areas. Currently, most EEG systems require subjects to wear several electrodes (channels) on the scalp. However, this induces longer clinical trial set-up times, hinders EEG-based assistive technologies’ portability, and penalizes any Statistical or Machine Learning effort to decode EEG recordings by adding noisy and redundant information and rising computational complexity. One way to increase the signal-to-noise ratio and aid classification is combining Channel Selection (CS) with feature extraction. However, EEG signals’ strong inter-subject variability led most of the existing literature to focus on subject-specific CS.

AIMS: We propose ERNEST, EEG EmbeddeR aNd channEl SelecTor, a novel method for cross-subject CS.

METHODS: Irrespectively of the task, each statistical unit is represented by a trial (recording session) matrix, where each row is a high-dimensional signal from one of the channels, associated to a binary target variable. ERNEST is designed to learn channel-specific 1-Dimensional Convolutional Neural Networks (1D-CNN) to embed signals grouped by electrode in a latent space of small dimensionality that maximizes intra-class separability. Then, it builds a unique representation of each trial by concatenating the channels’ embedding into a trial vector from which it selects the most relevant channels to perform classification. For CS, ERNEST exploits a Deep Learning-based Feature Selection algorithm readapted from [1]. After training, ERNEST transfers the parametrized subgroup of selected channel-specific 1D-CNNs to embed new signals, obtaining trial vectors of small dimensionality and high predictive power that can be fed to any classifier.

EXPERIMENTS: Cross-subjectivity can be group-dependent (testing on new trials of the same group of subjects used for CS) or fully subject-agnostic (testing on new subjects). We tested ERNEST in both frameworks on an Event-Related Potential (ERP) experiment discriminating between signals recording reactions to visual stimuli, and a patient classification task separating alcoholics from controls.

RESULTS: Results are promising, especially compared to the limited literature on cross-subject CS. For instance, in fully subject-agnostic patients’ classification ERNEST yields 0.858±0.018 AUROC with 5 channels out of 64, whereas selecting 10 channels for ERP achieves 0.951±0.012 and 0.724±0.024 AUROC in group-dependent and subject-agnostic frameworks respectively.

Linear statistical models and ridge regression used in shape index calculation on human face

ABSTRACT. Spatial interpolation and smoothing is usually done for one surface. In our case, we have random samples of such surfaces represented by human faces captured by stereo-photogrammetry and characterised by about 150,000 points. These points are triangulated by about 300,000 triangles. The number of points is extremely high for the purpose of statistical analyses, therefore the 3D coordinates of (semi)landmarks on curves or surface patches sufficiently characterising the shape have to be automatically identified and this simplified model comprising about 1000 points is then used in further statistical modelling in functional data analysis setting. The identification of (semi)landmarks is a complex process during which B-splines, P-splines and thin-plate splines are used together with the measures of local surface topology, including principal curvatures and shape index.

Shape index is calculated in R using different linear regression models and ridge regression model (allowing more flexibility for regression coefficients) of z coordinates on x and y coordinates, i.e. quadratic with interaction without/with intercept, cubic with interaction of x and y without/with intercept (without/with other interactions), and similar models of higher order. The estimates of regression coefficients related to the quadratic terms and their interaction are elements of Weingarten matrix from which the principal curvatures are calculated. These models are applied on sufficiently large neighbourhood of all points in local 3D coordinate system.

Since the measures of local surface topology represent principal guide in estimating locations of ridge and valley curves across the face, we aim to compare different regression models used in shape index calculation on faces of patients with facial palsy and healthy controls. We suggest to use quadratic or cubic linear regression model or ridge regression model with interaction of the first order without intercept.

Acknowledgment: The work was supported (partly) by RVO:67985807 and MUNI/A/1615/2020.

References: Vittert L, AW Bowman, S Katina. A hierarchical curve-based approach to the analysis of manifold data. The Annals of Applied Statistics 13,4 (2019): 2539–2563. Katina S, et al. The definitions of three-dimensional landmarks on the human face: an interdisciplinary view. Journal of anatomy 228.3 (2016): 355–365.

Unsupervised classification of ECG signals via FDA to look for different patterns of variation among patients

ABSTRACT. Research on electrocardiogram records (ECG) often aims on supervised classification. Based on recorded signals, the main purpose is to create a classifier to recognize healthy people and new patients affected by specific heart disease. The reason is that the dataset on heart disease is often composed of records of heart signals over time and the labels of groups for each patient. Generally, the latter is a binary variable such as "Disease" and "Healthy": however, it's also possible to deal with more than two possible modalities of the grouping variable. Therefore, this kind of dataset pushes researchers to use this information and deal with supervised classification problems rather than unsupervised classification. Thus, rarely, scholars worry about finding specific patterns in ECG signals regardless of the known labels. This study's basic idea is to exploit functional data analysis (FDA) and unsupervised classification (i.e. clustering) to look for additional information in the data. The main aim of using these two approaches is to find patterns and different types of variability between the know groups. The starting point is to treat ECG signals using FDA. Because the former can be seen as a function in the time domain, the most intuitive approach is to treat these curves as functions of time as single objects. In the case of a binary outcome, having healthy and non-healthy people's curves, we can calculate the functional mean within each group. Nevertheless, this approach would not provide us with exhaustive information. Thus, we propose to use functional clustering to detect specific patterns within known groups. Clearly, in FDA, different metrics and approaches can be used for unsupervised classification. This study focuses on the functional k-means based on the functional principal components' semi-metric (FPCs). We present the results of this approach applied on a dataset composed of 200 patients recorded during one heartbeat. The dataset distinguishes patients with a regular heartbeat and people with a diagnosis of Myocardial Infarction. Our approach shows that within the “disease” group, we can highlight different patterns of non-healthy people and thus different type of Myocardial Infarction subclasses.

15:15-16:45 Session OC7E: Selection and validation of prediction models
To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets

ABSTRACT. For finite samples with binary outcomes penalized logistic regression such as ridge logistic regression (RR) has the potential of achieving smaller mean squared errors (MSE) of coefficients and predictions than maximum likelihood estimation. There is evidence, however, that RR is sensitive to small or sparse data situations, yielding poor performance in individual datasets. In such low-dimensional settings Firth’s correction (FC) may be preferable. Motivated by an endometrial cancer study relating histological grading to three risk factors we demonstrate that the results provided by RR strongly depend on the choice of complexity parameter. However, estimating this parameter from the data by minimizing some measure of the out-of-sample prediction error or information criterion may result in arbitrary optimized values. We elaborate this issue further by performing a comprehensive simulation study, investigating the performance of RR in terms of coefficients and predictions and compare it to FC. In addition to RR where complexity parameter is estimated from the data, we also consider pre-specifying the degree of shrinkage according to some meaningful prior assumptions about true effects. Our benchmark is defined by oracle models that show what the best possible performance of RR could be if the true underlying data generating mechanism was known. We show that complexity parameter values optimized in small or sparse datasets are negatively correlated with optimal values and suffer from substantial variability which translates into large MSE of coefficients and large variability of calibration slopes. Therefore, applying tuned RR in such settings is problematic. In contrast, if the degree of shrinkage is pre-specified, accurate coefficients and predictions can be obtained even in non-ideal settings such as encountered in the context of rare outcomes or sparse predictors.

Propensity-based standardization to enhance the interpretation of prediction model discrimination

ABSTRACT. Interpreting model discrimination estimates in external validation studies is often challenging, because differences in discrimination may arise from invalid model coefficients and differences in (the distribution of) population characteristics. Hence, it may be unclear whether a developed prediction model may benefit from local revisions. This is particularly relevant in pooled data sets, where prediction models can be externally validated in multiple data sets.

We aimed to disentangle the effects of differences in case-mix and invalid regression coefficients, to allow for the identification of reproducibility of model predictions (and predictor effects) in a target population.

We propose propensity-weighted measures of the concordance statistic that are standardized for case-mix differences between the development and validation samples. The propensity scores are derived with (multinomial) membership models that predict the originating samples of observations.

We illustrate our methods in an example on the validation of eight diagnostic prediction models for the detection of deep vein thrombosis (DVT) that may aid in the diagnosis of patients suspected of DVT in 12 external validation data sets. We assess our methods in a simulation.

In the illustrative example, summary estimates of the meta-analysis of discrimination performance were not greatly affected by standardization. However, standardization substantially reduced the between-study heterogeneity of concordance, which indicated that variation in model concordance could partially be attributed to differences in case-mix, rather than invalid coefficients. In the simulation, only the propensity-score method estimated with splines produced unbiased estimates of concordance in the target population, whereas an unweighted approach and a propensity method with linear terms did not.

Propensity score-based standardization may facilitate the interpretation of (heterogeneity in) prediction model performance across external validation studies, thereby guiding model updating strategies. These propensity score models should allow for non-linear effects to capture differences in sample variation.

A novel score for prognostic index assessment with event-free survival outcome

ABSTRACT. In many clinical settings (especially, in cancer studies), a challenging goal consists in the definition of a prognostic index able to subdivide patients into groups with different event-free survival curve, on the basis of a subset of clinical variables. In order to be used in the practice, the identified classification must represent “clinically relevant” groups so that physicians may assign patients to different treatments or managements, depending on the level of prognosis. Thus, a good prognostic classification should satisfy the following properties: (1) the groups must correspond to “well separated” event-free survival curves, (2) the order of the prognostic levels must be retained in all cohort of the same clinical setting, (3) the groups must be reliable/robust in terms of size, (4) the classification should be characterized by a good survival prediction accuracy. Commonly applied scores for the assessment of a prognostic index (e.g. the Brier score [1] or the c-index [2]) actually evaluate only one or two of these characteristics. In order to have a more comprehensive evaluation of a prognostic index, we defined a new measure of separation (called Expected SEParation, ESEP) which represents, from the theoretical point of view, the expected difference between the survival times of any two patients, given that they belong to groups with “consecutive” levels of prognosis. From its definition, ESEP is able to evaluate the properties (1-3). Hence, for a complete assessment of a prognostic index, ESEP must be used together with an error measure of survival prediction (such as the Brier score). From the estimation point of view, ESEP relies on the estimation of the restricted mean survival time, whose usage is suggested even in case of non-proportional hazards assumption. Overall, ESEP outperformed many other scores proposed in the literature on a wide range of simulated data, being able to better identify wrong prognostic classifications, even in case of small sample size and/or high percentage of censored data. The same behaviour was also observed when comparing different prognostic classifications on public real data, such as the ones of the German Breast Cancer Study Group 2.

Minimum sample size for external validation of a clinical prediction model with a binary or time-to-event outcome

ABSTRACT. Background: In prediction model research, external validation is needed to examine an existing model’s performance using data independent to that for model development. Current external validation studies often suffer from small sample sizes and consequently imprecise predictive performance estimates.

Aims and methods: In this talk, we propose how to determine the minimum sample size needed for a new external validation study of a prediction model for a binary or time-to-event outcome. Our calculations aim to precisely estimate calibration (Observed/Expected and calibration slope), discrimination (C-statistic) and clinical utility (net benefit). For each measure, we propose closed-form and iterative solutions for calculating the minimum sample size required. These require specifying: (i) target standard errors (confidence interval widths) for each estimate of interest, (ii) the anticipated outcome event risk in the validation population, (iii) the prediction model’s anticipated (mis)calibration and distribution of linear predictor values in the validation population, and (iv) potential risk thresholds for clinical decision making. Software code is discussed. We show how to derive the anticipated linear predictor distribution based on the C statistic or D statistic reported in the model development study.

Results: We illustrate our proposal for external validation of a prediction model for mechanical heart valve failure with an expected outcome event risk of 0.018. Calculations suggest at least 9835 participants (177 events) are required to precisely estimate the calibration and discrimination measures, with this number driven by the calibration slope criterion, which we anticipate will often be the case. Also, 3554 participants (64 events) are required to precisely estimate net benefit at a risk threshold of 8%. Lastly, we discuss the importance of also plotting simulated calibration curves for the sample size identified, to visually assess whether confidence interval bands are suitable in key ranges of predicted risks relevant to clinical decision making.

Conclusions: Our approach allows researchers to gauge the sample size required when designing a study for validating a model’s predictive performance in new data. The calculations can also inform whether the sample size of an existing (already collected) dataset is adequate for external validation.

15:15-16:45 Session OC7F: flexible modelling and spatial data analysis
Conditional and unconditional logistic regression analysis of matched case-control studies of recurrent events

ABSTRACT. Case-control studies can estimate incidence rate ratio in fixed or dynamic populations by using incidence density sampling, which is also called concurrent sampling. This involves matching cases and controls according to the time of events. We review the arguments for and against the use of conditional and unconditional logistic regression models for the analysis of individually matched case-control studies. While unconditional logistic regression analysis that adjusts for case-control matched sets as indicator variables generates a bias, the method used in parsimonious formulations does not. While both conditional and properly formulated unconditional logistic regression models have similar properties in the studies of outcome events that are non-repeatable, currently there is no valid method for statistical inference based on conditional logistic regression in the studies of recurrent events such as episodes of pneumonia or hospital readmissions. In this context, we propose to apply unconditional logistic regression with adjustment for time in quintiles and residual times within each quintile (9 degrees of freedom), with a robust standard error to allow for clustering of observations within persons. The methods are evaluated in simulations in realistic scenarios that resemble the studies of pneumonia and malaria. The results show that in case-control studies of non-repeatable events or first events using incidence density sampling, the proposed method and conditional logistic regression analysis give highly comparable results in terms of relative bias, root mean square error and coverage probability. However, only the proposed method provides correct statistical inference in the studies of recurrent events. We use data from a study of pneumonia and a study of malaria to illustrate.

On the implications of influential points for the selection and reproducibility of MFP models

ABSTRACT. The number of covariates is often too large and a more parsimonious model is needed. The multivariable fractional polynomial (MFP) combines variable selection using backward elimination and function selection for continuous variables using fractional polynomial (FP) functions, thus a nominal significance level is required for each part ( A function selection procedure (FSP) compares the best fitting FP2 function in three steps with exclusion of the variable, the linear function and the best FP1 function. The latter tests are only conducted if earlier tests are significant.

Regression diagnostics, such as detection of influential points (IP) and plots of residuals may reveal lack of fit or other peculiarities of a selected model. Single observations may be responsible for the selection of a linear, a monotonic non-linear FP1 or even one of the non-monotonic FP2 functions.

Based on real data, Royston and Sauerbrei (2008) designed the ‘ART’ study, an artificial data set with six continuous and four categorical variables. They identified IPs using leave-one-out and illustrated that if the sample size is too small MFP would have low power to detect non-linear effects and the selected model may differ substantially from the true model.

Using ART, we extended the investigation of IPs and their effect on selected univariate functions and the multivariable model. We propose diagnostic plots for the combination of two or more points and approaches to identify influential points in multivariable analysis. Using non-overlapping parts from the dataset, we consider model reproducibility and investigate the effect of sample size on MFP models.

IPs can have a severe effect on FP functions, specifically in small datasets. Regression diagnostics can be used to identify IPs and their effect can be downweighed. Too small sample sizes often result in wrongly postulating linear effects with further implications for the selected multivariable model. If the sample size is large and regression diagnostics are carefully done, MFP is a suitable approach to identify variables with a stronger effect and suitable functional forms for continuous variables.

For better illustration we use a structured profile which provides an overview of all analyses conducted.

Worldwide age specific Human Papillomavirus prevalence patterns: a mixed effects binomial method to cluster trajectories

ABSTRACT. Introduction To cluster genital HPV age-prevalence trajectories from locations around the world, a clustering methodology taking into account heterogeneity is developed, compared by simulations with a methodology without heterogeneity and then applied to 29 locations worldwide. Material and method A clustering algorithm is created based on the binomial likelihood, with random effects in the age-prevalence trajectories. A clustering version of the EM algorithm is proposed to maximize the loglikelihood of the model. Applying different number of groups, the best model is selected according to classification criteria and clinical relevance. Simulations were performed to compare fixed and mixed effect classification models in terms of proportion of well classified trajectories, with different scenarios in terms of number of measurements at each time and variance of random effects. The models with (mixed) or without (fixed) heterogeneity are applied on genital HPV age-prevalence trajectories data for 29 locations worldwide. Results In the simulations, the proportions of mean trajectories well classified were systematically higher with the mixed effect model, from 30 to 50% compared with the fixed effect model. However, the validity of the classification obtained with the mixed effect model depends on the number of measurements at each time, decreasing with a decrease in measurements. For the application part, the fixed effect model selected a model with 4 groups that corresponded to translations of the 2 groups obtained with the mixed effect model. Features of the populations (sexual behaviors and demographic data) by groups are represented for both selected models. Discussion There is always heterogeneity in biological data and taking it into account can modify the clusters obtained. One limit of the mixed effect classification model is the high number of measurements required at each time. Fixed effect classification model classifies trajectories according to their level and shape whereas mixed effect model leads to classifications based mainly on the shape. The two models have different and complementary information, one may help identifying sexual behaviors linked with the overall intensity of HPV prevalence, and the other one factors linked with the evolution of HPV over age, allowing to target health actions on specific population subgroups.

Treating ordinal outcomes as continuous quantities: when, why and how.

ABSTRACT. Ordinal variables, such as quality of life scores and patient-reported outcomes, are common in studies of population health and patient care. Analysing such outcomes often utilizes the linear regression model to estimate the effect of an exposure or intervention of interest. The magnitude of the effect is quantified by the difference in mean ordinal scores of the two (or more) groups being compared, and this quantity is useful for the assessment of clinical significance. However, this approach may be inappropriate as it assumes the ordinal outcome is a proxy for the continuous scale but does not assess this assumption.

The cumulative link model, which is an appropriate model for assessing the effect of an exposure on an ordinal outcome, is less well known and not widely used. Here we propose a new procedure using this model to assess the proxy assumption and to estimate the difference in mean ordinal scores when appropriate. As an illustration, the procedure is applied to five subscales of fatigue measured using the Multidimensional Fatigue Inventory to investigate the effect of time since diagnosis on fatigue among breast cancer survivors in Singapore.

A statistically significant improvement over time since cancer diagnosis was found in the General Fatigue and Mental Fatigue scores, but only General Fatigue satisfied the proxy assumption. As such, we can only draw conclusions on the magnitude of change in General Fatigue score, which is expected to be 1-unit for every 6.5 additional years since diagnosis and clinical significance (i.e., a 2-unit difference) achieved at the 13-th year.

The procedure offers a seamless way to assess both statistical and clinical significance of an effect on ordinal outcomes when the proxy assumption is appropriate. Where the assumption is not appropriate, only the statistical significance should be reported.

Health map for HealthGap: Defining a geographical catchment to examine cardiovascular risk in Australia

ABSTRACT. Linked administrative data provide opportunities to understand how people navigate and receive care within the health care system. The HealthGap pilot is a population-based cohort study linking general practice data to three major tertiary hospitals in Melbourne, Australia. HealthGap aims to understand health inequities in cardiovascular risk and treatment between Indigenous and non-Indigenous Australians and to recalibrate cardiovascular risk models for Indigenous people. As available data covered a only subset of hospitals in which cardiovascular events could be observed, the first aim was to define a catchment around HealthGap hospitals to estimate the denominator population for risk modelling.

Geocoded hospital location data were overlayed on postcode shape files from the Australian Bureau of Statistics. The number of cardiovascular events by postcode was extracted for each HealthGap hospital overall and by Indigenous status. Catchments were defined using three approaches and compared to determine the optimal definition. The first two approaches utilised first- and second-order neighbours of hospital postcodes, while the third was based on the spatial distribution of observed events. In this latter approach, postcodes were grouped into deciles based on numbers of events, with those in the top decile classified as locations with high cardiovascular burden and thus the largest at-risk population. Each approach was applied separately to each study hospital.

The first-order neighbours, second-order neighbours and spatial event distribution approaches captured 27-58% (27-41% Indigenous), 55-79% (59-65% Indigenous) and 72-95% (76-88% Indigenous) of cardiovascular events presenting at HealthGap hospitals, respectively. Although none of the approaches was able to exclude proximal non-study hospitals at which patients may also have presented, the impact was minimised by restricting to major cardiovascular events.

Geographic catchments provide an alternative approach to defining at-risk populations when denominator data is not readily available. Of the three approaches trialled in the HealthGap study, the spatial event distribution catchment performed best and was deemed the optimal approach to defining the denominator population. These techniques should be applied with caution in order to minimise the impact of proximal non-study hospitals, and refined based on local knowledge of processes of care.

15:15-16:45 Session OC7G: Methods for clinical research
Bayesian extrapolation from pre-clinical data to human

ABSTRACT. Background: During the initial phases of drug development, the toxicity and mechanisms of action of a candidate drug are explored in preclinical studies, building candidate pharmacodynamic models. These models can be used to define a dose range that is likely to be effective and of low toxicity in human studies. Enriching these models with other sources of information (elicited expert data, data from previous experiments or data on similar molecules) can be particularly interesting.

Aim: We propose a Bayesian approach of extrapolation from pre-clinical data to human, taking into account all external sources of information (data, published articles, etc.).

Method: Our work is inspired by a real-life example in oncology, the inhibition of TGF-beta signaling to block tumor growth [1]. We consider several in vivo PK/PD experiments from literature. Bayesian models are used on the dose-outcomes (PD surrogates of efficacy or toxicity) relationship with mixture distributions constructed from elicited expert data, data from previous experiments and data from similar molecules, as prior distributions. Maximum tolerated doses determined during Phase I and II are assumed to be expert elicited data. Thereafter, each of the posterior distributions resulting from these analyses is used to transfer information via extrapolation to obtain a dose distribution in humans. The final recommended dose range for first-in-human study is then deduced from these distributions.

Results: We perform an extensive simulation study. Compared to the standard methods that use only the preclinical estimated dose in the extrapolation, our simulations suggest that using all the available external information leads to a better (in terms of efficacy and toxicity) dose-range choice in first in human clinical trials.

Conclusion: Using the proposed Bayesian approach, which incorporates all available external information, allows a better dose selection for human trial, possibly reducing the trials failure rate due to wrong dose panel selection. This work is part of the European FAIR project that has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement N° 847786.

References [1] Gueorguieva et al., British Journal of Clinical Pharmacology (May 2014); 77(5):796–807.

Analysing ordinal endpoints in two-arm randomized clinical trials

ABSTRACT. We compare the type 1 error and power of several methods for the analysis of ordinal endpoints in two-arm randomized clinical trials. The considered methods are the two-sample Wilcoxon rank sum test (which ignores that the endpoint is categorical), the chi-square test of independence and the corresponding exact test (which ignore the ordinal scale of the endpoint), the chi-square test for trend, the likelihood-ratio test in the proportional odds ordinal logistic regression model as well as different (exact and asymptotic) versions of an original approach based on the distribution of the maximally selected chi-square statistic. In this context, “maximal selection“ refers to the selection of a cutpoint considered to dichotomize the ordinal endpoint. The latter new approach yields an additional interpretable output (namely the optimal cutpoint) and is based on an intuitive principle that can be understood as a correction procedure for multiple testing. The results of our extensive simulation study suggest that all tests, except the exact version of the new approach, have an adequate type 1 error below but close to the nominal level. However, they show different performance patterns in terms of power depending on the sample size and the distribution of the ordinal endpoint in the two treatment arms. The considered methods are illustrated through an application to data from four randomized clinical trials investigating the effect of new therapies on the outcome of lymphoma patients.

Simulations to assess the impact of non-adherance due to COVID-19 in a cluster-randomized non-inferiority trial

ABSTRACT. BACKGROUND The COVID-19 pandemic has led to disruptions to clinical research globally, including the suspension of thousands of clinical trials. In rural Gambia, a cluster-randomised, non-inferiority trial is underway to compare two schedules of the pneumococcal conjugate vaccine: 3 doses (standard) versus 2 doses (intervention). In 2020, this trial was temporarily suspended due to the COVID-19 pandemic. This led to treatment non-adherence whereby children residing in the 2-dose clusters received the standard 3-dose schedule. The aim of the current study is to use simulations to assess the impact of treatment non-adherence on study results.

METHODS The simulations replicated the trial design: 35 clusters in the 2-dose arm, 33 clusters in the 3-dose arm, 60 individuals per cluster. The primary outcome was a binary variable, the contrast between groups expressed as a risk ratio, and the non-inferiority margin was 1.38. The outcome data were simulated using a multilevel logistic model, with simulation settings based on observed data: prevalence of the outcome in the standard group (13%, 15%), odds ratio comparing the two groups (1, 1.1, 1.2) and percentage of non-adherence in the intervention arm (3%, 4%, 6%, 8%, 10%). Risk ratios were estimated using generalised estimating equations. We estimated power, bias and coverage using 2000 replications for each scenario.

RESULTS The simulation results indicated that with small amounts of non-adherence (i.e. 3-10%) in one arm only, there was minimal bias across the scenarios. Estimates of coverage were within nominal levels. There was little impact of non-adherence on study power, but as expected for this non-inferiority study design, power did decrease as the assumed odds ratio increased.

CONCLUSION This study illustrated methods for assessing the potential impact of the COVID-19 pandemic on trial results, and indicated minimal impact of participant non-adherence on trial results. We are undertaking further simulations to assess bias with more non-adherence, and evaluating methods for handling the non-adherence (e.g. two-stage least squares regression using cluster-level summaries for estimating the local average treatment effect). This work will inform the analysis for the current trial and is relevant to other clinical studies that have been disrupted by the COVID-19 pandemic.

Analysis methods for personalized randomized controlled trial (PRACTical)

ABSTRACT. For some clinical areas, multiple interventions are available, but it is not known which works best, and some interventions are not suitable for some patients. A personalized randomized controlled trial (PRACTical) design has been proposed for this setting. In this, each patient is randomised between the interventions for which they are eligible. The design can be considered as a platform that investigates a set of interventions on multiple subgroups of patients, where subgroups have “personalized” randomization lists that reflect their eligible interventions. The aim is to produce rankings of interventions that can guide the choice of intervention for each individual, rather than to estimate relative intervention effects.

We explore three analysis methods for this novel design. First, various analysis approaches at the subgroup level are possible but unlikely to provide inference of adequate precision. Second, we can utilise all indirect evidence by combining evidence from all subgroups as in a standard network meta-analysis, where the evaluation of interventions on a subgroup is analogous to a standard trial. This method adjusts for the randomization lists in the model, which would fail to converge if the number of subgroups is large. Third, a method resolves this issue by estimating each pairwise comparison separately from the appropriate subset of data and combining the estimates for different comparisons. Approaches 2 and 3 assume the indirect evidence on intervention comparisons is consistent with the direct evidence.

By performing simulation studies with a binary outcome, we evaluate the performance of the three analysis methods in terms of the properties of the intervention effect estimates and some new performance measures that summarise the benefit of treating according to the intervention rankings and the chances of picking best or near-best interventions. We find that approaches 2 and 3 provide estimates with good properties when there are no qualitative or quantitative subgroup-by-intervention interactions, and are reasonably robust to plausible levels of interaction. The overall rankings produced by these methods are likely to be suitable for determining intervention policies and guiding individual clinical decisions.

MultiNet: a computational algorithm to analyze high-dimensional epigenetic correlation structures

ABSTRACT. Derived from the lack of efficient algorithms of big data analysis, and motivated by the importance of finding a structure of correlations in (epi)genomics, we present a computational algorithm of topological data analysis (TDA) called MultiNet [1] with the main aim of developing a novel analytical tool to analyze the correlation structure of a high-dimensional epigenetic dataset. Additionally, MultiNet detects epigenetic patterns associated with data characteristics as a cancer disease status. This fact makes MultiNet a very powerful diagnostic tool to identify disease-associated biomarkers to be used in medical practice. MultiNet is based on the idea of translating the data into the language of algebraic topology to extract their main characteristics (the “shape” of the data). MultiNet converts a big data cloud into a correlation network where each node is a cluster of variables and they are joined as per their Pearson correlation, extending the Mapper design [2] as it follows a “divide and conquer” strategy to analyze the correlation of the data by overlapping data windows and doing feature modeling. This network is a simpler object that allows to analyze the data cloud in a very complete and fast way. MultiNet follows the main machine learning algorithm steps of data collection (DNA methylation arrays), exploration (remove sites with higher known variability), model training (creating the network and applying statistical models like random forest to detect significantly differentiated markers among sample groups), evaluation (test the significant sites on new datasets and detect disease-related biomarkers), and learning (apply the algorithm to explore deeply regional patterns on chromosomes). Moreover, MultiNet has more functionalities implemented as network differentiation models or plots for biological interpretation. MultiNet identifies, for instance, several markers associated to prostate and colorectal cancer with a very good sample prediction rate. Also, it identifies important genes associated with the aging process. Overall, this work establishes a novel perspective of analysis and modulation of hidden correlation structures, specifically those of great dimension and complexity, contributing to the understanding of the epigenetic processes, and that is designed to be useful for non-biological fields too.

17:00-18:00 Session Plenary 2: KEYNOTE INVITED SPEAKER

ABSTRACT. The occurrence and interpretation of negative variance components in the context of linear mixed models is well understood at this point, even though the issue is surrounded by subtle issues for estimation and testing (Verbeke and Molenberghs 2003, Molenberghs and Verbeke 2007). Broadly, negative variance components often point to negative within-cluster correlation. It is even possible to give such linear mixed models a meaningful hierarchical interpretation (Molenberghs and Verbeke 2011). Matters are more complicated when the outcomes are non-Gaussian, either in the context of the generalized linear mixed model, or extensions thereof that allow for flexible modeling of both within-unit correlation as well as overdispersion (Molenberghs et al. 2010). An additional complication is that, in practice, not only negative variance components due to negative correlation, but also underdispersion instead of overdispersion can occur, sometimes even jointly. With focus on both continuous and count data, we describe how models can be made sufficiently flexible and, in a number of cases, interpreted hierarchically (Luyts et al. 2019).

Video-Assisted Non-Intubated Lobectomies for Lung Cancer: A Systematic Review and Meta-Analysis

ABSTRACT. Introduction. To assess safety, feasibility and oncological outcomes of video-assisted non-intubated lobectomies for non-small cell lung cancer (NSCLC). Methods. A comprehensive search performed in EMBASE (via Ovid), MEDLINE (via PubMed) and Cochrane CENTRAL from 2004 to 2020. Studies comparing non-intubated anaesthesia with intubated anaesthesia for video-assisted lobectomy for NSCLC were included. A systematic review and meta-analysis were performed by combining the individual studies' reported outcomes using a random effect model. For dichotomous outcomes, risk ratios (RR) was calculated, and for continuous outcomes, the mean difference (MD) was used. Results. Three retrospective cohort studies were included. The comparison between non-intubated and intubated patients undergoing video-assisted lobectomy showed no statistically significant differences in postoperative complication rate (RR = 0.65; 95% confidence interval (CI) = 0.36 – 1.16; p = 0.30; I2 = 17%), operating times (MD = -12.40; 95% CI = -28.57 – 3.77; p = 0.15; I2 = 48%), length of hospital stay (MD = -1.13; 95% CI = -2.32 – 0.05; p = 0.90; I2 = 0%) and number of dissected lymph nodes (RR = 0.92; 95% CI = 0.78 – 1.25; p = 0.46; I2 = 0%). Conclusion. Awake and intubated video-assisted lobectomies for resectable NSCLC have comparable perioperative and postoperative outcomes. The oncological implications of the non-intubated approach should be considered. Radical pulmonary resection and systematic lymph node dissection might prove more challenging and harder to achieve due to spontaneous ventilation maintenance. Further research should be focused on the safety and feasibility of anatomical lung resections under spontaneous breathing. The long-term benefits for patients with lung cancer still need to be carefully assessed.

Bias in evaluation of discrete surrogate outcomes, due to separation: a penalized likelihood solution

ABSTRACT. Background: The use of a surrogate in place of the outcome of true clinical interest is one strategy intended to improve the efficiency of clinical trials and hence drug development programmes. Rigorous evaluation of a potential surrogate outcome is required to ensure the validity of inference about the effect of treatment on the true clinical outcome that has been replaced by a surrogate. Various techniques are available to assess surrogates: a leading method is use of a meta-analytical information theory-based approach, which enables a trial-level measure of surrogacy, R2ht, to be assessed across a range of outcome types including binary and ordinal discrete outcomes.

Objectives: In the context of surrogacy evaluation for discrete outcomes, separation (a zero cell count in the cross-tabulation of the surrogate and true clinical outcome within a randomised treatment group) causes bias in estimating the strength of surrogacy. We investigated the penalized likelihood technique of Firth (1993) as a possible solution to this.

Methods: Penalized likelihood adds a systematic bias to the score function to prevent bias in the resulting parameter estimate. We simulated multiple clinical trials for various scenarios (varying the strength of surrogacy; and the number and size of trials) to assess the value of penalized likelihood when investigating a potential binary surrogate outcome for an ordinal true clinical outcome. Outcomes and simulation parameters were based on data from a large clinical trial of compression stockings for prevention of deep vein thrombosis in immobile stroke patients.

Results: Compared to a default approach of omitting from the meta-analysis trials in which separation is present, we found that the penalized likelihood approach reduces the bias when estimating the strength of surrogacy using R2ht. It also estimates R2ht more precisely, utilising all of the available data even in the presence of separation.

Conclusions: The adoption of the penalized likelihood approach into information theoretic surrogacy evaluation is a useful addition to address the issue of separation which arises frequently in the context of categorical outcomes.

Firth D. (1993) Bias reduction of maximum likelihood estimates. Biometrika 80:27-38.

Developing a Novel Interactive Multifaceted Graphic for Treatment Ranking within Network Meta-Analysis

ABSTRACT. Background For many diseases there are multiple interventions or treatments available, with varying advantages and disadvantages. Network meta-analysis (NMA) is an advantageous statistical methodology for synthesising study results looking at treatment efficacy. In contrast to standard pairwise meta-analysis, NMA has the ability to synthesis data on multiple treatments/interventions simultaneously, allowing comparison between all treatment pairs even where there is no direct evidence. Subsequently, a powerful feature of NMA is the ability to rank interventions.

Objective This project had two aims: (1) ascertain current methods, challenges, and visualisations regarding treatment ranking within an NMA framework; (2) develop a novel graphic within MetaInsight (an interactive free NMA web application) to aid clinicians and stake-holders when making decisions regarding the ‘best’ intervention(s) for their patient(s).

Methods Current literature regarding methodology and/or visualisation of treatment ranking over the last ten years was collated and studied. Based on the literature, a novel graphical visualisation was developed using RShiny and integrated within MetaInsight, which is currently hosted on [].

Results Bayesian analyses produce rank probabilities from which mean/median rank and surface under the cumulative ranking curve can be calculated. For frequentist analyses the P-score is available. The simpler methods may be easier to interpret but are often more unstable and don’t encompass the whole analysis (and vice versa). To aid interpretation, and facilitate sensitivity analysis, an interactive graphic was developed that presented rankings alongside treatment effect and study quality results.

Conclusions Treatment ranking is a beneficial tool for clinicians and stake-holders, but results should be interpreted cautiously and visualisation transparent and all-encompassing. A ‘living’ version of MetaInsight, with treatment ranking, would allow interested parties to follow the evidence base as it grows, empowering users to continuously know which treatment(s) is the ‘best’.

Propensity Score-Integrated Meta-Analytic-Predictive Priors

ABSTRACT. In clinical trials there may be ethical, financial or feasibility constraints that make it difficult to randomize patients to a control group. Historical control groups may address this problem by borrowing information from external patients and thus, reducing the size of the concurrent control group. A Bayesian approach can be used to incorporate this information in the form of a prior distribution for the parameter of interest. Two popular approaches for defining this prior are the power prior and the Meta-Analytic-Predictive (MAP) prior (Neuenschwander, Capkun-Niggli, Branson & Spiegelhalter, 2010) approaches. Additionally, propensity scores can be used to select a subset of the historical patients that is comparable to the patients in the current study. Bayesian inference and propensity score methods have already been integrated for the purpose of forming historical controls using a propensity score-integrated power prior approach (Wang et. al, 2019). In this research, several new methods were implemented for integrating propensity score stratification or weighting with the MAP prior approach. In the first method, the patients are stratified using propensity scores and then the MAP prior is estimated separately in each stratum. The final estimate is a weighted sum of the estimates in each stratum using either equal weights or weights based on the overlap coefficient of the propensity score distributions. In the second method, one MAP prior is derived by replacing the study level of the standard hierarchical model with the strata level. In other words, the MAP prior is derived from a hierarchical model for strata which ignores the study attribute. Finally, in the third method, the MAP prior is derived from a hierarchical model for studies. However, the weight of each study in the estimation of the posterior mean is adjusted based on a summary statistic of the propensity scores of the patients in each study. Initial simulation results suggest that the second propensity-score integrated MAP method using a random effect per strata outperforms the other methods in terms of bias and mean-squared error. This simulation study will be expanded to consider further scenarios. Additionally, the methods will be applied to data on Alzheimer’s disease.

Using threshold analysis to assess the robustness of public health intervention recommendations from network meta-analyses

ABSTRACT. In the appraisal of new clinical interventions, network meta-analysis (NMA) is commonly used to investigate the effectiveness of multiple interventions in a single analysis. The results from a NMA can be fed into a decision analytic model to assess the cost-effectiveness of the interventions [1]. However, in public health intervention appraisals, NMAs are not widely used. Public health interventions are often complex in nature, with multiple outcomes and multiple interventions containing multiple components. Most evidence in public health appraisals is summarised with a narrative review or pairwise meta-analysis [2].

The hesitancy to use NMA methods in public health intervention appraisals is usually stated to be due to high levels of heterogeneity between studies [2]. Heterogeneity can arise due to several reasons such as differing study populations, outcomes, and study designs. In public health, the studies are more likely to be prone to bias and often are of poor quality due to the nature of the interventions and the outcomes [2].

Threshold analysis is used to assess the robustness of intervention recommendations to bias adjustments in the effect estimates from NMA [1]. Threshold analysis quantifies how large a change in the effect estimates, from individual studies or intervention contrasts, would be needed to alter the intervention recommendations from the NMA and resulting conclusions.

The threshold analysis [1] was applied to published network meta-analyses in various areas of public health. We were able to assess and quantify the robustness of intervention recommendations from several networks by assessing the credibility of results from individual studies in the networks as well as the intervention contrasts.

We illustrate that threshold analysis allows researchers and policy makers to assess and quantify the credibility of their results to evidence that could be biased [1]. This highlights that the use of such methods should ease any hesitancy to use NMAs in public health intervention appraisals and increase the use of such methods.

Meta-analysis methods used in systematic reviews of interrupted time series studies

ABSTRACT. Background: Systematic reviews are used to inform decision making, evaluating the effects of organisational, policy change, public health interventions or exposures. Such reviews may include evidence from interrupted time series (ITS) studies. A core component of many systematic reviews is meta-analysis, which is the statistical synthesis of results across studies. To date there have been no reviews examining the statistical approaches used to meta-analyse effect estimates from ITS designs. Objectives: We aimed to examine: 1) whether reviewers reanalyse primary ITS studies included in reviews, and what reanalysis methods are used; 2) the meta-analysis methods used; 3) the effect measures reported; and 4) the tools used to assess the risks of bias or methodological quality of ITS studies. Methods: We searched eight electronic databases across several disciplines between 2000 and 2019 to identify reviews that meta-analyse at least two ITS studies. We describe, at the review level, the type of interruption and methods used to assess ITS study risk of bias; and at the meta-analytic level, the effect measures, meta-analytic methods, and any methods used to reanalyse the primary ITS studies. Results: Of 4213 identified records, 54 reviews were included. The common interruption evaluated was a public health policy intervention (32/54). The majority of reviews (34/54) meta-analysed results from separate studies (with 23/34 reanalysing the included ITS studies), and the remainder meta-analysed results from sites (e.g. states) within a study (20/54). Among reviews that analyse the ITS studies, the most common models were Poisson (8/41) and ARIMA (7/41), with 21 including both level and slope parameters. Most reviews used a two-stage meta-analysis (51/54), fitting a random effects model (35/51) with the DerSimonian and Laird variance estimator and Wald type confidence intervals (18/35). Several effect measures were meta-analysed, with the most common being an immediate level change (45/54). Of the reviews of studies, nearly all assessed the risk of bias of included ITS studies (33/34). Conclusion: This review informs work that aims to examine the performance of different meta-analysis approaches to combining results from ITS studies, and demonstrates the need for improved statistical analysis reporting at the ITS analysis and meta-analysis levels.

Effects of probiotics on mortality and morbidity in preterm infants: a Bayesian network meta-analysis of randomized and non-randomized studies

ABSTRACT. Background and statistical challenges Necrotizing enterocolitis and late-onset sepsis are major causes of mortality and morbidity in preterm infants. It is argued that only randomized controlled trials (RCTs) should be included in the meta-analysis. The experimental setting, stringent inclusion and exclusion criteria, small sample size, high cost, short follow-up time and ethical restrictions limit the abilities of RCTs to apply the findings in real-world settings. The integration of findings from non-randomized studies (NRSs) will complement the findings of RCTs and address some of the limitations of both study designs. Bayesian network meta-analysis is suitable to incorporate data from multiple sources. There is a lack of network meta-analysis that assesses the effects of probiotics on mortality and morbidity in preterm infants combing evidence from both RCTs and NRSs. The study will fill this gap. This study will also guide combining evidence from RCTs and NRSs in network meta-analysis in the future. Objective The study aims to assess the preventive effects of probiotics on all-cause mortality, severe necrotizing enterocolitis, and late-onset sepsis in infants of low birth weight. Methods and analysis We searched MEDLINE, Embase, CINAHL, CENTRAL and Scopus databases on August 4, 2020, without any date or language restriction. We assessed the risk of bias using the Cochrane risk of bias instrument for randomized trials and Newcastle-Ottawa Scale for non-randomized studies. We will use Bayesian network meta-analysis to compare the effects of single vs multi-strain probiotics combining data from randomized trials and non-randomized studies (e.g., cohort, case-control, cross-sectional). We will use the GRADE approach to assess the certainty in of evidence of each outcome. Expected results We included 104 articles (71 RCTs and 33 NRS) for our network meta-analysis. We hope to finish the data analysis for this project by March 2021. Our preliminary analysis indicates that probiotics have beneficial effects on mortality and morbidity in preterm infants. Conclusions Probiotics have positive effects on mortality and morbidity in preterm infants. The integration of real-world evidence from NRS with RCTs has the potential to increase the precision of the evidence and help in the decision-making process.