Dependent random measures for modelling exchangeable and non-exchangeable data

Paper

Workshop

Summary

<aside> 📢

</aside>

Transcript

<aside> 📢

Really enjoy the environment, the atmosphere here, even if just for the workshop this year, but still a wonderful experience. So today I want to speak about some constructions that are typically used for modeling non exchangeable data, but I want to show how they can be used also in a setting where data are actually homogeneous or exchangeable. And this is a project I have developed with David Sole, who has been a PhD student of Igor's and mine, and now he's a postdoc in Milano and Igor himself is a long standing co author of mine. Okay, so we already mentioned this today. So especially in Edwin's talk, he anticipated this problem of nonparametric regression where response, I mean data correspond to responses, in this case Y, which depend on some covariate X. S Speaker 1 01:10 The presence of such a covariate makes the data in a sense heterogeneous, meaning that we cannot assume exchangeability across different covariate values. And probabilistically, this is summarized by these simple relationships whereby we preserve the distribution whenever we swap two observations corresponding to the same covariate, in this case y1 and y2. Whereas when we swap observations corresponding to different covariate values, then the probabilistic invariance does not hold true anymore. Okay, so this is one way of somehow formalizing in mathematical terms the notion of heterogeneity. Okay. When referring to different covariate values, you may see when in particular the covariate space is discrete, this notion as a notion of partial exchange B. I won't enter into this today. So what is the possible modeling approach in this situation? S Speaker 1 02:16 Well, typically you may assume that the data are conditionally independent from some density function that depends on a set of parameters beta and on a covariate X. And these parameters are themselves independent from some family of random probability measures that are indexed on some covariate X. Well now a very popular construction of the general framework is U2 Sieve McGeeger who is teaching in his summer school and his second work in the 90s, at the end of the 90s marked really a change point in this literature. He actually introduced the independent living process whereby you construct this family of dependent random probability measures through the slip breaking construction and you take the sneak breaking length variables and the atoms in the sequence of probability measures as being independent within each x. Now this literature is flourished in the last 25 years. S Speaker 1 03:09 There are a huge number of papers. Those of you who are interested in gaining some access to this literature, I recommend going through any of these three review papers. The first one is on random measures. The second One on statistical science, very recent on the dependent living that process and applications to dependent measures in this other paper by Wade and Inaccy. So this is a very general framework for which there exist alternatives in terms of random measures. So if you take a discrete random measure, in this case I suppose the atoms are two dimensional, so I have some jumps sh with atoms c and theta that belong to some space, C belongs to the covariate space and fade up to some atom space. Using some kernels, you can construct still dependent random probability measures which are discrete, as in Steve's construction. S Speaker 1 04:18 And you normalize through this ratio. Okay? And I mean possible choices of kernels. For instance, these ones are kernels where you basically tend to inflate jumps corresponding to c values that are close to the actual covariate. Okay, so what you're doing here is something of this type. So you start with a random measure having some jumps sh at values of C in the covalent space. And then you either inflate those jumps or you shrunk them, or you shrink them, sorry, according to the distance you have between the actual covariate X indexing that particular random probability measure and the atom axi. I mean this is one way, but you can also decide the other way around the two somehow amplify the jumps that are far apart and shrink the jumps that are close to the covariance. S Speaker 1 05:28 So the use of this kernel allows you to achieve several possible options. And even the literature on this alternative construction is quite extensive. This has been used for spatial modeling by Rao and Telegram, Folkett Williamson and consider this construction, but with discrete covariate space and have proposed a conditional slice. Ampson authors considered a finite number of gamma random jumps at different observed covariates. So again using some kernel construction, a more recent construction, which is very nice, goes under the name of normalized random measures by Griffin and Lysen. And they basically assume a further hyper priority similarity kernel. But their construction is also very interesting. And with the Claudio and Igor we are trying to. It's an ongoing work where we try to study the marginal properties of this process. So determine a distribution of induced random partition. S Speaker 1 06:44 You see, these are discrete objects, so they generate a partition, okay, based on the ties you observe in a sample. And we try to identify the distribution of such a random partition which is going to be covariate dependent, then try to give a posterior characterization and then provide an assessment of covariates impact on the predictive distribution. So I won't ninja much of this work, so I'll Give it just as a pointer, because this somehow enters quite a traditional literature on the use of dependent non climatic priors for modeling heterogeneous data. But I want to shift the focus of the talk to the case where we want to model homogeneous. They are exchangeable. And I consider a specific instance of application in survival analysis that has to do with competing risk. S Speaker 1 07:46 See, when you examine competing risk, you only have a special type of observation which is a failure time and failure can be caused by different causes, obviously, which identify by delta that belongs to a finite set from 1 to D. And in this case one wants to understand what is the most likely cause of death, or based on past observations, what is going to be the most likely cause of death for a subsequent observation. So a predictive problem. I mean, this obviously is a very old problem whose origins date back even in the 18th century. In particular this work by Daniel Berguli, where he tried to disentangle the risk of dying from smallpox and other causes. And actually this is considered one of the papers that initiated epidemiology. But also think of other examples. S Speaker 1 08:49 For instance, when you want to examine mortality after myocardial infarction, this can be due either to sudden cardiovascular disease related to the issue you had, or non sudden cardiovascular disease or non cardiovascular disease. Okay. Or for instance, you may examine patients affected by melanoma who undergoes some surgical treatment. They may either die because of melanoma or to other causes. For instance, causes related to the specific surgery you have undergone. So I mean, you try to somehow cope favor times with possible causes of death. It's important to see here that the occurrence of one cause excludes the possibility of other causes to be observed, obviously. Right. So this is. The causes are mutually exclusive. This is an important modeling aspect you have to bear in mind. S Speaker 1 09:54 So what is a traditional approach to competing risk, which comes from the Scandinavian school, which is very much productive in this area. It's a multi state model whereby you start with a macro process having initial transient state that corresponds to state zero being alive. And you move to any of the absorbing states, okay? And the death of these absorbing states is death from cause D. Okay. And you regulate the transition from being alive to dying for cause D, fading for cause D through this intensity hd. Okay, sorry, this hd, which is nothing but the hazard, the co specific hazard rate in survival analysis. Okay, so instantaneous rate of failure to have survived up to time T due to cause B. And we refer to it as cause specific hazard. S Speaker 1 11:07 So you see, you move from different states, okay, sorry, from state 0 to any of the states from 1 to E, okay? In a time interval between 0 and t with some rates intensities. And based on these intensities, you define the transition probabilities, which are P0D and P00. Of course, P00 means that I didn't experience any cause of death and I'm still alive time t, which is obviously another possible state I can be in at time t. So how are the transition probabilities and likelihood defined in these settings? Well, first of all, the survival function, the probability of remaining at state zero after time interval t is defined in terms of the pose specific hazard rate you see on top there. Take the sum of the cumulative hazards at the X1, and this defines also the overall survival function. S Speaker 1 12:06 And then you do have the transition probabilities P0D, which are defined as cumulative incidence functions, okay? And involve the cause of specific hazards through this integral here. So the observed data, which are assumed to be exchangeable in this case, so we assume homogeneity such survival times, okay? They are made of pairs of survival times and cause specific, sorry, causal failures. In this case, the survival time may be either exact if actually there is a failure because of one of these gene causes, or it is not exact because such is still alive time t. Okay? And then you can determine, I mean, this is standard calculus in survival analysis, the likelihood function, which is a function of the cause and specific hazards. Again, this is not the only approach you can experience for handling competing risk data. S Speaker 1 13:12 There is also another popular approach which uses latent variables. Okay? So latent failure times, whereby you associate to each cause of death a failure time y1, y2yd, and you define the observations as the minimum across these d latent failures. And obviously the corresponding cause of death is the unknown of these latent variables, okay? So you can define the joint survival function, okay? And also the survival function associated with the actual survival time t. The observed survival time T, okay, is easily determined and the causes specific hazards follows again from standard calculations in survival analysis. So this looks like a very appealing also approach because, you see, it's very easy. You define failure times and once you define this joint probability distribution of the latent failure times, you can draw inferences on the observed survivals. Okay, but there are some caveats here. S Speaker 1 14:28 While being quite intuitive, it has some problems. First of all, it assumes that all event types occur eventually for each individual. But as I said, that the competing events are mutually exclusive, okay? Moreover, and this is most important with the data you observe. So the competing risk data you cannot identify without Further assumptions, the joint survival function and the marginal distribution okay. Of the latent failure times. Okay, so different joint survival functions of these latent variables identify the same likelihood lead to the same livelihood. Okay. And the only identifiable parameters are the post specific hazards. So though being an intuitive way of approaching the problem of competing risk, this has some issues of identifiable. But you see we're not going to use this approach. We will speak to the multi state approach. S Speaker 1 15:31 But anyway, this latent failures approach can be used for generating significant data, you can use for testing your model. And this is what we actually do. Now the frequent literature on these topics is quite extensive. Okay, here are a few papers I'm listing about the identifiable issue identifiability issue for the latent variables approach. You also have very nice reviews. It's paper by guess whose that has recently appeared. What's the name of the journal Seeing annual statistical reviews some review journal and statistics and applications and then textbooks by California Lawless Crowder. S Speaker 2 16:19 Guess who's. S Speaker 1 16:22 Find that a lot of material on the frequent approach to meeting rates. Then the agent of parametric literature is a bit more narrow. Here you have some work by who use a ventastasy approach. Zoo et al Use a dependent lyrical process for estimating treatment effects so cause an inference problem with semi competing risk organization. Some more recent paper uses a gamma process prior for baseline cause for CT hazards in a non parametric sphere. There is also this paper by Sparapani who uses a discrete time approach using math. Okay, now here we try to address this problem using dependent priors. Okay? And the way we do it, I mean the goals we try to achieve with dependent priors are the following. S Speaker 1 17:18 So we try to specify an EMP model for co specific hazards for estimating the survival function obviously for estimating the co specific incidence functions which are the transition probabilities in this Markov model. And then finally what is really relevant for us is the predictive, what we call the prediction. What is the prediction value is nothing but the probability that a new patient at time T dies because of host D. Okay, so you try to predict based on past data what's going to be the most likely, let's say host of death for a future patient. Okay, so here I'm drawing just as a cartoon. I'm very excited for the Danoma represent. Okay, so the orange curve reflects the probability of death because of melanoma and the green curve because of other modes. S Speaker 1 18:17 And you see that the beginning immediately after surgery other causes of death are more likely, okay, than melanoma and then as time goes by, melanoma becomes more likely as opposed to death compared to other causes. And you also have these credibility intervals with all causes around these prediction curves. Okay, so I'll address now these three problems. Let's see how much I can cover. So, first of all, what we assume here is dependence among the different cause specific hazards. The idea is that different causes of death can borrow information across each other. There may be causes of death that you observe more readily. Okay? They would want to impact on other causes of death. Okay, so we have, in this case, you can think of a covariate space which is discrete and finite, corresponding to the different causes of death. Okay? S Speaker 1 19:04 He can also accommodate for other covariates, as we drew in some medical actually build data example. And moreover, we want to achieve analytical inference. We want to draw predictions at the end of the day. And one tool, which is bystander in Bayesian parametrics to achieve these goals, is a mixture representation using a completely random measures. Okay, so again, you see, we have random measures having jumps that are usually known as S, and random locations, which are these betas here. And that each cause of death has their own specific jumps and atoms. But of course, you may want to add some dependence across different causes of death because you want to borrow information across them. And so, I mean, you try to build dependence among these majors. And one simple way to do it is by taking a hierarchical construction. S Speaker 1 19:55 But before getting into the hierarchical construction that we briefly recall that I will be dealing with completely other majors which are atom matrix that take on induce independent random variables when evaluating at otherwise disjoint sets. And they are characterized by this representation of the Euler class transformation. And we also assume that this intensely measured mean characterizing a completely random major factorizes, so that we have homogeneous space. And we also have two technical conditions where one is quite standard. This integrability condition, this divergence that corresponds to having an introductivity measure which is needed for our results. Okay? And alpha, the major alpha can be either diffuser deterministic, or it can be random itself in this way, as we shall see. So in this case, the kind of dependence I shall assume is hierarchical type. S Speaker 1 20:47 So I'll assume that v1, v2, e, d are conditionally independent and identically distributed completely random measures. And you see that the intensity is random. So we have Hox processes here at the higher level, sorry, at the bottom level of the hierarchy, whereas at the top level of the hierarchy we have a standard completely random measure whose intensity is deterministic depending on its diffuse major lambda 0 in the atom space beta. Okay, and this is what we label hierarchical completely random major. And so we will assume for our competitive disk data an exchangeable assumption. We have homogeneous data, as I said, they are conditionally IID given some random probability measure P which we obtain using a transformation of this vector that accounts for the different causes of death. So how do we do this? S Speaker 1 21:34 Well, we have to define a prior distribution of those transition probabilities, right? So the way we do this, we define these priors as this way. So you see this transition from 0 to D at time t is expressed in this way. So you see it's a nature of this co specific random measure vd. And then you have an exponential term that accounts for the survival part. And then you see the survival function also has a prior that is induced by this random function, okay? You also can establish definition under which this survival function is almost shorting product, because you want these two tend to zero almost surely as T tends to infinity. And this condition is easily satisfied by kernels K the most unlike kernels one may use and I will provide some examples later. S Speaker 1 22:26 So marginalizing you obtain, I mean, if you consider the marginals of this. So you take the expectations of those transition probabilities you do have that you see the instantaneous transition from 0 to d is the same for Ned. It doesn't depend on the cause of death. So there is an underlying assumption of uniformity of the different causes of death. A priori, I'm assuming that all causes of death are equally likely, which is quite a strong assumption. But you can easily remove this assumption if you wish, and if the specific problem tells you to do so, by either differentiating the kernels according to the different causes of death, or by taking the underlying random measures at the bottom of the hierarchy as being just clicked independent. So you remove the identity distribution, etc. And still the whole framework works. S Speaker 1 23:12 And just need to do some adaptation of the results. I will show now key formula we use here is a simple change of measure, okay? For all the calculations we need to perform, you need to establish a sort of random equivalent derivative between this moment measure at the left and this expression we have at the right hand side depending on this exponential. You see, this is a Laplace transform. Here we have on the right hand side and this holds true whenever alpha is diffused. But we are going to handle discrete measures, okay, because we have a Hox process at the bottom level of the higher. And when alpha is discrete, there are Some, I mean, one can deduce formula similar to these ones, either using the Fabi Bruno formula for derivative of composite functions. S Speaker 1 24:12 Otherwise, by doing an augmentation of the dimension of the atom space through a map, I will give just a simple intuition of what we do here and then skip some material in the interest of time. So what we do here, you see, the problem is the following. When you have a random measure here, you see what happens that you may have atoms that coincide here, okay? This theta star, because alpha now is discrete. And whenever you have atoms that coincide, the corresponding jumps are gathered all together. There's no way you can disentangle the different jumps corresponding to atoms that coincide, okay? Because the random measure will depend on those jumps for their sum. But for Bayesian calculations, it is important to separate those jumps, okay? This kind of aggregation doesn't work, okay? Produce some important combinatorial hurdles in computation. S Speaker 1 25:10 So what you do, you introduce its augmentation with a Z, okay? Which being from anatomic probability measure, are almost surely distinct. And so once you pair thetas with this Z, these pairs are going to be distinct, okay? Which allows you to disentangle the different jumps at that point, okay? And this is what we do, okay? And this corresponds basically to a fragmentation process, okay? Those of you who studied coordination of fragmentation processes in probability, I mean, this is what you're doing here, okay? You're basically fragmenting a partition, okay, using this continuous marking, okay? And so you have a nested partition, so, which gives you starts from a coarse partition in thetas, getting to a finer partition in the zetas, this continuous marks. And this achieves what we want to achieve for performing Bayesian calculations. S Speaker 1 26:15 So you see, you have nested partitions, okay? The coarser partition in theta and the finer partitions in the Z, okay? And using this, I mean, now I will go quickly through this formula here. But you see, you can marginalize very easily and you recognize the formula of the marginal distributions, the acting of these nested partitions, okay? They are very transparent in this formula, okay? Which depend only, you see on the. This C is the C's are the Laplace exponents, and these are the cumulants of the underlying completely random measures. And also you can determine the predictives, okay? And again, also the predictives. You recognize this nested structure whereby to each distinct theta, you can either have an old Z or a new Z, okay? So because you are fragmenting again theta partition into the final z partition. S Speaker 1 27:22 So, and this is a kind of intuitive Structure, you recover also in all the predictives, allowing you to determine the prediction curve, okay, which was the ultimate goal we wish to achieve. You also can give a posterior characterization which shows a structure of conditional conjugacy, meaning that if a priori, you start with a hierarchical random measure, also a posteriori, for the components not related to the jumps, but for Mistar0 and MD, you still recognize a hierarchical construction. And we can also prove a probability of conditional conjugacy for generalized gamma hierarchical completely random measures, in the sense that if you start from generalized gammas here, you end up having a sort of extended generalized gammas apostelio, okay? So sort of contrasting property also in this competing risk setting estimation. S Speaker 1 28:32 So we can use those results to estimate survival function, co specific incident functions and the prediction curve, okay, using mcmc. So, and here I will show you just I have time maybe for a couple of quick illustrations. Okay, I have some time very quickly. Okay, so here we have, you see, generalized gammas, for which we assume, for which we prove the conjugacy property with different choices of kernels. Here you see the Dixon law or the tanular kernel for some threshold code. And we use the length variables framework for simulating three different causes of risk, okay, with some prior specification for the generalized gamma process. And for the Dixon Laud kernel we use in this case, you know, the Dixon Laud kernel corresponds to increasing hazards, which may not be the case in some applied settings, okay? S Speaker 1 29:44 And T, you see here I consider a maximum follow up time because I need lambda zero to be finite, okay? And doing so, you see, this is just a kind of output I want to show, so you can estimate the cause specific incidence function, okay? So instantaneous rate of transition from 0 delta t. And here you see, for the three different causes, and we also have credible intervals around those estimates, the prediction curves, okay, corresponding to these causes. You see, we have the dashed lines, which are the true ones, and the continuous lines which are the estimated ones. And then we have estimates of the cause cumulative incidence functions. And so, I mean, there is also a comparison with independent hazard rates, which I skipped. And then the application had to do with transplant data. S Speaker 1 30:50 This is a real data set where we examine 400 patients that were diagnosed with acute myeloma leukemia. Another went to bone marrow transplant in different hospitals, okay? And we consider only two competing events. Dying from acute or chronic graft versus host disease or death or relapse without going through the graft versus host disease. And here you see, we had several sensor observations for which our model can Be easily adapted and with a maximum follow up time of 7.3 years. And we also use a multivariable predictor. Okay. Which can be introduced in our model to identify the source for transplantation which could have been 1 ng or blood stem cells. And here's the result. You see, it's clear that I mean through time most likely cause of death is a graft versus host disease versus other causes. S Speaker 1 31:45 And you also have the estimation of the sub distribution functions or transition probabilities. The K0 the. And I think. Well, I speak the melanoma data. And I thank you for your attention. Any personal comment? S Speaker 2 32:17 Thank you very much, Antonio. So just a quick clarification question. S Speaker 1 32:22 If I understand or you use hierarchical. S Speaker 2 32:25 Sort of construction to couple these different causes. And of course using CRM and since you have the for the mixing probability it depends on the covariance. But the atoms may share. S Speaker 1 32:41 Yeah. S Speaker 2 32:42 And then is there a particular consideration into that choice for not allowing the atoms to maybe vary with respect to the community? So the mixing seems to be dependent on the variate. S Speaker 1 33:04 Yes. S Speaker 2 33:05 The mass over tmax. But not the iron. S Speaker 1 33:10 No, the atoms. No. Okay. Because I mean the atoms. What you want to achieve is a sort of sharing of the atoms across different causes of death. So it's like you have a simple atoms model. So the atoms are shared across different causes of death. So they cannot depend on It's a latent quantity. Right. So we wanted to use the atoms actually to allow some moral information across different codes of data. And one way of doing this is not to make it dependent on the coherence. I have a question. So can you use a model to predict the will be applied for human. Then can you use the model to read the life frequency of life? Yes, you can use the estimated survival function to do that. Right. S Speaker 1 34:15 So our focus was more on the prediction of the causes of death as a function of time you survive. But of course you can address that problem by using estimation of other functionals. So the underlying ground of probability measure which is the survival function. Yes, but when you apply to green problem then I did a similar thing. But I would have the problem with the data Several way to collect data and for each kayak different strategy to collaborate the model. I will discuss with you later. Okay. S Speaker 2 35:07 Thank you very much. So I just curious about. I see that there's some relation between the boy and some state. So between the state when you mess the state one to d something like this between the state as a dependent. S Speaker 1 35:26 Yes, I create dependence in the different states. S Speaker 2 35:30 Okay, so, but at a certain point when you mentioned there's some pictures, they are merged in a certain house. S Speaker 1 35:35 Yes. S Speaker 2 35:35 So what's the meaning of this one? So, because there are some. S Speaker 1 35:41 I mean, what you have is that the incidence functions may cross, obviously. Right? Because I mean, this is what typically happens. You don't have proportional hazard rates, for example. Right. So across the cost specific rates are not proportional. Okay. And this creates this sort of behavior. And actually these models, I mean, I didn't show it, but in the simulated examples you can actually identify very well the point at which the different insects the Cuban ins and functions to cross. Okay, so dependence gives you information also about this aspect of the problem. Okay. Identify the relationship between the different curves associated to the different states. An example I'm going to show is where we actually compare the dependent case with the independent case where you do assume that the different states are independent. And in that case there's a huge difference. S Speaker 2 36:46 What happened is the state is not independent anymore. S Speaker 1 36:51 You don't have any. I mean, the atoms speaking in going back to Long's question would be specific to each cause of death. Okay. So it's as if you are estimating independently the different cause specific hazards without using information from other cause specific hazards. Okay, so there may be a death whose cause is very rarely observed. Okay. You have very few data, but you can use data from other causes of death for estimating the cause specific hazard for that specific case you have observed very rarely. Okay. This is what we try to achieve. Make sense? Thank all the speaker again. Sync with audio 00:0038:00 1×

</aside>