Paper

Summary

<aside> 📢

Introduction to Predictive Bayes

The presentation introduced the Predictive Bayesian approach, which flips traditional Bayesian inference by starting directly with predictive density rather than the prior-likelihood-posterior sequence. This framework allows statisticians to work backwards from predictive densities to recover posterior distributions. The approach has historical roots in work by David Ettinger and Phil David's "prequential approach," with recent resurgence through papers by Fortini and Petroni.

Sequential Imputation Algorithm

The core algorithm uses sequential imputation to construct joint predictive densities:

Factorize joint predictive density into one-step-ahead predictors
Draw samples from the predictive density
Update the predictive model with simulated data
Repeat to generate samples from which parameters can be computed
This "martingale posterior" approach requires specific technical conditions

Practical Applications

The presenter demonstrated the framework using several examples:

Gaussian location model showing trajectories converging to match the true posterior
Non-parametric density estimation using copula kernels as an alternative to Dirichlet process mixture models
Quantile function-based approach that simplifies implementation for regression problems
Conditional density estimation extensions for regression settings

Advantages and Implementation

Computation can be significantly faster than MCMC methods (seconds vs. half hour for quantile regression)
No need to specify priors or likelihoods explicitly
Provides similar uncertainty quantification to traditional Bayesian methods
Naturally resolves some technical issues like quantile crossing in regression
Model selection can use marginal likelihood or cross-validation

Research Challenges and Future Directions

Extending to hierarchical models using Newton's method for information sharing
Developing approaches for time series data using transition kernels
Addressing theoretical challenges with non-exchangeability of update rules
Exploring applications to non-Euclidean spaces
Creating a general framework for incorporating dependence structures

Action Items

Review Fortini and Petroni's paper "Exchangeability, prediction and predictive modeling based on statistics" for foundation of predictive Bayes
[ ] Consider applications where computational efficiency is critical as potential use cases </aside>

Transcript

<aside> 📢

So, this is the usual framework for Bayes. You choose a prior likelihood, you get the posterior, and then you do everything afterwards. And finally, at the end, you get this posterior predictive density that you can use to...

Predictive Bayes, which is kind of the umbrella under which this Markov Neapolistic Framework lies, is actually flipping things around. So instead of starting off with the prior likelihood, we're going to start directly with the predictive benefit. It's quite easy to construct these predictive densities. So the predictive Bayesian actually works backwards. We start with the predictive density. We construct this machine for computing the population, which is what I'm going to talk about.

And from that you can actually recover the posterior distribution. So it's kind of a data-driven version of inference. You just construct your model for the data, and then from that you can actually... So you can get back uncertainty on the parameters of interest, which is what we are aiming for in statistics.

After I give a brief introduction for those who are new to this topic, we will move on to a more difficult topic, which is how to extend the framework to derive nonparametric functions.

I mentioned that it's relatively straightforward to elicit predictive densities. That's sort of a misnomer, actually. It can be pretty complicated when you put some restraints on the models. So if you're trying to populate predictive densities where there's some structure, it's not super obvious what to do.

So we're going to basically talk about how to extend it to dependents. That's going to be the research part of the talk. Whereas the details of the framework, you can look at the paper, but it will also be covered in the e-mail.

Why are we going to be working with this predictive density? It's kind of weird to turn things around. Well, I think it's sort of timely to look at predictive density because these come up a lot in machine learning.

What's nice as well is that when you have statements on observations, so predictives, you can actually see the data in practice, so you can check how good your predictive is. Whereas statements on data, like your prior distribution, are usually pretty hard to check because your parameter is not ever directly used.

And this is mainly the reason why I'm interested in this topic, is that computation can be much more expedient. So you might have seen, in Bayesian inference, you require these quite expensive Markov chain Monte Carlo algorithms.

Inside the predictive framework, it turns out, austere sampling is actually quite easy, but sometimes it's way faster than MCAs. Not always, but sometimes it is. And if you're interested in kind of the theory and the general framework for this, I really highly recommend looking at this paper.

This is a paper by Sandra Fortini and Sonia Petroni. It's called exchangeability prediction and predictive modeling based on statistics. It covers kind of the whole topic of the predictive database, starting from the fundamentals. If you're interested as a student, I would highly recommend reading this as your first entry into the field. Now that we've got the motivation out of the way, we're going to start talking about some...

Before that, I'd like to pay homage to the people who have contributed to this field. This notion of working correctly with a predictive distribution actually has a really long history. It goes all the way back to David Ettinger, who was actually... And there have been other well-known components of working directly with a predictive distribution, like Phil David, employing this term, prequential approach, which we might mention later.

This is all you can see papers from quite a long time ago. There's been a big resurgence recently based on this paper, also by Sonia. This is a voice memo from 2020, where they looked at constructing a base in non-parametric falsifiers directly.

I think it kind of sheds a nice light on the source of statistical uncertainty, a little bit philosophical. So if this is not your jam, you can zone out for a bit. I think most statisticians, Bayesians and non-Bayesians, can agree that there's only uncertainty because you actually see a finite number of data points.

So what I want to end is a finite data set. If you're interested in estimating something like the mean or the variance or some population parameter, if you have access to every single person in the population, For some way of getting access to it. So sampling. And actually you would know your parameter exactly. So there's no point doing statistical inference anymore. So in this work, the idea is that actually, because you've seen y1 to n, but you haven't seen the rest, ym plus 1 to infinity, you can look at basic inferences working directly with this object.

Using that to construct confidence intervals. You can actually look at Bayesian instances doing a very similar thing, where instead of drawing the data population with replication, you're going to draw from this big joint predictive density. Yn plus 1 to infinity of this joint predictive density.

So we're going to talk about how to do that later on. It looks a little bit kind of crazy right now. But the idea is every time you draw a new population, you can compute your parameter like a bootstrap. And that theta infinity is going to have a distribution. It's going to have a random distribution, because y n plus 1 to infinity is random. And it's going to be conditioned on the data you've seen, because you're generating a prediction.

And so what happens is you get actually the posterior distribution back. You get a distribution on your parameter, which is. There is some nice math you can show that in specific parametric Bayesian models. This equivalence is exact. So it's not just like intuition, this is actually a very formal connection. So you can actually do Bayesian inference using a bootstrap.

The hard part, of course, is constructing this kind of crazy joint predictor that's this big P of y n plus 1 to infinity, which is what we're going to talk about now.

So the most important part of actually constructing this, of course, is an algorithm.

So how would you actually do this very big imputation step? With a bootstrap it's very easy, but in this case it's a little bit more complex. So an algorithm for doing this was actually introduced by... I work in, as well as, Fortini and Petroni's paper. And the idea is to do something called sequential imputation. So you're going to be working with smaller bite-sized versions of the problem, these one-step-ahead predictive densities. So let's denote p i as p y i plus 1 given y 1 to i.

That's the predictive density for a new datum, y i plus 1, given the history. And what you can do is you can factorize your joint predictive density, this big object here, this first line of the algorithm, into a product of one-step-ahead predictors. This is just a chain rule from probability. Nothing special so far. And what this actually translates in terms of an algorithm is this inner loop here in red. So, if I want to generate a big long sample, y n plus 1 to infinity, what I can actually do is draw a sample from my predictive density.

So I'm actually going to impute, like make up fake data basically. I'm going to update my predictive density, treating that fake data as real data, and then keep going. And that's actually going to be equivalent to drawing one big loop. So the inner loop is going to be one bootstrap sample, essentially. The outer loop is the kind of traditional bootstrap. And then you can compute any parameter of interest.

So we're going to talk about what these two steps look like—the two red lines—in a very complicated formal example shortly.

Okay, so I'm going to go through a PowerMagic example, and this is kind of the most important thing in this talk.

What we can show is how to do this algorithm in practice, and then we're going to use this to generalize the framework. So, consider a very standard base in parametric logic. We're going to look at a Gaussian location model, where you have a normal distribution and an unknown mean beta with variance 1. We're going to put a conjugate prior, normal beta 0, 1, and of course, we know how to calculate it.

So we actually know what the posterior is, and we can calculate the posterior predictive. So the posterior predictive in this setting is really straightforward. It's going to be basically centered at the regularized version of the sample mean. The d to the power n is going to be basically the average rate.

So instead of starting with the posterior, why don't we just go directly to the predictor density and carry out this algorithm that I just mentioned before. So the first step is to draw a sample from this predictor density, very straightforward. I just draw a Gaussian centered at the previous mean with some variance.

The next step is going to be this weird step where you actually update your model with this simulated data. You kind of treat this fake data as real data. And so what we're doing here is we're actually going to just plug it back into the app. We're going to be computing the sample mean, where the first n data points are real and the n plus 1 data point is fake. So that's going to give you a random number. And you can repeat this onwards until some big capital N and look at what happens to your population.

So this algorithm is going to be illustrated in the next plot. Basically here we have a figure which is a little bit loaded. What we did is we generated 10 data points. So we started off with data points. And on the left-hand side, I have a plot of these trajectories. So what happens to this algorithm as I draw more and more samples? And each of these black lines corresponds to one of the trajectories.

What happens is you get this wandering of these trajectories. You get these kind of curves that eventually collapse to a certain point. The reason why it collapses is because you see more and more observations, and you start to actually estimate the...

And if you look at where these kind of parameters end up, the distribution at the end, you actually end up with a very nice posterior-looking plot. In fact, if you overlay it on the exact posterior density, which is known to be Gaussian, you can see that they're basically the same.

So all of this uncertainty that you get from making an inference, you can actually start directly with the predictive density, this Gaussian center at the mean, instead of working with the prior and the posterior. And you end up back with basically the same thing. So this algorithm works in practice, and also you can justify this theoretically.

So, we've looked at using the predictive density as a parametric model. Well, the whole point of kind of coming up with this interesting algorithm is that we can make it more general. So, one way to look at this is actually to consider a much more general class of predictive densities instead of this kind of simple Gaussian location model.

So, let's just leave things abstract for now. Let's suppose that our model of interest, I'm going to refer to the sequence of predictors as a model now, is p of yi given yi to i minus 1 for n plus 1 to infinity. So, I'm actually going to... When you construct a Bayesian model in this way, which I'm going to show you how to do shortly, you don't actually need likelihood or prior anymore. You just need a direct predictive density. So this is a way of doing Bayes without not just the prior, but also without the likelihood.

There is a very technical condition that's needed. The framework that we coined is called martingale posterior arises because you have some pretty stringent conditions on the predictive. We're not going to talk about that today, but if you're interested, please take a look at it.

And then to actually do posterior inference, you just have the two steps that we mentioned before. We're going to do this bootstrap, where we're going to draw samples, but instead of drawing IID, we're going to do this. So if you're interested in progression functions or probability densities, which is what we're going to look at shortly, then this also works as well. You just need to be a bit careful.

This imputation scheme, of course, you get a random theta infinity. You're going to look at these trajectories and see where they end up. And plotting the ends of the trajectories, we're going to call this distribution the Mach-Novel. Again, the name Sigmartinghill comes from a very technical condition. This is a mathematical construct, but it's just a fancy way of saying that. A predictive leads and posterior reason.

Okay, so we've come up with this nice general scheme, right, and I've kind of motivated this from the Bayesian point of view. The important question basically now is, are there a sequence of predictive densities that are both intuitive and work nicely with this martingale framework?

And the answer is sort of. And we're going to give you some examples now. And we're going to talk about how one might extend this kind of framework, as well as this kind of modeling of predictive densities, to settings where you actually have some version of the engines.

I'm going to now go through a very concrete algorithm. It's actually very easy if you're not amazing.

It's going to be closely connected to kernel density estimates, which you might have seen before. In this work and previous papers, what we actually tried to do was to come up with a The first one that we started off with was this version of kernel density estimation with Bayes. So if you're familiar with Bayesian non-parametrics, which a lot of the speakers in this audience are going to be world-leading experts, there's this model called the Dirichlet process mixture model, which is supposed to basically be the Bayesian version of the kernel density model.

I'm going to be talking about predictive density. But what you can do is you can use that model for inspiration and actually construct this predictive update. So in this predictive update, we have two components. We're basically taking a weighted sum.

Your new predictive density, p of i plus 1, is going to be a weighted sum of your old predictive density. And this kernel thing, which is going to be centered at this new data set. So, what you can actually do is, if you're very careful with this construction of this kernel, you can show that it fits into this Markov framework. So, what does this kernel look like? Well, this column on the left actually shows you what the kernel looks like.

The real data point is at... So you see this black dot, that's like the real thing. And for a regular kernel density estimate, this is just going to be a Gaussian fill-up section of it. In our setting, actually, we can see that as we decrease the bandwidth of our copy of the kernel,

It actually moves a little bit away. It moves towards this dashed line, which is the old connected density. So, there's a lot of technical details that I've swept under the hood. The main idea is that the copy of current... That's the estimate. So if you take a weighted sum of the predictive density, pi, with a copula kernel, you end up with this plot on the right. So if the old density is the dashed line, the new density has this new extra lump at this new observation that you've seen.

And how spiky the lump is, that depends exactly on the bandwidth of your kernel, or your copula kernel. So if you look at this plot on the right, it basically just looks like a regular kernel density estimate. Nothing too special. What's interesting is that to get this construction, you don't need to talk about priors or likelihoods, right? You just talk about a rule.

It takes a density, a new data point, and updates it to a new density. And if you actually use this algorithm for predictive resampling, which we kind of show is valid and has a lot of nice properties, you can actually do something quite interesting, which is you can get posterior distributions of your density.

So on the figure on the left here, we have what happens when you do... You had to use Gibbs sampling, which is a type of Monte Carlo, to draw samples from this pretty low density, and it was pretty slow. But you can see that if you compare the figures on the left and the right, they look pretty similar.

We have this black curve, which is the mean. We have these grey bands, which is the credible intervals for the statistical uncertainty. And they're pretty comparable, right? They're not exactly the same. There's more uncertainty on the left-hand side in certain regions than the right.

But all in all, they give a pretty accurate kind of representation. What's neat is on the left-hand side, I didn't have to use any of this Markov chain Monte Carlo. I didn't have to talk about priors, I didn't have to talk about likelihoods. Just start off with this weird copula kernel algorithm.

It has been edited to include proper punctuation.

Of course, there's still very big advantages to working with the Dirichlet process mixture model, but if you're interested in just the density, then maybe it's worthwhile using this kind of update. Okay, so if you look at this copula update here, it's basically just a kernel density estimate. So a natural question to ask is what happens if I have covariance. So suppose that I don't just have yi plus 1, I now have xi plus 1 as well, and I want to model the conditional density of y given x, kind of your traditional regression setup.

So the work that I'm looking at now is how to extend this to conditional density estimation. The framework extension is a little bit more complex, but in terms of coming up with a predictive rule, we're actually going to be using a very famous kind of dependent version of the Dirichlet process, actually pioneered by one of the...

So you can model your kind of conditional density. This is the regular things in setup. It has this statement in the instruction where you have dependence on x, and maybe also dependence on x in your density. And what you can show actually is that if you construct your prior in a specific way, where you kind of...

If you want to look at what the predictive rule is, you can use this dependent Dirichlet process mixture to come up with a covariant-dependent version of this previous algorithm here. So, the new conditional density is a lot more notation, but actually it's just the same algorithm.

The new conditional density is going to be a weight of some of the old one. The main difference here is that the weights, alpha, in the previous case were just equal to 1 over i plus 1. Now they're actually going to depend on the distance between the covariant you're interested in and the distance between the weights.

This is what I'm trying to show now. For the copula kernel, you actually end up with something very similar. It's actually just exactly the same as this. But I've kind of left things a little bit vague. So, the take-home story basically is that if I want to introduce covariance, I can actually look at the dependent base of the jitcher, look at the kind of predictive rules that those imply, and hope that it gives me something tractable.

And once you have this algorithm now, we can actually look at more general structures for alpha. So there's a very rich history on how to construct these wj's in the dependent Dirichlet process mixture model. So what I'm looking at now is how can I use this rich history, this field of constructing w's, to also inspire me to construct these alphas directly.

And you'll see there's actually going to be some really interesting connections to machine learning. But this is future work that I'm working on with one of my students. Now we're going to change gears a little bit and talk about a different kind of machine learning.

On the previous slide, we talked about predictive densities. You can actually work with something that's a little bit more tractable in this case, which is quantile function. This is some new work that I did with one of my collaborators. We introduced this kind of framework for doing martingale posteriors using the quantile function instead of a predictive.

So for those who are not familiar, a quantile function is basically the inverse of a CDF. The best part about a quantile function is that you can use it to generate samples from your on-dope test. You can plug in basically a uniform random variable. So, instead of working directly with density, we're going to work with a quantile. This update actually looks very similar to the previous one. Your new quantile is your old quantile plus this kind of gradient term.

I'm not going to go through the details here, but the main idea is that actually what you end up doing is you end up smoothing the kind of stochastic version of the quantile update. So you can show that if you do a quantile estimate and you kind of look at it.

The figure on the left here is what this update rule on the right looks like. So you see a new data point here, you're going to update the quantile function with this kind of blob that's smooth, depending on some bandwidth. And then on the figure on the right here, you see what happens to your quantile function.

So the dotted line is the old quantile, and then you see this new data point, which has CDMV. And you're going to basically add a little dimple there. It's very much akin to a kernel density estimate version of the quantile, but what we show is that it kind of satisfies the right martingale properties as well. So you can also use this algorithm for predictive re-setting.

where you're not wobbling the predictor density anymore, you're wobbling the kind of general version of the quantum function.

So what this looks like in practice, actually, is really nice, because the quantile function can be used directly for sampling. If you remember, we need to draw samples of p of ym plus 1 given y1 to n. So in this setting, actually, you just need to draw some uniform weights, plug them into this quantile function or pseudo quantile function that you're estimating.

And you define that to be the distribution of ym plus 1. So all I need to do for sampling is draw these weights, plug them into my quantile function, and then I get this kind of new sample coming from this distribution that I want. You can actually show, it's really nice because it ends up just being a bunch of new weights.

And if you run this algorithm forward in time, then you get basically the same thing as I showed you earlier, you get these posterior distributions on quantile functions. So on the left-hand side, I have... The blue line is the posterior mean, this kind of traditional mean that you're familiar with. These blue bands are the credible intervals of the Quanta function, and you can see that as I see more and more samples, the bands decrease in size. You can also use this to construct your posterior distribution over a linear mean.

So in this case, the theta here will just be the average of your population. And you can see that again, you get these very nice looking, reasonable postures, which can concentrate on the truth as you see more and more observations. But again, this algorithm is completely free of NCs.

So we have this quantile update here. It's kind of like a stochastic gradient descent version. What's nice actually, and it's quite surprising, is that introducing conditioning or dependent structure with covariance is actually way easier with quantile.

The reason is because quantiles live in a kind of bigger space. They just need to basically be in L2, these pseudo-quantile functions. So it's a lot easier to kind of incorporate the density in that way. So in our paper what we showed is that instead of working with gradient descent for the quantile estimation case, you can look at the quantile regression case and you end up with this very similar looking formula here.

You have that new quantile regression coefficients, or your old regression coefficients plus this kernel, and then you just multiply it by the gradient of your gradient. So if you're working with linear quantile regression, which is the most popular one, this gradient term is just x. So you end up with basically the same algorithm as before, but you multiply it by x.

You can actually make this non-linear. It's not too difficult. And when you run this kind of algorithm, you actually resolve one of these long-standing problems of quantile estimation, which is this issue of quantile proximity. It's a little bit technical, so I'm not going to talk about it here, but it's sort of naturally resolved in this framework. And what you can do is then use this algorithm to do kind of quantile regression, Bayesian quantile regression. And we compared it to this really nice work, also by Steve.

So you can see I take a lot of inspiration from his work, where they looked at extending a Bayesian version of this kind of quantile regression. So they put prior distributions, kind of like the dependent Dirichlet process, but now on Quartals instead. And they have this really nice implementation, which Steve's student was really nice to share with me.

</aside>