<aside> 📢
This presentation introduced a generalized functional delta method with applications to Bayesian model selection. The research was conducted in collaboration with Dr. Xinguang and Jacob Westraub, a PhD student from the University of Queensland. The work extends traditional delta methods to accommodate random functions in statistical analysis.
The speaker established a probability space framework where random variables live in a metric space X with its Borel sigma algebra and push-forward measure. Key components include:
The framework focuses on risk-based model selection where:
The presentation described various information criteria including:
The core contribution is the generalized delta method that:
The research offers several advantages over existing approaches:
The work builds on previous qualitative results to establish quantitative convergence rates, enabling the construction of model selection criteria that are both consistent and parsimonious under natural regularity conditions.
<aside> 📢
This is a generalized functional delta method with applications to Bayesian model selection. This work could not be done without my very esteemed colleagues who are here with us today.
Dr. Xinguang and Jacob Westraub, our PhD student at the University of Queensland. So let's get started. My favorite starting sentence to any text in statistics is let a triple be a probability space, omega fp be a probability space, you've got some element omega that you're endowing a probability to, and you've got the expectation operator, and you have a random operator.
I say random variable. Many people say random variable must be a real object. I'm just going to assume that it lives in some metric space X endowed with its Borel sigma algebra BX and endowed with its push-forward measure P of X. And from that, you're going to observe an IID sequence of data.
And we're going to define two objects that are very common when you're studying empirical processes, which is something that enables you to take a sample average of a function, which is this Pn of g object.
This pn of g object, which takes the average of some object g, that's a bounded functional, and you have this object p of g, which takes the expectation instead. And this expectation might not be measurable, which is why there's a star here, but in all of our context, it's a measurable object.
So the star doesn't matter too much. And what we care about is, well, these probability measures P of X, we're going to assume that they at least have a density function with respect to some reference measure M, right? And so we're going to declare some families of density functions, f of k, where f of k of theta k is just a density function that depends on some extra parameter theta k, that lives in M.
Each of these f of k's live in the set of all possible density functions on your set X with respect to your measure M, which is this curly F here. And we're not going to assume explicitly that your X has a density function within the union of all of these f of k's.
So you're allowed in this specification. in this framework. So to each of these f of k's we're going to endow some prior density function, little pi of k, that's going to be a positive prior on the entire space of Tk, and then from that we're going to endow that further with a posterior measure, pi k n of beta n, which is what they call a power posterior, which is you take the product of...
So we're going to talk about risk-based model selection, which is a very... This is all very frequent. I'm sorry, I'm an amazing connoisseur, but this is all very frequent. We're going to talk about risk-based model selection. In risk-based model selection, you start yourself off with some notion of some loss that you're making from making the wrong kind of model. And that loss is going to be your function L there, which takes some density function F and some data X, and it maps that.
And so your risk is just going to be the expectation of your loss superpositioned. So I'm using this diamond. It's non-standard notation, but I put it... L superpositioned with F, which is you take your L and you evaluate your loss that takes into account your density and your data.
And so, from that, you then consider a posterior measure that is consistent. And when I say a posterior measure that's consistent, what I mean is that it converges to a point mass, which is what the... The delta measure is delta k 0 there. It's a point mass at the location theta k 0.
And it converges to that weakly, which is that you take the expectation with respect to the left-hand side of any bound to a particular function, and that converges to the expectation on the right-hand side. The integral would reflect that for any continuous standard function.
So using our two pieces together now, we can define a kind of limiting posterior risk. And this limiting posterior risk is that we're taking your integral with respect to your limiting measure, respect of your expected loss object, right? And so this is going to be your target. You're going to be able to want to pick the best model that has the smallest limiting posterior risk.
I have K here as if K can be any integer. But really, this situation reduces down to 2. K is equal to 1 or 2. And so, your reasonable interpretation of what a criterion should do is that if your risk for model 1 If your risk for model 1 is less than your risk for model 2, then you should want the model that's less risky, right? So you should want model 1. And if your risk for model 1 and your risk for model 2 are the same, but model 1 is in some sense simpler than model 2, that is, model 1 depends on less parameters,
The model 2, which is a D1 less than D2 there, then again, you want the more parsimonious model. You want a simpler model. So again, you'd want to prefer model 1.
And if the two models have the same risk and they have the same complexity, then you really don't care. Just pick one. So that's how Lisserata goes back to what a good model selection criteria should do. And so, that brings us to the notion of information criteria. Many of you who have done statistics, applied statistics, have seen information criteria before. Things like your AIC, your BIC, are these kind of information criteria. So, our information criteria are just going to be these I, K, N.
And you want to choose model 1 if the information criterion of model 1 is smaller than the information criterion of model 2. And so, our deserata really consider two properties of these information criteria. That is, these information criteria should firstly be consistent.
That is, if model 1 actually has the smaller risk than model 2. You should choose model 1 with probability approaching 1. The information for model 1 should be less than model 2 with probability approaching 1. And you want it to also be parsimonious. Which is, if the risks are the same and d1 is less than d2, then again, you want to be able to choose model 1 with probability approaching 1.
We're probably approaching one.
fairly obvious. And so, let's consider some common objects or some previously considered objects in this area that have been considered to construct, to conduct this model selection in the Bayesian framework. When I'm saying Bayesian information criteria, I don't mean the BIC, I mean proper information criteria based on...
So you have this Bayesian predictive information criterion, which is you take the integral of the average of your long likelihood with respect to your posterior, and then you plot it.
It's a one-armed log n, and in the ever-present deviation situation criteria of Spiegelhalter, where you take minus two times the average log-likelihood term with respect to your posterior, and then you plus a compromise term where you take the average log-likelihood.
Evaluate it at your average of your posterior measure.
And so in our previous work, in what we did last year, we looked at just the almost sure point convergence with our brain.
We know that under some reasonable conditions, the integral over your average loss functionals with respect to posteriors that converge almost surely weakly, do end up converging to their limiting values. And so, this is not only true for these posterior objects, it's true for any classes of objects.
And this makes use of some well-known but really hidden results in the statistics literature, mostly from Sopozo in 1982 and Sampia, but also kind of rediscovered.
For the proof that we put together, we assume the most general conditions that we have are that your parameter space is hemi-compact. That is locally compact and sigma compact, so you've got nested sets, compact sets, that eventually covers your space. That they carry a theory, the measurable bit, in the bit that needs to be measurable, the continuous in the bit that needs to be continuous, that uniform strong laws of large numbers hold, and what's called asymptotic uniform interoperability, which is if you've seen uniform interoperability in the past.
A pretty standard in the literature, so imagine both of them now, depending on A.
Using that result, we were able to show that under some pretty weak conditions, that the WBIC and the Bayesian Predictive Information Criterion If your norm of your parameter is asymptotically uniform integrable with respect to your set of measures, then the DIC is also convergent to the same limit as well. So all of these things will be consistent, in my definition of consistent, for the risk based on the negative log-likelihood loss function.
So they will be consistent if that's what you're targeting as your model selection criteria. So then the question is, can we make these consistent losses, or can we construct consistent losses that are also parsimonious? And to construct parsimonious losses...
We can't just rely on qualitative convergence. We also now need some quantitative rate results. And so, Sin and White, in their Journal of Econometrics paper for a penalized loss object or penalized risk object to produce a parsimonious model selection criterion. And so, in my paper in ANZJS and in Jacob's work that was in ICML, we kind of expanded upon this observation of Ceylon-White.
Use this expanded observation from Sin and White in the context of these posterior average loss functionals, we end up with these two criteria for determining consistent and parsimonious model selection. That is, what you want is you want some sequence tau n such that when scaled by tau n, your integrated average object converges to your...
If d1 is less than d2, then you want your scaling of penalty 2 minus penalty 1 to go to infinity. And if you have these two things, you will end up with model selection consistency and parsimonious model selection. And it's a very simple argument. It's just tedious, but very simple.
So, to that end, we're going to need something that provides us with a rate of convergence. And the way that we're going to get our rate of convergence is we're going to have to... I'm going to use the very classical Jane and Marcus Central Limit Theorem. The Jane and Marcus Central Limit Theorem, in context, just says that if we take our space Tk's to be compact sets, and they have a variance at some location theta star, and the loss superposition with the density functions are Lipschitz in the sense of 0.3 there, and the Lipschitz constant.
Then what you end up with from the J.N. Marcus Center Limit Theorem is that you end up with root n times your average loss minus its expected value converges to some Gaussian product.
And it's going to be a continuous gaussian process in this case as well. But that doesn't matter. The continuity doesn't matter.
The only thing that matters to us here is that it does, in fact, converge to some Gaussian process. And using that Gaussian process limit, we're going to turn it into a limit theorem that we want. So what we want is we're going to use a delta method. So a delta method just says that if I have some map Zn, right, and it converges to some mu in distribution at a rate of tau n, then if I have a notion of a directional derivative that is d and that...
Yeah. Yeah. Yeah. Yeah. The direction of the movement, when you're standing at you, you're looking in the direction of mu zero, not mu zero, eta zero.
After the reconvergence through a functional g that has the directional derivative, that is, we take g of our z minus g of our mu, and we scale that by 12n, then that now converges to the distribution determined by the transformation of our directional derivative, standing at mu, looking in the direction of z.
When we take these integrals of some function g with respect to our posterior measure pi k n, that's a functional map for which there might be a notion of a derivative, right? And so maybe if there's a notion of a derivative, then we can... And it wouldn't be much of a talk if the answer to that question is no. So, in fact, we do have a delta method that allows us to do this. And this is the generalized delta method that we proved that is kind of the main point of this talk. Is that if you have a metric space D, and you have two metric topological vector spaces, D and F, that is, you have a vector space, you can add the vectors, you can scale the vectors by scalars, but they have a topology endowed by some metric.
It will have to be a norm, it will just have to be some metric. And you take g and hn to be mapped between dA to f, and you have some sequence tau n that goes to infinity. And now you have some sequence, gamma n. In D, you have some sequence, eta n, in E, such that gamma n converges to gamma 0 in D in its topology, and eta n converges to eta 0 in E in the topology of E.
And you have some centering mu. Then you can define a kind of derivative object. D, located at mu in the direction of gamma nua, eta nua, which are the limits of your sequences, gamma and eta. Just define the way that you would think it's defined. You scale Gn at gamma 0, and then add u plus the direction eta n divided by your tau n, which is going to be your derivative object, and then minus the end point of the derivative, hn.
If this object has a limit, then you can now pass your weak convergence through this limit, through this derivative object, in the same way that you pass your limit through any other delta method. This looks complicated, but if I were to lock you all in a room and told you that you can't escape until you prove this result, you'd all be out in about five minutes. It's not that complicated a result to prove.
But for us, it's going to be a powerful result, and a useful result. So, why do I call this a generalized delta method? So far, I'm just inventing this notion of derivative. Why does this generalize the delta method? Well, it generalizes the delta method in two steps.
So, if we look at Jacob's work in 2024, he looked at the situation where you don't have this additional gamma object, and you... And if gn and hn are just some fixed functions g, well then you just get your standard delta method from, for example, Ramesh, which is just your usual directional derivative delta method. And of course, if you let e be your real space and f just be a real variable, and you let g be properly differentiable,
Then you just get back your usual DELTA method that you're all used to applying in applied words. So this is a generalization of the DELTA method. And what does it look like when you're actually using it in the context of this model selection? Just identify D with your space of probabilities on Tk. Identify E with your continuous functions on Tk. And F is just the real. Well then, you can now identify Gn and Hn.
in the direction eta dq, and you identify your sequence gamma as the sequence of weakly converging measures qn to some q0, and it's weakly convergent, so it has an endowed topology, it's fine, it has a metric on it, it fits within our framework, and eta n converges to eta 0, again, that has a topology, you're allowed to do that.
additional cost of assumptions. This limit will just be equal to the limit to the integral with respect to the end of your direction at a node and your limiting measure. And that's going to end up being your notion of a derivative. And if you end up passing this derivative through with respect to your weak convergence that you have from the Jane and Marcus Central Limit Theorem, you get this first object up there, which is not...
What you want is the second object, which is you want it to be the integral of both things, and then minus the thing with no longer anything to do with n. Your first object has something still to do with n. Your second object is what you want. It has nothing to do with n.
So you need the bridging object between the two, which is the second bit of the first object minus the second bit of the second object, and you need that to converge at the same rate. And so we're halfway there. Now we need to get something that gets us the third line, right? And it turns out that...
If you assume that your limit expectation is Lipschitz, then what happens is if you take pi k of n now, just use the notation, and that's now an expectation with respect to some function t, and we're just going to let theta just be the identity map, because I hate writing id as the identity map.
Then it follows that from your lipschitzness, what you end up with is the bound there, which bounds with respect to your... The second term is with respect to your posterior expectation... The difference between your posterior expectation and the truth, and the first object, is with respect to really your variance of your posterior distribution. And so if your variance of your posterior distribution...
If you substitute those two into the middle inequality, and it turns out that a really neat sufficient condition for all of this is that if you just assume that your posterior measure It has a kind of Bernstein-Von Lise property, that is, it converges to a limit that is the normal density function. And it has a variance, so that's why we have not just the limit goes to your Gaussian density, but also you have the square of your...
of your parameter also converging in OP1. Sorry, it shouldn't be OP1. It's 0. It converges to 0, right? So that should be little-o-p1, not big-o-p1. Then it follows by some standard computation that now we have little-o-p1. So what that means is, now you have a way of verifying whether or not your model selection criteria is both consistent and parsimonious. That is, you need weak convergence in your posterior measure, you need a J.N. Marcus central limit theorem,
You need Lipschitzness, you need this convergence of your posterior mean and your variance of your posterior distribution, and then you can construct your penalty in a way that satisfies 0.5. And now your loss function, or what that looks like. It has been edited to include proper punctuation.
To be of the form log n on square root of n multiplied by the dimension of your space, of your parameter space Tk, then now you get your model selection consistency. So I called that NLT term there. This is a white NLT. I think that it should work. It should be more used in the written code.
And so you might ask, okay, what have you done here? Have you really improved the situation? We already have information criteria. Some of them are even consistent, right? And so, for example, if you take the WBIC, well, then you follow Watanabe's proof of the WBIC, and it turns out that under really the same conditions, the WBIC is no different to the BIC, right?
And we know when the BIC is consistent. We know it's consistent from work by Nishi, who is now at Kyushu, actually. We know that the BRC is parsimoniously consistent under the condition that if you take your sample average at the truth of parameter 1 and you take your sample average at the truth of parameter 2, when the risks are the same, then that thing should scale at a rate of n.
Sorry, it should be bound improperly at a rate of n. And this condition is really difficult to verify, and it's really only easily verifiable... When your two classes of functions are nested within one another, and outside of that it's quite difficult. So we have now a condition that doesn't need this additional nestedness verification.
Of course, you can go beyond this. Your loss function doesn't have to just be about your density function. Your loss function can be about anything. It really doesn't require too much of a structure. And your measure doesn't necessarily have to be your posterior measure, it could be any measure that converges consistently to some point mass. So for example, you can play a game of doing fiducial model selection, for example.
So just to conclude, we now have a very general delta method that allows for random functions to be taken to the derivative of that also depend on n. We've used that to construct very general and moral selection, consistent and parsimonious criteria. that really satisfy only very natural kind of regularity conditions and that do have their place in the setting of model selection criteria that do offer something different that current model selection criteria in this setting don't offer and that they do generalize to just beyond the likelihood setting.
I just want to point out some references. I guess the reference that I really want to point out is the Vegetable Health Nguyen reference, which is where we developed the qualitative results that we've now extended to the quantitative results. And the Westerhout 2004 reference, which is, among other things, regarding tax, regarding the minimum expected empirical risk, we also established this kind of outline.
This is the alpha version of this delta method that we proved here.
Thank you very much.
Thank you again for a wonderful talk.
Thank you for that great theoretical talk. So maybe, the first question is just to double-check that you need this new concept of the, sort of, like a new type of interaction of the human being. I was guessing maybe you didn't need it because the space is now the space of the appropriateness. Yeah, so the reason why we need it is because the integral depends on the probability measure, which is random. So we need it to incorporate the randomness.
And also, the centering also depends on n as well. So we have to have randomness and it depends on n of the centering.
And the second one is maybe this could be a bit random or a technical question, but do we still have to worry about this sort of like a, you know, cannot be technically linear?
No, because it's already directional derivative, it doesn't need that usual linearity and continuity.
I just have a clarification question about your beta n argument. Is it a robustness argument? It's a sufficient decrease.
Do you have an example where...
It is very hard to prove directly the consistency of the modulation method.
And that's where maybe your theory is coming into play.
</aside>