Paper

Summary

<aside> 📢

Overview of Bayesian Online Learning

The presentation discussed a mathematical theory for regional online learning in the context of Bayesian statistics. The approach focuses on standard statistical models parameterized by θ in Euclidean space, where data arrives sequentially in mini-batches (D₁, D₂, ..., Dₜ). Once a mini-batch is processed, it is discarded - a standard assumption in online learning frameworks. The Bayesian paradigm is philosophically well-suited for this sequential data processing, as posterior distributions naturally become priors for subsequent data.

Theoretical Framework

The research focuses on regular parametric models where the log-likelihood is smooth in θ
The approximating family Q is chosen to be the class of all Gaussian distributions
This choice is justified by the Bernstein-von Mises theorem, which states that posterior distributions converge to Gaussian distributions under regular conditions
The key analysis involves tracking how approximation errors accumulate through sequential updates

Key Challenges

Prior becomes increasingly informative over time, unlike in standard Bayesian theory where priors are often non-informative
Kullback-Leibler divergence (used in variational approximation) is technically difficult to handle due to tail behavior
Accumulated approximation errors must be carefully bounded
High-dimensional parameter spaces require special techniques

Main Results

The total variation distance between the full posterior distribution and the online posterior distribution can be bounded by terms related to the "effective number of parameters"
The theoretical result implies that when p³ (parameter dimension cubed) is smaller than the mini-batch size n, the approximation works well
Numerical experiments with a Bernoulli distribution showed that:
- With mini-batch sizes larger than 4-6, the online estimator performed almost as well as the batch MLE
With mini-batch size of 1, performance degraded as more samples were processed due to accumulated errors
- Credible regions based on the online posterior maintained approximately 95% coverage when mini-batch size was sufficiently large

Limitations and Future Work

Current theory applies only to parametric models where posterior distributions are asymptotically normal
Extension to non-parametric models presents challenges since Gaussian families may not approximate those posteriors well
Future work aims to investigate consistency and optimal rates for non-parametric models
The sharpness of the theoretical bounds (particularly the p³ < n condition) remains an open question

Technical Approach

Used Laplace approximation as a key technique
Leveraged recent high-dimensional approximation theory from researchers in the field
Required assumptions of 4-times continuously differentiable log-likelihood functions
Controlled the ratio between second, third, and fourth derivatives in Taylor expansions </aside>

Transcript

<aside> 📢

Thank you to everybody who organized this nice summer school. I'm always happy to give a talk in this summer school. Today, I'm going to talk about some mathematical theory for regional online learning. Actually, I have started studying vision online learning from my Ph.D. program, and since then, I want to solve this problem. However, at that time, I found that this problem is bad.

The main theorem I want to prove is, it's statement is simple and very intuitive, but the proof technique for handling those kinds of theories is quite difficult, I think. So I wanted to solve this theorem during my PhD program, but at that time I didn't have enough technique.

For the last few years, my understanding on high dimensional parametric models has been drastically increased, so I could succeed to produce some nice theoretical results, I will talk in this talk. This is basically joint work with my PhD students, Jae-yong Lee and Jun-hyuk Choi. Actually, both Jae-yong and Jun-hyuk have attended this summer school last year. I remember that Jae-yong gave a post-talk about this.

Actually, compared to the last year, we have some technical advancements. So, I'm going to start. This setup is quite simple. I'm going to focus on some very standard statistical models parameterized by VET-TAC. This is a Euclidean parameter, p-dimensional.

Basically, in a very standard statistical framework, we have an observation y1 to yn. Based on this observation, we want to make some inference about the parameter theta. Basically, we are going to assume that given theta, y1 to yn are iid parameter density fit data.

This is the basic Bayesian paradigm. Pi-naught R will be the notation for our prior distribution and zeta. We know that the prior distribution just represents the degree of belief on a non-prompt data prior belief. After observing the data, we can update the data.

We know that the posterior distribution represents the updated belief on the unknown parameter beta.

Every Bayesian inference is based on the postural distribution. For example, we can use the postural mean for point estimation, or for unsynthetic quantification, we can utilize the triangle set. But in most cases, I... Except for some very simple conjugate model, the postulate distribution itself is not directly available because of the computational difficulty. So the computation is a big issue in Bayesian statistics. So the two most important...

The tools for computing posterior distribution is the MCMCM Variational Inference.

So, this was a very standard Bayesian setup. We know that the posterior distribution represents the degree of relief of a setup after observing the... So the posterior distribution itself can be used as a new prior for the future data. This is the basic idea of our Bayesian predictive paradigm.

For example, our finished data can come from the same population of our observation. In this case, it is very natural to use our posterior as a new prior. For another example, if the new feature data comes from a slightly different population, then even in this case, maybe if the new population shares some similarity with our training data, then we can utilize our posterior distribution to construct some new prior.

Because it contains some information about our training data. So, in principle, once we have a prior distribution of data, and we have another new data, then our belief on data can be updated based on prior and new data. This is the basic idea. In this talk, we are going to talk about some online learning setups. We will assume that the entire data D, which will be denoted as boldcase D, will be divided into several mini-batches, which will be denoted as D1, D2, and Dd.

These mini-benches will be available sequentially. We do not observe the entire data at once. These data are sequentially available. For mathematical simplicity, we will assume that each mini-bench has the same sample size, which will be small n. The total sample size is capital N.

And the capital T is the total number of mini-matches. And we will assume that once... If each mini-batch is used for inference, we cannot reuse it for analyzing the next mini-batch. So just assume that it will be discarded. This is a very standard assumption in many online learning frameworks. Actually, there are a lot of works about online learning, both in Frequentist literature and Bayesian literature.

I also have some... I think that philosophically, the Bayesian method is very well suited for unaligned sequential data. Once you look at here, usually we start from our initial prior distribution denoted by pi0. After observing the first mini-batch D1, then we can update our initial prior into the first posterior, which is pi given D1. This will serve as a new prior for the second mini-batch. So if our second mini-batch D2 arrives, then we can update the previous prior into a new posterior.

And this can be repeated or retaken slowly. So once you look at the formula for the full posterior, it is the product of prior density and full life view. And the first two components here are forms and updates prior. The last time we will play the role of the new life view. So philosophically, I think the Bayesian paradigm is very nice for analyzing the new life view.

However, the problem of Bayesian analysis is that we always have some problem in computation. The previous paradigm is quite elegant in philosophical, but it is practically not implementable in many cases because the posterior description is difficult.

For example, if we consider a non-conjugate prior, then the posterior distribution is solved. Once the posterior distribution is solved, it is not available in a closed form. So the previous paradigm is next in philosophically. So, maybe in this case, we can employ some approximation techniques.

So in practice of regional line learning, we approximate the obtained posterior distribution in some way. So here, I'm going to use the notation Q for the family of approximating family. This is a family of prior distributions, for example. We choose a suitable distance d. For example, the first posterior distribution can be processed into this family of prior distributions.

So d can be any distance, such as the Kruger-Leibler divergence. If you want, you can choose the distance, for example. So, now once we have an approximate posterior distribution pi 1, which will be easy to handle computationally, then this approximate posterior distribution can serve as a new product in the next step.

In general, most Bayesian online learning algorithms can be understood in this framework. Of course, the most important thing in practice is to choose the distance D and the approximating family Q. Of course, there are some general frameworks. For example, in distance, the most popular choice of distance is the Kruger-Liebert divergence in both directions.

Each direction is called the variational approximation and expectation propagation. The choice of family and the choice of optimization algorithm, all these together determine the statistical and algorithmic efficiency of the Bayesian online learning. In this talk, I'm going to focus on the variational approximation because it is one of the most popular computational tools.

So, now suppose that pi t-1 is our prior belief distribution at time t. And suppose that we observe the new minibase dt. Then, using the Bayes rule, we will have the following updated posterior distribution. And this guy will now be approximated again using the variational approximation.

And this approximated posterior distribution... This inductive step can be repeated many times. This forms our online learning framework. So here is the illustration. We start from final and observe the first minibus type D1. Then, by Bayes' rule, we have the posterior distribution pi tilde 1. And this guy will be approximated.

using the variational approximation, so it will be denoted as pi1. And this will be repeated many times. After this procedure, we will obtain the sequence of posterior distributions, pi0, pi1 to pi t. Once our initial prior itself is contained in the family Q, then all these, from pi0 to pi t, all these guys will belong to the family Q.

This is our object for mathematical analysis.

Mathematically, what we want to know is that we hope that the previously updated and approximated posterior distribution, in particular the final distribution pi t, we hope that this guy will be sufficiently close to the full posterior distribution. Because we have a lot of theories about the full posterior distribution, so we know that the full posterior distribution is very good.

The point estimation and a certain quantification, there are... There are a lot of nice theories justifying the full posterior distribution. So we hope that our approximated online posterior is sufficiently close to the full posterior distribution. However, the problem is that we are applying an approximation many times.

And whenever new mini-batch arrives, we apply some approximation techniques. So these approximation errors will be applied. So the key is to handle this accumulated approximation error. So far, there is no theory, so we don't know how accurate it is. How large the accumulated error can be, or how rich the family Q should be, or how large the number of updates T can be. The goal of my theory is to provide some theoretical justification about these questions.

In principle, this looks simple, but mathematically handling this is not easy. There is no theory, so we started from a very standard parameter theory. We are going to assume that our underlying model is the regular parameter model. That means that the log-likelihood is smooth in theta.

It performs well in this kind of regularization model. For some students, if you are not familiar with a regular parameter model, you can just imagine something like GRM, such as the logistic regression or the Poisson regression model. For mathematical analysis, we are going to assume frequentist assumptions. Suppose that there exist two parameters generating the data. We are considering that all data come from the same population. So there are only single two distributions.

We are going to assume that the model is well-specified. Under this situation, we firstly have to characterize the approximating family Q. There are a lot of candidates for Q. This is a famous theorem in Bayesian society. If the model is regular, The parameter dimension P is fixed, and with any prior, the posterior distribution, under some very mild assumptions, the posterior distribution converges to some maximum distribution. This is the well-known Bernstein-Bohm's theorem, or in machinery it is often called the Bayesian central link theorem.

Anyway, in summary, the posterior distribution converges to some Gaussian distribution. That is based on the Bernstein-Balmes' theorem. Based on this Bernstein-Balmes' theorem, we can... Maybe the Gaussian family will be a very nice family for approximating families. So we are going to consider Q as a class of all Gaussian distributions, p-dimensional Gaussian distributions.

And this Bernstein-Bornstein theorem in particular, in very recent theory, there are a lot of extensions, in particular for divergent p-situation. This high-dimensional Bernstein-Bornstein theorem roughly implies that if the mini-batch size is sufficiently larger compared to the dimension of the parameter,

If the number of mini-batches is not too large, then intuitively the approximation error will not be large. So we will obtain that the full posterior will be close to the approximated posterior. So this is intuitively clear, but the rigorous justification is not simple, not straightforward even for when t equals 1. Maybe if t is 1, it is just a bachelor's learning. In bachelor's learning, for this argument, we have to prove that the variational approximated posterior distribution will be sufficiently close to the full posterior distribution in some way.

So, Kullberg-Leibler distance is particularly difficult to handle in theory because we have to consider the tail part of the densities. So, I would like to mention that this justification is not simple even in the case of T, even in Bexler. Anyway, our theoretical goal is to establish some sufficient conditions under which the previous approximation works well.

In particular, we would like to express the sufficient condition in terms of the mini-batch size n, dimension p, number of mini-batches t, and the quantity of the regularity.

Actually, we have some nice results and it is quite complicated. I don't know if our approach is the best way to this problem. But anyway, there are a lot of technical challenges. Here are some examples. For example, in online learning, the prior becomes increasingly informative over time.

Because we have to accumulate information from the previous layer. If T is large enough, then the prior at the T-th step must be informative. So that is one critical difference from the existing Bayesian theory. In most existing Bayesian theory, they assume that prior is highly non-involuntary.

So that's one of the key challenges. Another one is that since we are using the variational approximation, as I mentioned, in variational approximation, we utilize the Kruger-Weibuller divergence, and Kruger-Weibuller divergence is technically difficult to handle.

because of the tail of the dense case. And finally, this is actually the most difficult part. We have to handle the accumulative approximation error. Technically, we would like to provide a very sharp theory. These are the key challenges in our problem.

I cannot explain all the details, but I'm going to show some of the main results. Here are some notations. Our approximated positive distribution at t-th step will be denoted as pi t. Since we are considering the Gaussian distribution, we only need to characterize the mean and variance.

The mean is mu t, the notation is mu t, and the precision matrix is omega t. And LT tilde is actually the posterior density at the tth step. So the first term is the low likelihood, and the second term is the prior at the tth step. And that hat is the maximizer of the tth posterior density. So that hat t is nothing but the MAP estimator, and the t star is something like a sugar 2.

This is the Fisher information matrix of the local likelihood, and this is the Fisher information plus the prior information. Some important techniques we have considered. The key technique we are using here is the Laplace approximation. I believe this is a Bayesian community, so everyone knows what is the Laplace approximation. Maybe last year, when I was chatting with my close friend, he is a very nice statistician. She is a very nice statistician. She has a lot of...

I was surprised that she told me that she has never heard about Laplace approximation. I believe that some frequentists have no chances to study Laplace approximation. But anyway, for Bayesian, Laplace approximation is a key tool in both computation and theory.

Anyway, we are going to utilize the Laplace approximation. To understand the Schott's theory in Laplace approximation, in Laplace approximation, the center will be the MAP. So we have to know the behavior of the MAP estimator. This is a very standard, not very standard, but recently

He is using very old notations, where I don't know how to read his notation even. But anyway, he has a very nice theory, particularly in high dimensional parameter models. In his paper, the analysis of likelihood ratio is given. So basically, in many of Spokane's papers, they provide a lot of very nice quadratic approximations.

of the low likelihood ratio, very sharp theory, in terms of the dimension and sample size and the model variability. And based on those sharp likelihood theory, we can prove that the MAP estimator converges to the two parameters in some sense. Typically, the right-hand side is the rate of convergence.

In classical theory, it depends on the sample size and the dimension, but he represents the convergence rate. In terms of the effective number of parameters, defined like this trace term, and some operators of the fish information matrix, it incorporates both the sample size, the dimension, and the model regularity. It is a very nice quantity, I think. These quantities represent very nice parameters.

the number of effective parameters. Based on this concentration theory of MAP estimator, one key tool is the Laplace approximation. As I told you, Spokovini is one leader in high dimensional Laplace approximation, and another very nice literature is by Anya Kasavici.

Whenever I read their papers, they are communicating with each other, so when a new paper is published, they adopt each other's techniques and improve the conditions. Using these techniques, we could obtain some very nice high dimensional Laplace approximations. This is a very simplified version of our main theory. But anyway, the posterior distribution and its Laplace approximation, the approximation error can be bounded by in this way.

This Laplace approximation means that the important thing is that PI information is incorporated here. So if we are using more informative PI in the Laplace approximation, then the approximation becomes better. Furthermore, in this literature, they are considering Laplace approximation with respect to the total variation matrix. However, as you can see here, our Laplace approximation theory is in terms of the Kruger hybrid divergence.

So, based on the previous two-half mass approximation, we can prove that the variational approximation will be good. Because variational approximation... They tried to find the best approximator in terms of the Kruger-Weibull divergence, but the previous Laplace approximation results say that there exists a very nice approximator in terms of the Kruger-Weibull divergence.

The variational approximation is not technically very difficult once we have a nice Laplace approximation. Building upon the previous theory, this is actually a very simplified version of our main theory. In the left-hand side, this is the total variance distance between the full posterior distribution and the online posterior distribution at the final stage.

These guys can be bounded by the quantity in the right-hand side where P is something like the effective number. Anyway, we need a lot of regularity conditions. For example, in classical Laplace approximations, we only need the second derivatives. But in high dimensional Laplace approximations, it is important to handle the remainder term in the Taylor expansion.

In particular, we need to control the ratio between the second derivative and third and fourth derivative. Some are strong assumptions. For example, we assume that the local likelihood has a 4 times continuously differentiable. Those are typically based on the high dimensional Laplace approximation. Anyway, many standard examples such as the logistic and Poisson regression satisfy those.

Once the true parameter is not too large or not too small. If these parameters are very large or very small, then there is some singularity in the model. So that makes some problems. Anyway, we need a lot of assumptions, but the assumption depends on the model.

Actually, handling the non-Gaussian model depends on the model severely. So the assumptions depend on the model. in this way. Roughly, if p cube is smaller than the mini-batch size m, This is the most important implication of our argument theorem. Consider these two types of confidence interval or credible interval. The first one is simple what type confidence interval, confidence set based on the full MLE.

The second one is the credible set based on the online posterior distribution. Based on our main theory, these two sets will be asymptotically very similar. So that means that the credible set, based on the online demonstrative resolution, achieves the asymptotic efficiency and optimality in some nice classical sense.

Here we have a very simple numerical result. This is quite simple but interesting, I think. We generate the sample yi from a very simple Bernoulli distribution. The total sample size, capital N, is 1000. This is our initial prior, and we consider different mini-vector sizes.

This is the result of the relative efficiency of point estimation. Thank you for your attention. Relative efficiency brings us back to the batch MLE. The x-axis is the number of samples used, and the y-axis is the relative efficiency. Relative efficiency if the...

If our estimator is efficient, then relative efficiency will be 1, and large relative efficiency means that it performs poorly. As you can see, our theory just says that once the mini-batch size is larger than a certain threshold, then we can ignore the accumulated approximation error.

As you can see here, in this simple example, once the mini-batch size is larger than 4 or 6, then the online estimated performance is almost similar to the batch MLE. If you look at the case of minibatch size 1, then as the number of samples increases, that means as the number of minibatches increases, the accumulation error also increases, so the relative efficiency decreases.

Of course, this is based on our estimator introduced in the previous slide. There are some other nice ways to obtain an efficient estimator, even when the mini-batch size is just 1. But in our talk, we have considered only a full variation of approximated posterior description.

Okay, so here is the result of the curve region. As you can see here, once the sample size, mini-batch size is larger than a certain threshold, the curve region is almost 95%.

This is the result I want to talk about today. In summary, our results will be on recent techniques for Gaussian approximation in high-dimensional parametric models. In our framework, we assume that the variational family is large enough, so the true posterior distribution is sufficiently close to this variational family. However, this is not possible because we were concerned with the size of the family.

We are considering only the parametric model. In parametric model, we know that posterior distribution is asymptotically normal. And Gaussian family approximates the posterior distribution very well. Many applications of Bayesian learning is non-parametric model. In those cases, there are certain mis-specifications. There is some strictly positive distance between the variation family and the target posterior distribution. In those cases, something like our theory does not make sense.

So in that kind of sense, in those cases, it is nonsense to... It provides some theory that the online posterior distribution will be close to the full posterior distribution. But maybe we can investigate something like consistency, or maybe something like an optimal rate. That might be possible, I think. So this is our current direction of the future world.

Okay, I'm going to stop here and thank you very much for your attention.

Thank you for that brilliant talk. It was super interesting. Actually, I haven't seen much about this high-dimensional approximation, so I wanted to ask a technical question.

Or things like sparse priorism. Certainly, I'm sure that Gaussian prior is mathematically very convenient, because its low density is Gaussian quadratic function. But I don't know how to extend it to non-Gaussian family. But in some of the spoken papers, he mentioned that it is possible to extend it to non-Gaussian prior.

He also doesn't provide any details, so I'm not sure if that is the true argument or not. Thank you.

Thanks for the nice talk. Given a data set, can you say something on how you can choose a number? Is there an optimal way to do that so that you can achieve the best? That's a great question. Actually, that was our main motivation for this kind of theoretical analysis.

With our current theory, it is difficult to provide such guidelines. In practice, we usually rely on heuristic analysis at this moment. Maybe that is our ultimate goal.

Very nice. So I understand that you modulated the KL conversions, and it was tricky, but that's why... KL works very well with Gaussian.

</aside>