Often in machine learning, we start with the assumption that there exists a distribution that explains the observables. In learning settings, our goal is to learn the true distribution of the data samples we observe. Hence, we say that “what our model believes in ought to capture the data samples that came from the data distribution”.
Energy-based models such as Markov random field are one of the most beloved machine learning tools due to the expressiveness in its structure and formulation, and their relationship to statistical physics. Energy-based models have an associated network graph that delineates the configuration of the states. That is, the relationship of each variable to other variables is represented through weights on edges of the network graph. The probability of each configuration is inversely proportional to the (scalar) energy. Therefore, the lower the energy of a given state configuration, the higher the probability that network will be found in the state. As a result of its elegancy, energy-based models provide a unified and definite framework for machine learners to work with.
One popular way to learn an appropriate energy function is to maximize the likelihood of the data samples with respect to the parameters. Generally, we posit that making the model agree with the data samples will give a good estimate of the true data distribution. This maximum likelihood technique is a classical method of parameter estimation, with many convenient and useful asymptotic guarantees. There is one nasty problem that occurs when we do maximum likelihood learning with the energy-based model. There exist a normalizing constant term, also known as the partition function, which sums over all possible configuration of the states. Computing the partition function is intractable unless the number of all possible states is small enough, hence we try to approximate the partition function using Markov chain Monte Carlo (MCMC), variational inference, or some other tools. Regardless, here, we take the view of fitting the empirical distribution to a stationary system.
Alternatively, we can try to understand or look closely at the model's dynamical system. This seems more natural and fresh, at least to me, because we often do inference and learning together. What I mean by this is that we usually start from some distribution and it evolves to another distribution, more precisely to a stationary distribution. One can think of it as the states evolving towards more probable states. This particular idea is often used in physics and chemistry related fields, where they describe the time-evolution of a system that can be modelled with the countable number of states at anytime and probabilistically switches between states. Indeed, this phenomenon is described by a first order differential equation with respect to the time and this is called the master equation. Little more elaboration about the master equation is that one can think of it as a Markov process. This is because the states can jump from one to other states in a continuous time; this is a kinetic system. Since it is a Markov process, there has to be transition matrix as a function of time.
Minimum Probability Flow Learning
So wondering how these are related to machine learning or even related to energy based models described above? We can introduce an explicit dynamics over the model, yielding an equilibrium distribution as a function of dynamics. Let's think of a dynamic system with the initial states as the states that we observed and the initial states will evolve over time towards the states that the model believes in. Here, the states can be seen as a distribution. So this phantasm is written in terms of the master equation such that the rate of change is the difference between the probability flowing out from one state to other states and the probability flowing in from other states to this particular state. However, there is one requirement, which is that the system must settle down to some equilibrium state, i.e stationary distribution. This requirement can be satisfied by designing our transitional matrix that satisfies the fixed point equation.
Minimum probability flow (MPF) is an algorithm that is based on these ideas. The objective of MPF is to minimize the Kullback-Leibler (KL) divergence between the data distribution and the distribution after evolving an infinitesimal amount of time under the dynamics. More precisely, it minimizes the first order Taylor expansion with respect to the time of KL divergence. This leads to a simple and elegant objective formulation. The interpretation of MPF is that it tries to minimize the probability flow from data states to non-data states, which we hope the stationary distribution to be closer to the data distribution. The details of this algorithm can be found in . Another remarkably intriguing idea, at least to me, is that we can design our own dynamics of the model by choosing a transitional matrix that converges to a stationary distribution, and there are many ways to do so. You can choose the transition matrix based on some inductive rules or sampling the connectivities of the states .
Wait! There is one catch. We are only minimizing the initial probability flow which only has evolved for infinitesimal time, so it does not fully guarantee that it will minimize the overall flow towards the stationary distribution. As we posit that we are working under the Markov Process, we are also interested in rate of convergence. Because we have the choice of choosing a dynamic, the rate of convergence will be different depending on our choice. Hence, there is a relationship to minimizing initial flow and rate of convergence of dynamic system. As the rate of convergence is faster, minimizing the initial probability flow should be more effective than when the chain of convergence is longer. Thus, there is definitely a lot of potentials for studies in various kinds of dynamic systems.
It is worthwhile to share different perspective than just from likelihood learning. Notice that we did not have to deal with computing a partition function. In practice, many machine learning researchers use contrastive divergence or MCMC to estimate the parameters. But resorting to sampling seems less pleasing, especially when we deal with continuous space. In addition, contrastive divergence has many weird properties and we do not understand why it works so well in practice. The view of contrastive divergence in terms of dynamics can be read more from [1,2]. Lastly, the choice of designing a dynamic gives open area for machine learning researchers to study. All of this is explained more thoroughly in the two reference papers!