Jekyll2019-01-15T19:33:22+00:00https://variationalbay.es/feed.xmlThe RL ProbabilistA blog by Dibya Ghosh on topics in mathematics, reinforcement learning, and machine learning.Dibya GhoshTrouble in High-Dimensional Land2018-12-31T00:00:00+00:002018-12-31T00:00:00+00:00https://variationalbay.es/probability/highdimensionalgeometryDibya GhoshLet's dive into the world of high-dimensional geometry! When considering high-dimensional spaces (4 dimensions or higher), we rely on mental models and intuitions from 2D or 3D objects which generalize poorly to high dimensions. This is especially in machine learning, where estimators, decision boundaries, and pretty much everything else as well are defined in $d$-dimensional space (where $d$ is very high), and all our insights often collapse. This post will attempt to highlight some peculiarities of high-dimensional spaces, and their implications for machine learning applications. Volumes Concentrate on the Outside¶ In high-dimensional spaces, volume concentrates on the outside, exponentially more so, as dimension increases. Let's first look at this fact through "hypercubes": when $d=1$, this is an interval, when $d=2$, a square, when $d=3$, a cube, and so on. Mathematically, a hypercube with edge-length $l$ centered at the origin corresponds to the set $$\mathcal{A}_{d}(l) = \{x \in \mathbb{R}^d ~~\vert~~ \|x\|_\infty \leq \frac{l}{2}\}$$ Volumes in $\mathbb{R}^d$ are calculated exactly like they are in 2 or 3 dimensions: the volume of a hyper-rectangle is the product of all of the edge lengths.By these calculations, hypercubes $\mathcal{A}_d(l)$ will have volume $\prod_{k=1}^d l = l^d$. Now, volumes of different dimensional objects aren't directly comparable (it's like comparing apples and oranges), but what we can look at are relative volumes. Say we have two hypercubes, one of length $l$ and another of $\frac{l}{3}$, what is the relative volume of the smaller cube to the larger cube? How does this proportion change as the dimension increases? Let's first visualize in the dimensions where we can. Our visualizations indicate that as dimension increases, the relative volume of the smaller cube vanishes exponentially fast. We can confirm this mathematically as well with a simple calculation: $$\text{Relative Volume} = \frac{\text{Volume}(\mathcal{A}_{d}(\frac{l}{3}))}{\text{Volume}(\mathcal{A}_{d}(l))} = \frac{(l/3)^d}{l^d} = \left(\frac{1}{3}\right)^d$$ This implies that most of the volume in a hypercube lies around the edges (near the surface), and that very little volume lies in the center of the cube. Why is this an issue for machine learning? Most optimization problems in machine learning can be written of the form: $$\min_{x \in U_d} ~~~f(x)$$ where $U_d = A_d(1)$ is a unit hypercube. In many applications (including reinforcement learning), the function $f$ is sufficiently complicated that we can only evaluate the value of a function at a point, but no access to gradients or higher-order data from the function. A typical solution is exhaustive search: we test a grid of points in the space, and choose the point that has the best value. The number of points we need to test to get the same accuracy scales exponentially with dimension, for the exact same argument as the volume. To get accuracy $\varepsilon$ (that is $\left|f(\hat{x})-f(x^*)\right| < \varepsilon$ where $\hat{x}$ is our estimate and $x^*$ is the optimal point), the number of points we need to test is on the order of $\left(\frac{1}{\varepsilon}\right)^d$, which is exponential in dimension (a rigorous proof can be given assuming $f$ is Lipschitz continuous). This is often referred to as optimization's curse of dimensionality. A similar problem exists when computing expectations of functions: a naive way one might compute an expectation is by evaluating the function on a grid of points, and averaging the values like in a Riemannian sum, and computing in this way would also take time exponential in dimension. Spheres and their Equators¶ Instead of considering cubes now, let's think about spheres. In particular, we'll think about the unit sphere in $d$ dimensions, which we'll call the $(d-1)$-sphere $S^{(d-1)}$ ($d=2$, a circle, $d=3$, a sphere). $$S^{(d-1)} = \{x \in \mathbb{R}^d~~\vert~~ \|x\|_2 = 1\}$$ A side note: Calling it a $(d-1)$-sphere may seem odd, but is standard mathematical notation; feel free to mentally substitute $d-1$ with $d$ if it helps improve intuition (the reason it's called a $(d-1)$-sphere is because the sphere is a manifold of dimension $d-1$) The primary question we'll concern ourselves with is the following: What proportion of points are near the equator? We'll approach the problem dually, by asking the question how wide does a band around the equator need to be to capture $1-\varepsilon$ proportion of the points on the sphere? For the time being, we'll let $\varepsilon = \frac14$ (that is we hope to capture 75% of points), and let's start by investigating $d=2$ (the unit circle) For circles ($d=2$), a band of arbitrary height $h$ covers $\frac{4\sin^{-1}(h)}{2\pi} = \frac{2}{\pi}\sin^{-1}(h)$ of the circumference (the picture above serves as a rough proof). To cover 75% of the space, we can solve to find that $h$ needs to be at least $0.92$. Now let's consider spheres ($d=3$). For spheres, a band of height $h$ covers a proportion $h$ of the surface area (one can look at spherical caps to derive the formula). Then to cover 75% of the space, we need a band with half-width only $0.75$, which is significantly less than the $0.92$ required for a circle. This seems to indicate the following hypothesis, that we shall now investigate: Hypothesis: As dimension increases, more of the points on the sphere reside closer to the equator. Let's jump into $d$ dimensions. For low-dimensional folks like ourselves, analyzing volumes for a $(d-1)$-sphere is difficult, so we'll instead consider the problem probabilistically. What does it mean for a band to cover $1-\varepsilon$ proportion of the sphere? With probability, we can imagine it as saying If we sample a point uniformly at random from the $(d-1)$-sphere, the probability that it lands in the band is $1-\varepsilon$. How can we sample a point uniformly at random from the $(d-1)$ sphere? If we recall the symmetry of the multivariate Gaussian distribution about the origin, we encounter an elegant way to sample points from the sphere, by sampling such a vector, and then normalizing it to lie on the sphere. def sample_sphere(d): # Sample a point uniformly from a (d-1) sphere x = np.random.randn(d) return x / np.linalg.norm(x) We can investigate this problem empirically by sampling many points from a $(d-1)$-sphere, plot their "x"-coordinates, and find a band that contains 75% of the points. Below, we show it for d = 3 (the sphere), 9, 27, and 81. Notice that as the dimension increases, the x-coordinates group up very close to the center, and a great majority of them can be captured by very small bands. This yields an interesting point that is not at all intuitive! In high dimensions, almost all points lie very close to the equator We can also examine how quickly this clusters by plotting the required height to get 75% of the points as dimension varies: this is shown below. We can also prove how quickly points concentrate near the equator mathematically: we show that the square deviation of a point from the equator is distributed according to a Beta($\frac{1}{2}, \frac{d-1}{2}$) distribution, which shows that points concentrate in measure around the equator - that is, the probability that points lie outside of a band of fixed width around the equator goes to $0$ as the dimension increases. See the proof below. Toggle proof We provide some analysis of this problem. Consider sampling uniformly on the $(d-1)$-sphere: we can do so by sampling $(Z_1, \dots Z_d) \sim \mathcal{N}(0, I_d)$, and then normalizing to get $(X_1, \dots, X_d) = \frac{1}{\sqrt{\sum Z_k^2}}(Z_1, \dots Z_d)$. What is the distribution of $X_1$? First, let's consider what the distribution of $X_1^2$ is: $$X_1^2 = \frac{Z_1^2}{\sum Z_k^2} = \frac{Z_1^2}{Z_1^2 + \sum_{k > 1} Z_k^2}$$ Now, recall that $Z_k^2$ is Gamma($r=\frac12, \lambda=\frac12$) and so by the closure of the family of Gamma distributions, $Z_1^2 \sim \text{Gamma}(r=\frac12, \lambda=\frac12)$ and $\sum_{k > 1} Z_k^2 \sim \text{Gamma}(r=\frac{d-1}{2},\lambda=\frac12)$. Gamma distributions possess the interesting property that if $X \sim \text{Gamma}(r_1, \lambda)$ and $Y \sim \text{Gamma}(r_2, \lambda)$, then $\frac{X}{X+Y} \sim \text{Beta}(r_1, r_2)$. Then we simply have that $X_1^2 \sim \text{Beta}(\frac{1}{2}, \frac{d-1}{2})$. Now, this is a profound fact, and we can get a lot of insight from this formula, but for the time being, we'll use a simple Markov Bound to show that as $d \to \infty$, $X_1$ converges in probability to $0$ (that is that points come very close to the equator). For an arbitrary $\varepsilon$, $$P(|X| > \varepsilon) = P(X^2 > \varepsilon^2) \leq \frac{E(X^2)}{\varepsilon^2} = \frac{1}{d\epsilon^2}$$ This completes the statement. Summary and Perspective: Probability Distributions and the "Typical Set"¶ The core tool in statistical inference is the expectation operator: most operations, whether querying the posterior distribution for Bayesian inference or computing confidence intervals for estimators or doing variational inference, etc. The core problem is then to accurately estimate expectations of some function $g$ with respect to some probability distribution $\pi$ where $\pi$ and $g$ are defined on some high-dimensional space ($\mathbb{R}^d$). $$\mathbb{E}_{X \sim \pi}[g(X)] = \int_{\mathbb{R}^d} g d\pi = \int_{\mathbb{R}^d} g(x) f_\pi(x) dx$$ In the first section, we spent a little time discussing how one may compute this expectation integral: previously, we talked about evaluating the integrand at a grid of points, and averaging (as in a Riemann sum) to arrive at our estimate. However, in practice, we don't need to evaluate at all the points, only at the points that contribute meaningfully to the integral, that is we want to only evaluate in regions of high probability (places where points concentrate). The previous two sections have hinted at the following fact: For probability distributions in high-dimensional spaces, most of the probability concentrates in small regions (not necessarily the full space). For points sampled at uniform from inside a hypercube, with overwhelming probability, it will be near the surface of the hypercube and not in the center. For points sampled at uniform from the surface of a hypersphere, with overwhelming probability, the points will lie near the equator of the sphere. This concept can be made rigorous with the typical set, a set $A_\epsilon$ such that $P_\pi(X \in A_{\epsilon} > 1 - \epsilon)$. Then, if $g(x)$ is well-behaved enough, we can write $$\mathbb{E}_{X \sim \pi}[g(X)] = \int_{\mathbb{R}^d} g d\pi = \int_{A_{\epsilon}} g d\pi + \int_{A_{\epsilon}^C} g d\pi \approx \int_{A_{\epsilon}} g d\pi$$ What will help us is that for most distributions, this typical set is actually rather small compared to the full high-dimensional space. In the next article, we'll consider how we can efficiently sample from the typical sets of probability distributions, which will introduce us to topics like Markov Chain Monte Carlo, Metropolis-Hastings, and Hamiltonian Monte Carlo. giKL Divergence for Machine Learning2018-08-07T00:00:00+00:002018-08-07T00:00:00+00:00https://variationalbay.es/probability/kldivergenceDibya GhoshThis post will talk about the Kullback-Leibler Divergence from a holistic perspective of reinforcement learning and machine learning. You've probably run into KL divergences before: especially if you've played with deep generative models like VAEs. Put simply, the KL divergence between two probability distributions measures how different the two distributions are. I'll introduce the definition of the KL divergence and various interpretations of the KL divergence. Most importantly, I'll argue the following fact: Both the problems of supervised learning and reinforcement learning are simply minimizing the KL divergence objective What's the KL Divergence?¶ The Kullback-Leibler divergence (hereafter written as KL divergence) is a measure of how a probability distribution differs from another probability distribution. Classically, in Bayesian theory, there is some true distribution $P(X)$; we'd like to estimate with an approximate distribution $Q(X)$. In this context, the KL divergence measures the distance from the approximate distribution $Q$ to the true distribution $P$. Mathematically, consider two probability distributions $P,Q$ on some space $\mathcal{X}$. The Kullback-Leibler divergence from $Q$ to $P$ (written as $D_{KL}(P \| Q)$) $$D_{KL}(P \| Q) = \mathbb{E}_{x \sim P}\left[\log \frac{P(X)}{Q(X)}\right]$$ Properties of KL Divergence¶There are some immediate notes that are worth pointing out about this definition. The KL Divergence is not symmetric: that is $D_{KL}(P \| Q) \neq D_{KL}(Q \| P)$. As a result, it is also not a distance metric. The KL Divergence can take on values in $[0,\infty]$. Particularly, if $P$ and $Q$ are the exact same distribution ($P \stackrel{a.e.}{=} Q$), then $D_{KL}(P \| Q) = 0$, and by symmetry $D_{KL}(Q \| P) = 0$. In fact, with a little bit of math, a stronger statement can be proven: if $D_{KL}(P \| Q) = 0$, then $P \stackrel{a.e.}{=} Q$. In order for the KL divergence to be finite, the support of $P$ needs to be contained in the support of $Q$. If a point $x$ exists with $Q(x) = 0$ but $P(x) > 0$, then $D_{KL}(P \| Q) = \infty$ Rewriting the Objective¶With some algebra, we can manipulate the definition of KL divergence in terms of other quantities. The most useful such manipulation is: $$D_{KL}(P \| Q) = \mathbb{E}_{x \sim P}[-\log Q(X)] - \mathcal{H}(P(X))$$ Here, $\mathbb{E}_{x \sim P}[-\log Q(X)]$ is the cross entropy between $P$ and $Q$ (and denoted $H(p,q)$). The second term $\mathcal{H}(P(X))=\mathbb{E}_{x \sim p}[-\log p(x)]$ is the entropy of $P$. Forward and Reverse KL¶ Let's place ourselves in the optimization setting. There is some true distribution $P(X)$ that we're trying to estimate with our approximate distribution $Q_\theta(X)$. I'm using $\theta$ as a parameter here to explicitly emphasize that $Q$ is the distribution that we get to control. As we mentioned earlier, the KL divergence is not a symmetric measure (i.e. that $D_{KL}(P \| Q) \neq D_{KL}(Q \| P)$). As a result, when trying to approximate $P$, we have a choice between two potential objectives to optimize. Minimizing the forward KL: $\arg\min_{\theta} D_{KL}(P\|Q_\theta)$ Minimizing the reverse KL: $\arg\min_{\theta} D_{KL}(Q_\theta\|P)$ As it turns out, the two different objectives actually cause different types of approximations. We'll spend the next section discussing the qualitative behaviours of each approach. We'll investigate in the following setting: $P(X)$ is the bimodal distribution below. We'll try to approximate this with a normal distribution $Q(X) = \mathcal{N}(\mu, \sigma^2)$. Forward KL: Mean-Seeking Behaviour¶ Let's consider optimizing the forward KL objective with respect to $Q_{\theta}$ \begin{align*} \arg\min_{\theta}D_{KL}(P \| Q) &= \arg\min_{\theta} \mathbb{E}_{x \sim P}[-\log Q_\theta(X)] - \mathcal{H}(P(X))\\ &= \arg\min_{\theta} \mathbb{E}_{x \sim P}[-\log Q_\theta(X)]\\ &= \arg\max_{\theta} \mathbb{E}_{x \sim P}[\log Q_\theta(X)] \end{align*} Notice that this is identical to the maximum likelihood estimation objective. Translated into words, the objective above will sample points from $P(X)$ and try to maximize the probability of these points under $Q(X)$. A good approximation under the forward KL objective thus satisfies Wherever $P(\cdot)$ has high probability, $Q(\cdot)$ must also have high probability. We consider this mean-seeking behaviour, because the approximate distribution $Q$ must cover all the modes and regions of high probability in $P$. The optimal "approximate" distribution for our example is shown below. Notice that the approximate distribution centers itself between the two modes, so that it can have high coverage of both. The forward KL divergence does not penalize $Q$ for having high probability mass where $P$ does not. Reverse KL: Mode-Seeking Behaviour¶ Now consider optimizing the reverse KL objective with respect to $Q_{\theta}$ \begin{align*} \arg\min_{\theta}D_{KL}(Q \| P) &= \arg\min_{\theta} \mathbb{E}_{x \sim Q_\theta}[-\log P(X)] - \mathcal{H}(Q_\theta(X))\\ &= \arg\max_{\theta} \mathbb{E}_{x \sim Q_\theta}[\log P(X)] + \mathcal{H}(Q_{\theta}(X)) \end{align*} Let's translate the objective above into words. The objective above will sample points from $Q(X)$ and try to maximize the probability of these points under $P(X)$. The entropy term encourages the approximate distribution to be as wide as possible. A good approximation under the reverse KL objective thus satisfies Wherever $Q(\cdot)$ has high probability, $P(\cdot)$ must also have high probability. We consider this mode-seeking behaviour, because any sample from the approximate distribution $Q$ must lie within a mode of $P$ (since it's required that samples from $Q$ have high probability under $P$). Notice that unlike the forward KL objective, there's nothing requiring the approximate distribution to try to cover all the modes. The entropy term prevents the approximate distribution from collapsing to a very narrow mode; typically, behaviour when optimizing this objective is to find a mode of $P$ with high probability and wide support, and mimic it exactly. The optimal "approximate" distribution for our example is shown below. Notice that the approximate distribution essentially encompasses the right mode of $P$. The reverse KL divergence does not penalize $Q$ for not placing probability mass on the other mode of $P$. Which one should I use?¶In this toy example, because we knew the exact distribution of $P$, we were able to show the behaviour of minimizing the forward and reverse KL divergences. In practice, it's often not possible to do both, and you are limited by domain to only one. Forward KL¶Recall that the simplified objective for the forward KL objective was $$\arg\max_{\theta} \mathbb{E}_{x \sim P}[\log Q_\theta(X)]$$ To be able to evaluate this objective, we need either a dataset of samples from the true model $P(X)$, or a mechanism for sampling from the true model. Reverse KL¶The simplified objective for the forward KL objective was $$\arg\max_{\theta} \mathbb{E}_{x \sim Q_\theta}[\log P(X)] + \mathcal{H}(Q_{\theta}(X))$$ To be able to evaluate this objective, we need to be able to evaluate probabilities of data-points under the true model $P(X)$ Supervised Learning = Forward KL¶ Recall in supervised learning (empirical risk minimization), we have a dataset of samples $\mathcal{D} = \{(x_i,y_i)\}$ from some ground-truth data distribution $P(x,y) = P(x)P(y|x)$. Our goal in supervised learning is to learn a model $f: \mathcal{X} \to \mathcal{Y}$ that minimizes the empirical risk of the model, which is parametrized by a loss function $L(f(x),y)$. In particular, we optimize over some distribution of models $f_\theta$ $$\arg\min_{\theta} \mathbb{E}_{(x,y) \sim \mathcal{D}}[L(f_\theta(x),y)]$$ We'll show that optimizing this objective is equivalent to minimizing the divergence from an approximate distribution $q_\theta(y|x)$ to the true data distribution $p(y|x)$. For reference, the forward KL divergence objective is $$\arg\min_{\theta} \mathbb{E}_{x,y \sim \mathcal{D}}[-\log Q_\theta(y|x)]$$ Classification with Cross-Entropy Loss: Here, our approximate distribution $q_{\theta}(y|x)$ is a discrete distribution parametrized by a probability vector $p$ which is outputted by a neural network $f_{\theta}(x)$. By definition, the cross-entropy loss is exactly what the KL divergence minimizes. Regression with Mean-Squared Error Loss: Here, our approximate distribution $q_{\theta}(y|x)$ is distributed normally $\mathcal{N}(f_{\theta}(x), I)$, where the mean of the distribution is parametrized by a neural network. The negative log-likelihood of the normal distribution is written below. Minimizing the NLL of this normal distribution is clearly equivalent to the mean-squared error loss. $$-\log q(y|x) = -\frac{1}{2}\|y - f_{\theta}(x)\|_2^2 + C$$ This concept can in fact be extended to many other losses (for example, absolute error corresponds to the Laplace distribution). In particular, the forward KL divergence loss corresponds exactly to the problem of maximum-likelihood estimation which is the primary basis for many supervised learning problems. Reinforcement Learning = Reverse KL¶ Viewing the problem of reinforcement learning as minimizing the reverse KL objective requires us to think about reinforcement learning from a probabilistic perspective. For a good intro on why we want to do that, and how exactly we formulate it, check out my control as inference guide. We can imagine that there's a distribution of optimal trajectories, given by $P_{opt}(\tau)$. Our goal in reinforcement learning is to learn stochastic policies $\pi(a|s)$ that induce a distribution over trajectories: $q_{\pi}(\tau)$. Now, we can't sample directly from the distribution of optimal trajectories $P_{opt}(\tau)$, but we know that the probability of a trajectory under optimality is exponential in the sum of rewards received on the trajectory. $$\log P(\tau) = \sum_{t=1}^T r(s_t,a_t)$$ Optimizing the reverse KL objective then is \begin{align*} &~\arg\max_{\pi} \mathbb{E}_{\tau \sim Q_\pi}[\log P(\tau)] + \mathcal{H}(Q_{\\pi}(\tau))\\ &=\arg\max_{\pi}\mathbb{E}_{\tau \sim Q_\pi}[\sum_{t=1}^T r(s_t,a_t)] + \mathbb{E}_{\tau \sim Q_\pi}[\sum_{t=1}^T -\log \pi(a_t|s_t)]\\ &=\arg\max_{\pi}\mathbb{E}_{\tau \sim Q_\pi}\left[\sum_{t=1}^T \left(r(s_t,a_t) -\log\pi(a_t|s_t)\right)\right]\\ \end{align*} This is exactly the maximum-entropy reinforcement learning objective! Summary¶KL divergences show up everywhere in machine learning, and a solid foundation in what the KL divergence measures is very useful. If you're interested in learning more about applications of KL divergence in statistics, I'd recommend reading articles on bayesian inference. KL divergence also has a very rich history in information theory: the following are great reads. If you love deep learning, two very important concepts in the field using KL divergences right now are VAEs and information bottlenecks. As always, if you catch an error, shoot me an email at dibya @ berkeley.edu or comment below.An Introduction to Control as Inference2018-06-12T00:00:00+00:002018-06-12T00:00:00+00:00https://variationalbay.es/rl/controlasinferenceDibya GhoshA recent paper of mine proposed an algorithm to do weakly-supervised inverse RL from goal states (check out the paper!). The algorithm is derived through an interesting framework called "control as inference", which analogizes (max-ent) reinforcement learning as inference in a graphical model. This framework has been gaining traction recently, and it's been used to justify many recent contributions in IRL (Finn et al, Fu et al), and some interesting RL algorithms like Soft Q-Learning(Haarnoja et al). I personally think the framework is very cute, and it's an interesting paradigm which can explain some weird quirks that show up in RL. This document is a writeup which explains exactly what "control as inference" is. Once you've finished reading this, you may also enjoy this lecture in Sergey Levine's CS294-112 class, or his primer on control as inference as a more detailed reference. The MDP¶In this article, we'll focus on a finite-horizon MDP with horizon $T$ : this is simply for convenience, and all the derivations and proofs can be extended to the infinite horizon case simply. Recall that an MDP is $(\mathcal{S}, \mathcal{A}, \mathcal{T}, \rho, R)$ , where $\mathcal{S,A}$ are the state and action spaces, $T(\cdot \vert s,a)$ is the transition kernel, $\rho$ the initial state distribution, and $R$ the reward. The Graphical Model¶Trajectories in an MDP as detailed above can be modelled by the following graphical model. The graphical model has a state variable $S_t$, an action variable $A_t$ for each timestep $t$. We'll define the distributions of the variables in this graphical model in a way such that the probability of a trajectory $\tau = (s_0, a_0, s_1, a_1, \dots s_T)$ is equal to the probability of the trajectory under the MDP's dynamics. We set the distribution of $S_0$ to be $\rho(s)$ (the initial state distribution of the MDP). For subsequent $S_{t}$, the distribution is defined using transition probabilities of the MDP. $$P(S_{t+1} = s' \vert S_{t}=s, A_t = a) = T(s' \vert a,s)$$ The distribution for the action variables $A_t$ is uniform on the action space. $$P(A_t = a) = C$$ It may seem odd that the actions are sampled uniformly, but don't worry! These are only prior probabilities, and we'll get interesting action distributions once we start conditioning (Hang tight!) The probability of a trajectory $\tau = (s_0, a_0, s_1, a_1 , \dots s_T,a_T)$ in this model factorizes as $$\begin{align*}P(\tau) &= P(S_0 = s_0) \prod_{t=0}^{T-1} P(A_t = a_t)P(S_{t+1} = s_{t+1} | S_t = s_t, A_t = a_t)\\ &= C^T \left(\rho(s_0) \prod_{t=0}^{T-1} T(s_{t+1} \vert s_t, a_t)\right)\\ &\propto \left(\rho(s_0)\prod_{t=0}^{T-1} T(s_{t+1} | s_t,a_t)\right) \end{align*}$$ The probability of a trajectory in our graphical model is thus directly proportional to the probability under the system dynamics. In the special case that dynamics are deterministic, then $P(\tau) \propto \mathbb{1} \{\text{Feasible}\}$ (that is, all trajectories are equally likely). Adding Rewards¶So far, we have a general structure for describing the likelihood of trajectories in an MDP, but it's highly uninteresting since at the moment, all trajectories are equally likely. To highlight interesting trajectories, we'll introduce the concept of optimality. We'll say that an agent is optimal at timestep $t$ with some probability which depends on the current state and action : $P(\text{Optimal at } t) = f(s_t,a_t)$. We'll embed optimality into our graphical model with a binary random variable at every timestep $e_t$, where $P(e_t = 1 \vert S_t=s_t, A_t=a_t) = f(s_t,a_t)$. While we're at it, let's define a function $r(s,a)$ to be $r(s_t,a_t) = \log f(s_t,a_t)$ . The notation is very suggestive, and indeed we'll see very soon that this function $r(s,a)$ plays the role of a reward function. The final graphical model, presented below, ends up looking much like one for a Hidden Markov Model. For a trajectory $\tau$, the probability that it is optimal at all timesteps is proportional (exponentially) to the total reward received in the trajectory. $$P(\text{All } e_t=1 | \tau) =\exp (\sum_{t=0}^T r(s_t,a_t))$$ Toggle proof $$\begin{align*}P(\text{All } e_t=1 | \tau) &= \prod_{t=0}^T P(e_t = 1 \vert S_t=s_t, A_t=a_t) \\ &= \prod_{t=0}^T f(s_t,a_t) \\ &= \prod_{t=0}^T \exp{r(s_t,a_t)} \\ &= \exp (\sum_{t=0}^T r(s_t,a_t))\end{align*}$$ We'll describe the optimal trajectory distribution as the distribution when conditioned on being optimal at all time steps. $$\pi_{\text{optimal}}(\tau) = P(\tau \vert \text{All } e_t =1) = P(\tau~\vert~e_{1:T} = 1)$$ Explicitly writing out this distribution, we have that $$P(\tau ~\vert ~e_{1:T} = 1) \propto \exp(\sum_{t=0}^T r(s_t,a_t))P(\tau)$$ Toggle proof $$\begin{align*}P(\tau ~\vert~ e_{1:T} =1) &= \frac{P(e_{1:T} =1 \vert \tau)P(\tau)}{P(e_{1:T} =1)} \\ &\propto P(e_{1:T} =1 \vert \tau)P(\tau) \\ &\propto \exp(\sum_{t=0}^T r(s_t,a_t))P(\tau)\end{align*}$$ Under deterministic dynamics, since $P(\tau) \propto \mathbb{1}\{\text{Feasible}\}$, the probability of any feasible trajectory is $$P(\tau~\vert~ e_{1:T} =1) \propto \exp(\sum_{t=0}^T r(s_t,a_t))$$ This can be viewed as a special form of an energy-based model, where the energy of a trajectory is proportional to the reward. Exact Inference in the Graphical Model¶We now have a model for what the optimal trajectory distribution is, so the next appropriate step is to look at optimal action distributions. If I am at state $s$ on timestep $t$, what is the "optimal" distribution of actions? Pedantically, this corresponds to finding $$\pi_{t}(a \vert s) = P(A_t = a~\vert~S_t = s,e_{1:T} =1)$$ In our graphical model, $A_t$ is independent of all events before $t$ ($A_t \perp E_1 \dots E_{t-1})$. We can verify this mathematically, but the intuition is that the distribution of actions at a timestep shouldn't be impacted by what happened previously (the environment is Markovian). So, $$\pi_{t}(a \vert s) = P(A_t = a \vert S_t = s, e_{t:T} =1)$$ Solving for these probabilities corresponds to doing exact inference in the graphical model above, which looks much like the forward-backward algorithm for HMMs. The procedure goes as follows: Backward message passing: Compute probabilities $P(e_{t:T} = 1 ~\vert~ S_t =s)$ and $P(e_{t:T} = 1 ~\vert~ S_t =s, A_{t} = a)$ Forward message passing: Compute probabilities $P(A_t = a \vert S_t = s, e_{t:T} =1)$ using Bayes Rule and the backwards messages. Backward Messages¶We can compute these backward messages recursively, since $P(e_{t:T} = 1\vert A_t =a, S_t=s)$ can be expressed in terms of $P(e_{t+1:T} = 1 \vert S_{t+1} = s')$ $P(e_{t:T} = 1\vert S_t=s)$ can be expressed in terms of $P(e_{t:T} = 1\vert S_t=s, A_t =a)$ Working through the math (see the proof for more details) $$P(e_{t:T} = 1 = e^{r(s,a)} \mathbb{E}_{s' \sim T(\cdot \vert s,a)}[P(e_{t+1:T}=1 \vert S_{t+1}=s')]$$ $$P(e_{t:T} = 1\vert S_t=s) = \mathbb{E}_{a}[P(e_{t:T} = 1 \vert A_t=a, S_t =s)]$$ Toggle proof $$ \begin{align*} P(e_{t:T} = 1&\vert A_t =a, S_t=s)\\ &= \int_{\mathcal{S}} P(e_{t:T}=1, S_{t+1}=s' \vert S_t=s, A_t=a) ds'\\ &= \int_{\mathcal{S}} P(e_t = 1 | S_t=s, A_t=a)P(e_{t+1:T}=1, S_{t+1}=s' \vert S_t=s, A_t=a) ds'\\ &= P(e_t = 1 | S_t=s, A_t=a) \int_{\mathcal{S}} P(e_{t+1:T}=1 \vert S_{t+1}=s') P(S_{t+1} = s' \vert S_t=s, A_t=a) ds'\\ &= e^{r(s,a)} \mathbb{E}_{s' \sim T(\cdot \vert s,a)}[P(e_{t+1:T}=1 \vert S_{t+1}=s')]\\ P(e_{t:T} = 1&\vert S_t=s)\\ &= \int_{\mathcal{A}} P(e_{t:T} = 1, A_t=a \vert S_t=s) da\\ &= \int_{\mathcal{A}} P(e_{t:T} = 1 \vert A_t=a , S_t=s) P(A_t=a) da \\ &= \mathbb{E}_{a}[P(e_{t:T} = 1 \vert A_t=a, S_t =s)]\\ \end{align*} $$ That looks pretty ugly and uninterpretable, but if we view the expressions in log-probability space, there's rich meaning. Let's define $$Q_t(s,a) = \log P(e_{t:T} = 1\vert A_t =a, S_t=s)$$ $$V_t(s) = \log P(e_{t:T} = 1 \vert S_t=s)$$ $Q$ and $V$ are very suggestively named for a good reason: we'll discover that they are the analogue of the $Q$ and $V$ functions in standard RL. Rewriting the above expressions with $Q_t(\cdot, \cdot)$ and $V_t(\cdot)$: $$Q_t(s,a) = r(s,a) + \log \mathbb{E}_{s' \sim T(\cdot \vert s,a)}[e^{V_{t+1}(s')}]$$ $$V_t(s) = \log \mathbb{E}_a [e^{Q_t(s,a)}]$$ Remember that the function $\log \mathbb{E}[\exp(f(X))] $ acts as a "soft" maximum operation: that is $$\log \mathbb{E}[\exp(f(X))] = \text{soft} \max_X f(X) \approx \max_{X} f(X)$$ We'll denote it as $\text{soft} \max$ from now on - but don't get it confused with the actual softmax operator. With this notation: $$Q_t(s,a) = r(s,a) + \text{soft} \max_{s'} V_{t+1}(s')$$ $$V_t(s) = \text{soft} \max_{a} Q_{t}(s,a)$$ These recursive equations look very much like the Bellman backup equations! These are the soft Bellman backup equations. They differ from the traditional Bellman backup in two ways: The value function is a "soft" maximum over actions, not a hard maximum. The q-value function is a "soft" maximum over next states, not an expectation: this makes the Q-value "optimistic wrt the system dynamics" or "risk-seeking". It'll favor actions which have a low probability of going to a really good state over actions which have high probability of going to a somewhat good state. When dynamics are deterministic, then the Q-update is equivalent to the normal backup: $Q_t(s,a) = r(s,a) + V_{t+1}(s')$. Passing backwards messages corresponds to performing Bellman updates in an MDP, albeit with slightly different backup operations Forward Messages¶Now that we know that the $Q$ and $V$ functions correspond to backward messages, let's now compute the optimal action distribution. $$ \begin{align*} P(A_t =a \vert S_t=s, e_{t:T}=1) &= \frac{P(e_{t:T}=1 \vert A_t =a, S_t=s)P(A_t = a \vert S_t =s)}{P(e_{t:T}=1\vert S_t=s)}\\ &= \frac{e^{Q_t(s,a)}C}{e^{V_t(s)}}\\ &\propto \exp(Q_t(s,a) - V_t(s))\\ &\propto \exp(A_t(s,a)) \end{align*} $$ If we define the advantage $A_t(s,a) = Q_t(s,a) - V_t(s)$, then we find that the optimal probability of picking an action is simply proportional to the exponentiated advantage! Haarnoja et al perform a derivation similar to this to find an algorithm called Soft Q-Learning. In their paper, they show that the soft bellman backup update is a contraction, and so Q-learning with the soft backup equations have the same convergence guarantees that Q-learning has in the discrete case. Empirically, they show that this algorithm can learn complicated continuous control tasks with high sample efficiency. In follow-up works, they deploy the algorithms on robots and also present actor-critic methods in this framework. Approximate Inference with variational methods¶Let's try to look at inference in this graphical model in a different way. Instead of doing exact inference in the original model to get a policy distribution, we can attempt to learn a variational approximation to our intended distribution $q_{\theta}(\tau) \approx P(\tau \vert e_{1:T}=1)$. The motivation is the following: we want to learn a policy $\pi(a \vert s)$ such that sampling actions from $\pi$ causes the trajectory distribution to look as close to $P(\tau \vert e_{1:T} = 1)$ as possible. We'll define a variational distribution $q_{\theta}(\tau)$ as follows: $$q_\theta(\tau) = P(S_0 = s_0) \prod_{t=0}^T q_{\theta}(a_t \vert s_t) P(S_{t+1} = s_{t+1} \vert S_{t} = s_t, A_t = a_t) = \left(\prod_{t=0}^T q(a_t | s_t)\right) P(\tau)$$ This variational distribution can change the distribution of actions, but fixes the system dynamics in place. This is a form of structured variational inference, and we attempt to find the function $q_{\theta}(a \vert s)$ which minimizes the KL divergence with our target distribution. $$\min_{\theta} D_{KL}(q_{\theta}(\tau) \| P(\tau \vert e_{1:T} = 1))$$ If we simplify the expressions, it turns out that $$\arg\min_{\theta} D_{KL}(q_{\theta}(\tau) \| P(\tau \vert e_{1:T} = 1)) = \arg \max_{\theta} \mathbb{E}_{\tau \sim q}[ \sum_{t=0}^T r(s_t,a_t) + \mathcal{H}(q_{\theta}(\cdot \vert s_t)]$$ Toggle proof Remember from the first section that $P(\tau \vert \text{All } e_t=1) = P(\tau) \exp(\sum_{t=0}^T r(s_t,a_t))$ $$ \begin{align*} D_{KL}(q_{\theta}(\tau) \| P(\tau \vert \text{All } e_t = 1)) &= -\mathbb{E}_{\tau \sim q}[\log \frac{P(\tau \vert \text{All } e_t = 1)}{q_{\theta}(\tau)}]\\ &= -\mathbb{E}_{\tau \sim q}[\log \frac{P(\tau) \exp(\sum_{t=0}^T r(s_t,a_t))}{ P(\tau)\left(\prod_{t=0}^T q_{\theta}(a_t | s_t)\right)}]\\ &= -\mathbb{E}_{\tau \sim q}[\log \frac{\exp(\sum_{t=0}^T r(s_t,a_t))}{\prod_{t=0}^T q_{\theta}(a_t | s_t)}]\\ &= -\mathbb{E}_{\tau \sim q}[\log \frac{\exp(\sum_{t=0}^T r(s_t,a_t))}{\exp (\sum_{t=0}^T \log q_{\theta}(a_t | s_t)}]\\ &= -\mathbb{E}_{\tau \sim q}[ \sum_{t=0}^T r(s_t,a_t) - \log q_{\theta}(a_t | s_t))]\\ \end{align*} $$ Recalling that $-\log q(a_t | s_t)$ is a point estimate of the entropy of $q_{\theta}$: $\mathcal{H}(q(\cdot \vert s))$, we get our result. $$D_{KL}(q_{\theta}(\tau) \| P(\tau \vert e_{1:T} = 1)) = -\mathbb{E}_{\tau \sim q}[ \sum_{t=0}^T r(s_t,a_t) + \mathcal{H}(q_{\theta}(\cdot \vert s_t)]$$ The best policy $q_{\theta}(a|s)$ is thus the one that maximizes expected reward with an entropy bonus. This is the the objective for maximum entropy reinforcement learning. Performing structured variational inference with this particular family of distributions to minimize the KL divergence with the optimal trajectory distribution is equivalent to doing reinforcement learning in the max-ent setting! Inferring Reward with Maximum Entropy Inverse Reinforcement Learning¶ To Be Continued¶This tutorial is a work in progress. Stay tuned for more updates!Quick ML2018-01-28T00:00:00+00:002018-01-28T00:00:00+00:00https://variationalbay.es/ml/quickmlDibya GhoshA quick beginner's guide to coding the fundamental algorithms in ML has been on my todo-list for a while. Heavily inspired by Napkin ML Table of Contents¶ K-Nearest Neighbors Linear Regression Logistic Regression More coming soon! This post hides some of the boilerplate code: if you want to extend these tutorials or play around with them, check them out hereTensorFlow: A Beginnerâ€™s Guide2017-01-05T00:00:00+00:002017-01-05T00:00:00+00:00https://variationalbay.es/ml/tensorflow-guideDibya GhoshWhat is TensorFlow?¶If you've been following the machine learning community, in particular that of deep learning, over the last year, you've probably heard of Tensorflow. Tensorflow is a library to structure and run numerical computations developed in-house by Google Brain (the people who developed Alpha-GO). One can imagine this library as an extension of NumPY to work on more scalable architectures, as well as with more detailed algorithms and methods that pertain specifically to machine learning. Tensorflow joins Theano and cuDNN as architectures for building and designing neural networks. This article hopes to delve into Tensorflow through case studies of implementations of Neural Networks. As such, it requires advance knowledge of neural networks (the subject is too expansive to cover in a single article). For those new (and for those who need a refresher), here are some good reference materials http://neuralnetworksanddeeplearning.com/ (Basic) http://www.deeplearningbook.org/ (More Advanced)