一区二区三区日韩精品-日韩经典一区二区三区-五月激情综合丁香婷婷-欧美精品中文字幕专区

分享

Neural Variational Inference: Variational Autoencoders and Helmholtz machines

 htxu91 2016-07-30

So far we had a little of “neural” in our VI methods. Now it’s time to fix it, as we’re going to consider Variational Autoencoders (VAE), a paper by D. Kingma and M. Welling, which made a lot of buzz in ML community. It has 2 main contributions: a new approach (AEVB) to large-scale inference in non-conjugate models with continuous latent variables, and a probabilistic model of autoencoders as an example of this approach. We then discuss connections to Helmholtz machines — a predecessor of VAEs.

Auto-Encoding Variational Bayes

As noted in the introduction of the post, this approach, called Auto-Encoding Variational Bayes (AEVB) works only for some models with continuous latent variables. Recall from our discussion of Blackbox VI and Stochastic VI, we’re interested in maximizing the ELBO L(Θ,Λ):

L(Θ,Λ)=Eq(zx,Λ)log?p(x,zΘ)q(zx,Λ)

It’s not a problem to compute an estimate of the gradient of the ELBO w.r.t. model parameters Θ, but estimating the gradient w.r.t. approximation parameters Λ is tricky as these parameters influence the distribution the expectation is taken over, and as we know from the post on Blackbox VI, naive gradient estimator based on score function exhibits high variance.

Turns out, for some distributions we can make change of variables, that is, for some distributions zq(zx,Λ) can be represented as a (differentiable) transformation g(ε;Λ,x) of some auxiliary random variable ε whose distribution does not depend on Λ. A well-known example of such reparametrization is Gaussian distribution: if zN(μ,Σ) then z can be represented as z=μ+Σ1/2ε for εN(0,I). This transformation is called the reparametrization trick. After the reparametrization the ELBO becomes

L(Θ,Λ)=EεN(0,I)log?p(x,g(ε;Λ,x)Θ)q(g(ε;Λ,x)Λ,x)1Ll=1Llog?p(x,g(ε(l);Λ,x)Θ)q(g(ε(l);Λ,x)Λ,x)where ε(l)N(0,I)

This objective is much better as we don’t need to differentiate w.r.t. expectation’s distribution, essentially putting variational parameters Λ to the same regime as model parameters Θ. It’s sufficient now to just take gradients of the ELBO’s estimate, and run any optimization algorithm like Adam.

Oh, and if you wonder what Auto-Encoding in Auto-Encoding Variational Bayes means, there’s an interesting interpretation of the ELBO in terms of autoencoding:

L(Θ,Λ)=Eq(zx,Λ)log?p(x,zΘ)q(zx,Λ)=Eq(zx,Λ)log?p(xz,Θ)p(zΘ)q(zx,Λ)=Eq(zx,Λ)log?p(xz,Θ)?DKL(q(zΛ,x)∣∣p(zΘ))

Here the first term can be treated as expected reconstruction (x from the code z) loss, while the second one is just a regularization term.

Variational Autoencoder

One particular application of AEVB framework comes from using neural networks as the model p(xz,Θ) and the approximation q(zx,Λ). The model has no requirements, and x can be discrete or continuous (or mixed). z, however, has to be continuous. Moreover, we need to be able to apply the reparametrization trick. Therefore in many practical applications q(zx,Λ) is set to be Gaussian distribution q(zΛ,x)=N(zμ(x;Λ),Σ(x;Λ)) where μ and Σ are outputs of a neural network taking x as input, and Λ denotes a set of neural network’s weights — the parameters you optimize the ELBO with respect to (and also Θ). In order to make reparametrization trick practical, you’d like to be able to compute Σ1/2 quick. You don’t want to actually compute this quantity as it’d be too computationally expensive. Instead you might want to predict Σ1/2 by a neural network in the first place, or consider only diagonal covariance matrices (as it’s done in the paper).

In case of Gaussian approximation q(zx,Λ) and Gaussian prior p(zΘ) we can compute KL-divergence DKL(q(zΛ,x)∣∣p(zΘ)) analytically, see the formula at stats.stackexchange. This reduces variance of gradient estimator, though one can still train a VAE estimating KL-divergence using Monte Carlo, just like the other part of the ELBO.

We optimize both the model and the approximation by gradient ascent. This joint optimization pushed both approximation towards the model, and the model towards approximation. This leads not only to efficient inference using the approximation, but also the model is encouraged to learn latent representations z such that the true posterior p(zx,Θ) is approximately factorial.

This model has generated a lot of buzz because it can be used as a generative model, essentially VAE is an autoencoder with natural sampling procedure: suppose you’ve trained the model, and now want to sample new samples similar to those you used in the training set. To do so you first sample z from the prior p(z), and then generate x using the model p(xz,Θ). Both operations are easy: the first one is a sampling from some standard distribution (like Gaussian, for example), and the second one is just one feed-forward pass followed by another sampling from another standard distribution (Bernoulli, for example, in case x is a binary image).

If you want to read more on Variational Auto-Encoders, I refer you to a great tutorial by Carl Doersch.

Helmholtz machines

In the end I’d like to add some historical perspective. The idea of two networks, one “encoding” an observation x to some latent representation (code) z, and another “decoding” it back is definitely not new. In fact, the whole idea is a special case of the Helmholtz Machines introduced by Geoffrey Hinton 20 years ago.

Helmholtz machine can be thought of as a neural network of stochastic hidden layers. Namely, we now have M stochastic hidden layers (latent variables) h1,,hM (with deterministic h0=x) where the layer hk?1 is stochastically produced by the layer hk, that is, it is samples from some distribution p(hk?1hk), which as you might have guessed already is parametrized in the same way as in usual VAEs. Actually, VAEs is a special case of a Helmholtz machine with just one stochastic layer (but each stochastic layer contains a neural network of arbitrarily many deterministic layers inside of it).

This image shows an instance of a Helmholtz machine with 2 stochastic layers (blue cloudy nodes), and each stochastic layer having 2 deterministic hidden layers (white rectangles).

The joint model distribution is

p(x,h1,,hMΘ)=p(hMΘ)m=0M?1p(hmhm+1,Θ)

And the approximate posterior is the same, but in inverse order:

q(h1,,hMx,Λ)=m=1Mp(hmhm?1,Θ)

The p(x,h1,,hM?1hM) distribution is usually called a generative network (or model) as it allows one to generate samples from latent representation(s). The approximate posterior q(h1,,hMx,Λ) in this framework is called a recognition network (or model). Presumably, the name reflects the purpose of the network to recognize the hidden structure of observations.

So, if the VAE is a special case of Helmholtz machines, what’s new then? The standard algorithm for learning Helmholtz machines, the Wake-Sleep algorithm, turns out to be optimizing a different objective. Thus, one of significant contributions of Kingma and Welling is application of the reparametrization trick to make optimization of the ELBO w.r.t. Λ tractable.

    本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間,所有內(nèi)容均由用戶發(fā)布,不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購(gòu)買等信息,謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,請(qǐng)點(diǎn)擊一鍵舉報(bào)。
    轉(zhuǎn)藏 分享 獻(xiàn)花(0

    0條評(píng)論

    發(fā)表

    請(qǐng)遵守用戶 評(píng)論公約

    類似文章 更多

    一二区不卡不卡在线观看| 黄色国产自拍在线观看| 欧美日韩欧美国产另类| 久久中文字人妻熟女小妇| 粗暴蹂躏中文一区二区三区| 欧美美女视频在线免费看| 五月综合婷婷在线伊人| 欧美日韩无卡一区二区| 好吊妞视频只有这里有精品| 草草夜色精品国产噜噜竹菊| 欧美日韩免费黄片观看| 国产精品香蕉在线的人| 麻豆一区二区三区在线免费| 冬爱琴音一区二区中文字幕| 91精品欧美综合在ⅹ| 亚洲日本久久国产精品久久| 国产在线一区二区三区不卡| 亚洲三级视频在线观看免费| 午夜精品久久久免费视频 | 欧美日韩亚洲国产综合网| 日韩人妻一区中文字幕| 亚洲熟女熟妇乱色一区| 欧美美女视频在线免费看| 欧美熟妇喷浆一区二区| 国产无摭挡又爽又色又刺激| 中文字幕不卡欧美在线| 2019年国产最新视频| 成人精品欧美一级乱黄| 婷婷色国产精品视频一区| 久久亚洲精品中文字幕| 亚洲精品深夜福利视频| 欧美亚洲综合另类色妞| 国产级别精品一区二区视频 | 四十女人口红哪个色好看| 深夜福利亚洲高清性感| 性感少妇无套内射在线视频| 久久精品国产一区久久久| 久久99夜色精品噜噜亚洲av| 国产女同精品一区二区| 一区二区三区日韩经典| 国产精品二区三区免费播放心 |