干貨 | Information Flow Mechanism in LSTMs and their Comparison

htxu91 2015-11-25

展開全文

上次 character-level models 里（回復(fù)代碼：GH023）提到了一個新 NN，叫 Highway networks，簡稱 HW-Net。和它有關(guān)的還有幾個 LSTM variants，motivation 其實都是雖然 LSTM 的提出是為了解決 gradient vanishing 的問題，但是也不能說解決的很好。如果說 LSTM 的 input/forget gate 設(shè)計，是其中一種解決 gradient vanishing 的機制，那么以下幾篇論文中的個 variants 是提出了更多可能更多靈活的機制。同時，第三篇論文又橫向?qū)Ρ攘诉@些 variants 的設(shè)計單元（如 gate, cell）的作用與表現(xiàn)。

相關(guān)幾篇論文分別是（不支持外鏈，依然請大家自行搜索 paper, code 和 note）：

Training Very Deep Networks ：arXiv pre-print; paper code
Grid Long Short-Term Memory ：arXiv preprint; paper
LSTM: A Search Space Odyssey： arXiv preprint; paper note

Training Very Deep Networks (Highway networks)

Rupesh Kumar Srivastava, Klaus Greff, Jürgen Schmidhuber

這篇論文前身是《Highway Networks》，發(fā)表于 ICML workshop 上?，F(xiàn)在這版還沒有 publish，只是在 arXiv 上掛著。如上面所說，motivation 也是為了解決 gradient vanishing 的問題，所謂 gradient vanishing 其實是 information 無法有效傳遞到 deeper 的 layers。這時候就好像 information 被阻隔了一樣。而作者就是希望找到一種方法，讓信息重新變得暢通起來，就像高速公路一樣——于是就有了這個名字，Highway Networks（HW-Nets）。
To overcome this, we take inspiration from Long Short Term Memory (LSTM) recurrent networks. We propose to modify the architecture of very deep feedforward networks such that information flow across layers becomes much easier. This is accomplished through an LSTM-inspired adaptive gating mechanism that allows for paths along which information can flow across many layers without attenuation. We call such paths information highways. They yield highway networks, as opposed to traditional ‘plain’ networks.

加粗了 adaptive，這就是這個 mechanism 的重點。

adaptive gating mechanism

在文章中，公式（2）就是他們的機制。

對應(yīng)的兩個 gate，分別是 T: transform gate；C: carry gate。其實意思就是，對于當(dāng)前的這個 input，在這個 layer 里，我多大程度是讓它去進行 nonlinear transform（隱層），還是多大程度上讓它保留原貌，
直接傳遞給下一層，直通無阻。
有了這個設(shè)計，作者進一步拆分，把每一個 layer 的處理從 layerwise 變成了 blockwise，也就是我們可以對于 input 中的某一部分（某些維度）選擇是 transform 還是 carry。即：

Implementation Details

初始化的時候，作者發(fā)現(xiàn)把 transform gate 的值設(shè)置為負(fù)比較好。
因為這個機制要求input, hidden layer, transform gate的維度一致，不足的情況下用 zero-padding 補足就好。

Analysis and Conclusions

文章分析了實驗結(jié)果，想探究在一個 deep 的 layer 里，到底多少是被 transform 了（hidden layer, nonlinearalize），多少是被直接 carry 的。

分析結(jié)果在 Figure 2 中的最后兩列，結(jié)論是絕大多數(shù)都不被 transform。
同時還有一個結(jié)論，The last column of Figure 2 displays the block outputs and visualizes the concept of “information highways”. Most of the outputs stay constant over many layers forming a pattern of stripes. Most of the change in outputs happens in the early layers (≈ 15 for MNIST and ≈ 40 for CIFAR-100).
第三個驗證和分析的就是他們 blockwise 的機制設(shè)置，是否有必要。這個結(jié)果在 Figure 3 中表明，其實就是看 input 的不同地方到底在 layers 中表現(xiàn)是否一致——如果不一致，不是在同一個地方被 transform/carry，
那么說明我們就該把它們區(qū)別對待。說明 blockwise 是對的。
This data-dependent routing mechanism is further investigated in Figure 3. In each of the columns we show how the average over all samples of one specific class differs from the total average shown in the second column of Figure 2. For MNIST digits 0 and 7 substantial differences can be seen within the first 15 layers, while for CIFAR class numbers 0 and 1 the differences are sparser and spread out over all layers. In both cases it is clear that the mean activity pattern differs between classes. The gating system acts not just as a mechanism to ease training, but also as an important part of the computation in a trained network.

Grid Long Short-Term Memory

Nal Kalchbrenner, Ivo Danihelka, Alex Graves

總的來說，這篇的貢獻應(yīng)該是給出了一個更 flexible 還 computation capability 更高的框架。
要理解這個論文，可能首先要理解三個概念：grid/block, stacked, depth。

Grid/Block, Stacked, Depth

Grid/Block 是把一個 LSTM 機制改造后的一個 component，這個 component 可以是 multi-dimensional 的，決定了幾個方向進行 propagate。每一個 dimension 都有 memory 和 hidden cell。1-dimensional 的 Grid LSTM 就很像上面所說的 Highway Networks。
Stacked 和 LSTM stacked 一樣，是指把 output 和 input 連在一起。但是 stacked 并不會改變 Grid LSTM 的 dimension。stacked 2D Grid LSTM 依然是 2D 的，而不是 3D 的。從 visualize 來看，無非就是把一個個方塊/方形，平鋪在空間里（每個 dimension 都要延展）。
Depth 則是會增加 dimension。在一個 block 內(nèi)部，變 deep，就是增加 layers。一個 block 由幾個 layer 組成，就是幾層 deep 的 Grid LSTM。

Multidimensional

只是 1D/2D 的時候，Grid LSTM 看不出特別大的優(yōu)點。但是當(dāng)變成 multidimensional 的時候，就會比傳統(tǒng)的 multidimensional LSTM 更好的解決 gradient vanishing 的問題。原因是，傳統(tǒng) multidimensional LSTM 在計算每層的 memory cell 的時候，是把每個 dimensional 的 gate 信息集合起來的：
顯然這樣有問題。Grid LSTM 就不是這樣。它是每個 dimensional 分開計算 memory cell。對于每一個 grid，有 N 個 incoming memory cells 和 hidden cells，同時還有 N 個 outgoing memory cells 和 hidden cells。N 是 dimension 的個數(shù)。而 Grid LSTM share 的其實大的隱層 H。這樣既保證了 interaction 又保證了 information flow。

這篇論文后面還有挺有趣的應(yīng)用，把 MT 的任務(wù)轉(zhuǎn)換成一個 3D Grid LSTM 的問題，其中兩個 dimensions 分別是 bi-LSTM 正向逆向讀寫，第三個 dimension 是 depth。效果不俗。

可能這篇論文的這個框架的提出，在于讓 LSTM 的變種稍微有跡可循了一點，到底有多大 performance 的提高，我還是比較懷疑的。

LSTM: A Search Space Odyssey

Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, Jürgen Schmidhuber

關(guān)于這篇論文，還有一個別人的note（請自行搜索，暫時沒法支持外鏈）。這論文最開始我看到題目以為會看起來很困難，就一直拖?，F(xiàn)在看完發(fā)現(xiàn)看起來很流暢。而且 get 到的 point 和上面這個 note 里幾乎一樣。

還是挺清晰的。

首先，這篇論文提到的 8種 variants based on vanilla LSTM，并不全是已經(jīng)被 propose 出來用在實際工作中的。而是用了控制變量，專門為了實驗對比，設(shè)計出來的。除了 vanilla LSTM，比較有意義的可能就是 GRU 變種的（第七種 variants）。
其次，這篇論文中講解 vanilla LSTM 時候，非常清晰 clear，如果讓我推薦入門 LSTM 的模型講解，我也會推薦這個（和上面那個 note 的人感受一樣）
第三，這篇論文號稱是做了 N 多實驗，需要 CPU 跑15年的實驗量的實驗……但最后結(jié)果展示和分析非常有條理。用的是一個叫 functional Analysis of Variance (fANOVA) 的方法，用于分析不同的 hypermeter 對于結(jié)果的貢獻/影響有多大?？雌饋砗懿诲e的方法。

Conclusions

然后得出來的重要結(jié)論直接看最后的 Conclusions 部分就可以了，我來摘錄一下：

The most commonly used LSTM architecture (vanilla LSTM) performs reasonably well on various datasets and using any of eight possible modifications does not significantly improve the LSTM performance.
Certain modifications such as coupling the input and forget gates or removing peephole connections simplify LSTM without significantly hurting performance.
The forget gate and the output activation function are the critical components of the LSTM block. While the first is crucial for LSTM performance, the second is necessary whenever the cell state is unbounded.
Learning rate and network size are the most crucial tunable LSTM hyperparameters. Surprisingly, the use of momentum was found to be unimportant (in our setting of online gradient descent). Gaussian noise on the inputs was found to be moderately helpful for TIMIT, but harmful for other datasets.
The analysis of hyperparameter interactions revealed that even the highest measured interaction (between learning rate and network size) is quite small. This implies that the hyperparameters can be tuned independently. In particular, the learning rate can be calibrated first using a fairly small.
如果要補充的話，還有一條結(jié)論就是雖然整體上各種 variants 沒有提高，但是也是 task-specific 的。
換句話說，既然整體上沒提高，很多減少 hypermeter 的 variants 是值得一用的（相當(dāng)于不損害 performance）。所以 GRU 的改造是比較合理的。
另外就是警醒在 tune learning rates 時的錯誤。