Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation(rnn_nmt_baidu)

标签：nlp, natural language processing, nmt, lstm nmt, 机器翻译

2016-11-04

0. 摘要
1. 介绍
2. NMT
3. Deep Topology
4. Experiments
5. Conclusion

这篇论文发表在acl,2016上论文地址

0. 摘要

基于deep lstm networks + interleaved(插入/交错) deep bi-lstm，使用了新的linear connections(fast-forward connections).fast-forward connections在propagating gradient以及建立深度达到16的深度拓扑中起到了重要作用。

在wmt’14的English->French的翻译中，单一attention的模型BLEU达到37.7（超越了传统nmt的单浅层模型6.2的BLEU）；去掉attention，BLEU=36.3。在对unknown words进行了特殊的处理，同时进行模型ensemble之后，可以达到BLEU=40.4。

1. 介绍

传统mt模型（statistical mt，SMT）包括了multiple separately tuned components，而NMT将源序列encode到一个continuous representation space，然后使用end-to-end的方式生成新的序列。

NMT一般有两种拓扑：encoder-decoder network(Sutskever et al., 2014)以及attention网络（Bahdanau et al., 2015）。

encoder-decoder网络将源序列表示成一个fixed dimensional vector，并word by word地生成目标序列。

attention网络使用all time steps的输入建立一个targetwords和inputwords之间的detailed relationship。

但single的neural network和最好的conventional(传统) SMT还是不能比的，6层BLEU才只有31.5，但传统方法有37.0。

近两年，在computer vision领域，imagenet比赛前几名的，基本都是几十甚至上百层的网络，但NMT领域，成功的模型里，最深的也就6层。原因在于，与卷积层相比，lstm里面有更多的非线性激活函数，而这些激活函数significantly decrease the magnititude（重要性）of the gradient in the deep topology, especially when the gradient progates in recurrent form.

本文中使用了一种new type of linear connections (fast forward connections) for 多层的recurrent network。而且，我们还在encoder中使用了一个interleaved bi-directional architecture to stack lstm layers。这种拓扑可以在encoder-decoder网络中用，也可以在attention网络中使用。

2. NMT

3. Deep Topology

3.1 Network

3.2 Train technique

3.3 Generation

4. Experiments

4.1 Data sets

4.2 Model settings

4.3 Optimization

4.4 Results

4.4.1 Single models

4.4.2 Post processing

4.5 Analysis

4.5.1 Length

4.5.2 Unknown words

4.5.3 Over-fitting

5. Conclusion

原创文章，转载请注明出处！
本文链接：http://daiwk.github.io/posts/nlp-rnn-nmt-baidu.html

上篇： paddlepaddle上的lstm crf做序列标注

下篇： crf++用法

comment here..