手把手丨用TensorFlow開發(fā)問答系統(tǒng)

學(xué)習(xí)雪雪 2017-11-29

展開全文

授權(quán)轉(zhuǎn)載自 OReillyData（ID：OReillyData）

一個問答系統(tǒng)是被設(shè)計用來回答用自然語言提出的問題的系統(tǒng)。一些問答系統(tǒng)從諸如文本和圖片這樣的“源”里獲得信息來回答特定的問題。這些依賴“源”的系統(tǒng)可以基本被分為兩類：開放話題的，它需要回答的可能是任何問題，不限于特定的領(lǐng)域；特定話題的，它回答的問題是有特定限制的，因為它們是與一些預(yù)先定義的“源”相關(guān)，比如有給定上下文或是特定領(lǐng)域（如醫(yī)學(xué)等）。

這篇博文會帶領(lǐng)你完成一個使用TensorFlow來創(chuàng)建和開發(fā)問答系統(tǒng)的任務(wù)。我們會構(gòu)建一個基于神經(jīng)網(wǎng)絡(luò)的問答系統(tǒng)，并基于一個特定話題的源信息。為了完成這個任務(wù)，我們會使用一個簡化版的叫做動態(tài)記憶網(wǎng)絡(luò)（Dynamic Memory Network，DMN）的模型。這個模型是Kumar等人在他們的論文《Ask Me Anything: Dynamic Memory Networks for Natural Language Processing》里給出的。

開始前的準(zhǔn)備工作

除了要安裝Python 3.0版本和TensorFlow 1.2版以外，確保你還安裝了下面這些軟件和Python庫：

Jupyter
Numpy
Matplotlib

你也可以選擇性地安裝TQDM來觀看訓(xùn)練過程并得到訓(xùn)練速度指標(biāo)，但這不是必須的。這篇文章里的代碼和Jupyter Notebook文件都可以在GitHub里找到。我建議你把它們下載下來并使用。如果這是你第一次使用TensorFlow，我建議你先看看Aaron Shumacher的《Hello, TensorFlow》這篇文章來對什么是TensorFlow以及它是如何運作的獲得一個初步的概念。如果這是你第一次使用TensorFlow來解決自然語言的問題，我也會建議你先看看《Textual Entailment with TensorFlow》這篇文章。因為它里面介紹了一些對本文里構(gòu)建神經(jīng)網(wǎng)絡(luò)有幫助的概念。

讓我們首先導(dǎo)入所有的相關(guān)的庫：

%matplotlib inline

import tensorflow as tf

import numpy as np

import matplotlib.pyplot as plt

import matplotlib.ticker as ticker

import urllib

import sys

import os

import zipfile

import tarfile

import json

import hashlib

import re

import itertools

探索bAbI數(shù)據(jù)集

對于這個項目，我們將會使用由Facebook構(gòu)建的bAbI數(shù)據(jù)集。與所有的問答數(shù)據(jù)集類似，這個數(shù)據(jù)集里包括了問題。bAbI數(shù)據(jù)集里的問題都非常直接明了，盡管有些比別的要難一點。這個數(shù)據(jù)集里的所有問題都有相關(guān)的上下文，即一些句子。這些句子里面肯定包括了回答問題所需要的細(xì)節(jié)。另外，這個數(shù)據(jù)集也會提供每個問題的正確答案。

基于回答問題所需要的技能，bAbI數(shù)據(jù)集里的問題被分成了20類任務(wù)。每種任務(wù)都有它自己的用于訓(xùn)練的問題和測試的問題。這些任務(wù)測試了多種標(biāo)準(zhǔn)的自然語言處理的能力，包括時間的推理（任務(wù)#14）和歸納邏輯（任務(wù)#16）。為了能對這些任務(wù)有更好的理解，讓我們看一個我們的問答系統(tǒng)將需要回答的問題。如圖1所示。

圖1. bAbI數(shù)據(jù)集里的一個例子。上下文在藍(lán)色框內(nèi)，問題在金色框內(nèi)，答案在綠色框內(nèi)。來源：Steven Hewitt

這個#5號里面的任務(wù)要測試神經(jīng)網(wǎng)絡(luò)理解三個對象之間動作的關(guān)系的能力。語法上講，這個任務(wù)是在測試問答系統(tǒng)是否能區(qū)分主語、直接賓語和間接賓語。在這個例子里，問題問的是最后一個句子里的間接賓語，即誰從Jeff手里接收了牛奶。神經(jīng)網(wǎng)絡(luò)必須能找出Bill是主語而Jeff是間接賓語所在的第五個句子，和Jeff是主語的第六個句子。當(dāng)然我們的神經(jīng)網(wǎng)絡(luò)沒有得到任何明確的訓(xùn)練來找到什么是主語或賓語，而是必須通過訓(xùn)練數(shù)據(jù)里的例子來推測出這個理解。

另外一個系統(tǒng)必須解決的小問題就是數(shù)據(jù)集里的各種同義詞。Jeff把牛奶“遞給”Bill，但他也可以是簡單地“給”或是“交”給Bill?？紤]這些，我們的神經(jīng)網(wǎng)絡(luò)并不是從零創(chuàng)建的，它會得到詞向量的幫助。詞向量會存儲對詞的定義以及詞與詞之間的關(guān)系。類似的詞有相似的向量，這意味著神經(jīng)網(wǎng)絡(luò)可以認(rèn)為它們是相同的詞。我們會使用Stanford大學(xué)的GloVe詞向量庫。關(guān)于這個部分，我在之前的這篇文章里有更詳細(xì)的介紹。

大部分任務(wù)都有限制，要求上下文里包含能回答問題的確切文字。在我們上面的例子里，答案“Bill”就可以在上下文里找到。我們會利用這一限制，從而在上下文里搜索和我們最終結(jié)果意思最相近的詞。

注意：下載和解壓縮數(shù)據(jù)可能會需要幾分鐘。因此盡早運行下面三段代碼來開始。這些代碼會下載bAbI和GloVe數(shù)據(jù)，并從中解壓出需要的文件來用于我們的神經(jīng)網(wǎng)絡(luò)。

glove_zip_file = “glove.6B.zip”

glove_vectors_file = “glove.6B.50d.txt”

# 15 MB

data_set_zip = “tasks_1-20_v1-2.tar.gz”

#Select “task 5”

train_set_file = “qa5_three-arg-relations_train.txt”

test_set_file = “qa5_three-arg-relations_test.txt”

train_set_post_file = “tasks_1-20_v1-2/en/”+train_set_file

test_set_post_file = “tasks_1-20_v1-2/en/”+test_set_file

try: from urllib.request import urlretrieve, urlopen

except ImportError:

from urllib import urlretrieve

from urllib2 import urlopen

#large file – 862 MB

if (not os.path.isfile(glove_zip_file) and

not os.path.isfile(glove_vectors_file)):

urlretrieve (“http://nlp.stanford.edu/data/glove.6B.zip”,

glove_zip_file)

if (not os.path.isfile(data_set_zip) and

not (os.path.isfile(train_set_file) and os.path.isfile(test_set_file))):

urlretrieve (“https://s3.amazonaws.com/text-datasets/babi_tasks_1-20_v1-2.tar.gz”,

data_set_zip)

def unzip_single_file(zip_file_name, output_file_name):

“””

If the output file is already created, don’t recreate

If the output file does not exist, create it from the zipFile

“””

if not os.path.isfile(output_file_name):

with open(output_file_name, ‘wb’) as out_file:

with zipfile.ZipFile(zip_file_name) as zipped:

for info in zipped.infolist():

if output_file_name in info.filename:

with zipped.open(info) as requested_file:

out_file.write(requested_file.read())

return

def targz_unzip_single_file(zip_file_name, output_file_name, interior_relative_path):

if not os.path.isfile(output_file_name):

with tarfile.open(zip_file_name) as un_zipped:

un_zipped.extract(interior_relative_path+output_file_name)

unzip_single_file(glove_zip_file, glove_vectors_file)

targz_unzip_single_file(data_set_zip, train_set_file, “tasks_1-20_v1-2/en/”)

targz_unzip_single_file(data_set_zip, test_set_file, “tasks_1-20_v1-2/en/”)

解析GloVe和處理未知的詞條

在《Textual Entailment with TensorFlow》里，我介紹了sentence2sequence函數(shù)。這是一個基于GloVe定義的映射把字符串轉(zhuǎn)換成矩陣的功能。它把字符串分成詞條。這些詞條是更小的詞片段，大致類似于標(biāo)點、詞或詞的一部分。例如“Bill traveled to the kitchen.”里包含6個詞條，對應(yīng)于5個單詞和最后的那個句號。每個詞條獨立地被向量化，就形成了和這個句子相對應(yīng)的向量列表。如圖2所示。

圖2 把句子變成多個向量的過程。來源：Steven Hewitt

在bAbI的一些任務(wù)里，問答系統(tǒng)將會碰到GloVe的詞向量化里沒有的詞。為了讓我們的神經(jīng)網(wǎng)絡(luò)能處理這些未知的詞條，我們需要維護(hù)一個這些詞的一致的向量。常見的動作是把所有的這些未知詞條替換成一個單一的向量，但這并不總是有效。這里，我們使用隨機(jī)化的方法來為每一個未知的詞條新建一個向量。

當(dāng)我們首次碰到一個未知的詞條，我們就從最初的GloVe向量化的分布（近似高斯分布）里獲取一個新的向量，然后把這個向量放回到GloVe的詞映射里。想獲得分布的超參數(shù)，Numpy有可以自動計算方差和均值的函數(shù)。

下面的fill_unk函數(shù)會在我們需要時給出一個新的詞向量。

# Deserialize GloVe vectors

glove_wordmap = {}

with open(glove_vectors_file, “r”, encoding=”utf8″) as glove:

for line in glove:

name, vector = tuple(line.split(” “, 1))

glove_wordmap[name] = np.fromstring(vector, sep=” “)

wvecs = []

for item in glove_wordmap.items():

wvecs.append(item[1])

s = np.vstack(wvecs)

# Gather the distribution hyperparameters

v = np.var(s,0)

m = np.mean(s,0)

RS = np.random.RandomState()

def fill_unk(unk):

global glove_wordmap

glove_wordmap[unk] = RS.multivariate_normal(m,np.diag(v))

return glove_wordmap[unk]

已知還是未知

bAbI任務(wù)里有限的詞匯表意味著我們的神經(jīng)網(wǎng)絡(luò)即使在不知道詞的意思的情況下也可以學(xué)習(xí)詞之間的關(guān)系。不過，為了加快學(xué)習(xí)的速度，我們會盡量選擇有意思的向量。為了實現(xiàn)它，我們使用貪婪搜素策略，查找Stanford的GloVe詞向量數(shù)據(jù)集里已經(jīng)存在的詞。如果不存在，則把整個詞用一個未知的隨機(jī)生成的新的向量表示替換掉。

使用這一詞向量的模型，我們可以定義新的sentence2sequence函數(shù)：

def sentence2sequence(sentence):

“””

– Turns an input paragraph into an (m,d) matrix,

where n is the number of tokens in the sentence

and d is the number of dimensions each word vector has.

TensorFlow doesn’t need to be used here, as simply

turning the sentence into a sequence based off our

mapping does not need the computational power that

TensorFlow provides. Normal Python suffices for this task.

“””

tokens = sentence.strip(‘”(),-‘).lower().split(” “)

rows = []

words = []

#Greedy search for tokens

for token in tokens:

i = len(token)

while len(token) > 0:

word = token[:i]

if word in glove_wordmap:

rows.append(glove_wordmap[word])

words.append(word)

token = token[i:]

i = len(token)

continue

else:

i = i-1

if i == 0:

# word OOV

# https://arxiv.org/pdf/1611.01436.pdf

rows.append(fill_unk(token))

words.append(token)

break

return np.array(rows), words

現(xiàn)在我們可以把每個問題需要的數(shù)據(jù)給打包起來了，包括上下文、問題和答案的詞向量。在bAbI里，上下文被我們定義成了帶有序號的句子。用contextualize函數(shù)可以完成這個任務(wù)。問題和答案都在同一行里，用tab符分割開。因此在一行里我們可以使用tab符作為區(qū)分問題和答案的標(biāo)記。當(dāng)序號被重置后，未來的問題將會指向是新的上下文（注意：通常對于一個上下文會有多個問題）。答案里還有另外一個我們會保留下來但不用的信息：答案對應(yīng)的句子的序號。在我們的系統(tǒng)里，神經(jīng)網(wǎng)絡(luò)將會自己學(xué)習(xí)用來回答問題的句子。

def contextualize(set_file):

“””

Read in the dataset of questions and build question+answer -> context sets.

Output is a list of data points, each of which is a 7-element tuple containing:

The sentences in the context in vectorized form.

The sentences in the context as a list of string tokens.

The question in vectorized form.

The question as a list of string tokens.

The answer in vectorized form.

The answer as a list of string tokens.

A list of numbers for supporting statements, which is currently unused.

“””

data = []

context = []

with open(set_file, “r”, encoding=”utf8″) as train:

for line in train:

l, ine = tuple(line.split(” “, 1))

# Split the line numbers from the sentences they refer to.

if l is “1”:

# New contexts always start with 1,

# so this is a signal to reset the context.

context = []

if “\t” in ine:

# Tabs are the separator between questions and answers,

# and are not present in context statements.

question, answer, support = tuple(ine.split(“\t”))

data.append((tuple(zip(*context))+

sentence2sequence(question)+

sentence2sequence(answer)+

([int(s) for s in support.split()],)))

# Multiple questions may refer to the same context, so we don’t reset it.

else:

# Context sentence.

context.append(sentence2sequence(ine[:-1]))

return data

train_data = contextualize(train_set_post_file)

test_data = contextualize(test_set_post_file)

final_train_data = []

def finalize(data):

“””

Prepares data generated by contextualize() for use in the network.

“””

final_data = []

for cqas in train_data:

contextvs, contextws, qvs, qws, avs, aws, spt = cqas

lengths = itertools.accumulate(len(cvec) for cvec in contextvs)

context_vec = np.concatenate(contextvs)

context_words = sum(contextws,[])

# Location markers for the beginnings of new sentences.

sentence_ends = np.array(list(lengths))

final_data.append((context_vec, sentence_ends, qvs, spt, context_words, cqas, avs, aws))

return np.array(final_data)

final_train_data = finalize(train_data)

final_test_data = finalize(test_data)

定義超參數(shù)

到這里，我們已經(jīng)完全準(zhǔn)備好了所需的訓(xùn)練和測試數(shù)據(jù)。下面的任務(wù)就是構(gòu)建用來理解數(shù)據(jù)的神經(jīng)網(wǎng)絡(luò)。讓我們從清除TensorFlow的默認(rèn)計算圖開始，從而能讓我們在修改了一些東西后再次運行網(wǎng)絡(luò)。

tf.reset_default_graph()

這里是網(wǎng)絡(luò)的開始，因此讓我們在這里定義所有需要的常量。我們叫它們“超參數(shù)”，因為它們定義了網(wǎng)絡(luò)的結(jié)構(gòu)和訓(xùn)練的方法。

# Hyperparameters

# The number of dimensions used to store data passed between recurrent layers in the network.

recurrent_cell_size = 128

# The number of dimensions in our word vectorizations.

D = 50

# How quickly the network learns. Too high, and we may run into numeric instability

# or other issues.

learning_rate = 0.005

# Dropout probabilities. For a description of dropout and what these probabilities are,

# see Entailment with TensorFlow.

input_p, output_p = 0.5, 0.5

# How many questions we train on at a time.

batch_size = 128

# Number of passes in episodic memory. We’ll get to this later.

passes = 4

# Feed Forward layer sizes: the number of dimensions used to store data passed from feed-forward layers.

ff_hidden_size = 256

weight_decay = 0.00000001

# The strength of our regularization. Increase to encourage sparsity in episodic memory,

# but makes training slower. Don’t make this larger than leraning_rate.

training_iterations_count = 400000

# How many questions the network trains on each time it is trained.

# Some questions are counted multiple times.

display_step = 100

# How many iterations of training occur before each validation check.

網(wǎng)絡(luò)架構(gòu)

有了這些超參數(shù)，讓我們定義網(wǎng)絡(luò)的架構(gòu)。這個架構(gòu)大致能被分成4個模塊，具體描述請見這篇《Ask Me Anything: Dynamic Memory Networks for Natural Language Processing》。

網(wǎng)絡(luò)里定義了一個循環(huán)層，能基于文本里的其他的信息被動態(tài)地定義，因此叫動態(tài)記憶網(wǎng)絡(luò)（DMN，Dynamic Memory Network）。DMN大致是基于人類是如何試圖去回答一個閱讀理解類型的問題的理解。首先，人類會讀取一段上下文，并在記憶里創(chuàng)建一些事實內(nèi)容。基于這些記住的內(nèi)容，他們再去讀問題，并再次查看上下文，特別是去尋找和問題相關(guān)的答案，并把問題和每個事實去比對。

有時候，一個事實會把我們引向另外一個事實。在bAbI數(shù)據(jù)集里，神經(jīng)網(wǎng)絡(luò)可能是想找到一個足球的位置。它也許會搜索句子里和足球相關(guān)的內(nèi)容，并發(fā)現(xiàn)John是最后一個接觸足球的人。然后搜索和John相關(guān)的句子，發(fā)現(xiàn)John曾經(jīng)出現(xiàn)在臥室和門廳里。一旦它意識到John最后是出現(xiàn)在門廳里，它就可以有信心地回答這個問題，指出足球是在門廳里。

圖3 神經(jīng)網(wǎng)絡(luò)里的4個模型，它們組合在一起來回答bAbI的問題。在每個片段里，新的事實被注意到，它們能幫助找到答案。Kumar注意到這個網(wǎng)絡(luò)不正確地給句子2分配了權(quán)重，但這也合理，因為John曾經(jīng)在那里，盡管此時他沒有足球。來源：Ankit Kumar等，授權(quán)使用

輸入模塊

輸入模塊是我們的動態(tài)記憶網(wǎng)絡(luò)用來得到答案的4個模塊的第一個。它包括一個帶有門循環(huán)單元（GRU，Gated Recurrent Unit，在TensorFlow里的tf.contrib.nn.GRUCell）的輸入通道，讓數(shù)據(jù)通過來收集證據(jù)片段。每個片段的證據(jù)或是事實都對應(yīng)這上下文的單個句子，并由這個時間片的輸出所代表。這就要求一些非TensorFlow的預(yù)處理，從而能獲取句子的結(jié)尾并把這個信息送給TensorFlow來用于后面的模塊。

我們會在后面訓(xùn)練的時候處理這些額外的過程。我們會使用TensorFlow的gather_nd來處理數(shù)據(jù)從而選擇相應(yīng)的輸出。gather_nd功能是一個非常有用的工具。我建議你仔細(xì)看看它的API文檔來學(xué)習(xí)它是如何工作的。

# Input Module

# Context: A [batch_size, maximum_context_length, word_vectorization_dimensions] tensor

# that contains all the context information.

context = tf.placeholder(tf.float32, [None, None, D], “context”)

context_placeholder = context # I use context as a variable name later on

# input_sentence_endings: A [batch_size, maximum_sentence_count, 2] tensor that

# contains the locations of the ends of sentences.

input_sentence_endings = tf.placeholder(tf.int32, [None, None, 2], “sentence”)

# recurrent_cell_size: the number of hidden units in recurrent layers.

input_gru = tf.contrib.rnn.GRUCell(recurrent_cell_size)

# input_p: The probability of maintaining a specific hidden input unit.

# Likewise, output_p is the probability of maintaining a specific hidden output unit.

gru_drop = tf.contrib.rnn.DropoutWrapper(input_gru, input_p, output_p)

# dynamic_rnn also returns the final internal state. We don’t need that, and can

# ignore the corresponding output (_).

input_module_outputs, _ = tf.nn.dynamic_rnn(gru_drop, context, dtype=tf.float32, scope = “input_module”)

# cs: the facts gathered from the context.

cs = tf.gather_nd(input_module_outputs, input_sentence_endings)

# to use every word as a fact, useful for tasks with one-sentence contexts

s = input_module_outputs

問題模塊

問題模塊是第二個模塊，也可是說是最簡單的一個。它包括另外一個GRU的通道。這次是處理問題的文本。不再是找證據(jù)，我們就是簡單地進(jìn)入結(jié)束狀態(tài)，因為數(shù)據(jù)集里的問題肯定就是一個句子。

# Question Module

# query: A [batch_size, maximum_question_length, word_vectorization_dimensions] tensor

#  that contains all of the questions.

query = tf.placeholder(tf.float32, [None, None, D], “query”)

# input_query_lengths: A [batch_size, 2] tensor that contains question length information.

# input_query_lengths[:,1] has the actual lengths; input_query_lengths[:,0] is a simple range()

# so that it plays nice with gather_nd.

input_query_lengths = tf.placeholder(tf.int32, [None, 2], “query_lengths”)

question_module_outputs, _ = tf.nn.dynamic_rnn(gru_drop, query, dtype=tf.float32,

scope = tf.VariableScope(True, “input_module”))

# q: the question states. A [batch_size, recurrent_cell_size] tensor.

q = tf.gather_nd(question_module_outputs, input_query_lengths)

片段記憶模塊

我們第三個模塊是片段記憶模塊。在這里事情開始變得有趣了。它使用注意力來進(jìn)行多次數(shù)據(jù)通路流轉(zhuǎn)。每個通路包括多個GRU，對輸入進(jìn)行循環(huán)?；诋?dāng)時有多少注意力被放在相應(yīng)的事實上，每個通路內(nèi)部的循環(huán)對現(xiàn)有記憶的權(quán)重進(jìn)行更新。

注意力

神經(jīng)網(wǎng)絡(luò)里的注意力最初是被設(shè)計用來進(jìn)行圖像分析的，特別是圖片的部分內(nèi)容遠(yuǎn)比其他部分和分析的主題更相關(guān)。諸如在圖片里尋找目標(biāo)對象、在圖片之間跟蹤對象、面部識別，或是其他需要在圖片里找到最相關(guān)信息這樣的任務(wù)里，神經(jīng)網(wǎng)絡(luò)使用注意力來決定圖片里需要進(jìn)一步進(jìn)行分析的最佳的部分。

這里的主要問題是注意力或至少是硬注意力（僅關(guān)注一個輸入?yún)^(qū)域）不容易被優(yōu)化。與處理其他神經(jīng)網(wǎng)絡(luò)類似，我們的優(yōu)化策略是計算不同輸入和權(quán)重下的損失函數(shù)的導(dǎo)數(shù)。但由于它的二元特性，硬注意力是不可導(dǎo)的。因此我們被迫使用了基于實數(shù)的版本（“軟注意力”），它對所有輸入的區(qū)域都用某種權(quán)重來標(biāo)識注意的程度。而這個實數(shù)權(quán)重是完全可導(dǎo)的，也能正常被訓(xùn)練。盡管硬注意力也能被訓(xùn)練學(xué)習(xí)，但非常困難，而且有時它的表現(xiàn)也不如軟注意力好。因此在這個模型里我們還是使用軟注意力。不用擔(dān)心要寫求導(dǎo)的代碼，TensorFlow的優(yōu)化策略已經(jīng)為我們完成了。

在這個模型里，我們通過構(gòu)建每個事實、現(xiàn)有記憶和最初的問題之間的相似度來計算注意力（需要注意的是，這一方法和通常的注意力是有區(qū)別的。通常的注意力只構(gòu)建事實和現(xiàn)有記憶的相似度）。我們把結(jié)果送進(jìn)一個兩層的前饋網(wǎng)絡(luò)來獲得每個事實的注意力常數(shù)。接著我們把輸入的事實通過一個GRU來給予權(quán)重（通過相應(yīng)的注意力常數(shù)來給予），從而修改了現(xiàn)有記憶。為了避免當(dāng)上下文比矩陣的長度小的時候向現(xiàn)有記憶里加入不正確的信息，我們創(chuàng)建了一掩蓋層，當(dāng)事實不存在的時候就根本不去注意它（例如，獲取了相同的現(xiàn)有記憶）。

另外一個值得注意的方面就是注意力掩蓋層幾乎總是在網(wǎng)絡(luò)的一層里封包了表達(dá)層。例如在圖像里，這個封包很可能是發(fā)生在卷積層上（很可能是把圖片的位置直接映射）。對于自然語言，這一層很可能是封包在循環(huán)層上。盡管技術(shù)上是可以的，但對一個前饋網(wǎng)絡(luò)進(jìn)行注意力封包通常并沒什么用，或至少在那些不能容易被后續(xù)前饋網(wǎng)絡(luò)層模擬的方式上有用。

# Episodic Memory

# make sure the current memory (i.e. the question vector) is broadcasted along the facts dimension

size = tf.stack([tf.constant(1),tf.shape(cs)[1], tf.constant(1)])

re_q = tf.tile(tf.reshape(q,[-1,1,recurrent_cell_size]),size)

# Final output for attention, needs to be 1 in order to create a mask

output_size = 1

# Weights and biases

attend_init = tf.random_normal_initializer(stddev=0.1)

w_1 = tf.get_variable(“attend_w1”, [1,recurrent_cell_size*7, recurrent_cell_size],

tf.float32, initializer = attend_init)

w_2 = tf.get_variable(“attend_w2”, [1,recurrent_cell_size, output_size],

tf.float32, initializer = attend_init)

b_1 = tf.get_variable(“attend_b1”, [1, recurrent_cell_size],

tf.float32, initializer = attend_init)

b_2 = tf.get_variable(“attend_b2”, [1, output_size],

tf.float32, initializer = attend_init)

# Regulate all the weights and biases

tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES, tf.nn.l2_loss(w_1))

tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES, tf.nn.l2_loss(b_1))

tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES, tf.nn.l2_loss(w_2))

tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES, tf.nn.l2_loss(b_2))

def attention(c, mem, existing_facts):

“””

Custom attention mechanism.

c: A [batch_size, maximum_sentence_count, recurrent_cell_size] tensor

that contains all the facts from the contexts.

mem: A [batch_size, maximum_sentence_count, recurrent_cell_size] tensor that

contains the current memory. It should be the same memory for all facts for accurate results.

existing_facts: A [batch_size, maximum_sentence_count, 1] tensor that

acts as a binary mask for which facts exist and which do not.

“””

with tf.variable_scope(“attending”) as scope:

# attending: The metrics by which we decide what to attend to.

attending = tf.concat([c, mem, re_q, c * re_q,  c * mem, (c-re_q)**2, (c-mem)**2], 2)

# m1: First layer of multiplied weights for the feed-forward network.

#     We tile the weights in order to manually broadcast, since tf.matmul does not

#     automatically broadcast batch matrix multiplication as of TensorFlow 1.2.

m1 = tf.matmul(attending * existing_facts,

tf.tile(w_1, tf.stack([tf.shape(attending)[0],1,1]))) * existing_facts

# bias_1: A masked version of the first feed-forward layer’s bias

#     over only existing facts.

bias_1 = b_1 * existing_facts

# tnhan: First nonlinearity. In the original paper, this is a tanh nonlinearity;

#        choosing relu was a design choice intended to avoid issues with

#        low gradient magnitude when the tanh returned values close to 1 or -1.

tnhan = tf.nn.relu(m1 + bias_1)

# m2: Second layer of multiplied weights for the feed-forward network.

#     Still tiling weights for the same reason described in m1’s comments.

m2 = tf.matmul(tnhan, tf.tile(w_2, tf.stack([tf.shape(attending)[0],1,1])))

# bias_2: A masked version of the second feed-forward layer’s bias.

bias_2 = b_2 * existing_facts

# norm_m2: A normalized version of the second layer of weights, which is used

#     to help make sure the softmax nonlinearity doesn’t saturate.

norm_m2 = tf.nn.l2_normalize(m2 + bias_2, -1)

# softmaxable: A hack in order to use sparse_softmax on an otherwise dense tensor.

#     We make norm_m2 a sparse tensor, then make it dense again after the operation.

softmax_idx = tf.where(tf.not_equal(norm_m2, 0))[:,:-1]

softmax_gather = tf.gather_nd(norm_m2[…,0], softmax_idx)

softmax_shape = tf.shape(norm_m2, out_type=tf.int64)[:-1]

softmaxable = tf.SparseTensor(softmax_idx, softmax_gather, softmax_shape)

return tf.expand_dims(tf.sparse_tensor_to_dense(tf.sparse_softmax(softmaxable)),-1)

# facts_0s: a [batch_size, max_facts_length, 1] tensor

#     whose values are 1 if the corresponding fact exists and 0 if not.

facts_0s = tf.cast(tf.count_nonzero(input_sentence_endings[:,:,-1:],-1,keep_dims=True),tf.float32)

with tf.variable_scope(“Episodes”) as scope:

attention_gru = tf.contrib.rnn.GRUCell(recurrent_cell_size)

# memory: A list of all tensors that are the (current or past) memory state

#   of the attention mechanism.

memory = [q]

# attends: A list of all tensors that represent what the network attends to.

attends = []

for a in range(passes):

# attention mask

attend_to = attention(cs, tf.tile(tf.reshape(memory[-1],[-1,1,recurrent_cell_size]),size),

facts_0s)

# Inverse attention mask, for what’s retained in the state.

retain = 1-attend_to

# GRU pass over the facts, according to the attention mask.

while_valid_index = (lambda state, index: index <>1])

update_state = (lambda state, index: (attend_to[:,index,:] *

attention_gru(cs[:,index,:], state)[0] +

retain[:,index,:] * state))

# start loop with most recent memory and at the first index

memory.append(tuple(tf.while_loop(while_valid_index,

(lambda state, index: (update_state(state,index),index+1)),

loop_vars = [memory[-1], 0]))[0])

attends.append(attend_to)

# Reuse variables so the GRU pass uses the same variables every pass.

scope.reuse_variables()

答案模塊

最后一個模塊是答案模塊。它使用一個全連接層來對問題和片段記憶模塊的輸出進(jìn)行回歸來得到最后結(jié)果的詞向量，以及上下文里和這個詞向量距離最接近的詞作為我們最后的答案（保證結(jié)果是一個實際的詞）。我們?yōu)槊總€詞創(chuàng)建一個得分來計算最近的詞，這個得分就是結(jié)果的詞距離。雖然你可以設(shè)計一個答案模塊來返回多個詞，但這對于我們要解決的bAbI任務(wù)而言就不必要了。

# Answer Module

# a0: Final memory state. (Input to answer module)

a0 = tf.concat([memory[-1], q], -1)

# fc_init: Initializer for the final fully connected layer’s weights.

fc_init = tf.random_normal_initializer(stddev=0.1)

with tf.variable_scope(“answer”):

# w_answer: The final fully connected layer’s weights.

w_answer = tf.get_variable(“weight”, [recurrent_cell_size*2, D],

tf.float32, initializer = fc_init)

# Regulate the fully connected layer’s weights

tf.add_to_collection(tf.GraphKeys.REGULARIZATION_LOSSES,

tf.nn.l2_loss(w_answer))

# The regressed word. This isn’t an actual word yet;

#    we still have to find the closest match.

logit = tf.expand_dims(tf.matmul(a0, w_answer),1)

# Make a mask over which words exist.

with tf.variable_scope(“ending”):

all_ends = tf.reshape(input_sentence_endings, [-1,2])

range_ends = tf.range(tf.shape(all_ends)[0])

ends_indices = tf.stack([all_ends[:,0],range_ends], axis=1)

ind = tf.reduce_max(tf.scatter_nd(ends_indices, all_ends[:,1],

[tf.shape(q)[0], tf.shape(all_ends)[0]]),

axis=-1)

range_ind = tf.range(tf.shape(ind)[0])

mask_ends = tf.cast(tf.scatter_nd(tf.stack([ind, range_ind], axis=1),

tf.ones_like(range_ind), [tf.reduce_max(ind)+1,

tf.shape(ind)[0]]), bool)

# A bit of a trick. With the locations of the ends of the mask (the last periods in

#  each of the contexts) as 1 and the rest as 0, we can scan with exclusive or

#  (starting from all 1). For each context in the batch, this will result in 1s

#  up until the marker (the location of that last period) and 0s afterwards.

mask = tf.scan(tf.logical_xor,mask_ends, tf.ones_like(range_ind, dtype=bool))

# We score each possible word inversely with their Euclidean distance to the regressed word.

#  The highest score (lowest distance) will correspond to the selected word.

logits = -tf.reduce_sum(tf.square(context*tf.transpose(tf.expand_dims(

tf.cast(mask, tf.float32),-1),[1,0,2]) – logit), axis=-1)

對優(yōu)化的策略進(jìn)行優(yōu)化

梯度下降是神經(jīng)網(wǎng)絡(luò)的默認(rèn)優(yōu)化器。它的目標(biāo)是降低網(wǎng)絡(luò)的“損失”。損失是通過測量網(wǎng)絡(luò)的表現(xiàn)來確定的。為了計算損失，梯度下降法會找到權(quán)重和輸入所對應(yīng)的損失的導(dǎo)數(shù)，然后通過“下降”權(quán)重來降低損失。大部分情況下，這個方法都不錯，但并不是最理想的?，F(xiàn)在有很多不同的策略，使用“動量”或是其他的對于直接路徑的近似的方法來優(yōu)化權(quán)重。其中一個最常用的就是自適應(yīng)動量估計，叫Adam。

Adam方法會估計梯度的頭兩個動量。它計算上一個循環(huán)的梯度和梯度的平方的平均值的指數(shù)衰減，這兩個值對應(yīng)著這些梯度的均值和方差的估計。計算會使用兩個額外的超參數(shù)來控制平均衰減和得到新的信息的速度。平均值被初始化成零，并導(dǎo)致偏置量為零，特別是在超參數(shù)也趨近于零的時候。

為了抵消這些偏置量，Adam計算偏置修正動量估計，這個值一般比初始值要大。然后這個修正過的估計被用來在網(wǎng)絡(luò)里更新權(quán)重。這些估計的組合使得Adam方法成為總體優(yōu)化的最佳方案之一，特別是對復(fù)雜的神經(jīng)網(wǎng)絡(luò)。而對于非常稀疏的數(shù)據(jù)（這在自然語言處理里的任務(wù)是很常見的），它的效果就更好了。

在TensorFlow里，我們可以用tf.train.AdamOptimizer來創(chuàng)建Adam優(yōu)化器。

# Training

# gold_standard: The real answers.

gold_standard = tf.placeholder(tf.float32, [None, 1, D], “answer”)

with tf.variable_scope(‘a(chǎn)ccuracy’):

eq = tf.equal(context, gold_standard)

corrbool = tf.reduce_all(eq,-1)

logloc = tf.reduce_max(logits, -1, keep_dims = True)

# locs: A boolean tensor that indicates where the score

#  matches the minimum score. This happens on multiple dimensions,

#  so in the off chance there’s one or two indexes that match

#  we make sure it matches in all indexes.

locs = tf.equal(logits, logloc)

# correctsbool: A boolean tensor that indicates for which

#   words in the context the score always matches the minimum score.

correctsbool = tf.reduce_any(tf.logical_and(locs, corrbool), -1)

# corrects: A tensor that is simply correctsbool cast to floats.

corrects = tf.where(correctsbool, tf.ones_like(correctsbool, dtype=tf.float32),

tf.zeros_like(correctsbool,dtype=tf.float32))

# corr: corrects, but for the right answer instead of our selected answer.

corr = tf.where(corrbool, tf.ones_like(corrbool, dtype=tf.float32),

tf.zeros_like(corrbool,dtype=tf.float32))

with tf.variable_scope(“l(fā)oss”):

# Use sigmoid cross entropy as the base loss,

#  with our distances as the relative probabilities. There are

#  multiple correct labels, for each location of the answer word within the context.

loss = tf.nn.sigmoid_cross_entropy_with_logits(logits = tf.nn.l2_normalize(logits,-1),

labels = corr)

# Add regularization losses, weighted by weight_decay.

total_loss = tf.reduce_mean(loss) + weight_decay * tf.add_n(

tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))

# TensorFlow’s default implementation of the Adam optimizer works. We can adjust more than

#  just the learning rate, but it’s not necessary to find a very good optimum.

optimizer = tf.train.AdamOptimizer(learning_rate)

# Once we have an optimizer, we ask it to minimize the loss

#   in order to work towards the proper training.

opt_op = optimizer.minimize(total_loss)

# Initialize variables

init = tf.global_variables_initializer()

# Launch the TensorFlow session

sess = tf.Session()

sess.run(init)

訓(xùn)練神經(jīng)網(wǎng)絡(luò)

萬事俱備，我們可以開始批次化訓(xùn)練神經(jīng)網(wǎng)絡(luò)了。在訓(xùn)練過程中，我們應(yīng)該持續(xù)監(jiān)測神經(jīng)網(wǎng)絡(luò)的準(zhǔn)確度指標(biāo)。我們從測試數(shù)據(jù)里取出一部分作為驗證數(shù)據(jù)集，這樣它們和訓(xùn)練數(shù)據(jù)就不會有重疊了。

使用從測試數(shù)據(jù)集里生成的驗證數(shù)據(jù)，我們就能夠觀察我們的神經(jīng)網(wǎng)絡(luò)的泛化能力，即從訓(xùn)練數(shù)據(jù)里學(xué)習(xí)到的東西能否應(yīng)用于未知的上下文中。如果我們使用訓(xùn)練數(shù)據(jù)里的部分做驗證，整個網(wǎng)絡(luò)可能會過擬合，即學(xué)到了特定的上下文例子，并記住答案。但這在碰到新問題的時候并不會很好。

如果你安裝了TQDM，你可以用它來跟蹤網(wǎng)絡(luò)已經(jīng)被訓(xùn)練了多久，以及還要多久才能訓(xùn)練完。你可以通過中斷Jupyter Notebook的kernel來在任何你覺得結(jié)果已經(jīng)夠好的時間停止訓(xùn)練。

def prep_batch(batch_data, more_data = False):

“””

Prepare all the preproccessing that needs to be done on a batch-by-batch basis.

“””

context_vec, sentence_ends, questionvs, spt, context_words, cqas, answervs, _ = zip(*batch_data)

ends = list(sentence_ends)

maxend = max(map(len, ends))

aends = np.zeros((len(ends), maxend))

for index, i in enumerate(ends):

for indexj, x in enumerate(i):

aends[index, indexj] = x-1

new_ends = np.zeros(aends.shape+(2,))

for index, x in np.ndenumerate(aends):

new_ends[index+(0,)] = index[0]

new_ends[index+(1,)] = x

contexts = list(context_vec)

max_context_length = max([len(x) for x in contexts])

contextsize = list(np.array(contexts[0]).shape)

contextsize[0] = max_context_length

final_contexts = np.zeros([len(contexts)]+contextsize)

contexts = [np.array(x) for x in contexts]

for i, context in enumerate(contexts):

final_contexts[i,0:len(context),:] = context

max_query_length = max(len(x) for x in questionvs)

querysize = list(np.array(questionvs[0]).shape)

querysize[:1] = [len(questionvs),max_query_length]

queries = np.zeros(querysize)

querylengths = np.array(list(zip(range(len(questionvs)),[len(q)-1 for q in questionvs])))

questions = [np.array(q) for q in questionvs]

for i, question in enumerate(questions):

queries[i,0:len(question),:] = question

data = {context_placeholder: final_contexts, input_sentence_endings: new_ends,

query:queries, input_query_lengths:querylengths, gold_standard: answervs}

return (data, context_words, cqas) if more_data else data

# Use TQDM if installed

tqdm_installed = False

try:

from tqdm import tqdm

tqdm_installed = True

except:

pass

# Prepare validation set

batch = np.random.randint(final_test_data.shape[0], size=batch_size*10)

batch_data = final_test_data[batch]

validation_set, val_context_words, val_cqas = prep_batch(batch_data, True)

# training_iterations_count: The number of data pieces to train on in total

# batch_size: The number of data pieces per batch

def train(iterations, batch_size):

training_iterations = range(0,iterations,batch_size)

if tqdm_installed:

# Add a progress bar if TQDM is installed

training_iterations = tqdm(training_iterations)

wordz = []

for j in training_iterations:

batch = np.random.randint(final_train_data.shape[0], size=batch_size)

batch_data = final_train_data[batch]

sess.run([opt_op], feed_dict=prep_batch(batch_data))

if (j/batch_size) % display_step == 0:

# Calculate batch accuracy

acc, ccs, tmp_loss, log, con, cor, loc  = sess.run([corrects, cs, total_loss, logit,

context_placeholder,corr, locs],

feed_dict=validation_set)

# Display results

print(“Iter ” + str(j/batch_size) + “, Minibatch Loss= “,tmp_loss,

“Accuracy= “, np.mean(acc))

train(30000,batch_size) # Small amount of training for preliminary results

在一段訓(xùn)練后，讓我們看看網(wǎng)絡(luò)的內(nèi)部，檢查一下網(wǎng)絡(luò)會返回什么樣的答案。在下面的一系列圖片里，我們可視化了每個片段（行）和上下文里的所有的句子（列）。深色代表那個句子對片段有更多的注意力。

你應(yīng)該能看到對每個問題，注意力至少在兩個片段里被改變過。但是有時候注意力將能在一個里面發(fā)現(xiàn)答案，或者有時候注意力會關(guān)注所有四個片段。如果注意力看起來是空的，它可能是飽和了，對每個事情都進(jìn)行關(guān)注。碰到這種情況，你可以試著使用一個更高的weight_decay來降低這種情況發(fā)生的可能性。在訓(xùn)練的后期，飽和會變得非常常見。

ancr = sess.run([corrbool,locs, total_loss, logits, facts_0s, w_1]+attends+

[query, cs, question_module_outputs],feed_dict=validation_set)

a = ancr[0]

n = ancr[1]

cr = ancr[2]

attenders = np.array(ancr[6:-3])

faq = np.sum(ancr[4], axis=(-1,-2)) # Number of facts in each context

limit = 5

for question in range(min(limit, batch_size)):

plt.yticks(range(passes,0,-1))

plt.ylabel(“Episode”)

plt.xlabel(“Question “+str(question+1))

pltdata = attenders[:,question,:int(faq[question]),0]

# Display only information about facts that actually exist, all others are 0

pltdata = (pltdata – pltdata.mean()) / ((pltdata.max() – pltdata.min() + 0.001)) * 256

plt.pcolor(pltdata, cmap=plt.cm.BuGn, alpha=0.7)

plt.show()

#print(list(map((lambda x: x.shape),ancr[3:])), new_ends.shape)

為了能看看上面問題的答案是什么，我們可以使用上下文里的距離得分的位置作為一個指數(shù)，看看什么詞出現(xiàn)在那個指數(shù)里。

# Locations of responses within contexts

indices = np.argmax(n,axis=1)

# Locations of actual answers within contexts

indicesc = np.argmax(a,axis=1)

for i,e,cw, cqa in list(zip(indices, indicesc, val_context_words, val_cqas))[:limit]:

ccc = ” “.join(cw)

print(“TEXT: “,ccc)

print (“QUESTION: “, ” “.join(cqa[3]))

print (“RESPONSE: “, cw[i], [“Correct”, “Incorrect”][i!=e])

print(“EXPECTED: “, cw[e])

print()

TEXT:  mary travelled to the bedroom . mary journeyed to the bathroom . mary got the football there . mary passed the football to fred .

QUESTION:  who received the football ?

RESPONSE:  mary Incorrect

EXPECTED:  fred

TEXT:  bill grabbed the apple there . bill got the football there . jeff journeyed to the bathroom . bill handed the apple to jeff . jeff handed the apple to bill . bill handed the apple to jeff . jeff handed the apple to bill . bill handed the apple to jeff .

QUESTION:  what did bill give to jeff ?

RESPONSE:  apple Correct

EXPECTED:  apple

TEXT:  bill moved to the bathroom . mary went to the garden . mary picked up the apple there . bill moved to the kitchen . mary left the apple there . jeff got the football there . jeff went back to the kitchen . jeff gave the football to fred .

QUESTION:  what did jeff give to fred ?

RESPONSE:  apple Incorrect

EXPECTED:  football

TEXT:  jeff travelled to the bathroom . bill journeyed to the bedroom . jeff journeyed to the hallway . bill took the milk there . bill discarded the milk . mary moved to the bedroom . jeff went back to the bedroom . fred got the football there . bill grabbed the milk there . bill passed the milk to mary . mary gave the milk to bill . bill discarded the milk there . bill went to the kitchen . bill got the apple there .

QUESTION:  who gave the milk to bill ?

RESPONSE:  jeff Incorrect

EXPECTED:  mary

TEXT:  fred travelled to the bathroom . jeff went to the bathroom . mary went back to the bathroom . fred went back to the bedroom . fred moved to the office . mary went back to the bedroom . jeff got the milk there . bill journeyed to the garden . mary went back to the kitchen . fred went to the bedroom . mary journeyed to the bedroom . jeff put down the milk there . jeff picked up the milk there . bill went back to the office . mary went to the kitchen . jeff went back to the kitchen . jeff passed the milk to mary . mary gave the milk to jeff . jeff gave the milk to mary . mary got the football there . bill travelled to the bathroom . fred moved to the garden . fred got the apple there . mary handed the football to jeff . fred put down the apple . jeff left the football .

QUESTION:  who received the football ?

RESPONSE:  mary Incorrect

EXPECTED:  jeff

讓我們繼續(xù)訓(xùn)練。為了能得到好的結(jié)果，你或許必須訓(xùn)練很久（用我的筆記本，花了12個小時）。但你應(yīng)該最終能得到一個非常高的準(zhǔn)確率（超過90%）。有經(jīng)驗的Jupyter Notebook用戶應(yīng)該知道，你可以在任何時間中斷訓(xùn)練過程并保存已經(jīng)訓(xùn)練的網(wǎng)絡(luò)，只要你保持使用相同的tf.Session。如果你想可視化注意力和當(dāng)前網(wǎng)絡(luò)能給出的答案，這一方式會非常有用。

train(training_iterations_count, batch_size)

# Final testing accuracy

print(np.mean(sess.run([corrects], feed_dict= prep_batch(final_test_data))[0]))

0.95

當(dāng)看到我們的模型被訓(xùn)練出來之后，我們就可以結(jié)束這個會話來釋放資源。

sess.close()

想嘗試更多？

完成上面的步驟后，還是有很多可以試驗和嘗試的東西：

bAbI里的其他任務(wù)。我們僅僅只是處理了bAbI的很多任務(wù)里的一小部分。你可嘗試修改一下預(yù)處理的過程來適配其他的任務(wù)，再看看動態(tài)記憶網(wǎng)絡(luò)在它們上面的表現(xiàn)。當(dāng)然你可能希望在用于新任務(wù)前重新訓(xùn)練一下神經(jīng)網(wǎng)絡(luò)。如果新的任務(wù)并不保證答案一定在上下文里，你可能希望去比較網(wǎng)絡(luò)的輸出和一本字典里的詞的相應(yīng)的向量。任務(wù)6-10和17-20是這樣一些任務(wù)。我也建議你去嘗試一下任務(wù)1和3。你只要通過修改test_set_file和train_set_file就能實現(xiàn)任務(wù)修改。
監(jiān)督訓(xùn)練。我們的注意力機(jī)制是無監(jiān)督的，因為我們并沒有明確地給出哪些句子應(yīng)該被注意，而是讓網(wǎng)絡(luò)自己去發(fā)現(xiàn)。你可以試著去給網(wǎng)絡(luò)的損失加上一些方法來在注意力關(guān)注了正確的句子時鼓勵它一下。
同時注意力。不像我們這里這樣只關(guān)注于單一輸入的句子，一些研究人員已經(jīng)在他們叫做“動態(tài)同時注意力網(wǎng)絡(luò)”里獲得成功。這種網(wǎng)絡(luò)會同時關(guān)注兩個句子里的兩個位置的矩陣。
其他向量化的方法和來源。你可以嘗試更智能的句子和向量的映射方法，或是使用不同的數(shù)據(jù)集。GloVe提供了一個多達(dá)8400億條唯一詞條的語料庫，每個詞條有300個維度。

這篇文章是O’Reilly和TensorFlow的合作產(chǎn)物。請閱讀我們的編輯獨立聲明。

This article originally appeared in English: 'Question answering with TensorFlow'.

Steven Hewitt

Steven Hewitt目前是加州大學(xué)伯克利分校EECA系的計算機(jī)專業(yè)研究生。他的研究興趣包括AI、自然語言處理、教育和機(jī)器人。他目前的研究項目包括教會程序去理解代碼里的模式并用人能理解的方式來展示，還有詞向量的方法，以及問答系統(tǒng)。當(dāng)他不上課或?qū)懘a的時候，他會創(chuàng)作音樂或是分形火焰藝術(shù)。