一篇文章讓你徹底搞懂神經(jīng)網(wǎng)絡(luò)：從原理到優(yōu)化如此簡單...

達天下圖書館 2020-12-29

展開全文

0. 文章介紹了什么

介紹了神經(jīng)網(wǎng)絡(luò)的基礎(chǔ)單元--神經(jīng)元neurons,
在神經(jīng)元中使用了常見的激活函數(shù)： sigmoid
神經(jīng)網(wǎng)絡(luò)中的神經(jīng)元是如何連接和交互的
創(chuàng)建了一個包含身高和體重（特征）作為輸入和性別作為輸出（標簽）的數(shù)據(jù)集，或者說是訓練集
介紹了損失函數(shù)loss functions 和均方誤差損失 mean squared error (MSE) loss.
訓練一個網(wǎng)絡(luò)就等同于最小化損失
使用反向傳播backpropagation計算偏導數(shù)
使用隨機梯度下降 stochastic gradient descent (SGD) 訓練網(wǎng)絡(luò).

完整代碼：

import numpy as npdef sigmoid(x):# Sigmoid activation function: f(x) = 1 / (1 + e^(-x))return 1 / (1 + np.exp(-x))def deriv_sigmoid(x):# Derivative of sigmoid: f'(x) = f(x) * (1 - f(x))fx = sigmoid(x)return fx * (1 - fx)def mse_loss(y_true, y_pred):# y_true and y_pred are numpy arrays of the same length.return ((y_true - y_pred) ** 2).mean()class OurNeuralNetwork:'''A neural network with:- 2 inputs- a hidden layer with 2 neurons (h1, h2)- an output layer with 1 neuron (o1)*** DISCLAIMER ***:The code below is intended to be simple and educational, NOT optimal.Real neural net code looks nothing like this. DO NOT use this code.Instead, read/run it to understand how this specific network works.'''def __init__(self):# Weightsself.w1 = np.random.normal()self.w2 = np.random.normal()self.w3 = np.random.normal()self.w4 = np.random.normal()self.w5 = np.random.normal()self.w6 = np.random.normal()# Biasesself.b1 = np.random.normal()self.b2 = np.random.normal()self.b3 = np.random.normal()def feedforward(self, x):# x is a numpy array with 2 elements.h1 = sigmoid(self.w1 * x[0] + self.w2 * x[1] + self.b1)h2 = sigmoid(self.w3 * x[0] + self.w4 * x[1] + self.b2)o1 = sigmoid(self.w5 * h1 + self.w6 * h2 + self.b3)return o1def train(self, data, all_y_trues):'''- data is a (n x 2) numpy array, n = # of samples in the dataset.- all_y_trues is a numpy array with n elements.Elements in all_y_trues correspond to those in data.'''learn_rate = 0.1epochs = 1000 # number of times to loop through the entire datasetfor epoch in range(epochs):for x, y_true in zip(data, all_y_trues):# --- Do a feedforward (we'll need these values later)sum_h1 = self.w1 * x[0] + self.w2 * x[1] + self.b1h1 = sigmoid(sum_h1)sum_h2 = self.w3 * x[0] + self.w4 * x[1] + self.b2h2 = sigmoid(sum_h2)sum_o1 = self.w5 * h1 + self.w6 * h2 + self.b3o1 = sigmoid(sum_o1)y_pred = o1# --- Calculate partial derivatives.# --- Naming: p_L_p_w1 stands for "partial L partial w1"p_L_p_ypred = -2 * (y_true - y_pred)# Neuron o1p_ypred_p_w5 = h1 * deriv_sigmoid(sum_o1)p_ypred_p_w6 = h2 * deriv_sigmoid(sum_o1)p_ypred_p_b3 = deriv_sigmoid(sum_o1)p_ypred_p_h1 = self.w5 * deriv_sigmoid(sum_o1)p_ypred_p_h2 = self.w6 * deriv_sigmoid(sum_o1)# Neuron h1p_h1_p_w1 = x[0] * deriv_sigmoid(sum_h1)p_h1_p_w2 = x[1] * deriv_sigmoid(sum_h1)p_h1_p_b1 = deriv_sigmoid(sum_h1)# Neuron h2p_h2_p_w3 = x[0] * deriv_sigmoid(sum_h2)p_h2_p_w4 = x[1] * deriv_sigmoid(sum_h2)p_h2_p_b2 = deriv_sigmoid(sum_h2)# --- Update weights and biases# Neuron h1self.w1 -= learn_rate * p_L_p_ypred * p_ypred_p_h1 * p_h1_p_w1self.w2 -= learn_rate * p_L_p_ypred * p_ypred_p_h1 * p_h1_p_w2self.b1 -= learn_rate * p_L_p_ypred * p_ypred_p_h1 * p_h1_p_b1# Neuron h2self.w3 -= learn_rate * p_L_p_ypred * p_ypred_p_h2 * p_h2_p_w3self.w4 -= learn_rate * p_L_p_ypred * p_ypred_p_h2 * p_h2_p_w4self.b2 -= learn_rate * p_L_p_ypred * p_ypred_p_h2 * p_h2_p_b2# Neuron o1self.w5 -= learn_rate * p_L_p_ypred * p_ypred_p_w5self.w6 -= learn_rate * p_L_p_ypred * p_ypred_p_w6self.b3 -= learn_rate * p_L_p_ypred * p_ypred_p_b3# --- Calculate total loss at the end of each epochif epoch % 10 == 0:y_preds = np.apply_along_axis(self.feedforward, 1, data)loss = mse_loss(all_y_trues, y_preds)print("Epoch %d loss: %.3f" % (epoch, loss))# Define datasetdata = np.array([[-2, -1],  # Alice[25, 6],   # Bob[17, 4],   # Charlie[-15, -6], # Diana])all_y_trues = np.array([1, # Alice0, # Bob0, # Charlie1, # Diana])# Train our neural network!network = OurNeuralNetwork()network.train(data, all_y_trues)# Make some predictionsemily = np.array([-7, -3]) # 128 pounds, 63 inchesfrank = np.array([20, 2])  # 155 pounds, 68 inchesprint("Emily: %.3f" % network.feedforward(emily)) # 0.951 - Fprint("Frank: %.3f" % network.feedforward(frank)) # 0.039 - M

一. 基本模塊--神經(jīng)元

在說神經(jīng)網(wǎng)絡(luò)之前，我們討論一下神經(jīng)元（Neurons），它是神經(jīng)網(wǎng)絡(luò)的基本單元。神經(jīng)元先獲得輸入，然后執(zhí)行某些數(shù)學運算后，再產(chǎn)生一個輸出。比如一個2輸入神經(jīng)元的例子：

1.單個神經(jīng)元的工作原理

在這個神經(jīng)元中，輸入總共經(jīng)歷了3步數(shù)學運算，簡略地說就是：輸入-->與權(quán)重向量相乘，乘后的結(jié)果相加，加后的結(jié)果作為激活函數(shù)的輸入進行計算-->輸出

（1）先將兩個輸入乘以權(quán)重w（weight）（圖中棕色方塊）：

x1→x1 × w1x2→x2 × w2

（2）把兩個結(jié)果想加，再加上一個偏置b（bias）（圖中綠色方塊）：

（x1 × w1）+（x2 × w2）+ b

（3）最后將它們經(jīng)過激活函數(shù)f（activation function）處理得到輸出（圖中黃色方塊）：

y = f(x1 × w1 + x2 × w2 + b)

激活函數(shù)的作用是將無限制的輸入轉(zhuǎn)換為可預(yù)測形式的輸出。

2.一種常用的激活函數(shù):sigmoid函數(shù)

sigmoid函數(shù)的輸出介于0和1，我們可以理解為它把 (?∞,+∞) 范圍內(nèi)的數(shù)壓縮到 (0, 1)以內(nèi)。正值越大輸出越接近1，負向數(shù)值越大輸出越接近0。

舉個例子，上面神經(jīng)元里的權(quán)重和偏置取如下數(shù)值：

w=[0,1]b = 4

w=[0,1]是w1=0、w2=1的向量形式寫法。給神經(jīng)元一個輸入x=[2,3]，可以用向量點積的形式把神經(jīng)元的輸出計算出來：

w·x+b =（x1 × w1）+（x2 × w2）+ b = 0×2+1×3+4=7y=f(w?X+b)=f(7)=0.999

上述單個神經(jīng)元計算對應(yīng)的python代碼：

import numpy as np #調(diào)用強大的Python數(shù)學函數(shù)庫NumPydef sigmoid(x):return 1 / (1 + np.exp(-x)) # 激活函數(shù): f(x) = 1 / (1 + e^(-x))class Neuron:#定義神經(jīng)元def __init__(self, weights, bias):#神經(jīng)元包含2個變量，權(quán)重和偏置self.weights = weightsself.bias = biasdef feedforward(self, inputs):# 輸入乘以權(quán)重，然后加上偏置，然后作為輸入傳給激活函數(shù)進行計算total = np.dot(self.weights, inputs) + self.biasreturn sigmoid(total)weights = np.array([0, 1]) # 示例定義權(quán)重w1 = 0, w2 = 1bias = 4 # 示例定義偏置b = 4n = Neuron(weights, bias)#定義一個神經(jīng)元，并進行初始化x = np.array([2, 3]) #定義輸入 x1 = 2, x2 = 3print(n.feedforward(x)) # 該神經(jīng)元對該輸入的結(jié)果0.9990889488055994

二. 神經(jīng)網(wǎng)絡(luò)

神經(jīng)網(wǎng)絡(luò)就是把一堆神經(jīng)元連接在一起，下面是一個神經(jīng)網(wǎng)絡(luò)的簡單舉例：

這個網(wǎng)絡(luò)有2個輸入（x1和x2）、一個包含2個神經(jīng)元的隱藏層（h1和h2）、包含1個神經(jīng)元的輸出層o1。

隱藏層是夾在輸入輸入層和輸出層之間的部分，一個神經(jīng)網(wǎng)絡(luò)可以有多個隱藏層。

把神經(jīng)元的輸入向前傳遞獲得輸出的過程稱為前饋（feedforward）。

我們假設(shè)上面的網(wǎng)絡(luò)里所有神經(jīng)元都具有相同（實際上每條邊的權(quán)重可能都不一樣）的權(quán)重w=[0,1]和偏置b=0，激活函數(shù)都是sigmoid，那么我們會得到什么輸出呢？

h1=h2=f(w?x+b)=f((0×2)+(1×3)+0)=f(3)=0.9526o1=f(w?[h1,h2]+b)=f((0?h1)+(1?h2)+0)=f(0.9526)=0.7216

import numpy as np# ... code from previous section hereclass OurNeuralNetwork:#神經(jīng)網(wǎng)絡(luò)'''#注釋說明A neural network with:- 2 inputs- a hidden layer with 2 neurons (h1, h2)- an output layer with 1 neuron (o1)Each neuron has the same weights and bias:- w = [0, 1]- b = 0'''def __init__(self):weights = np.array([0, 1])bias = 0# The Neuron class here is from the previous sectionself.h1 = Neuron(weights, bias)self.h2 = Neuron(weights, bias)self.o1 = Neuron(weights, bias)def feedforward(self, x):out_h1 = self.h1.feedforward(x)out_h2 = self.h2.feedforward(x)# The inputs for o1 are the outputs from h1 and h2out_o1 = self.o1.feedforward(np.array([out_h1, out_h2]))return out_o1network = OurNeuralNetwork()x = np.array([2, 3])print(network.feedforward(x)) # 0.7216325609518421

三. 訓練神經(jīng)網(wǎng)絡(luò)

現(xiàn)在我們已經(jīng)學會了如何搭建神經(jīng)網(wǎng)絡(luò)，現(xiàn)在我們來學習如何訓練它，其實這就是一個優(yōu)化的過程。

假設(shè)有一個數(shù)據(jù)集，包含4個人的身高、體重和性別：

現(xiàn)在我們的目標是訓練一個網(wǎng)絡(luò)，根據(jù)體重和身高來推測某人的性別。

為了簡便起見，我們將每個人的身高、體重減去一個固定數(shù)值，把性別男定義為1、性別女定義為0。

在訓練神經(jīng)網(wǎng)絡(luò)之前，我們需要有一個標準定義它到底好不好，以便我們進行改進，這就是損失（loss）。

比如用均方誤差（MSE）來定義損失：

n是樣本的數(shù)量，在上面的數(shù)據(jù)集中是4；

y代表人的性別，男性是1，女性是0；

y_true是變量的真實值，y_pred是變量的預(yù)測值。

顧名思義，均方誤差就是所有數(shù)據(jù)方差的平均值，我們不妨就把它定義為損失函數(shù)。預(yù)測結(jié)果越好，損失就越低，訓練神經(jīng)網(wǎng)絡(luò)就是將損失最小化。

如果上面網(wǎng)絡(luò)的輸出一直是0，也就是預(yù)測所有人都是男性，那么損失是：

計算損失函數(shù)的代碼如下：

import numpy as npdef mse_loss(y_true, y_pred):# y_true and y_pred are numpy arrays of the same length.return ((y_true - y_pred) ** 2).mean()y_true = np.array([1, 0, 0, 1])y_pred = np.array([0, 0, 0, 0])print(mse_loss(y_true, y_pred)) # 0.5

四. 訓練神經(jīng)網(wǎng)絡(luò)（二）--減少神經(jīng)網(wǎng)絡(luò)損失

這個神經(jīng)網(wǎng)絡(luò)不夠好，還要不斷優(yōu)化，盡量減少損失。我們知道，改變網(wǎng)絡(luò)的權(quán)重和偏置可以影響預(yù)測值，但我們應(yīng)該怎么做呢？

為了簡單起見，我們把數(shù)據(jù)集縮減到只包含Alice一個人的數(shù)據(jù)。

于是損失函數(shù)就剩下Alice一個人的方差：

預(yù)測值是由一系列網(wǎng)絡(luò)權(quán)重和偏置計算出來的：

所以損失函數(shù)實際上是包含多個權(quán)重、偏置的多元函數(shù)：

（注意！前方高能！需要你有一些基本的多元函數(shù)微分知識，比如偏導數(shù)、鏈式求導法則。）

1. 示例：如果調(diào)整一下w1，損失函數(shù)是會變大還是變??？

我們需要知道偏導數(shù)?L/?w1是正是負才能回答這個問題。

根據(jù)鏈式求導法則：

而L=(1-y_pred)^2，可以求得第一項偏導數(shù)：

接下來我們要想辦法獲得y_pred和w1的關(guān)系，我們已經(jīng)知道神經(jīng)元h1、h2和o1的數(shù)學運算規(guī)則：

實際上只有神經(jīng)元h1中包含權(quán)重w1，所以我們再次運用鏈式求導法則：

然后求?h1/?w1：

我們在上面的計算中遇到了2次激活函數(shù)sigmoid的導數(shù)f′(x)，sigmoid函數(shù)的導數(shù)很容易求得：

總的鏈式求導公式：

這種向后計算偏導數(shù)的系統(tǒng)稱為反向傳播（backpropagation）。

上面的數(shù)學符號太多，下面我們帶入實際數(shù)值來計算一下。h1、h2和o1

h1=f(x1?w1+x2?w2+b1)=0.0474h2=f(w3?x3+w4?x4+b2)=0.0474o1=f(w5?h1+w6?h2+b3)=f(0.0474+0.0474+0)=f(0.0948)=0.524

神經(jīng)網(wǎng)絡(luò)的輸出y=0.524，沒有顯示出強烈的是男（1）是女（0）的證據(jù)?，F(xiàn)在的預(yù)測效果還很不好。

我們再計算一下當前網(wǎng)絡(luò)的偏導數(shù)?L/?w1：

這個結(jié)果告訴我們：如果增大w1，損失函數(shù)L會有一個非常小的增長。

2.隨機梯度下降

下面將使用一種稱為隨機梯度下降（SGD）的優(yōu)化算法，來訓練網(wǎng)絡(luò)。

經(jīng)過前面的運算，我們已經(jīng)有了訓練神經(jīng)網(wǎng)絡(luò)所有數(shù)據(jù)。但是該如何操作？SGD定義了改變權(quán)重和偏置的方法：

η是一個常數(shù)，稱為學習率（learning rate），它決定了我們訓練網(wǎng)絡(luò)速率的快慢。將w1減去η·?L/?w1，就等到了新的權(quán)重w1。

當?L/?w1是正數(shù)時，w1會變?。划?L/?w1是負數(shù) 時，w1會變大。

如果我們用這種方法去逐步改變網(wǎng)絡(luò)的權(quán)重w和偏置b，損失函數(shù)會緩慢地降低，從而改進我們的神經(jīng)網(wǎng)絡(luò)。

訓練流程如下：

（1）從數(shù)據(jù)集中選擇一個樣本；

（2）計算損失函數(shù)對所有權(quán)重和偏置的偏導數(shù)；

（3）使用更新公式更新每個權(quán)重和偏置；

（4）回到第1步。

我們用Python代碼實現(xiàn)這個過程：

import numpy as npdef sigmoid(x):# Sigmoid activation function: f(x) = 1 / (1 + e^(-x))return 1 / (1 + np.exp(-x))def deriv_sigmoid(x):# Derivative of sigmoid: f'(x) = f(x) * (1 - f(x))fx = sigmoid(x)return fx * (1 - fx)def mse_loss(y_true, y_pred):# y_true and y_pred are numpy arrays of the same length.return ((y_true - y_pred) ** 2).mean()class OurNeuralNetwork:'''A neural network with:- 2 inputs- a hidden layer with 2 neurons (h1, h2)- an output layer with 1 neuron (o1)*** DISCLAIMER ***:The code below is intended to be simple and educational, NOT optimal.Real neural net code looks nothing like this. DO NOT use this code.Instead, read/run it to understand how this specific network works.'''def __init__(self):# Weightsself.w1 = np.random.normal()self.w2 = np.random.normal()self.w3 = np.random.normal()self.w4 = np.random.normal()self.w5 = np.random.normal()self.w6 = np.random.normal()# Biasesself.b1 = np.random.normal()self.b2 = np.random.normal()self.b3 = np.random.normal()def feedforward(self, x):# x is a numpy array with 2 elements.h1 = sigmoid(self.w1 * x[0] + self.w2 * x[1] + self.b1)h2 = sigmoid(self.w3 * x[0] + self.w4 * x[1] + self.b2)o1 = sigmoid(self.w5 * h1 + self.w6 * h2 + self.b3)return o1def train(self, data, all_y_trues):'''- data is a (n x 2) numpy array, n = # of samples in the dataset.- all_y_trues is a numpy array with n elements.Elements in all_y_trues correspond to those in data.'''learn_rate = 0.1epochs = 1000 # number of times to loop through the entire datasetfor epoch in range(epochs):for x, y_true in zip(data, all_y_trues):# --- Do a feedforward (we'll need these values later)sum_h1 = self.w1 * x[0] + self.w2 * x[1] + self.b1h1 = sigmoid(sum_h1)sum_h2 = self.w3 * x[0] + self.w4 * x[1] + self.b2h2 = sigmoid(sum_h2)sum_o1 = self.w5 * h1 + self.w6 * h2 + self.b3o1 = sigmoid(sum_o1)y_pred = o1# --- Calculate partial derivatives.# --- Naming: d_L_d_w1 represents "partial L / partial w1"d_L_d_ypred = -2 * (y_true - y_pred)# Neuron o1d_ypred_d_w5 = h1 * deriv_sigmoid(sum_o1)d_ypred_d_w6 = h2 * deriv_sigmoid(sum_o1)d_ypred_d_b3 = deriv_sigmoid(sum_o1)d_ypred_d_h1 = self.w5 * deriv_sigmoid(sum_o1)d_ypred_d_h2 = self.w6 * deriv_sigmoid(sum_o1)# Neuron h1d_h1_d_w1 = x[0] * deriv_sigmoid(sum_h1)d_h1_d_w2 = x[1] * deriv_sigmoid(sum_h1)d_h1_d_b1 = deriv_sigmoid(sum_h1)# Neuron h2d_h2_d_w3 = x[0] * deriv_sigmoid(sum_h2)d_h2_d_w4 = x[1] * deriv_sigmoid(sum_h2)d_h2_d_b2 = deriv_sigmoid(sum_h2)# --- Update weights and biases# Neuron h1self.w1 -= learn_rate * d_L_d_ypred * d_ypred_d_h1 * d_h1_d_w1self.w2 -= learn_rate * d_L_d_ypred * d_ypred_d_h1 * d_h1_d_w2self.b1 -= learn_rate * d_L_d_ypred * d_ypred_d_h1 * d_h1_d_b1# Neuron h2self.w3 -= learn_rate * d_L_d_ypred * d_ypred_d_h2 * d_h2_d_w3self.w4 -= learn_rate * d_L_d_ypred * d_ypred_d_h2 * d_h2_d_w4self.b2 -= learn_rate * d_L_d_ypred * d_ypred_d_h2 * d_h2_d_b2# Neuron o1self.w5 -= learn_rate * d_L_d_ypred * d_ypred_d_w5self.w6 -= learn_rate * d_L_d_ypred * d_ypred_d_w6self.b3 -= learn_rate * d_L_d_ypred * d_ypred_d_b3# --- Calculate total loss at the end of each epochif epoch % 10 == 0:y_preds = np.apply_along_axis(self.feedforward, 1, data)loss = mse_loss(all_y_trues, y_preds)print("Epoch %d loss: %.3f" % (epoch, loss))# Define datasetdata = np.array([[-2, -1],  # Alice[25, 6],   # Bob[17, 4],   # Charlie[-15, -6], # Diana])all_y_trues = np.array([1, # Alice0, # Bob0, # Charlie1, # Diana])# Train our neural network!network = OurNeuralNetwork()network.train(data, all_y_trues)

隨著學習過程的進行，損失函數(shù)逐漸減小。

現(xiàn)在我們可以用它來推測出每個人的性別了：

# Make some predictionsemily = np.array([-7, -3]) # 128 pounds, 63 inchesfrank = np.array([20, 2])  # 155 pounds, 68 inchesprint("Emily: %.3f" % network.feedforward(emily)) # 0.951 - Fprint("Frank: %.3f" % network.feedforward(frank)) # 0.039 - M

這篇教程只是萬里長征第一步，后面還有很多知識需要學習：

1、用更大更好的機器學習庫搭建神經(jīng)網(wǎng)絡(luò)，如Tensorflow、Keras、PyTorch

2、在瀏覽器中的直觀理解神經(jīng)網(wǎng)絡(luò)：https://playground./

3、學習sigmoid以外的其他激活函數(shù)：https:///activations/

4、學習SGD以外的其他優(yōu)化器：https:///optimizers/

5、學習卷積神經(jīng)網(wǎng)絡(luò)（CNN）

6、學習遞歸神經(jīng)網(wǎng)絡(luò)（RNN）

這些都是Victor給自己挖的“坑”。他表示自己未來“可能”會寫這些主題內(nèi)容，希望他能陸續(xù)把這些坑填完。如果你想入門神經(jīng)網(wǎng)絡(luò)，不妨去訂閱他的博客。

關(guān)于這位小哥

Victor Zhou是普林斯頓2019級CS畢業(yè)生，已經(jīng)拿到Facebook軟件工程師的offer，今年8月入職。他曾經(jīng)做過JS編譯器，還做過兩款頁游，一個仇恨攻擊言論的識別庫。

最后附上小哥的博客鏈接：

https:///