在我們使用k-NN模型時,需要計算測試集中每一點到訓練集中每一點的歐氏距離,即需要求得兩矩陣之間的歐氏距離。在實現(xiàn)k-NN算法時通常有三種方案,分別是使用兩層循環(huán),使用一層循環(huán)和不使用循環(huán)。
使用兩層循環(huán)
分別對訓練集和測試集中的數(shù)據(jù)進行循環(huán)遍歷,計算每兩個點之間的歐式距離,然后賦值給dist矩陣。此算法沒有經(jīng)過任何優(yōu)化。
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
for i in xrange(num_test):
for j in xrange(num_train):
#####################################################################
# TODO: #
# Compute the l2 distance between the ith test point and the jth #
# training point, and store the result in dists[i, j]. You should #
# not use a loop over dimension. #
#####################################################################
# pass
dists[i][j] = np.sqrt(np.sum(np.square(X[i] - self.X_train[j])))
#####################################################################
# END OF YOUR CODE #
#####################################################################
return dists
使用一層循環(huán)
使用矩陣表示訓練集的數(shù)據(jù),計算測試集中每一點到訓練集矩陣的距離,可以對算法優(yōu)化為只使用一層循環(huán)。
def compute_distances_one_loop(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using a single loop over the test data.
Input / Output: Same as compute_distances_two_loops
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
for i in xrange(num_test):
#######################################################################
# TODO: #
# Compute the l2 distance between the ith test point and all training #
# points, and store the result in dists[i, :]. #
#######################################################################
# pass
dists[i] = np.sqrt(np.sum(np.square(self.X_train - X[i]), axis = 1))
#######################################################################
# END OF YOUR CODE #
#######################################################################
return dists
不使用循環(huán)
運算效率最高的算法是將訓練集和測試集都使用矩陣表示,然后使用矩陣運算的方法替代之前的循環(huán)操作。但此操作需要我們對矩陣的運算規(guī)則非常熟悉。接下來著重記錄如何計算兩個矩陣之間的歐式距離。
記錄測試集矩陣P的大小為M*D,訓練集矩陣C的大小為N*D(測試集中共有M個點,每個點為D維特征向量。訓練集中共有N個點,每個點為D維特征向量)
記Pi是P的第i行,記Cj是C的第j行
Pi=[Pi1Pi2?PiD]?Cj=[Cj1Cj2?CjD]
首先計算Pi和Cj之間的距離dist(i,j)
d(Pi,Cj)=(Pi1?Cj1)2+(Pi2?Cj2)2+?+(PiD?CjD)2??????????????????????????????????????√=(P2i1+P2i2+?+P2iD)+(C2j1+C2j2+?+C2jD)?2×(Pi1Cj1+Pi2Cj2+?+PiDCiD)?????????????????????????????????????????????????????????????????????√=∥Pi∥2+∥Cj∥2?2×PiCTj?????????????????????√
我們可以推廣到距離矩陣的第i行的計算公式
dist[i]=(∥Pi∥2∥Pi∥2?∥Pi∥2)+(∥C1∥2∥C2∥2?∥CN∥2)?2×Pi(CT1CT2?CTN)??????????????????????????????????????????????????????????????√=(∥Pi∥2∥Pi∥2?∥Pi∥2)+(∥C1∥2∥C2∥2?∥CN∥2)?2×PiCT???????????????????????????????????????????????????√
繼續(xù)將公式推廣為整個距離矩陣
dist=????????∥P1∥2∥P2∥2?∥PM∥2∥P1∥2∥P2∥2?∥PM∥2????∥P1∥2∥P2∥2?∥PM∥2????????+????????∥C1∥2∥C1∥2?∥C1∥2∥C2∥2∥C2∥2?∥C2∥2????∥CN∥2∥CN∥2?∥CN∥2?????????2×PCT?????????????????????????????????????????????????????????????????
表示為python代碼:
def compute_distances_no_loops(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using no explicit loops.
Input / Output: Same as compute_distances_two_loops
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
#########################################################################
# TODO: #
# Compute the l2 distance between all test points and all training #
# points without using any explicit loops, and store the result in #
# dists. #
# #
# You should implement this function using only basic array operations; #
# in particular you should not use functions from scipy. #
# #
# HINT: Try to formulate the l2 distance using matrix multiplication #
# and two broadcast sums. #
#########################################################################
# pass
dists = np.sqrt(-2*np.dot(X, self.X_train.T) + np.sum(np.square(self.X_train), axis = 1) + np.transpose([np.sum(np.square(X), axis = 1)]))
#########################################################################
# END OF YOUR CODE #
#########################################################################
return dists
|