計算兩個矩陣之間的歐式距離

秋水一江 2020-09-11

展開全文

在我們使用k-NN模型時，需要計算測試集中每一點到訓練集中每一點的歐氏距離，即需要求得兩矩陣之間的歐氏距離。在實現(xiàn)k-NN算法時通常有三種方案，分別是使用兩層循環(huán)，使用一層循環(huán)和不使用循環(huán)。

使用兩層循環(huán)

分別對訓練集和測試集中的數(shù)據(jù)進行循環(huán)遍歷，計算每兩個點之間的歐式距離，然后賦值給dist矩陣。此算法沒有經(jīng)過任何優(yōu)化。

num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train)) 
    for i in xrange(num_test):
      for j in xrange(num_train):
        #####################################################################
        # TODO:                                                             #
        # Compute the l2 distance between the ith test point and the jth    #
        # training point, and store the result in dists[i, j]. You should   #
        # not use a loop over dimension.                                    #
        #####################################################################
        # pass
        dists[i][j] = np.sqrt(np.sum(np.square(X[i] - self.X_train[j])))
        #####################################################################
        #                       END OF YOUR CODE                            #
        #####################################################################
    return dists

使用一層循環(huán)

使用矩陣表示訓練集的數(shù)據(jù)，計算測試集中每一點到訓練集矩陣的距離，可以對算法優(yōu)化為只使用一層循環(huán)。

def compute_distances_one_loop(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using a single loop over the test data.
    Input / Output: Same as compute_distances_two_loops
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in xrange(num_test):
      #######################################################################
      # TODO:                                                               #
      # Compute the l2 distance between the ith test point and all training #
      # points, and store the result in dists[i, :].                        #
      #######################################################################
      # pass
      dists[i] = np.sqrt(np.sum(np.square(self.X_train - X[i]), axis = 1))
      #######################################################################
      #                         END OF YOUR CODE                            #
      #######################################################################
    return dists

不使用循環(huán)

運算效率最高的算法是將訓練集和測試集都使用矩陣表示，然后使用矩陣運算的方法替代之前的循環(huán)操作。但此操作需要我們對矩陣的運算規(guī)則非常熟悉。接下來著重記錄如何計算兩個矩陣之間的歐式距離。

記錄測試集矩陣P的大小為M*D，訓練集矩陣C的大小為N*D（測試集中共有M個點，每個點為D維特征向量。訓練集中共有N個點，每個點為D維特征向量）
記 $P_{i}$ 是P的第i行，記 $C_{j}$ 是C的第j行
$P_{i} = [P_{i 1} P_{i 2} \dots P_{i D}]$ ? $C_{j} = [C_{j 1} C_{j 2} \dots C_{j D}]$

首先計算 $P_{i}$ 和 $C_{j}$ 之間的距離dist(i,j)
$d (P_{i}, C_{j}) = \sqrt{(P_{i 1} - C_{j 1})^{2} + (P_{i 2} - C_{j 2})^{2} + \dots + (P_{i D} - C_{j D})^{2}} = \sqrt{(P_{i 1}^{2} + P_{i 2}^{2} + \dots + P_{i D}^{2}) + (C_{j 1}^{2} + C_{j 2}^{2} + \dots + C_{j D}^{2}) - 2 \times (P_{i 1} C_{j 1} + P_{i 2} C_{j 2} + \dots + P_{i D} C_{i D})} = \sqrt{{‖ P_{i} ‖}^{2} + {‖ C_{j} ‖}^{2} - 2 \times P_{i} C_{j}^{T}}$

我們可以推廣到距離矩陣的第i行的計算公式
$d i s t [i] = \sqrt{({‖ P_{i} ‖}^{2} {‖ P_{i} ‖}^{2} \dots {‖ P_{i} ‖}^{2}) + ({‖ C_{1} ‖}^{2} {‖ C_{2} ‖}^{2} \dots {‖ C_{N} ‖}^{2}) - 2 \times P_{i} (C_{1}^{T} C_{2}^{T} \dots C_{N}^{T})} = \sqrt{({‖ P_{i} ‖}^{2} {‖ P_{i} ‖}^{2} \dots {‖ P_{i} ‖}^{2}) + ({‖ C_{1} ‖}^{2} {‖ C_{2} ‖}^{2} \dots {‖ C_{N} ‖}^{2}) - 2 \times P_{i} C^{T}}$

繼續(xù)將公式推廣為整個距離矩陣
$d i s t = \sqrt{(\begin{matrix} {‖ P_{1} ‖}^{2} & {‖ P_{1} ‖}^{2} & \dots & {‖ P_{1} ‖}^{2} \\ {‖ P_{2} ‖}^{2} & {‖ P_{2} ‖}^{2} & \dots & {‖ P_{2} ‖}^{2} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ {‖ P_{M} ‖}^{2} & {‖ P_{M} ‖}^{2} & \dots & {‖ P_{M} ‖}^{2} \end{matrix}) + (\begin{matrix} {‖ C_{1} ‖}^{2} & {‖ C_{2} ‖}^{2} & \dots & {‖ C_{N} ‖}^{2} \\ {‖ C_{1} ‖}^{2} & {‖ C_{2} ‖}^{2} & \dots & {‖ C_{N} ‖}^{2} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ {‖ C_{1} ‖}^{2} & {‖ C_{2} ‖}^{2} & \dots & {‖ C_{N} ‖}^{2} \end{matrix}) - 2 \times P C^{T}}$

表示為python代碼：

def compute_distances_no_loops(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using no explicit loops.

    Input / Output: Same as compute_distances_two_loops
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train)) 
    #########################################################################
    # TODO:                                                                 #
    # Compute the l2 distance between all test points and all training      #
    # points without using any explicit loops, and store the result in      #
    # dists.                                                                #
    #                                                                       #
    # You should implement this function using only basic array operations; #
    # in particular you should not use functions from scipy.                #
    #                                                                       #
    # HINT: Try to formulate the l2 distance using matrix multiplication    #
    #       and two broadcast sums.                                         #
    #########################################################################
    # pass
    dists = np.sqrt(-2*np.dot(X, self.X_train.T) + np.sum(np.square(self.X_train), axis = 1) + np.transpose([np.sum(np.square(X), axis = 1)]))
    #########################################################################
    #                         END OF YOUR CODE                              #
    #########################################################################
    return dists