Matrix Vectorization for Calculating Distance (without a loop!)

Stanford 강의 CS231N: Convolutional Neural Networks for Visual Recognition의 첫 assignment인 “k nearest neighbor algorithm 구현하기”를 하며 Euclidean distance를 3가지 방법으로 구현해 보았는데, 그 과정을 정리했습니다. 이 assignment는 numpy에서 matrix manipulation을 자유자재로 할 수 있어야 감을 잡을 수 있습니다. Broadcasting, trasnpose, adding new axis와 같은 numpy의 기능들을 사용하여 assignment를 풀었습니다.

Euclidean Distance (L2-distance) 는 두 벡터 간의 거리를 구하는 방법 중 하나입니다. 얼마나 두 벡터가 비슷한지 다른지를 측정할 때 사용됩니다. K nearest neighbor 알고리즘을 구현할 때, 트레이닝 데이터인 X_train matrix $R^{N_{train} \times D}$ 과 테스트 데이터인 X matrix $R^{N_{test} \times D}$ 사이의 L2 distance를 구해야 합니다. (여기서 D는 이미지의 dimension입니다). Assignment에서는 train 데이터는 5000개, test 데이터는 500개가 주어졌습니다. 이 두 matrices 사이의 distance를 보여주는 distance matrix $R^{N_{test} \times N_{train}}$ 를 구하는 것이 목표입니다. 이 distance matrix는 test matrix안에 있는 각각의 벡터를 train matrix에 있는 각각의 벡터와의 euclidean distance를 계산한 값이 들어가 있습니다. 어떻게 거리를 효율적으로 구할 수 있을지 제일 비효율적인 방법에서 제일 효율적인 방법 순으로 알아 봅니다.

1. Two Loops

가장 비효율적인 방법이지만 처음 거리를 계산한다고 생각했을 때, 가장 먼저 떠오르는 방법입니다. Training data에 있는 각 벡터와 test data에 있는 모든 벡터와의 거리를 구해야 하기 때문에 nested for loop으로 거리를 계산합니다. 코드는 아래와 같습니다.

1	import numpy as np

def compute_distances_two_loops(self, X):
    """
    Inputs:
    - X: A numpy array of shape (num_test, D) containing test data.

    Returns:
    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
      is the Euclidean distance between the ith test point and the jth training
      point.
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in xrange(num_test):
        for j in xrange(num_train):
            dists[i, j] = np.sqrt(np.sum((X[i] - self.X_train[j])**2))
    return dists

2. Single Loop

아래에서 시간을 확인해 보겠지만, nested for loop으로 거리를 계산하는데에는 시간이 상당히 오래 걸립니다. 그렇기에 이번에는 하나의 for loop만 사용하여 거리를 계산해 봅니다. 이번에는 test data에만 for loop을 사용합니다. 여기서 single loop을 쓰면서 달라진 것은 numpy의 broadcasting을 이용했다는 점입니다. 다른 모양을 가진 array/tensor들의 연산을 표현할 때 Numpy broadcasting 이라고 합니다. 간단하게 말하면, “작은 array가 큰 array에게 ‘broadcast’된다” (물론 constraints는 존재합니다)라고 할 수 있습니다. 2개의 broadcasting rule이 존재하는데 첫번째로는, 두 array/tensor가 있을 때,각 차원이 같거나, 두번째로는, 둘 중 하나의 차원이 1일 때 broadcasting이 가능합니다. 예를 들면 아래와 같습니다.

a라는 4D array는 $3 \times 1 \times 6 \times 1$ 의 shape을 가지고 있습니다.

1 2	a = np.random.randn(3,1,6,1) a.shape

(3, 1, 6, 1)

b라는 3D array는 $2 \times 1 \times 5$ 의 shape을 가지고 있습니다.

1 2	b = np.random.randn(2,1,5) b.shape

(2, 1, 5)

두 array를 더했을 때, $3 \times 2 \times 6 \times 5$ 의 shape을 가지게 됩니다.

1	(a + b).shape

(3, 2, 6, 5)

위의 두 array를 더할 때, broadcasting이 적용이 되는데, 아래와 같이 dimension을 맞추어 보면, broadcasting rule에서 언급했던 것 처럼, dimension이 1이면 더할 때, 큰 array의 dimension을 따라간다는 것을 확인할 수 있습니다.

array	dimension	shape
a	4D	3	1	6	1
b	3D		2	1	5
a+b	4D	3	2	6	5

즉, numpy의 broadcasting을 사용하면 거리를 구할 때, row-wise subtraction이 가능하게 됩니다. 즉 X_train 의 dimension인 (num_train $\times D$)에서 각 테스트의 dimension인 ($1 \times D$)를 각 training data row에서 subtraction을 하면서 거리를 계산합니다.

def compute_distances_one_loop(self, X):
  """
  Compute the distance between each test point in X and each training point
  in self.X_train using a single loop over the test data.

  Input / Output: Same as compute_distances_two_loops
  """
  num_test = X.shape[0]
  num_train = self.X_train.shape[0]
  dists = np.zeros((num_test, num_train))
  for i in xrange(num_test):
      dists[i, :] = np.sqrt(np.sum((self.X_train - X[i, :])**2, axis=1))
  return dists

Without Loop

한 단계 더 나아가서, 이제는 loop없이 거리를 계산해 봅니다. 어떻게 계산을 해야하나 고민을 했었는데 힌트는 Wikipedia Euclidean distance page에서 얻었습니다. 아래와 같이 두 벡터의 거리를 연산할 때 아래와 같이 계산을 하므로,

$|| p - q || = \sqrt{|| p ||^2 + || q ||^2 - 2 \cdot p \cdot q}$

numpy broadcasting을 사용하면 한번에 모든 벡터와의 거리를 계산할 수 있습니다. 첫 두 term은 X_train과 X_test의 row의 L2 norm을 구하는 것이므로 쉽고 마지막 term인 $2 \cdot p \cdot q$ 는 vector를 dot product시킬 때처럼, X_train을 transpose시켜서 곱하면 됩니다. 하지만 어려운 점은 이 세개의 dimension이 맞지 않은 matrices를 더해야 합니다.

def compute_distances_no_loops(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using no explicit loops.

    Input / Output: Same as compute_distances_two_loops
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0] 
    dists = np.sqrt(np.sum(self.X_train**2, axis=1) + np.sum(X**2, axis=1)[:, np.newaxis] -2 * np.dot(X, self.X_train.T))
    
    return dists

위의 코드를 보면 np.sum(self.X_train**2, axis=1)의 shape는 (5000,) 이고, np.sum(X**2, axis=1)은 (500,)이며 두 matrix간의 dot product를 한 np.dot(X, self.X_train.T)의 shape은 (500,5000) 입니다. 이렇게 세 matrices를 더하려고 하면 오류가 나는데 그 이유는, np.sum(X**2, axis=1)의 dimension이 맞지 않기 때문입니다. 500 은 distance matrix의 column의 dimension이기 때문에 row vector를 column vector로 만들어야 하는데, 이 때 새 dimension을 만들어 주는 [:, np.newaxis]를 사용하여 (500,1)의 차원으로 만들어 주었습니다.

Compare 3 methods of calculating Euclidean distance

위에 3가지의 L2 distance로 training과 prediction을 하는데 걸리는 시간을 확인합니다 (time function은 CS231N assignment에서 구현해져 있는 함수로 아래에는 넣지 않았습니다).

two_loop_time = time_function(classifier.compute_distances_two_loops, X_test)
print('Two loop version took %f seconds' % two_loop_time)

one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)
print('One loop version took %f seconds' % one_loop_time)

no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)
print('No loop version took %f seconds' % no_loop_time)

Two loop version took 29.039080 seconds
One loop version took 20.465272 seconds
No loop version took 0.139282 seconds

Nested loop으로 distance를 구하는 데에는 29초가 걸렸고, 1개의 loop은 20초가 걸리면서 약 9초가 줄어들었음을 확인할 수 있습니다. 엄청난 차이는 아닙니다. 하지만 loop없이 broadcasting을 사용하여 구한 distance는 약 0.13초 밖에 걸리지 않습니다. numpy를 잘 활용만 하면 약 계산을 하는데 150배의 시간을 절약할 수 있습니다.