3-sum nzp algorithm

We introduce 3-sum nzp, an algorithm to solve 3-sum in quadratic time but about 300% faster than the fastest presently known algorithm found in the literature (as of September 2012). This post could also be used as a tutorial to explain how algorithm complexity affects performance.

You can download the totality of the code discussed in this post here, which includes all the algorithms and the running time tests (Python 2.7, 8kb)

Introduction

I first learned about this problem recently, while auditioning Coursera’s class on algorithms by Sedgewick and Wayne (sep. 2012). Sedgewick and Wayne use different versions of 3-sum to illustrate that the complexity of an algorithm is the more important factor to take into account when attacking a problem (versus the very common practice of using a slow algorithm and just throw money at it via using expensive hardware).

The 3-sum problem is so basic that I kept thinking that there had to be a better way to solve it than the presently accepted ‘best solution’ and, indeed, there was: the 3-sum-nzp algorithm presented here, about 300-400% faster than the current ‘fastest’ solution. On Sep. 9, 2012, I posted in the class’s forum a subset of the timings presented here.

We now go over the problem definition and over pretty much every solution of the 3-sum problem, starting with the brute force approach and ending with the new nzp algorithm. Finally, we present timing results for all the routines for entries of various sizes and present links to the related online references.

The 3-sum problem

The 3-sum problem is defined as follows:

“Given $%N$% distinct integers, what triples $%(A,B,C)$% are such that

$%A+B+C=0$%?”

Some of the variations of this problem are:

Is there at least one triple $%(A,B,C)$% such that $%A+B+C=0$%?
Is there a pair $%(A,B)$% that adds up to a third integer $%C$%, i.e., $%A+B=C$%?

In addition, there are problems that are 3-sum-hard, i.e., problems that can be linearly mapped to 3-sum so algorithms that solve it can be modified to solve the others. One such problem is 3-collinear, which is defined as:

“Given $%N$% distinct points on a plane, are there 3 points that lie in the same line?”

There are a number of algorithms that solve 3-sum in $%O( N^2\cdot \log N)$% or in $%O( N^2)$%. Whether there is an algorithm that can solve 3-sum in sub-quadratic time is an open problem; nobody has found one for the general case (no parallelism, no specific statistical distribution of the data) and it is widely believed that there is none.

Update (7/2015): In 2014, the original 3SUM conjecture was refuted by Allan Grønlund and Seth Pettie who showed a sub-quadratic algorithm that solves it (wikipedia).

In this post we present a discussion of some 3-sum algorithms and introduce a new one, 3-sum-nzp, that although also quadratic, has a small constant so it performs about 3.0-3.5x faster than the fastest known 3-sum algorithm. This is nothing to snub at because, other than finding a sub-quadratic algorithm, which would be a life achievement, the only thing that we can do is to improve on the constant of the running time, e.g., an $%\sim N/2$% algorithm is not going to allow us to solve orders-of-magnitude larger problems than an $%\sim N$% one would but it definitely is a better choice.

Brute force 3-sum $%- O(N^3)$%

This is the naive implementation of 3-sum; it simply loops over the array 3 times. Its order is cubic so it quickly loses its usefulness for anything other than benchmarking.

def sum3_brute_force(a):
  res = []
  N = len(a)

    # Python ranges exclude the last term, i.e., range(0,3) = [0,1,2]
  for i in range(0, N-2):           # for i = 0 to N-3
    for j in range(i+1, N-1):       # for j = i+1 to N-2
      for k in range (j+1, N):      # for k = j+1 to N-1
        if a[i] + a[j] + a[k] == 0:
          res.append( [a[i], a[j], a[k]] )
  return res

3-sum with binary sort $%- O(N^2 \cdot \log N)$% average

This is a slight variation of the brute force version. First, we sort the array in $%O( N\cdot \log N)$% time with Merge sort or Quick sort. Then we iterate over the array twice to find the first two elements of the triple, $%a[i]$% and $%a[j]$%, in the same way done by the brute force algorithm; this process is $%O( N^2)$%. However, instead of searching for the third element of the triple using a third loop, like in the brute force approach, we look for it using a binary search over the already sorted array. Since the complexity of the binary search is $%O(\log N)$% we have a total complexity of $%O(N^2\cdot \log N)$%

def binary_search(a, key):
  lo, hi = 0, len(a)-1
  while (lo <= hi):
    mid = lo + (hi-lo)/2
    if key < a[mid]:    hi = mid - 1
    elif key > a[mid]:  lo = mid + 1
    else:               return mid
  return None

def sum3_with_sort(a):
  res = []
  N = len(a)
  a.sort()                          #  O(N log N)

  # Python ranges exclude the last term, i.e., range(0,3) = [0,1,2]
  for i in range(0, N-2):           # for i = 0 to N-3
    for j in range(i+1, N-1):       # for j = i+1 to N-2
      val = -(a[i] + a[j])
      if a[i] < a[j] < val and binary_search( a, val ):
        res.append( [a[i], a[j], val] )
  return res

3-sum with hash table $%- O(N^2)$% average

We sped up the brute force version by realizing that if we already have two elements of the triple we could search for the third one in $%O(\log N)$% time using binary search. The hash table approach extends this: we can load the elements of the array in a hash table and then verify the existence of a third element $%a[k]$% such that $%a[i] + a[j] + a[k]=0$% in constant time. Again, the first two loops remain as they were in the brute force approach but the third loop is replaced, not with a $%O(\log N)$% binary search but with a $%O(1)$% hash table lookup:

def sum3_with_hash_table(a):
  res = []
  N = len(a)
  d = {}
  for i in range(N):
    d[a[i]] = 1
  a.sort()                          # O(N log N)

        # Python ranges exclude the last term, i.e., range(0,3) = [0,1,2]
  for i in range(0, N-2):           # for i = 0 to N-3
    for j in range(i+1, N-1):       # for j = i+1 to N-2
      val = -(a[i] + a[j])
      if a[i] < a[j] < val and val in d:
        res.append( [a[i], a[j], val] )
  return res

Python, Java and other modern languages include hash tables as standard data structures. In the sample code above, we used dictionaries, the hash tables of Python. We create a hash table $%d$% using the elements of the array as keys. We do not care about the values of the hash table, only about the hash table’s ability to verify membership in constant time. Thus, all the keys of the dictionary have a dummy value of 1. Alternatively, in Python, we could use sets that are also implemented using hash tables and thus, also have a membership checking of $%O(1)$%.

3-sum quadratic time $%- O(N^2)$% for average and worst cases

This was the fastest known 3-sum algorithm. It keeps two pointers that move from the extremes of the array toward the center, bracketing possible triples. The following algorithm is that shown in Wikipedia’s article about 3-sum, modified so that it will return all the triples that add up to zero (Wikipedia has the existence version of 3-sum, i.e., it returns immediately after it finds a triple that adds up to zero):

def sum3_quad(S):
  res = []
  n = len(S)
  S.sort()
  for i in xrange(0,n-2):  # for i=0 to n-3 do:
    a = S[i]
    k = i+1
    l = n-1
    while k<l:
      b = S[k]
      c = S[l]
      if a+b+c == 0:
        res.append([a,b,c])
        l = l-1
        k = k+1
      elif a+b+c > 0:
        l = l-1
      else:
        k = k+1
  return res

3-sum quadratic nzp $%- O(N^2)$% for average and worst cases

Presently it is widely believed that 3-sum’s lower bound for the general case is quadratic and that, in particular, the algorithm just shown is the fastest known. This does not mean that we cannot find a faster algorithm; only that this faster algorithm would also have a quadratic performance. Here we present such an algorithm that, in general, is 350% faster than the one described in Wikipedia.

We are interested in triples $%(A, B, C)$% such that $$A+B+C=0$$

Like in the previous algorithms, we start by sorting the array and adding its elements to a hash table. We then divide the problem in 2 parts: finding the triples that contain a zero as one of its elements and then finding those that don’t. These two cases cover all the possible cases because, since all the elements of the array are distinct, we cannot have a triple that has more than one zero.

The triple has a 0

We first address the case in which one of the elements of the triple $%(A,B,C)$% is zero. By force, this element lies between the other two, i.e., if, say, $%B=0$%, then $%A = -C$%. Thus, to find triples of the form $%(A, 0, C)$% we iterate over the positive values of the array, in linear time, and find, in constant time, whether their negatives are in the array by checking their membership in the hash table. If there are more positive numbers than negative ones, we iterate over the negative ones instead. The running time of this process is

$%O(1)$% in the best case, when there is a single positive (or negative value)
$%\sim N/2$% in the worst case, when there are the same number of positive and negative entries in the array.

Let’s see this with an example. We start with the following set with $%N=8$% entries:

$$a = [30, -40, -20, -10, 40, 0, 10, 5]$$

First we sort the array and we mark 3 locations: that of the last negative value, $%n$%, of zero, $%z$%, and of the first positive value, $%p$%. Thus, if $%0$% is an element of the array then $%z = n+1$% and $%p = z+1 = n+2$%; otherwise, $%z$% is undefined and $%p = n+1$%.

a = [-40, -20, -10, 0, 5, 10, 30, 40]
      ^         ^   ^  ^          ^
      0         n   z  p          N-1

In our example we have $%n=2$%, $%z=3$% and $%p=4$%.

For purposes of both the implementation and performance, we rearrange the negative numbers so they are sorted in descending order, i.e., now both positive and negative numbers are sorted in the ascending value of their magnitudes. This change does not alter the values of $%n$%, $%z$% or $%p$%:

a = [-10, -20, -40, 0, 5, 10, 30, 40]
      ^         ^   ^  ^          ^
      0         n   z  p          N-1

Now we execute the first step, namely the discovery of the triples that contain a zero. Since there are less negative numbers than positive ones, we iterate over the negative values of the array, from $%0$% to $%n$%, where $%n$% is largest index of a negative value. For each such negative number $%A$% we check for the membership of $%-A$% in the hash table. If this is the case then $%(A, 0, -A)$% is a triple that adds up to zero and we can add it to the result. The partial result at the end of this step is:

$$res = [ [-10, 0, 10], [-40, 0, 40] ]$$

The triple does not have a 0

Now we consider the case in which the triple does not have a zero as one of its elements. We are going to split this problem in two: first, finding the triples whose largest element in absolute terms is positive and then, finding those triples in which it is negative.

Hence, first consider the case in which the largest value of the triple in absolute terms is positive, i.e., $$|A|>|B|>|C|$$ and $%A>0$%. Since $%0$% is not in the triple, then both $%B$% and $%C$% have to be negatives. Now consider the possible values of $%C$%. $%C$% cannot be smaller than $%-A/2$% because if it were, $%B$% would have to be larger than $%-A/2$%. However, this contradicts that $%|B|>|C|$%. In addition, $%B$% cannot be equal to $%C$% because all the integers of the array are distinct. Hence, both $%B$% and $%C$% are negative numbers such that

$$-A \lt B \lt -A/2 \lt C \lt 0$$.

To give a numerical example, if $%A=5$%, $%C$% has to be larger than $%-A/2= -2.5$%, i.e., $%C \in \{-1,-2\}$%, and $%B$% has to be smaller than $%-A/2$% but larger than $%-A$%, i.e., $%B \in \{-3,-4\}$%.

Since the negative numbers are sorted in ascending order of magnitudes, then we know that if there is a triple $%(A,B,C)$% that has an $%A>0$% then $%C$% is to be found at the beginning of the array and it will have a magnitude larger than $%-A/2$%. Thus, we iterate over such possible values of $%C$% and, in constant time, verify the membership of the remaining member of the triple, i.e., $%B=-(A+C)$%.

Notice that it is here, in this second iteration, where the quadratic factor creeps in. Going over all the positive numbers has a cost of $%\sim N/2$% in the worst case, when there are the same number of positive and negative numbers in the array. Going over the negative numbers, although quadratic, is inexpensive because we only iterate over the negative values larger than $%-A/2$%. This makes the search efficient.

Continuing with our example, we first check for triples in which $%A = 5$%. Now we iterate over the negative numbers, starting with $%C = -10$% but $$-10 < -A/2 = -2.5$$ and thus no possible triple can have both $%5$% and $%-10$% under the assumption that $%5$% is the element of the triple with the largest magnitude. Thus, instead of iterating over all the negative values of the array, we iterated over a single one.

The same thing happens when we consider the next positive value, i.e., $%A = 10$%. The first possible negative value is, again, $%C = -10$% but $$-10 < -A/2 = -5$$ and thus no possible triple that adds to zero can have both $%-10$% and $%10$% (unless of course, one of the elements of the triple is $%0$% in which case, the triple would had already been found in the first step).

Now we consider the next positive value, i.e., $%A=30$%. The first possible negative match is $%C=-10$% which is larger that $%-A /2 = -15$%. Thus we look up, in constant time, whether the hash table has a key equal to $$B = -(A+C) = -(30-10) = -20$$ and, indeed, it does so $%(A, B, C) = (30, -10, -20)$% is added to the result.

After iterating over the positive values and adding the results to those obtained when checking for triples with zeroes, we have:

$$res = [ [-10, 0, 10], [-40, 0, 40], [30, -10, -20]]$$

The third (and last) step is the dual of the second step: the case in which the number with the largest absolute value of the triple is negative. This step is exactly the same as the second step, simply exchanging the roles of the positive and the negative numbers, i.e., we iterate over the negative numbers and for each of them, iterate over the positive numbers as long as they are smaller than $%-A/2$%, and search for the remaining value using the keys of the hash table. The result after this last step is:

$$res = [ [-10, 0, 10], [-40, 0, 40], [30, -10, -20], [-40, 30, 10]]$$

This concludes the search.

Sample code

The following code, easy to port to languages like C that do not have iterators, has an average speed up over the other quadratic algorithms of about 3x. Hash tables are not a standard data type in C but they are not difficult to write.

def signum(x):
    return (x > 0) - (x < 0)

def binary_search_2(a, key):
  """ searches an array 'a' for a value in log(n) time
  returns whether it found the item and the last array pos used"""
  lo, hi = 0, len(a)-1
  while (lo <= hi):
    mid = lo + (hi-lo)/2
    if key < a[mid]:    hi = mid - 1
    elif key > a[mid]:  lo = mid + 1
    else:               return True, mid
  return False, mid

def sum3_nzp(a):
  """ nzp without iterators (easy to port to, say, C)"""
  res = []
  a.sort()

  # fast rejection
  if ( a[0]*a[-1] == 0 ) or ( signum(a[0]) == signum(a[-1]) ):
    return res

  d = {}
  for i in xrange(N):
    d[a[i]] = 1

  zero_found, i = binary_search_2(a, 0)
  if zero_found:  n, p = i-1, i+1
  elif a[i] < 0:  n, p = i, i+1
  else:           n, p = i-1, i

  # reverse negatives
  for i in xrange((n+1)/2):
    a[i], a[n-i] = a[n-i], a[i]

  # first deal with zero case
  if zero_found:
    num_neg = n+1
    num_pos = len(a) - num_neg
    if zero_found:
      num_pos -= 1

    if num_neg <= num_pos:
      l, h = 0, n+1
    else:
      l, h = p, N
    for i in xrange(l, h):
      if -a[i] in d:
        res.append([a[i], 0, -a[i]])

  # now assume |A| > |B| > |C| and A > 0.. then B < 0 and C < 0
  for i in xrange(p,N):
    A = a[i]
    bound = A/2.
    for j in xrange(0, n+1):
      C = a[j]
      if -C < bound:
        B = -(A + C)
        if B in d:
          res.append([A, B, C])
      else:
        break

  # now assume that |A| > |B| > |C| and A < 0.. then B > 0 and C > 0
  for i in xrange(0, n+1):
    A = a[i]
    bound = -A/2.
    for j in xrange(p,N):
      C = a[j]
      if C < bound:
        B = -(A + C)
        if B in d:
          res.append([A, B, C])
      else:
        break

  return res

The auxiliary signum() and binary_search_2() functions used in this version are also used in the following version that uses iterators. Languages like Java and Python that support them can do away with the indexing of the arrays and marginally improve the performance to about 3.5x.

def sum3_nzp(a):
  res = []
  a.sort()

  # fast rejection
  if ( a[0]*a[-1] == 0 ) or ( signum(a[0]) == signum(a[-1]) ):
    return res

  # using set to check for membership (same as hash table)
  d =set(a)

  zero_found, i = binary_search_2(a, 0)
  if zero_found:  n, p = i-1, i+1
  elif a[i] < 0:  n, p = i, i+1
  else:           n, p = i-1, i

  neg = a[:n+1]
  neg.reverse()    # reverse negatives
  pos = a[p:]

  # first deal with zero case
  if zero_found:
    if len(neg)  <= len(pos):  v = neg
    else:            v = pos
    for A in v:
      if -A in d:
        res.append([A, 0, -A])

  # now assume |A| > |B| > |C| and A > 0.. then B < 0 and C < 0
  for A in pos:
    bound = A/2.
    for C in neg:
      if -C < bound:
        if -(A+C) in d:
          res.append([A, -(A+C), C])
      else:
        break

  # now assume that |A| > |B| > |C| and A < 0.. then B > 0 and C > 0
  for A in neg:
    bound = -A/2.
    for C in pos:
      if C < bound:
        if -(A+C) in d:
          res.append([A, -(A+C), C])
      else:
        break

  return res

Experimental results

The following are timings of runs with arrays with up to 320 elements; both the brute force and the binary sort versions take several minutes to complete with arrays not much bigger than those. These timings were obtained in a 2.4 GHz dual core MacPro with 4 Gb of RAM and 4 Mb of L2 cache, averaged over 100 runs using arrays with uniformly-distributed random data.

Below the timings we have the corresponding estimate of the running time of each algorithm: that of the brute force 3-sum is cubic, i.e., ∼N^∼2.8, and that of the 3-sum based on the binary sort is slightly above quadratic, i.e., ∼N^∼2.1, which is what is expected because its order is $%O(N\cdot \log N)$%. The solutions of all the algorithms are checked against those of the brute force algorithm. The running time of the quadratic algorithms are presented as reference but the size of the arrays is too small to generate confidence; we’ll improve them in the following section when we use large arrays.

        brute f. w/ binsort w/ hash t.   quadrat.  quad. nzp speedup
   10  1.332e-04  1.240e-04  6.053e-05  4.687e-05  4.069e-05  1.2x
   20  7.839e-04  5.628e-04  1.902e-04  1.696e-04  8.370e-05  2.0x
   40  5.221e-03  2.383e-03  6.823e-04  6.679e-04  2.336e-04  2.9x
   80  3.753e-02  1.031e-02  2.551e-03  2.618e-03  7.524e-04  3.5x
  160  2.804e-01  4.581e-02  9.890e-03  1.045e-02  2.703e-03  3.9x
  320  2.231e+00  2.003e-01  4.054e-02  4.281e-02  1.121e-02  3.8x

    brute f.:  T(N)=1.798e-07 x N ^ 2.813
  w/ binsort:  T(N)=9.374e-07 x N ^ 2.127
  w/ hash t.:  T(N)=7.062e-07 x N ^ 1.884
    quadrat.:  T(N)=4.766e-07 x N ^ 1.971
   quad. nzp:  T(N)=6.961e-07 x N ^ 1.636

Next we have timings with large arrays, with thousands and tens of thousands of entries. We are no longer computing the timings for the brute force and the binary-sort-based versions; they take too long. However, now that we are using large arrays we can get reliable running times for the quadratic algorithms. The following timings were averaged over 20 runs.

      w/ hash t.   quadrat.  quad. nzp speedup
   10  6.165e-05  4.745e-05  3.960e-05  1.2x
   20  1.884e-04  1.718e-04  8.515e-05  2.0x
   40  6.581e-04  6.553e-04  2.169e-04  3.0x
   80  2.555e-03  2.584e-03  7.543e-04  3.4x
  160  9.699e-03  1.032e-02  2.591e-03  4.0x
  320  3.938e-02  4.261e-02  1.105e-02  3.9x
  640  1.592e-01  1.732e-01  4.283e-02  4.0x
 1280  6.710e-01  7.234e-01  1.888e-01  3.8x
 2560  2.861e+00  3.018e+00  8.962e-01  3.4x
 5120  1.154e+01  1.266e+01  3.941e+00  3.2x
10240  4.897e+01  5.216e+01  1.648e+01  3.2x
20480  2.363e+02  2.329e+02  8.252e+01  2.8x

  w/ hash t.:  T(N)=4.409e-07 x N ^ 2.000
    quadrat.:  T(N)=3.819e-07 x N ^ 2.026
   quad. nzp:  T(N)=1.933e-07 x N ^ 1.957

As expected, both the version of the 3-sum based on the hash table and the quadratic version from Wikipedia show a quadratic running time, i.e., ∼N^∼2.0. The 3-sum nzp algorithm is also quadratic but its coefficient, i.e., ∼2.0e-7, is half of that of the other two, i.e., ∼4.0e-7. This translates into a consistent 3x speedup over the other two quadratic versions, i.e., a 300% speedup.

In practice the situation is even better than this. Consider a usual algorithm problem, e.g., sorting: the running time of the algorithm is primarily a function of the size of the input data. We can have 10, 1000 or a trillion numbers to sort, i.e., the size of the input array is not constrained. However, 3-sum is an unusual problem because its definition constrains the size of the array. If the data type that we are using is 8-bit numbers (i.e., type char), then the maximum size of the input array is 256 (i.e., from -127 to 128) because of the constraint that all the numbers of the array have to be distinct, i.e., it is just not possible to have arrays with more than 256 chars where all the values are distinct. Likewise, if the data corresponds to pixels on a screen, with a typical linear resolution of, say, about 1500 pixels, then the maximum size of the array is about 1500 elements. Moreover, these are maximums for which every element of the domain is in the array; in the average case, if we are dealing with chars we are likely to have fewer than 256 elements in the array.

With this in mind, we could consider typical cases as having arrays with hundreds to thousands of elements due to the requirement on the uniqueness of each data point. As we can see from the timings, 3-sum-nzp beats the other two quadratic algorithms for any array size, large or small. However, it is around the range of hundreds to thousands that 3-sum nzp has its best relative performance of close to 4x. In this range, the time spent searching for triples is an excellent tradeoff of both algorithm overhead and limitations on resources, e.g., due to the creation of the hash table and sorting, which lowers performance for very small arrays, and to cache misses incurred when comparing widely different locations of an array, which lowers performance for very large ones.

Andres Castano