This Xmas I took part in the Kaggle Traveling Santa 2018 - Prime Paths competition. I had not invested a significant amount of time into a Kaggle before and found the overall competition to be a real challenge and great fun.
The challenge involved a large scale traveling salesman problem (TSP) consisting of 197,768 points which must all be visited exactly once whilst minimising the overall length of the tour. Each point (x, y) was associated with an ID number which was used to add a fiendish twist to the problem: For every 10th point visited in the tour, if the previous point ID is *not* a prime number, then the distance to the 10th point is increased by 10%.
Specifically, let $\mathbb{P}$ be the set of primes, $Id_i$ be the ID number of the $i^{th}$ index in a permutation $s$ of $N$ points and $distance_{p, q}$ return the (Euclidean) distance between 2 points $p, q$. Then, the objective $g$ is to minimise:
$$ g(s) = distance_{N, 0} + \sum_{i = 1}^{N} P_i \cdot distance_{i - 1, i} $$ Where $$ P_i = \begin{cases} 1.1, & \text{if } i\bmod 10 = 0 \wedge Id_{i-1} \notin \mathbb{P} \\ 1, & \text{otherwise} \end{cases} $$What makes this problem challenging, is the fact that it is NP-hard: a brute force attempt at the problem would involve evaluating all possible tour permutations $N!$. What makes it especially interesting is the prime number soft constraint. Introducing a penalty which is dependent on the previously visited point in the tour makes the problem asymmetric: evaluating the cost of a tour in reverse order will result in a different score: $g(s) \neq g(reverse(s))$.
The first thing I tried was to ignore the prime penalty and use the famous 2-Opt heuristic to generate a "2-optimal" (explained below) tour. I had implemented a variation of 2-Opt some time ago in Java and Lisp, so took the chance to revisit my old (messy) code.
The 2-Opt algorithm is a well known local search heuristic for the traveling salesman problem (TSP). Like all local search approaches, the algorithm works by exploring a search space of candidate solutions, which in the case of the TSP, will be different permutations of a tour.
To explore a search space, it is first necessary to define a neighbourhood function. A naive neighbourhood function for the TSP may be to create a new permutation of a tour by swapping 2 elements and then checking to see if the resulting tour is an improvement or not. If the swap yields an improvement, then the search accepts the candidate solution and transitions to a new point in the search space. If it is not, the swap may be reverted and then another neighbouring candidate evaluated.
Each candidate solution may have a number of neighbours (defined by the neighbourhood function). The choice of neighbour on each iteration of the algorithm is defined by a search strategy, which may be *opportunistic* (accept the first found move which results in an improvement on the objective function), *greedy* (accept the move which yields the best improvement in the candidate set), or *stochastic* (accept a move probabilistically according to some policy - e.g., simulated annealing).
The key difficulty when formulating neighbourhood functions is finding a scheme which results in "smooth" search space. The trouble with the 2-element-swap approach is that whilst it returns valid candidate solutions (any permutation of points is valid), the neighbourhood operator is *destructive*, resulting in poor quality tours: swapping 2 elements is likely to just tangle the tour further.
The 2-Opt heuristic on the other-hand works by swapping 2 *edges* whilst preserving the overall order of the tour. To illustrate, consider the following tour (A to L and back to A): A-B-C-D-E-F-G-H-I-J-K-L-A, which spatially might be arranged like this:
(A) -> (B) -> (C) (I) <- (H) <- (G) ^ \ / ^ | \ / | ^ \/ ^ | /\ | ^ / \ ^ | / \ | (L) <- (K) <- (J) (D) -> (E) -> (F)
Note that the tour is obviously inefficient. In 2-Opt, 2 candidate edges are selected and swapped with one another to yield a new tour. In this example, the 2 edges may be (C-D) and (I-J):
A-B-C-D-E-F-G-H-I-J-K-L-A.
Swapping these 2 edges will result in 2 new edges (C-I) and (D-J) respectively. To do this, the (C-D) and (I-J) edges are first deleted, resulting in an invalid, disconnected tour:
(A) -> (B) -> (C) (I) <- (H) <- (G) ^ ^ | | ^ ^ | | ^ ^ | | (L) <- (K) <- (J) (D) -> (E) -> (F)
The next step is to reverse the order of one of the disconnected parts. Here we arbitrarily select the right hand sub-tour given that both parts are the same size, else we would select the shortest of the 2 as to minimise the computation time of the reverse operation.
(I) -> (H) -> (G) | v | v | v (D) <- (E) <- (F)
The final step is to stitch the tour back-together with the new (C-I), (D-J) edges, yielding a valid (and in this case, better) tour:
A-B-C-I-H-G-F-E-D-J-K-L-A
(A) -> (B) -> (C) -> (I) -> (H) -> (G) ^ | | v ^ | | v ^ | | v (L) <- (K) <- (J) <- (D) <- (E) <- (F)
As such, a single 2-Opt operation is a neighbourhood function and the neighbourhood of a tour is defined as the set of all edge *pair* combinations. Choosing which pair of edges to consider and which pair to select when multiple improvements are possible is the task of the search algorithm.
Having defined a neighbourhood function it is then possible to evaluate potential 2-Opt moves. One strategy might be to continuously select 2 random edges and check to see if 2-Opting those edges would result in a better shorter tour. Instead of random edge selection, FLS will iterate over all possible edge combinations, checking each for potential moves. If a move is found, the corresponding edges are 2-Opted and the search continues from the previous edge in the tour. This process continues until no possible moves can be found, at which point the tour is known to be 2-Optimal.
The trouble with the approach above is that a lot of time will be spent (re)evaluating moves which will not result in an improvement. Jon Bentley in his 1992 paper describes an optimisation for this process known as "Don't look bits". Each point/city in the tour is initialised as active at the start of the local search procedure. When no moves can be found for an edge (each edge is compared with every other), the edge is set to inactive. If a move is found, the 2 edges involved in the 2-Opt move are set to active. When evaluating potential moves inactive edges are ignored. This has the effect of forcing the local search procedure to only consider regions of the search space which might yield an improvement. Local search terminates when all edges are inactive.
When evaluating a potential move between 2 edges, the Euclidean difference between the original edges $ab$ and $bc$ and the candidate edges $ac$ and $bd$ is calculated:
$ \begin{align} \sqrt{\delta_{ac}} & + \sqrt{\delta_{bd}} \\ - \sqrt{\delta_{ab}} & + \sqrt{\delta_{bc}} \end{align} $Where $\delta_{pq} = (p_x-q_x)^2+(p_y-q_y)^2$
This operation is the bottleneck in the FLS algorithm. The trouble is the sqrt() call which will be slow (in terms of microseconds). There are 4 sqrt() operations needed in the Euclidean difference calculation, which makes matters worse.
An obvious optimisation would inolve precomputing the distances for all edge combinations and stashing the result in a lookup table/matrix with the expectation that a lookup will be computationally less expensive than doing a sqrt() call. This is fine for small enough problems, but what if the number of cities/points in the tour is sufficiently large to make this approach infeasible? In terms of an array based implementation, this would involve an array of size $\frac{1}{2}n(n-1)dtype$, where $dtype$ might be 64 bits for a double.
If precaching is infeasible, it is still possible to minimise number of sqrt() calls by exploiting the fact that for the Euclidean difference delta above to be < 0, at least one of the candidate edges must be shorter. If they are both longer, there is no point in doing the sqrt(): So we can compare the delta's before differencing, thanks to the triangle of inequality, thus bypassing sqrt() most of the time. The Euclidean difference function above can then return > 0 if $\delta_{ab} \lt \delta_{ac} \wedge \delta_{cd} \lt \delta_{bd}$ - degredation in tour length if edges swapped.
I was first introduced to GLS back in university whilst studying a constraint satisfaction course led by Professor Edward Tsang.
GLS is a metaheuristic which works in combination with an existing local search method. It attempts to guide a local search algorithm by augmenting the underlying objective function by assigning an additional penalty score to features present in the candidate solution. The underlying local search algorithm remains unchanged, in the sense that GLS "sits on top" of the search procedure.
The overall idea of GLS is to allow the underlying local search to reach a local minima and then penalise features which frequently occur in subsequent minima such that over time the local search can break out of local minima in the search landscape which is defined by the gradient of the objective function.
In the context of the TSP, a solution feature is an $edge_{p, q}$ defining a connection between points $p$ and $q$ which is present in the tour. GLS maintains a penalty matrix where each possible $edge_{p, q}$ is associated with a $penalty_{p, q}$ term. An augmented cost function $h(s)$ is then created by including the penalty terms in the original $g(s)$ cost function as follows:
$$ h(s) = g(s) + \lambda \sum_{i=1}^{N} penalty_{i-1, i} $$The $\lambda$ parameter controls the importance of the penalties to the underlying local search. Since the local search procedure will consider candidate edges to swap, $\lambda$ is set to the average contributing cost of an edge with respect to the overall tour cost:
$$ \lambda = \alpha\frac{g(minima)}{N} $$The $\alpha$ parameter is then a meta-parameter of the GLS algorithm and must be determined upfront. There is some guidance as to what this should be in this paper depending on the choice of underlying local search. In my testing, I've found a low value of $0.05$ to be effective in most cases. One possible improvement may be to control this meta parameter by some other metaheuristic such as Simulated Annealing, which would result in a meta-metaheuristic!
As mentioned previously, GLS allows the underlying local search procedure to run until it becomes stuck in a local minima. With 2-Opt FLS, this means waiting until it is no longer possible to perform any 2-Opt moves which would yield an improvement. At this point, GLS takes over and assigns penalties to the features (edges) present in the 2-Optimal solution $s$.
In each GLS iteration, GLS decides which edges to penalise by selecting edges which maximise a utility function designed to favour inidividually shorter edges:
$$ util(s, edge_{p, q}) = I_{edge_{p, q}(s)} \frac{distance_{p, q}}{1 + penalty_{p, q}} $$ Where $$ I_{edge_{p, q}}(s) = \begin{cases} 1 & \text{if } edge_{p, q} \in s \\ 0 & \text{otherwise.} \end{cases} $$This has the effect that edges are penalised up until a point (if they continuously appear in local minima solution) where the utility of penalising further will be less than some other edge. This is the really special part of GLS: the search landscape is continuously modified by levelling over minima until the search can fall into another. I like to think of it as a strategy to escape from a well by filling the well with water until you can climb over the top (and then fall into another well).
My first submission to the competition was a 2-Optimal tour (score 1641736.82), found with the standalone 2-Opt FLS algorithm. For this I ignored the prime penalty. My score was quite low, although far from the bottom. The next step was to try out GLS, still, ignoring the prime penalty soft constraint for now.
The problem I've always had with GLS is the dependence on the penalty matrix. For large enough problems, use of a penalty matrix becomes an issue.
For symmetric instances of the TSP problem, we do not need the entire $N^2$ matrix, and can store the penalties in an array with length $N(N-1)/2$. We can also make use of 16 bit shorts instead of 32 bit integers, reducing the memory footprint further. In Java, the maximum array length is $2^{31} - 3$, which means we could fit a $2^{16} = 65,536$ point problem into a single array which would be approx 4g short/8g integer (excluding the extra JVM memory overhead).
For problems with more than $2^{16}$ points, the penalties could be split across a bunch of arrays held in a single collection. For big enough problems such as the Kaggle TSP (198k points), even this becomes impractical: My machine has 32g RAM, which is less than the theroretical 36.5g of memory required to hold a (short) penalty matrix of this size.
One possibility could be to store the penalties in a sparse-array or hashmap. This would work if the penalty matrix is usually sparse - which I have not yet checked.
In the case of a non sparse matrix, an alternative could be to make use of a memory mapped file, inspired from here which is the approach I've taken in this case. This allows for a truely massive array to accessible to the Java process. Memory mapped files can also be used for high performance IPC, such as in the Chronicle queue project.
Implementation of the memory mapped file matrix is similar to the approach above in which a single virtual array is backed by a collection of arrays. A request for an array index $i$ will be mapped to a specific array $i / arrayLength$ and a specific position in that array $i \bmod arrayLength$. The difference is that each array is a ByteBuffer aligned and mapped to a single RandomAccessFile.
This matrix will ultimately be accessed from within a for loop. As such, creating a memory mapped file on a mechanical drive would be a pretty bad idea. It would be even worse if the underlying local search algorithm was stochastic (evaluating random edges) since this would result in random, as oppossed to sequential reads. I initially placed the memory mapped file on my Samsung SSD 850, which is advertised as capable of 550MB sequential reads per second (10,000 read IOPS):
[phil@arasaka ~]$ sudo hdparm -Tt /dev/sda /dev/sda: Timing cached reads: 37656 MB in 1.99 seconds = 18909.36 MB/sec Timing buffered disk reads: 1496 MB in 3.00 seconds = 498.22 MB/sec
This worked, however really slowed down the algorithm. (I'll add some benchmarks later). I used this as an excuse to treat myself to a NVM Express (NVMe) stick capable of 480K IOPS:
[phil@arasaka ~]$ sudo hdparm -Tt /dev/nvme0n1p1 /dev/nvme0n1p1: Timing cached reads: 37378 MB in 1.99 seconds = 18769.15 MB/sec Timing buffered disk reads: 7408 MB in 3.00 seconds = 2469.10 MB/sec
Which means ~1.3 billion (short) penalties per second. This helped a great deal. To speed things up a little more, I implemented a Least recently used (LRU) cache on-top of the memory mapped file penalty matrix with the intention to minimise the number of read operations.
The coolest thing about having a memory mapped file penalty matrix is the fact that it's contents (37G in this case) are persisted to disk. As such, I was able to stop and start the algorithm at will as to make some tweaks/fix bugs etc.
Having found a way to use GLS with a large matrix, I was able to escape the local minima found by my first 2-optimal submission. I then left the algo. running with the plan to do some other things with my vacation, such as read books and other analogue stuff (my actual plan was to learn Rust by porting this GLS-FLS algo.). The plan was to do that, then check in at the end of the competition to collect my gold medal. I was of course, deluded.
While my solution was making slow but steady progress without even considering the prime penalty, I found it hard to progress on the leaderboard. It turned out that many of the participants were using existing TSP solvers, notably the Concorde TSP solver and LKH solver. The LKH solver is an implementation of the Lin–Kernighan heuristic and was (of course) able to produce significantly better tours in a much shorter period of time compared to my own solver. The rules of the competition did not forbid this, so it was really nice when William Cook, author of Concorde and Keld Helsgaun author of LKH solver, showed up in the competition as a team, straight in at number 1 where they stayed to eventually win.
I made some changes to my code as to accept as input an existing 2-optimal tour (a 4 optimal tour is 3 and 2 optimal), and started work on adding the prime penalty to the FLS objective function. The idea was to generate a good tour ignoring the prime penalties using the LKH solver and then use this as a starting point to fine tune with GLS-FLS with prime penalties enabled.
Before I could finish, some really good kernels started to appear. The trouble was that anyone could run these high quality solutions and get a good score. To make matters worse, the kernels were posted with output solutions, so it was not even necessary to run, let alone understand the code to obtain a good score. There were then 100s of entries (above my solution) which were the result of dragging and dropping a file without even the need to touch a keyboard! This continued to the very last few days, and even 24 hours before the deadline.
Having added the prime penalties to my objective function I had fired up GLS-FLS again using as input a good tour found from LKH. Again, I was producing slow improvements, slowly edging my way up the leaderboard. The tipping point came with a kernel, aptly titled Shame on me.. This was a really nice idea, however was posted quite late in the competition - I was very quickly buried much further by 100s of drag-and-drop entries. Given the limited time, my strategy then switched to taking the output of these high scoring kernels and feeding them into my own algo.
In the end, I managed to come away with a bronze medal (140/1874). My highest position in the game was a silver, at position 73. This was the first competition I had invested a significant amount of time into and was optimising right up until the last minute of the game. It was great fun revisiting GLS and especially getting it to work on such a large TSP instance. In the process, I've picked up a fresh understanding (and appreciation) for this elegant metaheuristic and started to think of some possible extensions. I'm looking forward to next year :)
GLS-FLS code on Github.