3

In Travelling Salesman Problem setting, a dataset of observed paths that visit each vertex once with a fixed starting point is given. The shortest paths (TSP paths) and Nearest Neighbour paths are considered and my task is to decide which algorithm describes the observed paths better. My only idea is to compare the percentage of TSP paths and NN paths in the given dataset. Is there a statistical method applicable to this problem?

UPDATE Let us consider two "closeness" of the given path starting from the fixed vertex $O$ to the TSP and NN paths respectively $$ \rho_{TSP}={\rm the\, length\, of\, the\, given\, path - the\, length\, of\, the\, shortest\, TSP\, path} $$ and $$ \rho_{NN}=\# {\rm edges}\, AB\, {\rm on\, the\, given\, path\, such\, that}\, B\, {\rm is\, the\, nearest\, vertex\, to}\, A\, {\rm among\, the\, unvisited\, vertices} - \# {\rm all\, vertices} $$ Then one can consider the set of the points on the plane with coordinates $(\rho_{TSP}({\rm path}),\rho_{NN}({\rm path}))$ for all possible paths and the distinguished point $(\rho_{TSP}({\rm given\, path}),\rho_{NN}({\rm given\, path})$. I think that this data somehow can allow one to determine if the given path is closer to the TSP path or the NN path.

8k14
  • 181
  • 7

1 Answers1

1

From your problem description, it sounds like you are most interested in the question "Are path costs for my algorithm more similar to what I would expect from the optimal traveling salesman solution or the nearest neighbour solution?"

The underlying problem is that you want to resolve how similar the distributions of costs are and I can think of two statistics-based approaches to resolving this question:

  1. comparing their histograms using a metric like correlation or $\ell_2$. More metrics can be found in this answer.
  2. use a statistical goodness of fit test like the Kolmogorov–Smirnov test, the $\chi^2$ test, or the Anderson-Darling test. Which test you use depends what your costs on the graph look like for instance are they integers, positive integers, or real-valued?

In general, I think this problem is a little tricky to put into a statistical setting because it depends on the structure of the graphs and how the edge weights are assigned. For instance you can imagine families of graphs for which the NN and optimal TSP solution always coincide. If these graphs are say exponential random graphs with some distribution over the edge weights it may be possible to say more.

In your case, if I wanted a quick answer I would a take representative sample of the class of graphs of interest with similar size of vertices and edges. Then I'd compute the total path cost for each different graphs and different algorithm then compare histograms from all the graphs I have in my sample (although this is not a statistically well-founded solution!).

Finally, if you are really interested in analysing the performance, it may be worth pulling out a copy of your favourite analysis of algorithms text and formally going through the process, but that's not possible here without a description of the algorithm in question.

MachineEpsilon
  • 2,686
  • 1
  • 17
  • 29
  • Thank you for your answer. Could you please clarify your idea of comparing the distributions. What distributions do you mean? – 8k14 Jan 04 '18 at 06:48
  • I'm not sure tha it's appropriate to focus solely on the aggregate costs as NN solution is local – 8k14 Jan 04 '18 at 06:51
  • You take a representative sample of graphs of interest with similar parameters. You compute the total path costs according to your three graphs algorithms for each graph in your sample. This gives your three distributions $F_{OBS}$, $F_{NN}$ and $F_{TSP}$. You bin them into equal length histograms $H_{OBS}$, $H_{NN}$ and $H_{TSP}$. You use a measure of histogram distance $d$ and calculate $d(H_{OBS}, H_{TSP})$ and $d(H_{OBS}, H_{NN})$ and select the maximum to determine whether the observed paths are more like the nearest neighbour or optimal TSP solution. – MachineEpsilon Jan 04 '18 at 07:17
  • Are your observed paths drawn from a single graph? Can you describe a bit more the structure of your graph (approximately how many vertices and how many edges, sparse, symmetric)? I'm not sure it's very easy to incorporate local information about the structure of subgraphs associated with the paths. – MachineEpsilon Jan 04 '18 at 07:27
  • Thank you. For choosing the sample which parameters should be taken into account? Number of vetrex? Mean edge cost? – 8k14 Jan 04 '18 at 07:40
  • Every observed path is for a unique symmetric complete euclidean graph. The number of vertices is from 20 to 3, the costs are integers from 1500 to 20 – 8k14 Jan 04 '18 at 08:21
  • I thought the graphs might be more varied than that, and so it might be worth trying something like stratified sampling across the graph parameters, but I this case I would just try random sampling as a first approximation. This is an interesting question, I'm sorry I can't give more specific advice. – MachineEpsilon Jan 05 '18 at 14:14
  • Thank you. What do you think about the idea described in the question update? – 8k14 Jan 05 '18 at 15:09