Today I would like to talk about some difficulties regarding benchmarking of hard problems. With hard problems I mean problems which are supposed to be harder than NP, or related classes, from a computational complexity point of view. This is a very interesting topic on which I am currently working in collaboration with Cassio P. de Campos from Queen’s University Belfast. The motivation for this work comes from the fact that we are both confronted with very difficult computational problems in our research. In the past I have worked a lot with stochastic combinatorial optimization problems, which might be extremely difficult to solve or even to approximate. For example, the *Probabilistic Traveling Salesman Problem with Deadlines *is #P-hard, which implies NP-hardness and is a much stronger hardness result. In fact, already the evaluation of the objective function for this problem is #P-hard. The question is, how can we empirically assess the performance of algorithms for difficult problems such as the *Probabilistic Traveling Salesman Problem with Deadlines*?

The situation is not too bad if we are able to compute the value of an optimal solution in a reasonable amount of time and if we are able to evaluate solutions in a reasonable amount of time. But since the problem is #P-hard this is only possible for very small instances. Instances of practical relevance might not belong to that category. What can we do in that case?

If we create a benchmark by sampling from the set of relevant instances in a fair way, we run into the following two problems: We are in general not able to compute an optimal solution (or the value of an optimal solution) for the benchmark instances and we are in general not able to evaluate solutions provided by algorithms on the benchmark instances (in many cases not even approximately). How can we empirically assess the performance of algorithms in such a setting? It seems that we cannot do a lot here. An algorithm returns a solution and we are not able to evaluate, or even estimate, the quality of such a solution.

### How can we circumvent this problem?

Are there any ways to circumvent this problem? Well, we could sample and select instances of relevant size for which the above problems do not apply for some reason, or we could use an instance generator in order to create instances together with certain information about these instances, for example the value of an optimal solution. These two approaches are commonly used and therefore I will thoroughly discuss them in the following.

The problem with the first approach is that we are not really creating a benchmark which is representative for the set of relevant instances. Let me give you an example. Let us consider the following approach: We sample a certain number of instances of relevant size. We then apply an exact solver to these instances with a quite huge time limit. After that, we select only those instances for further usage, which could be solved by the exact solver within the given time limit. The problem is that we are not sampling anymore from the set of relevant instances, but instead from the set of instances which can be solved by an exact solver within a given time limit. This introduces a kind of bias and there is a serious threat (in fact it might be quite likely) that the resulting instances are not representative for the set of relevant instances.

The second approach runs into some theoretical limitations. Whenever we use an instance generator to create instances together with certain additional information about these instances, we are also able to retrieve these additional information using an algorithm in NP or some related complexity class. If these are information that are useful for the benchmarking process, then they are also very likely useful for the given computational task. That means for problems beyond NP such an instance generator cannot exist, unless the corresponding complexity classes collapse. Or in other words: Any attempt to generate instances in such a way leads to a bias in the benchmark, because it is not possible to create instances from the set of relevant instances together with certain additional information without introducing some kind of bias.

### Conclusions

Both approaches are not very satisfactory and introduce a certain kind of bias into the benchmarking process. It seems that we are not able to empirically assess the performance of algorithms for hard problems on instances which are representative for the set of relevant instances. We can only hope that the results on our biased benchmark instances generalize to the set of relevant instances, but that does not seem very scientific to me. Computational problems are becoming more and more complex nowadays and there is a huge amount of problems with a computational complexity beyond NP. We are not able to solve them exactly and we are also not able to empirically compare algorithms for such problems. I think what we ultimately can do, is to attack these difficult problems from a more theoretical point of view, for example by finding approximation algorithms for related computational tasks or by identifying important special cases which are tractable.