The material I got only described how to test if there is difference (null hypothesis: H0 = H1).
However, what I want to test is if the test version is doing better than control: null hypothesis: H0 >= H1 .
How to do it?
The material I got only described how to test if there is difference (null hypothesis: H0 = H1).
However, what I want to test is if the test version is doing better than control: null hypothesis: H0 >= H1 .
How to do it?
One answer is simply to look at the data to see.
Absent real data from the OP, I just steal the fake data given by @Glen_b in a comment (and thereby reinforce his point that the median is not the message; nod to Marshall McLuhan).
A common recipe is two box plots side by side. Here I give a hybrid box and quantile plot in the manner of Emanuel Parzen. That is, all the data points are plotted versus a tacit cumulative probability scale. The letter as well as the spirit of a box plot are honoured: half the data points are inside the box, as every account explains, and half are outside, which some accounts fail to emphasise, perhaps because it seems too obvious.
The medians are the same, by construction, but the distributions are not at all the same.
Another recipe is a quantile-quantile plot. We start with the idea of plotting minimum in sample 1 versus minimum in sample 2, and so on, up to maximum in sample 1 versus maximum in sample 2. Unequal sample sizes do not undermine the idea, as we can just interpolate within the larger sample to get quantiles corresponding to those of the smaller sample. This plot shows that for corresponding quantiles sample 1 is almost always scoring lower than sample 2; the equal medians are the only exception.
The Mann Whitney U Test will tell you if the medians of the datasets are significantly different from each other. In order to tell which dataset is doing better you need to consider what the data is and what you are trying to find out about it.
For example, if you were measuring some error metric where lower is better and your test returns $p<0.05$ then you can compare the medians and conclude that the dataset with the lower median was performing significantly better.
Conversely if you were measuring accuracy (0-100%) where the greater number is more desirable you may conclude that the dataset with the higher median was performing better.
Edit: Thanks to @Glen_b for pointing out a flaw in my answer, while it appears the test will identify a difference in the median...
In practice, the Mann-Whitney U test is more broadly used to interpret whether there are differences in the "distributions" of two groups or differences in the "medians" of two groups.
this paper rightly points out that differences in spread are also important.