1

Here it is suggested to use the Hodges-Lehmann estimator. However, I have just read (source) that this estimator is not reliable for skewed distributions. This is the case for the data I have - I know what the population looks like and it is skewed to the right. Aside asymmetric distributions there are several papers criticising the Hodges-Lehman estimator - unsurprisingly because statistical methods have developped since the 1960ties when it was developped.

The other solution is z divided by the root of N (r=Z/SQRT(N)) as described here and here. But what are the assumptions of this formula? random sampling? equal variances? etc?

In this great article, the authors suggest the probability of superiority. "Probability of Superiority is the probability that a randomly sampled score from one population is larger than a randomly sampled score from a second population." So here comes my issue: My data isn't randomly collected and the individual scores in my data set are not independent. My interpretation is that I cannot use this effect size calculation and there is no way round it except collecting new data which isn't feasible. Correct?

What solutions do you reccommend for rank based data?

Up-date: I retrieve historical tweets for category A and B. As a tweet contains more than one word, single words in each tweet depend on each other and are not statistically independent. The words in the tweets are replaced with a score from a database. Accordingly, the scores also depend on each other. The sample sizes are also unequal. The differences between the two categories is normally distributed.

Simone
  • 170
  • 6
  • "unsurprisingly because statistical methods have developped since the 1960" ... by that criterion, the mean and the median should be even more criticized! At the very least they're *hundreds* of years old. Can you offer a quote for exactly what criticism was given? In what specific sense is the Hodges-Lehmann estimator unreliable? – Glen_b Sep 12 '16 at 05:42
  • Can you describe the way in which you obtained the samples and the manner of the dependence? Your question is really very vague about just the kind of things that might allow people to do something more than shrug. – Glen_b Sep 12 '16 at 05:45
  • @Glen_b I have edited my question with the data I am using. Please let me know if you need more information. Re Hodges-Lehmann estimator: the biggest problem is an asymmetric distribution. Since I have such a distribution I didn't read the other 4+ papers talking about other issues but just wanted to flag it for others. – Simone Sep 13 '16 at 12:52
  • You already said there was a problem with an asymmetric distribution but what problem is it exactly? What are they saying is wrong, precisely?? – Glen_b Sep 13 '16 at 17:24
  • Glen_b my problem is that the estimator is unsuitable for asymmetric distributions. So I cannot use it. I am trying to find a solution to the problem at hand: what effect size can I use for my data? and save the theoretical discussion about the pros and cons of the Hodges-Lehmann estimator for later. Let me know if you want to have the corresponding pages of the book where Wilcox explains why above estimator shouldn't be used with asymmetrically distributed data including the references to the other papers discussing the issue in-depth. – Simone Sep 18 '16 at 09:41
  • You keep saying the same thing but without explaining the actual sense in which it is unsuitable. If you could explain the sense in which you think it's unsuitable for your problem it might be easier to suggest something. If one were trying to estimate the thing it actually estimates, it would seem to be ideal whether it's a symmetric distribution or not. Any information you think is relevant would be reasonable to quote in your question (unless it's really long). It sounds like you're trying to measure a difference in locations. Is that right? – Glen_b Sep 18 '16 at 10:15
  • It may be that what they say is important or it may be that it has little bearing on your problem. It seems to me that the problems you raise with your scores (some of which I am not sure are actually problems, but some may be) would be a problem for any estimator. You should explain more about the sampling which you have done. It might be best to focus more on the underlying problem you're trying to solve than your thoughts about a solution. – Glen_b Sep 18 '16 at 10:18
  • I think this part: "*So here comes my issue: My data isn't randomly collected and the individual scores in my data set are not independent.*" is central to a good answer to your question, and you most urgently need to expand on that, to make it clearer how things are not random and not independent. – Glen_b Sep 21 '16 at 10:06

0 Answers0