Sample standard deviation or population standard deviation

Question

I'm running an A/B testing in a website and I'm capturing the number of clicks in a given area of a page.

I am calculating the average of clicks per user's session and its standard deviation. The data process is quite heavy so my plan was to run it nightly with the data of each day and then at the end of a given period aggregate those data and use a t-test calculator to check the statistic relevance.

For each day's calculus of standard deviation I'm using pyspark sql stdev_sample function but I'm struggling to find a way to aggregate those values at the end. I have come across to this question and the oldest answer doesn't seem to work with the tests I did (not only with my A/B testing data), so not sure if it's valid. The newest answer though works for the data I've tested it with, but when I use population standard deviation.

So my question is, it's fine if I use population standard deviation (at the end of the day I am using every datapoint for each feature for calculating it), I've read here that

Therefore, you would normally calculate the population standard deviation if: (1) you have the entire population or (2) you have a sample of a larger population, but you are only interested in this sample and do not wish to generalize your findings to the population

so wonder if my case fits in that point 2.

Otherwise, how could I aggregate with the sample standard deviation?

Thanks

I guess that you are dealing with large samples, so using $n-1$ instead of $n$ would not make any noticeable difference, isn't it? — Tim, Nov 16 '16 at 10:09
For some reason I don't quite get your question. But maybe this post may help you: http://stats.stackexchange.com/questions/25848/how-to-sum-a-standard-deviation — Ytsen de Boer, Nov 16 '16 at 10:21
@Tim, you are (probably) right. Thing is that for testing it I am using around a hundred "rows" sample so that's why the error might come. — mitomed, Nov 16 '16 at 10:59
Thanks for the link @YtsendeBoer, I'm trying to make sense of it, but I don't seem to get the correct results — mitomed, Nov 16 '16 at 11:23
My pleasure :) Question 1: Does that post (in principle) provide the solution to your answer? Question 2: How do you know that you do not get the correct result? — Ytsen de Boer, Nov 16 '16 at 11:55
@YtsendeBoer it definitely is relevant to my question (actually I had come to it before asking) but if I'm not wrong according to the answer I just have to sum them quadratically, and that doesn't come back with the result I get both from pyspark or using the population standard deviation formula I was talking about — mitomed, Nov 16 '16 at 12:52
Look here: http://stats.stackexchange.com/questions/31177/does-the-variance-of-a-sum-equal-the-sum-of-the-variances. It suggests that your data is correlated. Why don't you make a histogram of your daily averages and see whether it makes sense at all to look at the standard deviation? — Ytsen de Boer, Nov 16 '16 at 15:34
@YtsendeBoer they are the same metrics but for different days — mitomed, Nov 16 '16 at 16:48

Sample standard deviation or population standard deviation

0 Answers0