I've read "AB Testing other factors besides conversion rate", but I don't know if it's relevant with my question, even if it seems similar.
Here is my problem. Let's assume I am creating a new tools designed to help user do their task. I A/B test this tools, where some user get the tools, and other doesn't. I want to know if the tools really helping user with the task.
My null hypothesis: users without tools finish their task as fast as than users without tools.
1. Is my hypothesis valid? Can we statisitically prove this? Should I use no less faster instead of slower?
Assuming valid (or it's going the right way), 2. What is the data I need to prove this tools help users?
While on experiment, the user do the task more than one time. I don't know what the effect of that in the experiment, but user with tools will consistently have ability to use the tools, and user without will consistently not using the tools.
Let's say I assume the time users needed to complete experiment is what I needed to collect. I have the data for each user, number of task they did, and how long they finish the task. The task is identical for all user in the experiment.
And then I gathered following data:
| sample | (Average time / User) |
without tools| 184 | 17 |
with tools | 176 | 12 |
2. I this kind of data correct? I mean, am I going the right way? Should I averages it? If not, how should I give the data? Should I only use the first time the user use it? But I want to prove the tools really helping people, so the first time uses might made people need time to learn.
From the data, statistically intuitive, the tools helped users, because users can finish their work averagely faster, and there is smaller action to do.
3. How to prove it mathematically? Is it even my intuitive correct?
We can calculate the standar deviation for each set (with tools and without), and then calculate their standar error. From what I've search, we can calculate the standar error for each data as
se(withouttools-withtools) = sqrt(sqr(se(withtools)) + sqr(se(withouttools)) )
Then we can calculate the z-score as
z-score = (mean(withouttools) - mean(withtools)) / se(withouttools-withtools)
Based on the data-distribution, we can compare z-score with significance value with p = 0.05. Null hypothesis states average time of people withouttools will be as fast as with tools (no effect of the tools). We are only interested whether using tools will be faster than without, so we are using one-tailed test.
That means if z-score of data > z-score of significant point, i.e. if z-score > 1.645, The null hypothesis can be rejected confidently, that is, using tools will make people finish their task faster than without. Here I assumed the data is normal distribution, because of Central Limit Theorem. But I don't know if it's valid assumption.
4. Am I calculating it correctly? How should I actually calculate it?
5. If I'm still not confident, or have not enough data, how much data I need? How long I should run the test?
Thank you.