I will provide R code for a reproducible example. I am calculating the difference in means for two groups. I get a sampling distribution by permutations but instead of a normally shaped distribution centered at 0 I am getting this. Can someone explain why? or suggest some change in R code?
***** some further details:
These groups are test and control groups..(2 is a test group) , x
is amount.
An external provider of marketing software calculates incremental deposit (and other amounts and counts). It calculates statistical significance for difference in means among participants who reacted to the campaign and statistical significance for difference in proportions of participants who reacted to the campaign. And at the end it has a different formula for calculation of increment based on whether both, one or none (diffs in means,props) are significant.
Within my company they want me to reproduce the calculation to put in into a reporting tool. But within this software they say they use Bayesian Monte Carlo with no further info (except conf level 90%) But in Monte Carlo you have to model distribution of variables and I don't know how they do it because it is automated for campaigns ranging from few 10s participants to 1000s and for summing and counting variables. So my lay opinion is that it is a very non-scientific approach. I am trying to come up with a method to do inference find out statistical significance and it should be generalizable for the lack of better term to all campaigns coming.
I am far from an expert here...but thanks for the answers and comments.
library(infer)
library(tidyverse)
library(ggplot2)
x<-structure(list(group = structure(c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L,
2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("C",
"R"), class = "factor"), amount = c(1000, 500, 1200, 36700, 1500,
1500, 500, 1500, 2500, 300, 500, 2500, 1400, 3050, 1098, 2000,
2156, 1885.78, 1000, 500, 200, 5000, 200, 500, 1100, 500, 1000,
100, 1500, 2370, 1470, 500, 1000, 500, 1200, 21000, 12000, 11000,
7350, 6000, 350, 500, 1700, 400, 500, 500, 2000, 1500, 300, 2600,
100, 480, 3900, 1500, 2650, 600, 900, 4100, 1980, 300, 2300,
200, 54000, 600, 9000, 5000, 100, 300, 323, 500, 2000, 200, 2709.42,
2000, 550, 500, 1800, 300, 6000, 500, 2000, 1911, 5700)), class = "data.frame", row.names = c(NA,
-83L))
permuted<-x%>%
specify(amount~group) %>%
hypothesize(null="independence") %>%
generate(reps=15000,type="permute")%>%
calculate(stat='diff in means',order=c("R", "C"))
ggplot(permuted,aes(x=stat))+
geom_histogram()