2

Consider a population characterized in terms of a small number of categorical features (age category, income bracket, gender, level of education, etc.). I am interested in how these features affect the probability that an individual will responds to an online advertisement. Assume that I have the ability to sample individuals with any set of characteristics from this population (i.e., show them the ad and see how they respond).

What is the standard procedure for using design of experiments to build a model that tells me which features are relevant? Is it simply a full/partial factorial design where I show the ad to all possible combinations of feature values, and use ANOVA to show me main effects and interaction effects?

Or, is ANOVA invalidated by the fact that my response variable is binary? In that case, should I rather sample from the population to "generate" the response variable, and then build a logistic regression model? Do I miss out on interaction effects by doing that?

Finally, what are the main "tradeoffs" one faces in this problem? Is it only that sampling data points is expensive, but in theory you get more precise results?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
arni
  • 287
  • 1
  • 6
  • 2
    How many samples can you afford to collect? A binary response with at least 6 factors and a potential multiple values per factors will involve a lot of sample to get meaningful results. – Dave2e Oct 22 '20 at 03:07
  • That is a good point. I'm fine with assuming first that sampling is cheap to get the ball rolling. – arni Oct 22 '20 at 07:46
  • How many levels for each of the categorical factors? Do you have any idea of the response probability, approximately? – kjetil b halvorsen Oct 22 '20 at 16:04
  • You are calling this an experiment, but you aren't randomly assigning participants to be male or female, so you won't be able to infer causality from this as you would from a true experiment. – gung - Reinstate Monica Oct 22 '20 at 16:10
  • @kjetilbhalvorsen If it helps we can assume 2-5 levels for each factor, and a relatively low response probability, say 10-20%. – arni Oct 22 '20 at 21:31

1 Answers1

1

With a binary response it is better to plan for using logistic regression. You can still use interactions with logistic regression. But you will need a quite large sample size! But first look at your candidate variables and how we can model them. age, income bracket and level of education have underlying continuous (or ordinal) variables, and maybe that could be used to reduce the complexity of the model. As a start assume we code those variables as integers, and assume quadratic models, and about equal number of observations for each level. So for an example, assume gender (which I take as binary ...), those three ordinal variables, each with 5 levels, and one more categorical variable, nominal, with 5 levels. Then one replication of a full factorial will need $2\cdot 5^3\cdot 5=1250$ runs. The number of variables (columns, that is) counting the number of dummys needed for the categorical variable is $p=\underbrace{1}_{\text{intercept}}+3\cdot \underbrace{2}_{\text{linear and square}}+\underbrace{1}_{\text{gender}}+\underbrace{4}_{5-1}=12$. One rule of thumb for logistic regression is that the number of (candidate) predictors $p$ should be less than $m/10$ or $m/20$, where $m$ is $n\cdot\min(p,1-p)$ (see for instance Understanding the 10:1 events per variable rule), with $p=0.1$ (pessimistic ...) that is $m=n\cdot 0.1=n/10$. Putting this together, using $m/20$ this gives $n\ge 200\cdot p=200\cdot 12=2400$. That is about twice replicates of the full factorial, but might be pessimistic for a designed experiment.

But it is not clear that this is the best design, but it could be a starting point. Optimal design ideas could be used, in R (on CRAN) there is a package OptimalDesign which maybe can be used, or you could use simulation to test different designs. There is also a package acebayes, covered in this arxiv paper.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467