3

The fundamental problem of causal inference says that only one potential outcome is observed for each unit.

What happens if both outcomes from control and treatment can be observed? Can we still make use of analysis tools like causal trees to understand heterogeneous treatment effects?

As a concrete example, suppose we are an online search engine and want to better understand how to serve ads. Each time a user enters a search query (request), we pick an ad from a collection of $N$ ads and show it to the user. For each of these $N$ ads, there are 2 versions, one with an image and one without. We randomize users into a control group and a treatment group (same distribution of users in each group), where users in the control group will see ads without an image and users in the treatment group will see ads with an image. By the end of the experiment, for each ad we record the total number of clicks by users in the control group as well as in the treatment group.

In this particular case, each ad is an experimental unit, and we are able to observe outcomes from both the treatment group and the control group. We want to understand if we should add an image to the ads or not.

In addition, each ad has its feature (e.g., associated company, promoted product, etc.), and we want to understand which group of ads will benefit the most from the addition of images. My question is, in such a case, can we follow the routines in treatment effect analysis? If not, what's a more suitable framework?

Thomas Bilach
  • 4,732
  • 2
  • 6
  • 25
Ryan
  • 301
  • 1
  • 9
  • 2
    "if both outcomes from control and treatment can be observed?" Not possible. See [The fundamental problem of causal inference](https://en.wikipedia.org/wiki/Rubin_causal_model#The_fundamental_problem_of_causal_inference). However, with [matching](https://en.wikipedia.org/wiki/Propensity_score_matching) one could build similar subjects. – msuzen Dec 13 '21 at 21:29
  • 1
    You don't observe both states simultaneously. This is only true if the ads have no permanent effect on demand and updating beliefs about the platform and there is no selection in who returns for a second exposure. This assumes away things like satiation. – dimitriy Dec 13 '21 at 22:15
  • @dimitriy I am assuming the users are split into 2 groups A and B (with same underlying distribution) and group A only see ads with image while group B only see ads without. But what I am trying to understand is how to identify the cohort of ads that are most benefitted from the addition of images. What are the technics I can use here? I am pretty new to the area as you can tell – Ryan Dec 13 '21 at 22:32
  • I think you need to edit your question with that detail. – dimitriy Dec 13 '21 at 22:41
  • I did mentioned "We randomize users into control group and treatment group, where users in control group will see ads without image and users in treatment group will see ads with image." Just added more detail if that's helpful – Ryan Dec 13 '21 at 23:02
  • You are confusing the intervention (ads) with the (experimental) unit of analysis (which is individuals who may or may not see an advert). In the study you describe you only observe a single outcome in each unit of analysis, with the counterfactual outcome unobserved in each unit of analysis. – Alexis Dec 14 '21 at 00:35
  • @Alexis i understand that now. How can I go ahead and understand heterogeneous treatment effect on the ad - level? Most studies show how to do that on the experiment unit level – Ryan Dec 14 '21 at 00:52
  • When you say "heterogeneous treatment effect" what are these effects *on*? I think answering that question will help you understand unit of analysis a bit better. – Alexis Dec 14 '21 at 05:36
  • the subject of my study is the ads. I want to answer questions like "what kind of ads benefit the most from adding an image?" – Ryan Dec 14 '21 at 06:24

3 Answers3

5

I agree that there is some confusion about the "unit" of analysis here. It's neither the ad nor the viewer, though; it's the instance of showing an ad to a viewer. And there is only one potential outcome observed because that instance can only either have an image or not. Because you randomly assigned, you don't have to worry about confounding, which is nice, but that's not the same thing as having both potential outcomes for each unit.

It happens to be that instances are nested within specific ads, but the specific ad is a characteristic of the instance.

You can estimate a number of quantities from this design. You can estimate the average treatment effect of pictures by simply comparing the outcomes between the instance with pictures and the instance without. You should additionally control for the specific ad and any user-level qualities as well to increase the precision of your estimate and improve estimation of the standard error. To do this, you could fit a fixed effects or random effects model with the treatment as the primary predictor and the specific ad as the fixed or random effect grouping variable, e.g., Y ~ treat + (1|ad) if using lme4 for random effects or Y ~ treat | ad if using fixest for fixed effects (the results should be similar).

You can also estimate the ad-specific treatment effect, which is the effect of showing a picture for a specific ad. This is no different from a subgroup average treatment effect; it is essentially interpreted as if you only had one ad but showed it several times with and without the picture. You can estimate these effects in a single model using the following syntax in R: Y ~ ad/treat - 1 in lm(). This gives you a treatment effect for each ad. This would only make sense if you had many instances of each ad with both a picture and no picture.

If you are interested not in specific ads but perhaps the effect of showing the picture for other user- or ad-level characteristics, you can estimate heterogeneous treatment effects using causal trees in the standard way; you just would not include the specific ad as a predictor if you were looking at ad-level characteristics. If you had specific hypotheses, you could also test them in the models above by including the predictor of interest in an interaction with treatment.

Noah
  • 20,638
  • 2
  • 20
  • 58
  • I am indeed interested in ad-level characteristic, basically trying to identify ads cohort that is most benefitted from the addition of image. How can I build a causal tree? For each ad I have the following data (total click with image, total click without image, ad characteristic). When using casual tree, I think data is expressed as (realized outcome, indicator on treatment or control, characteristic)? Could you give me some pointers on how to proceed? – Ryan Dec 14 '21 at 00:46
  • Thanks for the complement, Noah, that's what I meant. If it's still not clear, Ryan, you can check the definition of [SUTVA](https://en.wikipedia.org/wiki/Rubin_causal_model#Stable_unit_treatment_value_assumption_(SUTVA)), which uses the word unit in a causal sense and talks about it. – mribeirodantas Dec 14 '21 at 00:47
  • I understand the unit of experiment is "user request", i.e. (user, ads) pair as an instance, or an impression in terminology of online advertising. I know we can estimate heterogeneous treatment effects on the experiment unit level. Now I am seeing two options to analyze heterogeneous treatment effects on ad - level: (1) treat each impression as a data point (outcome, indicator, characteristic). Here outcome in click or non-click, characteristic basically combine attribute of the user and the ad. With this setup I can directly feed the data to build a causal tree. – Ryan Dec 14 '21 at 01:01
  • (2) As mentioned above for each each ad we observe (total click with image, total click without image, ad characteristic). We can treat it as TWO data points (total click, indicator on image or no image(=1), ad characteristic) and (total click, indicator on image or no image(=0), ad characteristic), and proceed as usual. Do any of these ideas make sense? – Ryan Dec 14 '21 at 01:04
  • Yes that makes sense. I think it's a better idea to treat impressions as data points but if you only care about ad-level characteristics and your outcome metrics are a bit complicated (i.e., more complicated than total clicks), it can be defensible to look at each ad as the data point, where you indeed have two data points for each ad (one with the picture and one without). You can use causal trees on either type of data. – Noah Dec 14 '21 at 01:12
4

You misunderstood the definition of unit there. One unit, individual, can not be in the control group and the treatment group at the same time. You can only observe the effect of ONE intervention on an individual, at a given time. The two types of ads in your exemple are the treatment, say A and B, and the visitors to your website are the individuals, the "patients".

In experiments we see all the time what happens to participants of the study in all groups, but we can not see what happened to Bob when he took the pill and when he didn't take it, ceteris paribus(keeping all the rest constant).

This is not an assumption, so you can not violate it. It's a problem, and one way of solving it, maybe, is traveling in time :-) A bit tricky

mribeirodantas
  • 796
  • 3
  • 17
  • In my A/B test framework, I am applying the intervention(i.e. adding an images) to the ads and trying to see its effect. In this sense the ads are the units of the experiment. In this case, how can I answer question like "which group of ads will benefit the most from the addition of images?" – Ryan Dec 13 '21 at 21:21
  • 1
    How are you measuring the benefit? More clicks on the ads with images when compared to the ads without it? More purchases after clicking on the ad with/out images? – mribeirodantas Dec 13 '21 at 21:27
  • i think whichever metric I am looking won't make a lot of difference? let's say I am looking at number of clicks – Ryan Dec 13 '21 at 21:29
  • Let's say that the ads are the units, as you're saying. Can you show, at the same time, to the same individual, the same ad WITH and WITHOUT the image separately? No. The best you can do is to show both at the same time, or one after the other, which is not the same thing. If you give two drugs to a lot of people, and they get better, this does not help you find out which drug, A or B, contributed to what you saw. The ideal way would be to give the ad without image, measure what happened. Then you travel back in time, and show the ad with the image, making sure **nothing else changed**. – mribeirodantas Dec 13 '21 at 21:34
  • Continuing the comment, sorry for it being so long. It's a fundamental problem of causal inference because you can not observe how Bob behaved, for each of the ads (with/out image). You can only observe with one. – mribeirodantas Dec 13 '21 at 21:36
  • 1
    Yes I am not giving the same ad (with and without image) to the same user. A user can only see one version. I am assuming the two group of users follow the same distribution. In addition I don't care about what happen at per user level. We care about the total click accumulated during the whole period of experiment. I think one potential workaround is to treat an ad as two, one only sees the control flow and one only sees the treatment flow. – Ryan Dec 13 '21 at 22:14
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/132305/discussion-between-ryan-and-mribeirodantas). – Ryan Dec 13 '21 at 22:18
  • Sure. The comment I upvoted seems to be the right approach. – mribeirodantas Dec 13 '21 at 22:25
4

I would set up your data as an ad-level crosssection, where:

  • Each row is a distinct ad
  • Ad characteristics are the other columns.
  • The outcome column is the treatment-control difference in the two clicks per impression rates.
  • I would also include the number of impressions that the difference is calculated from as a column, in case some ads appear more frequently than others.

You can fit a model of the effect as a function of the ad-level variables, possibly using the number of impressions to weight the data. If you have few ad characteristics relative to ads, linear regression or a non-parametric model will work. If you have many covariates relative to the number of ads, the lasso modified for inference is a good first start:

Belloni, Alexandre, Victor Chernozhukov, and Christian Hansen. 2014. "High-Dimensional Methods and Inference on Structural and Treatment Effects." Journal of Economic Perspectives, 28 (2): 29-50. https://www.aeaweb.org/articles?id=10.1257/jep.28.2.29

This modification is similar in spirit to the causal trees approach you mentioned.

dimitriy
  • 31,081
  • 5
  • 63
  • 138