Choosing Location Based Control Group

Question

I have a dataset that tracks activities at particular locations. The dataset is essentially, timestamp, latitude and longitude, indicating an activity at that lat-long and that time. I plan on collapsing this data down to daily data, a treatment location, a control location and dropping the remaining data. My question is in determining the control group for my analysis.

I have a situation where I want to test the impact of a policy that is location specific. The policy encourages activity in a particular location. Assume for now that this location is well defined as 10 meters around a particular lat-long. Call this area A. Assume that all activities outside A are unaffected by the policy affecting area A and that the policy affects some of the activity within A. I then want to choose a location (geofence, area, etc; however you want to think of it) that will act as my control group. How, then, can I choose an appropriate control group area B?

I plan on using some panel data techniques to analyze the impact.

My initial thought is to do some optimization procedure that will attempt to choose the area beyond A such that the means and variances of the daily activities in the two locations are as close as possible. I don't have any real statistical reasoning behind this other than that it intuitively makes sense.

Additionally, I am open to any statistical procedures that will allow me to infer the impact of the policy.

If I am unclear in any way, please let me know and I can clarify myself.

Does the policy 'kick-in' at a certain known time? (BTW, it sometimes takes a while before a question gets answered ;-). — gung - Reinstate Monica, Dec 09 '14 at 00:20
@gung It does 'kick-in'. You can consider the data to go back about two years before the treatment period and about 2 months into it. EDIT: Also, I guess I'm more sued to stackoverflow where they are answered relatively quickly :). Sorry for my impatience! — stanekam, Dec 09 '14 at 00:30
Sounds like synthetic controls might work: http://web.stanford.edu/~jhain/synthpage.html — Jeremy Miles, Dec 09 '14 at 01:23
This sounds like a difference-in-difference design. Try searching the site & reading some of our existing threads. I think any location with the same pre-policy trend could be used. — gung - Reinstate Monica, Dec 09 '14 at 03:31
@gung Yes that's how I envision it as well. However, the choice of the control group is the question. How do I select the control group appropriately to match my treatment? — stanekam, Dec 09 '14 at 04:09
@JeremyMiles Very interesting. I'll take a closer look later. If you can flesh it out to a fuller answer I'd love to upvote you and select it as the answer :) — stanekam, Dec 09 '14 at 04:09
@iShouldUseAName I'll try to do that. In the meantime, this might also help: http://rd.springer.com/article/10.1007/s10940-014-9226-5 , it's a paper where we had a similar problem, and addressed it with this approach. — Jeremy Miles, Dec 09 '14 at 04:12

score 5 · Answer 1 · edited Apr 13 '17 at 12:44

Depending on how many control units and different treatment periods you have this would be a good application for difference-in-differences (DiD) with pre-treatment matching on the outcome or the synthetic control method.

DiD with matching on the pre-treatment outcome
The most important thing in DiD is that the pre-treatment trends are the same in the treatment and control group. For this you can do matching in an initial step where you match treatment and control locations based on the pre-treatment outcome via nearest neighbor, propensity score, or caliper matching methods. This ensures that you select the most similar pre-treatment locations for the analysis. Here is an explanation for how to do this in Stata. The logic would be the same for R with the corresponding matching package. In any case, if you want to convince your audience with a DiD design you need to plot these pre-treatment trends. A nice graph would show the average in the outcome for the treatment and control groups over time with the two lines having the same trend before the treatment and a divergence after the treatment.

Once you have done the matching you can use the matched sample for the following regression, $$y_{lt} = \beta_1 \text{location}_l + \beta_2 \text{time}_t + \beta_3 \text{treated}_{lt} + \beta_4 X'_{lt} + \epsilon_{lt}$$ where $y_{lt}$ is the outcome for location $l$ at time $t$, $\text{location}_l$ is a dummy for whether location $l$ is in the treatment group, $\text{time}_t$ is a dummy for each time period, $\text{treated}_{lt}$ is a dummy which equals one if a location is in the treatment group AND the time period is in the treatment period. $X'_{lt}$ are location specific controls (if you have any), and $\epsilon_{lt}$ is a random error. The treatment effect is captured by the coefficient $\beta_3$. This is the most general representation of a DiD model which allows for multiple time periods and treatments at different points in time. This is useful if location $A$ is treated at $t=5$, location $B$ is treated at $t=7$, etc. so you don't need to drop other treated observations.

If you have several treated locations then it is a good idea to keep those and not just keep one. Discarding other treatment locations will result in a loss of statistical power for your hypothesis tests and you may end up finding no significant effect even though you should have. Regarding testing you should also definitely cluster the standard errors on the location identifier. Bertrand et al. (2004) have shown that without an adjustment for autocorrelation the standard errors in DiD designs will be too small and may lead to false inference. Clustering on the location identifier is the easiest way to get around it and it deals with both heteroscedasticity and autocorrelation.

Synthetic control
If you only have a single treated location then the synthetic control method can be useful. Basically what it does is that it compares your treatment location with all the control locations in the sample. From the control locations it then builds a synthetic location from the control locations that are most similar to the treated location. It does so by assigning different weights to each control location such that the resulting synthetic location most closely fits the pre-treatment outcome of the treatment location.

The original application of this was on the question how terrorist attacks affected economic growth in the Basque region. Given that no one region in Spain is as similar as the Basque region, Abadie and his co-authors took a weighted average of the control regions. This is how this synthetic control method came along. So if your application is similar to their idea then this is not a bad choice. At least in economics this method has not yet found many followers. The main problem is that you cannot really have statistical inference in the traditional sense, i.e. there won't be any standard errors or confidence intervals and most of the "inference" is done via graphical analysis and the use of placebo tests. This lecture introduces DiD and the synthetic control method if you want to know more about the details.

As with DiD, graphical evidence is needed to show the validity of your approach. Here is a discussion on the Statalist on how to implement these placebo tests. The basic idea behind them is that you randomly re-assign the treatment to different time periods that lie before the treatment or you re-assign the treatment identifier to control units. In either case you should not find an effect, otherwise it casts doubt on your initial finding.

Advantages and Disadvantages
Both methods are easily implemented in any statistical software. The initial matching for the DiD approach can be a bit tricky, especially if you don't have a balanced panel but the answer that I linked above provides a good solution for this problem. Otherwise DiD is the easier method to implement as regression with some generated dummy variables is straight forward. You get interpretable results with standard errors and confidence intervals right away, something you won't have in the synthetic control approach. This is probably the main reason for why economists have not used synthetic control much. You rarely have just one treated case and most often what you can do with synthetic control you can also do with DiD.

Synthetic control has the advantage that it is something relatively new. New ways of doing things are always more interesting than old ones, plus it works very well if you only have one treated case. The construction of the placebo graphs can be a bit annoying but some software packages do it for you. In general, this approach is slightly more difficult to implement but canned packages are available for R, Matlab and Stata (link). The R package has a nice documentation. People are also currently working on new ways to have more solid inference methods for synthetic control (see this working paper, for example) though most of them you will have to program yourself.

This is a great exposition of DiD and Synthetic control methods and I appreciate the work that you did to put this together. The question, however, is about how to set a geofence(s) that can appropriately act as a control group against which to compare the treatment area. — stanekam, Dec 15 '14 at 17:18
That's where the matching would come in. Span an arbitrary grid over the map, say 20x20 meters cells (or whatever sufficiently distinguishes treatment from control regions). Match the treatment cells with the $k$ closest control regions. Then take the closest control cell $c_1$ and match the other $c_{k−1}$ control cells to that one, after which you only keep control cells that match with the best fitting control cell. These cells then make up your control region. Then you can proceed with DiD or synthetic control. — Andy, Dec 15 '14 at 22:46
Perhaps this way seems a bit artificial but it's a good method to subdivide continuous locations without having natural boundaries for each location (e.g. state/county/district boarders). As a robustness check for your results you should definitely change the grid size or the number of control cells that make up the control region(s), and show that this choice does not affect your results. Then you can convince your audience of the merit of choosing the control location(s) in this way. — Andy, Dec 16 '14 at 10:04

Choosing Location Based Control Group

1 Answers1