10

I'm working on a homework assignment where my professor would like us to create a true regression model, simulate a sample of data and he's going to attempt to find our true regression model using some of the techniques we have learned in class. We likewise will have to do the same with a dataset he's given us.

He says that he's been able to produce a pretty accurate model for all past attempts to try and trick him. There have been some students that create some insane model but he arguably was able to produce a simpler model that was just sufficient.

How can I go about developing a tricky model for him to find? I don't want to be super cheap by doing 4 quadratic terms, 3 observations, and massive variance? How can I produce a seemingly innocuous dataset that has a tough little model underneath it?

He simply has 3 Rules to follow:

  1. Your dataset must have one "Y" variable and 20 "X" variables labeled as "Y", "X1", ..., "X20".

  2. Your response variable $Y$ must come from a linear regression model that satisfies:
    $$ Y_i^\prime = \beta_0 + \beta_1 X_{i1}^\prime + \ldots + \beta_{p-1}X_{i,p-1}^\prime + \epsilon_i $$ where $\epsilon_i \sim N(0,\sigma^2)$ and $p \leq 21$.

  3. All $X$-variables that were used to create $Y$ are contained in your dataset.

It should be noted, not all 20 X variables need to be in your real model

I was thinking of using something like the Fama-French 3 Factor Model and having him start with the stock data (SPX and AAPL) and have to transform those variables to the continuously compounded returns in order to obsfucate it a little more. But that leaves me with missing values in the first observation and it's time series (which we haven't discussed in class yet).

Unsure if this is the proper place to post something like this. I felt like it could generate some good discussion.

Edit: I'm also not asking for "pre-built" models in particular. I'm more curious about topics/tools in Statistics that would enable somebody to go about this.

dylanjm
  • 374
  • 2
  • 17
  • 4
    Going to be hard if he's limiting you to a linear model... – Frank H. Mar 07 '18 at 21:17
  • 3
    How will your professor's reconstruction be evaluated? – Stephan Kolassa Mar 07 '18 at 21:21
  • 2
    Hint: look at [tag:multicollinearity]. – Stephan Kolassa Mar 07 '18 at 21:28
  • @FrankH. I might not be familiar enough, but we can use squared terms or any transformation of the X variables we'd like. I'll be giving him the untransformed variables and he'll have to figure out the transformation himself. So we can have curve-linear models. Is that considered still a linear model? – dylanjm Mar 07 '18 at 21:35
  • @StephanKolassa I think if he can estimate the coefficients of my true model to a 95% level of confidence then he will "win". Ahh yes, some strategic use of multicollinearity could be deceiving. – dylanjm Mar 07 '18 at 21:37
  • 4
    If your professor wins if your true coefficients are inside the 95% confidence intervals, then multicollinearity will not help, because multicollinearity enormously inflates CIs. If, on the other hand, evaluation is done on the difference between predicted and actual data on new predictors (the "actual" data having been generated using your true DGP), then multicollinearity will be a much better approach. Bottom line: find out what the target function is and tailor your approach to it. (This applies more generally in life...) – Stephan Kolassa Mar 07 '18 at 22:02
  • 4
    @dylanjm Could you *precisely* define your victory conditions? – Matthew Gunn Mar 07 '18 at 23:41
  • 11
    The point of such exercise is for you to learn *by trying to think of something yourself*. If you pit experts here against him, your opportunity to actually stretch your brain by consolidating different pieces of information you have been given in relation to regression is dramatically reduced (as well as being unfair to the professor). Further, at any reputable institution presenting work to him as yours when it was partly done by someone else may lay somewhere between academic misconduct and fraud (esp. if it's worth any part of your mark). Be very careful about exactly how you ask this. – Glen_b Mar 08 '18 at 00:42
  • 1
    @Glen_b I appreciate your concern for my academic integrity. To be clear, my grade does not depend on my ability to fool him with my created dataset. My grade comes from my ability to develop a model for the data he gave us (which I have not asked about here). This was posed as more of a challenge for us. I would have no problem telling him I use CV as a resource. As mentioned above, there are only 3 rules to the challenge. I as well was not asking for complicated models, but ideas I could use to obfuscate my model (i.e. multicolinearity was mentioned) – dylanjm Mar 08 '18 at 01:26
  • 1
    @MatthewGunn I have confirmed. The precise wording is, *"I will win if I estimate, within 95% confidence, your true regression model, or a mathematically equivalent one."* – dylanjm Mar 08 '18 at 01:33
  • 2
    This may be a naive question, but what (apart from sportsmanship) is there to prevent the professor giving the confidence intervals as $(-\infty, \infty)$? – Hugh Mar 08 '18 at 06:49
  • How do you score in this game/battle. What do you mean by 'able to produce a simpler model that was just sufficient'. When is the model considered sufficient or 'mathematically equivalent'. – Sextus Empiricus Mar 08 '18 at 10:59
  • @dylanjm, your first comment seems to contradict your requirements 1 and 2. Could you clarify this? – Richard Hardy Mar 08 '18 at 12:51
  • 4
    Despite the popularity of this question, I feel obliged to close it at this point because even after repeated requests for clarifications concerning the rules of the game (what criteria will be used to evaluate success, how many samples must you supply, etc) this important information still has not appeared in the question. Our aims are narrower and more focused than "generate discussion": please consult our [help] for the kinds of questions we can address on this site. – whuber Mar 08 '18 at 13:55

5 Answers5

6

Simply make error term much larger than the explained part. For instance: $y_i=X_{i1}+\epsilon_i$, where $X_{ij}=\sin(i+j)$, $i=1..1000$ and $\sigma=1000000$. Of course, you have to remember what was your seed, so that you can prove to your professor that you were right and he was wrong.

Good luck identifying the phase with this noise/signal ratio.

Aksakal
  • 55,939
  • 5
  • 90
  • 176
4

If his goal is to recover the true data generating process that creates $Y$, fooling your professor is fairly trivial. To give you an example, consider disturbances $\epsilon_i\sim N(0,1)$ and the following structural equations:

$$ X_1 = \epsilon_1 + \epsilon_0\\ X_2 =\epsilon_1 + \epsilon_2\\ y = X_1 + \epsilon_2 $$

Note the true DGP of $Y$, which includes only $X_1$, trivially satisfy condition 2. Condition 3 is also satisfied, since $X_1$ is the only variable to create $Y$ and you are providing $X_1$ and $X_2$.

Yet, there's no way your professor can tell if he should include only $X_1$ only $X_2$ or $X_1$ and $X_2$ to recover the true DGP of $Y$ (if you end up using this example, change the number of the variables). Most likely, he will just give you as an answer the regression with all variables, since they will all show up as significant predictors. You can extend this to 20 variables if you want to, you might want to check this answer here and a Simpson's paradox machine here.

Note all conditional expectations $E[Y|X_1]$, $E[Y|X_2]$ or $E[Y|X_1, X_2]$ are correctly specified conditional expectations, but only $E[Y|X_1]$ reflects the true DGP of $Y$. Thus, after your professor inevitably fails the task, he might argue that his goal was simply to recover any conditional expectation, or to get the best prediction of $Y$ etc. You can argue back that it wasn't what he said, since he states:

variable Y must come from a linear regression model that satisfies (...) variables that were used to create Y (...) your real model (...)

And you might spark a good discussion in class about causality, what true DGP means and identifiability in general.

Carlos Cinelli
  • 10,500
  • 5
  • 42
  • 77
3

Use variables with multicollinearity and heteroscedasticity like income versus age: do some painful feature engineering that provides scaling problems: give NAs for some sprinkled in sparseness. The linearity piece really makes it more challenging but it could be made painful. Also, outliers would increase the problem for him upfront.

David
  • 31
  • 4
  • I think heteroscedasticity is outside the scope of the problem, but definitely agree multicollinearity is one of the best ways of making the true specification hard to find. – JDL Mar 08 '18 at 11:06
1

Are interaction terms allowed? If so, set all the lower order coefficients to 0 and build the entire model out of N-th order interactions (e.g. terms like $X_5X_8X_{12}X_{13}$). For 20 regressors the number of possible interactions is astronomically large and it would be very difficult to find just the ones you included.

Ruben van Bergen
  • 6,511
  • 1
  • 20
  • 38
0

Choose any linear model. Give him a data set where most samples are around x=0. Give him few samples around x=1,000,000.

The nice thing here that the samples around x=1,000,000 are not outliers. They are generated from the same source. However, since the scales are so different, errors around 1M won't fit with the errors around 0.

Let's consider an example. Our model is just $$ Y_i^\prime = \beta_0 +\beta_1 X_{i1}^\prime + \epsilon_i $$

We have a data set of n samples, near x=0. We will choose 2 more points in "far enough" values. We assume that these two point have some error.

A "far enough" value is such a value that the error for an estimation the doesn't pass directly in these two points is much larger than the error of the rest of the dataset.

Hence, linear regression will choose coefficients that will pass in these two points and will miss the rest of the dataset and be different from the underlining model.

See the following example. {{1, 782}, {2, 3099}, {3, 110}, {4, 1266}, {5, 1381}, {1000000 ,1002169}, {1000001, 999688}}

This is in WolfarmAlpha series format. In each pair the first item is x and the second was generated in Excel using the formula =A2+NORMINV(RAND(),0,2000).

Hence, $\beta_0=1, \beta_1=1$ and we add normally distributed random noise with mean 0 and standard deviation of 2000. This is a lot of noise near zero but a small one near million.

Using Wolfram Alpha, you get the following linear regression $y= 178433. x - 426805$, which is quite different from the underlining distribution of $y=x$

DaL
  • 4,462
  • 3
  • 16
  • 27
  • How exactly should this work and what effect is this supposed to create? – Richard Hardy Mar 08 '18 at 12:49
  • It works since the noise and precision will work differently in the different scales. In the high numbers, taking to extreme and consider a single point, the line should go directly through it or suffer a lot of cost. Some noise is enough to miss the right values. Around zero , again in extreme - no inteception, you are left with the noise. – DaL Mar 11 '18 at 07:06
  • Use a small value for the variable with the wrong coefficient and you are paying cost. – DaL Mar 11 '18 at 07:12
  • Yes, but why would it be hard for the professor to discover the model that generated this? It looks like a particularly easy task when there is so much variation in the given regressor. – Richard Hardy Mar 11 '18 at 07:52
  • Because no model will fit well both groups. – DaL Mar 12 '18 at 07:59
  • No model except the model that was used to generate the data will fit it nicely, won't it? That is how the professor will easily guess which one it was. – Richard Hardy Mar 12 '18 at 08:12
  • The model is not important that. The problem is due to the different behaviour of the noise. That happens even if you know what is the true model. – DaL Mar 13 '18 at 06:17
  • I do not follow. Your description is so brief that I would not be able to construct an example based on it. Would you? – Richard Hardy Mar 13 '18 at 09:04
  • I added an example. Is it clear now? – DaL Mar 14 '18 at 13:13
  • So let's say $x=(1,2,3,4,5,100,200)$, $\varepsilon=(0.01,-0.01,0.02,-0.02,0.00,0.01,-0.01)$, $\beta_0=0$ and $\beta_1=1$ and so $y=(1.01,1.99,3.02,3.98,5.00,100.01,199.99)$. The fit by OLS would be very nice, and so the model would be easy to judge as a good one. So how is this model supposed to be hard to detect among competing models? (I consider 1,2,3,4,5 to be near zero while 100 and 200 to be far off.) – Richard Hardy Mar 14 '18 at 13:38
  • I added numbers based on your example. – DaL Mar 14 '18 at 14:20
  • I think this is impossible. Can you give an example where this is true? I do not think such an example exists. – Richard Hardy Mar 14 '18 at 17:07
  • I think that you are right regarding the previous example. I didn't compute the model there but I think that the difference there will be too small. I gave instead another example, with noise level which is significant near zero and low next to the high values. – DaL Mar 15 '18 at 09:56
  • I think what you did now is burry the signal in the noise by using extremely high variance. This was considered a cheap trick by the author. More importantly, the fact that $x$ has some values close to zero and others close to 1,000,000 does not have the effect you expect it to have; if anything, the effect is the opposite: the higher the variance of $x$, the easier to identify the model because there is a strong signal. However, your noise is so humongous that it buries even that strong signal. I suggest you play with the numbers to see this for yourself as you got the intuition wrong here. – Richard Hardy Mar 15 '18 at 10:18