7

The question pretty much explains itself. When running a Lasso regression on a lot of indexed (say by time and location) explanatory variables, is it best practice to transform all data using a within transformation over first location and then time? For example...

$ y^{new}_{l,t} = (y_{l,t}- \frac1T\sum_{t=1}^{T} y_{l,t} - \frac1N\sum_{l=1}^{N} y_{l,t} + \bar{\bar{y}}) $

wolfsatthedoor
  • 771
  • 1
  • 7
  • 21

1 Answers1

0

First thing to do is to rearrange your data into a standard form. If you've got $n$ samples and $d$ features, that means you want

  • An $n \times d$ input matrix $X$, in which each column has a mean of $0$ and variance of $1$. This ensures that LASSO's regularization effect treats each dimension "fairly" when deciding whether to shrink it to zero.
  • A length-$n$ vector $y$ of outputs, which has a mean of $0$. This ensures the LASSO model doesn't need to use a constant term.

You'll probably want to encode time and location as dimensions (ie as extra columns in $X$), though without knowing the details of the problem I can't say for sure.

Anyway, if you feed $X$ and $y$ into a LASSO solver, you'll then get back a length-$d$ weights vector $w$, that you can then interpret in terms of time, location, and whatever other explanatory variables you have.

Andy Jones
  • 2,146
  • 9
  • 10
  • 1
    So you are arguing for a fixed effects approach with dummies? – wolfsatthedoor Dec 14 '14 at 14:25
  • Pretty much. If you want to use a different approach, you'll probably need to consider a different kind of model than plain L1-regularized linear regression. e: Well, you could add various basis functions but without knowing exactly what the problem is I can't offer advice on which might be useful. – Andy Jones Dec 14 '14 at 14:29
  • 1
    I mean what if I want to find out what covariates are relevant (as dictated by Lasso) conditional on including all of the time and fixed effects, which I consider nuisance parameters? – wolfsatthedoor Dec 19 '14 at 03:03
  • If you can find - or write your own - a LASSO solver that allows you to weight the terms in the regularization term, that might do what you need. So the objective would be $\|y-Xw - X^\prime v\|^2_2 + \|w\|_1 + 0\times\|v\|_1$, where $X^\prime, v$ are the parameters you always want to include. – Andy Jones Dec 19 '14 at 05:59
  • 1
    Why is this better than using a within estimator if I consider the fixed effects nuisance parameters? – wolfsatthedoor Dec 19 '14 at 15:56
  • What're you proposing as the alternative exactly? To do two fittings, one to time and co and then to fit again against the other parameters? – Andy Jones Dec 19 '14 at 16:11
  • 1
    To demean the data along multiple levels. Basically using the "within" transformation. Check it out https://en.wikipedia.org/wiki/Fixed_effects_model. Then, after the within-transformation, use Lasso. – wolfsatthedoor Dec 19 '14 at 16:44
  • Sorry, my frequentist stats is sketchy as hell. Now I understand you. If you're absolutely sure that time and location are nuisances, yes, demeaning them your measurements like that would be a good idea, since it'll effectively 'exaggerate' the variation within each time/location series - which is exactly what you want the model to focus on explaining. Having demeaned it though, you still have to center & standardize each feature (which is what I mistakenly thought you were asking after originally). – Andy Jones Dec 19 '14 at 17:15
  • Apologies for wasting your time on this wild goose chase! – Andy Jones Dec 19 '14 at 17:16