8

I'm working on a problem that involves a large amount of NA's. How does VW work around this? Should I try to impute the NAs with colmeans or something similar before piping into VW format?

Frank P.
  • 295
  • 1
  • 6

1 Answers1

5

To elaborate on my answer:

Lets say the first line of your data is:

y, v1, v2, v3
10, 5, NA, 3

The VW string encoding of that line is:

10 |v1:4 v2:NA v3:3

As you probably discovered v2:NA doesn't work for VW, as the part after the colon needs to be a number.

An easy solution to this is to find :NA in your VW string, and replace it with _NA:

10 |v1:4 v2_NA v3:3

This will work fine in VW, as it will internally recode v2_NA as v2_NA:1.

This will allow the model to learn what happens when v2 is NA, and how that differs from the case where it is known.


You could impute medians, but it's probably a better idea to:

  1. Compute a "NA flag" for each variable that is 1 when it is NA and 0 when it is not.
  2. Omit NA variables from your VW training file.
  3. Train on your dataset, omitting NAs and including flags.

This will let VW build a model that predicts one thing for an NA variable and another when it is present.

Zach
  • 22,308
  • 18
  • 114
  • 158
  • what does `omitting NAs and including flags` mean actually? – matt Sep 01 '18 at 06:10
  • The problem with imputing column means is that you need to remember what the means you used were at prediction time. VW uses a sparse encoding of the input variables, so my suggestion was to omit the NAs from the sparse coding. I'll update my answer. – Zach Sep 01 '18 at 21:34