3

I commonly see {0,1} used in representing pass/fail data, but I recently found a case where I was given a data set with {-1,1} to perform some pre-processing. So, I'm wondering whether I should keep {-1,1} for some downstream purposes.

Question: When dealing with binary data, what are some advantages/disadvantages of using {0,1} vs {-1,1}? Will this affect my result when performing correlation or applying machine learning algorithms e.g. scipy/scikit/tensorflow?

techtana
  • 45
  • 4

2 Answers2

3

If you have a binary variable coded as -1/1, this has the benefit of making the linear combination of coefficients for that category a contrast. That may be nice for statistical models, but for machine learning my opinion is that 0/1 is better.

Not only does 0/1 make interpretation of means and variances easier, but it also makes interpreting conditions of the data (e.g ad was present or absent) easier as well as automatically being a min/max scaling which some methods (like neural nets) seem to benefit from.

All in all, I would prefer 0/1 over -1/1.

Demetri Pananos
  • 24,380
  • 1
  • 36
  • 94
1

I would recommend you to read this post (if not duplicate question)

Why there are two different logistic loss formulation / notations?

The highlight is that

  • Using 0, 1 representation is more natural and can be easily related to probability.

  • Using -1, +1 is more concise for some cases (such as hinge loss or zero one loss). The reason it is concise is because we are getting a +1, if we are making right predictions (the product of ground truth predicted value are positive) and -1 if we are getting a wrong prediction.

Haitao Du
  • 32,885
  • 17
  • 118
  • 213