4

I have a problem, where I need to learn a classifier (such as SVM) such that all the learned weights to be non-negative due a constraint on the classifier function.

I found out that "SVM Struct" is able to do as mentioned in the comments inside some of their source codes. But, I'm still unable to find any reference (paper, or lecture notes) describing the theoretical aspects of how to solve this problem. Is there any standard method for integrating such constraints in the classification weight learning process? Note that a similar question has been asked in the following question, but I believe no clear and complete answer is given to it.

user58419
  • 211
  • 2
  • 5
  • I think the ramifications depend largely on the kernel you're using. I assume you are talking about a linear SVM and by "weights" you mean the actual support vectors? Since a linear SVM tries to approximate regions in the feature space with hypercubes instead of more or less arbitrary shapes, restricting the vector components to be positive means by my understanding that your hypercubes will all be in the first quadrant of your feature space, so to speak. – Christian Feb 15 '13 at 17:51
  • There is another point of confusion here. The decision rule for the SVM is generally expressed as sign of $\sum_{i}\alpha_{i}x_{i}^{T}x$, where $\alpha_{i}$ will only be non-zero for the support vectors (assuming no kernel). This is generally the whole point: you get to have a sparse representation of the separating hyperplane that involves only the measured contact with a few critical data points. In that case, you're not worried about the actual coefficients of the hyperplane itself (which may only make sense in the kernel space). – ely Aug 14 '13 at 18:14
  • I am also looking for a solution for this. Have you found a one? If so could you please add some references/solution for this problem? – user570593 Jul 31 '15 at 21:57

4 Answers4

1

I successfully used a non-negative weights linear Support Vector Machine (SVM) by adding some dummy data (all possible one-hot-vector excluding the bias) at training time and using off-the-shelf SVM implementations such as LibLinear but any implementation could do -- see my blog post. My goal was to fastly achieve diagonal Mahalanobis metric learning. Be careful with the class weighting (wi in LibLinear).

0

I developed a Python library to solve the linear SVM with nonnegative constraint, see https://github.com/statmlben/Variant-SVM.

logistic
  • 148
  • 9
0

I don't have any references on this either, but maybe the following thoughts will be helpful. First, why would you need all $w_i$ to be non-negative? If you need a black box that classifies your data, why would you care about how the box works? If you are studying your data, the negativity of $w_i$ is a property of the data reflected in the SVM classifier, then why deny it?

Answering your question, I'd suggest the following two approaches.

  1. Add constraints $w_i \geq 0$ to the original SVM optimization problem. I don't know if there are any SVM libraries that would allow you to do that. If your dataset is not very large, you can solve this new optimization problem yourself using any general solver for quadratic problems. Note that, generally speaking, the properties of your hyperplane will not be the same as those of the common SVM hyperplane.

  2. Get $w, b$ by solving the original SVM optimization problem and if some $w_i$ are negative, transform your data using one of the following approaches.

    • For each $w_i < 0$ you can consider $x'_i = -1 \times x_i$ ("reflection" of the $i$-th dimension).
    • Shift and rotate your data to get all $w_i \geq 0$. First, substract $b/||w||$ from your data -- now your hyperplane passes through the origin. Rotate your data to point $w$ into the positive quadrant of $R^n$. For example, if $||w|| = d$, you can calculate the angle $\alpha$ between $w$ and $w' = (d, 0, ..., 0)$, and then rotate your dataset by $\alpha$. Note that in this case you will get a classifier with all $w'_i = 0$ except $w'_1 = d$.
Leo
  • 2,484
  • 3
  • 22
  • 29
  • 1
    Thank you. I think these (especially the first option) are the only available options. The only concern is that whether adding such a constraint will not violate the convexity of the optimization problem. Since, this has been implemented in "SVM struct", and asked how to do in "SVMLib" [here](http://agbs.kyb.tuebingen.mpg.de/km/bb/showthread.php?tid=1003), I suppose it's not the case at least for SVM. – user58419 Jan 17 '13 at 18:37
  • As for the reasons of why one would want to do this, one example is [here](http://metaoptimize.com/qa/questions/10178/using-svmstruct-with-additional-contraints) which is similar to my requirements. Also, the domain expert might have some requirements or some guidelines for asking to have such constraints. For instance, a paper written by Kevin Small, et al. ICML, 2011 discusses a constrained weight Space SVM, where a ranking of features with respect to each other can be specified by the domain expert in the form of some constraints. Haven't read the paper completely though yet. – user58419 Jan 17 '13 at 18:41
  • 2
    I strongly dislike the reference to black boxes here. In choosing an SVM, it's highly likely, possibly even the central motivating factor, that the analyst wants a *transparent box* not a black box. SVM is just an algorithm, like logistic regression, perceptron, decision trees, etc. None of these things is inherently black-box-like at all. – ely Aug 14 '13 at 18:10
0

The decision rule for a (linear kernel) SVM is given by $f(x_{new}) = \textrm{sign}(\sum_{i}\alpha_{i}x_{i}^{T}x_{new} + b)$, where $\alpha_{i}$ is non-zero for only a subset of the $x_{i}$ (the support vectors) and I lump the training label $y_{i}$ in with $\alpha_{i}$ as part of the "weight" for support vector $x_{i}$.

The decision rule is generally never expressed as $f(x_{new}) = \textrm{sign}(w^{T}x_{new} + b)$, with $(w,b)$ being the parameters for the optimal separating hyperplane from the SVM optimization.

If you're looking for algorithms where representing the decision rule in that particular inner-product format is desirable, SVM may not be appropriate for you. You could instead use constrained logistic regression if you want to enforce the hyperplane coefficient condition $w_{i} > 0, \forall{} i$, but you make the optimization much harder and lose guarantees on properties of the classifier.

I haven't thought too much about it, but you may also lose feasibility. At the very least, you must by definition produce a classifier that has a worse margin than the traditional SVM. Suppose you were only considering a linear kernel. SVM will pick out the line (among all lines) that yields maximum margin in the data (in the case it is linearly separable). If you add a constraint like $w_{i} > 0, \forall{} i$, then you are asking the SVM algorithm to search only over a special subset of lines, with positive coefficients. So the best achievable margin there will be no better than the best margin among all possible lines, and probably a lot worse. And you will lose the ability to use the same Lagrangian dual representation that allows for the support-vector based decision rule I mentioned above, since the dual will now involve new constraints.

Can you share more information about the nature of your problem and why it requires this constraint?

Lastly, as far as I can tell, the mention of SVM-struct is entirely non-sequitur. That is a method for having vector- or structure-valued outputs instead of merely having class labels (some scalar enumeration data type) as the outputs.

ely
  • 2,272
  • 18
  • 31
  • There are plenty of circumstances where you want $w\ge 0$. I.e. suppose you know that the "+1" class is monotonic in the sense that if $x$ in "+1" (or is in "+1" in expectation) then for all $y\ge x$, $y$ is in "+1" (or so in expectation). By imposing the constraint, you (trivially) reduce the in sample fit, but you improve the out of sample fit, which is what you care about. – cfp May 22 '19 at 13:55
  • @cfp I don't understand your example in that case. You could just have a single decision threshold somewhere below the value `x` you describe, since on one side of `x`, things are perfectly separable between the classes. That example would only emphasize *not* adding the weight constraint in the SVM case. Maybe there are other examples though which would demonstrate when the weight constraint is needed? – ely May 22 '19 at 14:06
  • $x$ is a vector in my example. Consider e.g. if the "true" "+1" class were $x_1+0.01 x_2\ge 1$. In small samples the estimated weight on $x_2$ might easily be negative. Imposing the positivity constraint will improve the small sample performance. – cfp May 22 '19 at 14:45