8

My feature data is defined in such a way that I believe all weights must be non-negative. I am looking for a reference discussing how to optimize the weights of a logistic regression classifier with the constraint that the weights must be non-negative, and perhaps also a constraint on the sign of the bias.

Will replacing weights with squared weights and using the regular ML cost function with a local optimization scheme work?

mpiktas
  • 33,140
  • 5
  • 82
  • 138
o17t H1H' S'k
  • 511
  • 6
  • 11
  • did you find any theoretical advantages of setting this constraint? I also have a problem with SVM, where I need to set the weights (the W) to be non-negative. If you know any references or advantages could you please reply. – user570593 Aug 01 '15 at 15:12

6 Answers6

4

If you are familiar with convex optimization, I suspect this can be formalized as a quadratic programming problem (or some other convex problem) and then solved with a QP solver. If this direction interests you, I can elaborate further.

Regardless of the method which is used, if the problem is convex (there are a number of ways to check this and I know that unconstrained logistic regression is convex) you are guaranteed that a local minimum will also be a global minimum.

I might be mistaken, but it seems to be convex because if the original solution space is convex and the set of positive solutions is also convex (it is a cone), the new problem would be convex since an intersection of convex sets is convex. But this is hand-waving, best to examine it formally.

Bitwise
  • 6,379
  • 2
  • 22
  • 27
4

I have done this successfully using projected gradient descent.

The algorithm is very simple - take a gradient step, then set all negative coefficients to zero (i.e. project onto the feasible set).

I started with Leon Bottou's code here: http://leon.bottou.org/projects/sgd.

user1149913
  • 1,346
  • 12
  • 10
4

Your objective can be achieved using an R package "glmnet". It is a machine learning package in R that fits generalized linear models via penalized maximum likelihood. But, it also allows coefficients to be constrained. This can be used to solve your problem. For more details please follow the link below: https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html

Coming back to your problem, glmnet works only on matrices so please be sure to get all your data in the matrix form. Create two separate matrices one for the predictors (X) and another for the response (Y). Run the following code:

library(glmnet)
loReg <- glmnet(x=X, y=Y, family = "binomial", lower.limits = 0, lambda = 0, standardize=TRUE)

The above line will create a logistic model with penalizing coefficient equal to zero (which is what you want). Since the lower limit of all of your variables is the same (i.e. zero), setting lower.limits=0 will do the job.

To predict new observation: Suppose you want to predict m new observations. Get these observations in an mxp matrix, where p is the number of predictors. Let this matrix of new observations be newX.

predict(loReg, newx = newX, type="response")

Hope this would help.

user2974951
  • 5,700
  • 2
  • 14
  • 27
1

I have answered to that question in this blog post. It is about adding some dummy data but the real solution is of course to do Projected Gradient Descent in the primal as user1149913 said.

1

I wanted to mention a few more solutions that are quite straightforward and have not been mention, yet.

Bayesian logistic regression

We can use Bayesian logistic regression and set prior distributions that put no prior weight on negative coefficient values. Of course, your exact choice of prior distribution may have an influence on the inference, but since you seem to have some prior information that may be a good thing. A plus of this approach is that you get credible intervals for parameters.

For example, if you use R and the brms package, something like this should work in theory, but runs into some challenges:

library(tidyverse)
library(brms)

set.seed(1234)
mydata = tibble(y=rep(c(0,1), each=25),
                x1=rnorm(n=50, mean=y, sd=1),
                x2=rnorm(n=50, mean=y, sd=1))

fit1 = brm(data=mydata, 
           formula=y~0+x1+x2, 
           family = bernoulli(link = "logit"),
           prior=set_prior(class="b", prior="uniform(0,10)"),
           control=list(adapt_delta=0.999))

We can of course debate, whether a uniform(0,10) prior is sensible, but an log-odds ratio of 10 is huge, so I thought I'd cut of the distribution there. However, we run into sampling problems, because we did not tell the sampler that the parameter is always >=0. With e.g. a exponential(0.01) prior on the coefficients, that issue still occurs. There's two solutions to this (and possibly a third if you can tell brms about the constraint):

Firstly, one of the cool things about MCMC sampling is that you can introduce constraints retrospectively, e.g. by sampling without a constraint and then subsetting the samples:

fit2 = brm(data=mydata, 
           formula=y~x1+x2, 
           family = bernoulli(link = "logit"),
           prior=c(prior(class="b", prior="normal(0,10)"),
                   prior(class="Intercept", prior="normal(0,10)")),
           control=list(adapt_delta=0.999))

library(tidybayes) # Only needed for the median_qi function

as.data.frame(fit2) %>%
  filter(b_x1>=0 & b_x2>=0) %>%
  dplyr::select(-lp__) %>%
  median_qi()

This gets us:

  b_Intercept b_Intercept.lower b_Intercept.upper     b_x1 b_x1.lower b_x1.upper     b_x2 b_x2.lower b_x2.upper .width .point .interval
1  -0.7483035         -1.638089        0.03658298 1.136726  0.2859683   2.247933 1.011623   0.354213   1.851013   0.95 median        qi

The main limitation of this approach is that when you have lots of coefficients (curse of dimensionality) or the data kind of favors a negative coefficient for some, when in either case you may end up without (many) samples that are all positive.

The more general solution is to tell the sample about the constraint (which is easy enough to do in rstan by coding a Stan model ourselves):

library(rstan)
scode = "
data {
  int n;
  int<lower=0, upper=1> y[n];
  real x1[n];
  real x2[n];
}
parameters {
  real<lower=0> beta[2];
}
model {
  beta ~ uniform(0,10);
  for (record in 1:n){
    y[record] ~ binomial_logit(1, beta[1]*x1[record] + beta[2]+x2[record]);
  }
}
"
smodel = stan_model(model_code = scode)

sfit = sampling(object = smodel,
                data=list(n=dim(mydata)[1], y=mydata$y, x1=mydata$x1, x2=mydata$x2))

summary(sfit)$summary

Note that this is rather inefficiently coded (the loop over records and not using a matrix of covariates so that we can just do matrix multiplication to get the "xbeta"s).

              mean     se_mean        sd          2.5%          25%          50%         75%       97.5%     n_eff     Rhat
beta[1]   1.166326 0.012648042 0.5134742   0.215388737   0.80786712   1.12813511   1.5047327   2.2471028 1648.1265 1.001271
beta[2]   0.122852 0.002284301 0.1072845   0.004358814   0.04249053   0.09305318   0.1763618   0.4019377 2205.8067 1.002139
lp__    -33.951458 0.041777118 1.1712864 -37.180331160 -34.42739042 -33.57252903 -33.1131970 -32.8095330  786.0483 1.004368

General minimization tools such as PyTorch

If you don't mind not having credible intervals/standard errors, then using any generic minimization tool and defining a suitable averaging function works. E.g. you can use a tool like PyTorch like this, if you want no intercept term and not just positivity, but also a sum-to-zero constraint:

class MyAverager(nn.Module):
    def __init__(self, n_models, n_targets):         
        super(MyAverager, self).__init__()
        self.betas = nn.Parameter(torch.randn(size=(n_models, n_targets)))
        self.softmax = nn.Softmax(dim=0)        

    def forward(self, inputs):
        # Assume input tensors indexed as sample, model, target
        # self.betas for weights indexed as model, target
        wgts = self.softmax(self.betas)
        x = torch.mul(inputs, wgts).sum(dim=1)        
        return x

And like this if you only want the positive coefficients (but no sum-to-zero constraint) and no intercept:

class MyAverager(nn.Module):
    def __init__(self, n_models, n_targets):         
        super(MyAverager, self).__init__()
        self.betas = nn.Parameter(torch.randn(size=(n_models, n_targets)))

    def forward(self, inputs):
        # Assume input tensors indexed as sample, model, target
        # self.betas for weights indexed as model, target
        wgts = torch.exp(self.betas)
        x = torch.mul(inputs, wgts).sum(dim=1)        
        return x

Obviously, there's nothing unique about PyTorch that enables this and you can just as easily do this with Keras / TensorFlow, or if you are not a Python user with the corresponding R (or whatever else you are using) packages.

Björn
  • 21,227
  • 2
  • 26
  • 65
0

For Python users: LogisticRegressionBinaryClassifier() in nimbusml has an option enforce_non_negativity that imposes a non-negativity constraint. See here.

aahr1
  • 98
  • 5