Stat models to use (1) to model % values and (2) to classify objects

Question

I have a dataset of ~ 300 chemical compounds, which are described by ~ 100 independent dummy variables which codify the presence of a particular chemical group (i.e., these X variables are either 1, if a particular chemical group is present, or 0, in the case that it is not). For these 300 compounds I have two dependent variables Y.

1) In the first case my Y is a % value, thus going from 0% to 100%, and I want to create a model to correlate this Y to the predictors just mentioned above. I want then to use this model to predict outcomes for new compounds.

2) In a second case my Y is a category (I have 3 categories in total) and, as before, I want to generate a model that correlates the predictors to the category and use it to make predictions.

Please note that the two models are separately estimated. Which statistical models do you suggest to use in these two cases?

score 1 · Answer 1 · answered Aug 31 '14 at 20:51

1

For a very general model of proportion or percentage data, you might want to consider a zero-one-inflated beta distribution, and the corresponding regression model. There's a blog post with an R example on R-bloggers.com.

answered Aug 31 '14 at 20:51

shadowtalker

11,395
3
49
109

score 1 · Answer 2 · edited Apr 13 '17 at 12:44

Unless you are confident with the numerical dependence of $Y$ on your various $X$, I would prefer your second approach (personal intuition though). So I am just trying to comment on your second approach.

Normally logistic regression (sometimes probit models too, it depends) is used in the classification problem. This assumes a binomial probability model for the outcome. Under this assumption, the logistic regression model is maximum likelihood that quantifies the relative changes on $X$ in the risk of the outcome $Y$ difference in the predictor. The loss function of logistic regression model is also related to the likelihood.

So in your second approach, maybe you can try a one-vs-all strategy in which each class is distinguished from all the other 2 classes. Prediction is then performed using each binary classifier. The classifier with the highest confidence score is chosen (it's like a naive bayesian classifier).

Since your training set is small compared with feature size, it is possible that logistic regression suffers from the overfitting issue. A cross validation method helps you confirm such a possibility. If that is the case, you may consider reduce the features, or use SVM with a Gaussian kernel (mapping your data to a space that is linear separable).

EDIT

Regarding the regression, the basic strategy is almost the same in my analysis on classification. Note that logistic regression can do both classification and regression. While SVM is mostly used in classification, neural network can give you a better performance on regression by just changing the cost function from the derivative of sigmoid function to a least square based quadratic function, while the back propagation procedure retains. You have only two independent variables, so overfit may not be a problem. You can try with a small amount of dataset with the cross validation strategy, and observe whether the validation error curve converges to the same expected error as training error curve does.

Is there some reason you do not suggest multinomial logistic regression for the second problem? BTW, this is really two separate questions about two different models for two different responses, so there's no question about preferring one over the other: the O.P. seeks advice about each separately. — whuber, Jan 10 '14 at 22:39
Thank you for your suggestion whuber. Yes softmax regression is also an option. Thanks. — lennon310, Jan 10 '14 at 23:05
I need to do both models. I have 2 dependent variables: in one case it is a % value, in the other case it is a category. The predictors are the same anyway. Sorry for not being clear. — mimenico, Jan 13 '14 at 15:34
@mimenico I updated my answer by adding the possible model in regression, thanks. — lennon310, Jan 13 '14 at 16:00

Stat models to use (1) to model % values and (2) to classify objects

2 Answers2