2

I have a categorical response variable. It is binary and represents the win or loss of a deal. Some of the independent variables used to predict the response are also categorical (like Geo, Region, and others...). These categorical variables have more than 3 categories. The rest of the variable are counts (like #face2face activities, #of CXO/VP meet, Business development activities).

Should I use logistic regression to predict the response variable? If yes, please specify the steps needed to come up with the best model for prediction.

How should I check the quality of the model, so as to decide which one is the best?

logc
  • 103
  • 3
user43247
  • 429
  • 1
  • 4
  • 8
  • Search for questions with the 'model-selection' tag - it's a broad topic, & there are diverse approaches; there aren't four or five "steps". Model 'quality' requires some thought - are you interested in a specific classification task (having costs associated with mis-classification), with discrimination (scoring wins higher than losses), or with calibration (accurate estimation of probabilities) - see [here](http://stats.stackexchange.com/questions/91088/). – Scortchi - Reinstate Monica Jun 05 '14 at 12:24
  • "What model should I use? What are the steps needed? & How should I check the quality of the model?" is too broad to be answerable. Way too broad to be answerable. Such questions cannot be answered reasonably in a format like this. You will need to take several statistics classes to develop an adequate understanding of these issues. In the interim, we would be happy to help you with suitably focused questions. – gung - Reinstate Monica Jun 07 '14 at 02:22

1 Answers1

-1

You may want to look into the field of statistical classification, since you are unsure whether to use logistic regression to predict the response variable. Classification is a related field - some sources, including Wikipedia, regard classification as the field encompassing logistic regression.

To handle the categorical variables, use a suitable encoding that transforms them into numbers, and consider normalizing all data features: most classification algorithms require this in order to avoid the counts with highest maxima to dominate the result.

In order to tell which algorithm works best, or to tune the parameter values once you have one algorithm selected, use cross validation on the data you already have.

logc
  • 103
  • 3
  • This is *not* a place for a classifier, and normalizing the data will create confusion. – Frank Harrell Jun 05 '14 at 12:39
  • @FrankHarrell: I may be wrong, but I understand the OP wants to predict the response variable with *something*; since [logistic regression is a type of classification](http://en.wikipedia.org/wiki/Logistic_regression#cite_ref-1), I would recommend to him to look at his problem from the broader perspective. Otherwise, let me ask: why do you say that normalizing the data will create confusion? I am ready to be proved wrong but I do not understand the reasons behind your comment. – logc Jun 05 '14 at 12:50
  • 4
    Absolutely not. Logistic regression *is not* a type of classification. It is a direct probability model that does away with the need for arbitrary classification entirely. And normalizing the data means that you need to save and re-use the normalizing constants when evaluating the or developing predictions, thus effectively creating parameters that are external (and must be remembered) to the model's parameter estimates. – Frank Harrell Jun 05 '14 at 16:07
  • @FrankHarrell: sorry, but in the Wikipedia article that I linked previously, the first sentence states that logistic regression "is a type of probabilistic statistical classification model". Since Wikipedia is not a reliable source, I linked to the textbook citation from where they derive this claim: *Pattern Recognition and Machine Learning* by Christopher Bishop. See also ["Choosing between Logistic Regression and Discriminant Analysis", Journal of the American Statistical Association](http://www.tandfonline.com/doi/abs/10.1080/.U5CzvxbLiZw): the abstract mentions their equivalence. – logc Jun 05 '14 at 18:19
  • @FrankHarrell: on the issue of normalization, I understand now better your concern, but let me state that it is a [common requirement for many machine learning estimators](http://scikit-learn.org/stable/modules/preprocessing.html), and many software packages will remember the scaling parameters for you; see the `inverse_transform` on [this example from scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler). – logc Jun 05 '14 at 18:24
  • 2
    For a method invented by statisticians by 1958 (DR Cox, Journal of the Royal Statistical Society B: 20:215-242; 1958) it is appalling that anyone would consider the authority on the subject to be a machine learning expert. And note that for regression models, normalization is not needed. – Frank Harrell Jun 05 '14 at 21:20
  • 1
    Note that logistic regression and linear discriminant analysis are not equivalent but they do have a relationship. And LDA provides posterior probabilities of class membership, not all-or-nothing classification. – Frank Harrell Jun 06 '14 at 04:48
  • @FrankHarrell: I do not agree that the final authority on the technique has to be that of its inventors, but I see your point in distinguishing 'classification' from 'regression'. I will edit my answer to read more like a suggestion, because I still think it could be of help to the OP in solving his problem. You are welcome to provide another answer here (or edit Wikipedia to express your distrust of machine learning experts :) ). – logc Jun 06 '14 at 08:58
  • On a side note, the original inventor is not even aknowledged on the Wiki article. – Thomas Speidel Jun 06 '14 at 16:06