How to determine which category is most likely given an observation with a set of characteristics

Question

I have a question concerning the data analysis methods to use for a specific situation. Here is the situation:

There is a dataset from an e-commerce site about its customers' purchases with 875 observations total. Each observation consists of 5 values. Scales of value measurement available for each observation (for each customer) are summarised in the following:

Package Type (Nominal): Type 1, Type 2, Type 3
Sex (Nominal): Male, Female
Age Group (Ordinal): NULL, Younger than 20, 20-25, Older than 25
Location (Nominal Scale): NULL, Region 1, Region 2
Order Count (Ratio Scale): Integers

NULL represents missing data.

The task is to identify which Package Type is preferred by which client type, composed of Sex, Age Group, Location and Order Count. Put another way: what is the Package Type is most likely given the set of characteristics, consisting of Sex, Age Group, Location and Order Count?

What am I asking for is not a ready solution for this problem - this just wouldn't be so interesting :). I want you to head me towards the methodology that would be used in answering this question. What branch of Statistics might handle this problem? Maybe you could advise me some good classic book covering the subject or the forum thread?

A couple of small notes about the scales of measurement that you list: (1) Only the scale of the response variable is of primary importance; (2) The concept of measurement scales is largely overrated, IMHO; (3) I wouldn't call a count variable (i.e., your #5) a ratio scale variable--it's sort of true, but not really the right way to think about it. — gung - Reinstate Monica, Nov 12 '12 at 23:24
No time for a proper answer but my proto-answer on key words is classification analysis as a general branch; and multinomial regression as one particular technique that would be useful. — Peter Ellis, Nov 12 '12 at 23:34
Great, thank you all for the expanded answers! This is a great place there. I'll try everything, thank you again. — Anton Ivanov, Nov 17 '12 at 16:07

score 2 · Answer 1 · edited Apr 13 '17 at 12:44

The appropriate method for your situation is multinomial logistic regression. This is because your response variable is categorical, with more than two categories. MLR is a generalization of logistic regression, with which you are probably familiar (if not, my answer here: difference-between-logit-and-probit-models may be helpful, although it was written in a different context). A nice discussion of MLR on CV can be found here: interpreting-expb-in-multinomial-logistic-regression. For books, I would recommend the works by Agresti. He has put out a rigorous treatment of the subject (Categorical Data Analysis), and an introductory version. For a quick guide to get you to the point where you can run a MLR, you may want to peruse the UCLA stats help pages.

score 2 · Answer 2 · edited Nov 13 '12 at 14:16

If you are more focused on the results than hardcore statistical validity and soundness, then this is the domain of machine learning (Link is Wikipedia).

More precisely: classification (again, Wikipedia. Your classes are "Type 1", "Type 2" and "Type 3").

Borader applicable terms are artificial intelligence (but this includes various other tasks such as simulating swarm behaviour, and optimal path finding) and data mining (but which includes various non-learning data analysis methods, such as outlier detection and clustering; it is more of an application area for machine learning). As you want to predict the package, you are dead on target with classification/ML.

There are various methods around. Literally thousands. Probably a good introductory approach to the general idea is decision trees, because it is less heavy on the statistics but can actually be computed by hand. More advanced techniques include Naive Bayes, Neural Networks, SVM etc.

Well known books include:

Ian H. Witten, Eibe Frank, and Mark A. Hall (2011). Data Mining: Practical machine learning tools and techniques ("The Weka book")
Christopher M. Bishop (2006) Pattern Recognition and Machine Learning
Russell, Stuart J.; Norvig, Peter (2003), Artificial Intelligence: A Modern Approach (This one is a bit broader in focus, Peter Norvig is at Google, but was involved in the AI part of the famous AI-ML stanford online classes that got Coursera started; you can still watch the classes on YouTube).

How to determine which category is most likely given an observation with a set of characteristics

2 Answers2