2

Assume a data set with multiple columns, where the categorical data are coded. What is the best rule(s) or rule of thumb to determine whether each column contains qualitative data or quantitative data?

One possible way is to count the number of unique values and if the unique value count is less than some threshold value, then treat it as qualitative. But as this can differ with the size of data (i.e. for big data), is there a particular method to decide on the threshold?

Or is there any other completely different approach to this?

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Supun Setunga
  • 31
  • 1
  • 3
  • **Closely related:** http://stats.stackexchange.com/questions/23200 (concerning design of an object-oriented system based on measurement types) and http://stats.stackexchange.com/questions/539 (a discussion on "treating categorical data as continuous"). – whuber Sep 03 '14 at 17:03

1 Answers1

7

Whether to treat data as categorical or quantitative is a decision made by the analyst, taking into account what they represent, & has nothing to do with how many unique values there are in a sample. Having thousands of unique nine-digit customer IDs doesn't imply that customer ID should be treated as quantitative; having only five different levels of temperature measurement doesn't imply that temperature should be treated as qualitative. (And what about ordered categories, counts, angles, &c. ?)

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
  • The requirement is to automate the identification of the data type as quantitative/categorical, via a java code. And the code should be generalized to work with any set of data. Thus it has to be based on some decision rule. – Supun Setunga Sep 03 '14 at 09:13
  • 4
    (1) That sounds like a bad idea, if it's more than a provisional identification to be reviewed by someone with domain knowledge. (2) Any such decision rule could perform very badly on average depending on the kind of data-sets to which it's commonly applied - you could sample some & look for patterns. – Scortchi - Reinstate Monica Sep 03 '14 at 09:26
  • 3
    I endorse @Scortchi's answer and comments from a statistical point of view. I suspect that your question is really a programming question in disguise, such as: For an unstated purpose, I need to identify which variables (columns, fields, attributes, whatever) in databases (datasets) are "coded". I think you would need to say much, much more about your data files, software set-up, etc. to get good answers. Using Java doesn't tell us much except that you are apparently working outside the framework of specifically statistical software such as R, SAS, SPSS, Stata. – Nick Cox Sep 03 '14 at 09:35
  • Yes, it's hard to imagine what you'd be wanting to *do* with data you knew so little about. Some sort of automated descriptive/graphical analysis? - even then it seems rather barmy to have to guess an appropriate scale of measure rather than stipulating it with meta-data. If for some reason you do have to guess, knowing something about the context - behavioural/attitudinal surveys, order lines from a transactional database, or whatever - might justify making some assumptions about typical differences in no. unique values, & the distribution across those values, between qualitative & ... – Scortchi - Reinstate Monica Sep 03 '14 at 11:10
  • ... quantitative variables, which, together with knowledge of the costs of different mis-classifications, you could use to formulate a decision rule. E.g. it might seem reasonable to suppose that a variable with 5 unique values is a Likert item, or one that tails off quite slowly toward higher values is a count & not a category code. – Scortchi - Reinstate Monica Sep 03 '14 at 11:14