How to statistically prove if a column has categorical data or not using Python

Question

I have a data frame in python where I need to find all categorical variables. Checking the type of the column doesn't always work because int type can also be categorical.

So I seek help in finding the right hypothesis test method to identify if a column is categorical or not.

I was trying below chi-square test but I am not sure if this is good enough

import numpy as np
data = np.random.randint(0,5,100)
import scipy.stats as ss
ss.chisquare(data)

Please advise.

Could you clarify if I have understood correctly that you want to distinguish between int used as an ordinal variable and used as arbitrary categorical coding? I don't have an answer, but it might help other reads be clear what you want. Intuitively I don't think Chi2 could reliably do the job. — ReneBt, Mar 23 '18 at 11:00
Yes you are right. Except my focus is not on the type of the variable rather the data it carries. So in the example code i have shared the result should be that it is a categorical variable because it has only 5 unique values. — Amit, Mar 23 '18 at 11:03
Where does the rule "the result should be that it is a categorical variable because it has only 5 unique values." come from? — Dennis Soemers, Mar 23 '18 at 11:12
Oh its not a rule, Intuitively i know that all my column data is these 5 values only. — Amit, Mar 23 '18 at 11:16
Even though others have correctly pointed out that it can't be done, this nevertheless would make for an interesting machine learning problem. Doubtless there are problem domains in which reasonably accurate predictions can be made. — John Coleman, Mar 23 '18 at 18:21

score 35 · Accepted Answer · answered Mar 23 '18 at 11:02

Short answer: you can't.

There is no statistical test that will tell you whether a predictor that contains the integers between 1 and 10 is a numeric predictor (e.g., number of children) or encodes ten different categories. (If the predictor contains negative numbers, or the smallest number is larger than one, or it skips integers, this might argue against its being a categorical encoding - or it may just mean that the analyst used nonstandard encoding.)

The only way to be sure is to leverage domain expertise, or the dataset's codebook (which should always exist).

"the dataset's codebook (which should always exist)" — Ha, good one. — Kodiologist, Mar 23 '18 at 16:31

score 12 · Answer 2 · answered Mar 23 '18 at 12:26

Whatever criteria -- or rules of thumb -- work for your dataset are welcome to you, but we can't see your data. In any case the problem is better pitched generally, and without reference to any particular software either.

It's worse than you think, even if you think it's worse than you think.

@Stephan Kolassa's answer already makes one key point. Small integers could mean counts rather than categories: 3, meaning 3 cars or cats, is not the same as 3, meaning "person owns a car" or "person is owned by a cat".
Decimal points could lurk within categorical variables, as part of coded classifications, e.g. of industries or diseases.
Measurements strict sense could just be integers by convention, e.g. heights of people may just be reported as integer cm or inches, blood pressures as integer mm Hg.
The number of distinct (a better term than "unique", which still has the primary meaning of occurring just once) values is not a good guide either. The number of different heights of people possible in moderate samples is probably much less than the number of different religious affiliations or ethnic origins.

+1. This is a good list of things to consider. You should combine this with your domain knowledge about the dataset (and any documentation) to automate categorical variable detection. — Cacti, Mar 23 '18 at 20:19
@Anna I would say that *automated detection* should not be performed and is exactly what can get you into trouble as outlined in this thread. The domain knowledge and documentation should readily identify polytomous variables from among the other variables, so that you don't have to guess. — prince_of_pears, Sep 24 '18 at 18:38

score 7 · Answer 3 · answered Mar 23 '18 at 15:01

Well I think it's even worse than the other answers suggest: data aren't categorical or numeric sub specie æternatis—"level of measurement" is something stipulated by the analyst to answer a particular question on a particular occasion. See Glen_b's answer here.

It's of practical importance to understand that. For example, with a classification tree the distinction between ratio, interval, & ordinal level predictors is of no consequence: the only distinction that matters is that between ordinal & nominal predictors. Constraining the algorithm to split the predictor at a point along a line, separating higher from lower values, can have a significant effect on its predictive performance—for good or ill, depending on the smoothness of the (putatively ordinal) predictor's relation to the response & the size of the data-set. There's no sensible way to make the decision based solely on musing about how the predictor variable represents reality irrespective of the analysis you're about to undertake, let alone on what values you've found it takes in a sample.

score 0 · Answer 4 · edited Sep 24 '18 at 18:43

0

This is an open research question. See for example the work by Valera et al. (paper) or extensions (e.g. one by Dhir et al. - paper).

Edit:

A common practice in statistics and machine learning is to assume that the statistical data types (e.g., ordinal, categorical or real-valued) of variables, and usually, also the likelihood model is known. However, as the availability of real-world data increases, this assumption becomes too restrictive. Data are often heterogeneous, complex, and improperly or incompletely documented. Surprisingly, despite their practical importance, there is still a lack of tools to automatically discover the statistical types of, as well as appropriate likelihood (noise) models for, the variables in a dataset.

(From the Valera paper.)

So when we say that this is an "open question" (oddly enough quoting myself), we mean to say that currently there are no good automatic methods for inferring the type of data given a finite sample. If you had an infinite sample this would be easy, but since that is not possible, we need to revert to other means.

edited Sep 24 '18 at 18:43

Nick Cox

48,377
8
110
156

answered Sep 24 '18 at 15:24

Astrid

745
7
17

Could you tell us what you are referring to by "open research question"? Please consider also explaining how your answer does not (or does!) contradict other answers in this thread. – whuber Sep 24 '18 at 17:46
Sure, let me edit my answer. – Astrid Sep 24 '18 at 17:54
Thank you. It seemed to me, upon perusing the Valera paper, that it makes a much stronger claim: namely, it does purport to have a method to guess about variable types, and in particular to distinguish between categorical and ordinal data. I didn't study the method, but presume it must be based (at least in part) in looking at relationships between such variables and other variables they are assumed to be related to. I am unable to understand how an "infinite sample" (whatever that might be) would be of any additional use: could you explain how that would make the problem "easy"? – whuber Sep 24 '18 at 18:04
It is actually a very robust method, and I have myself have studied it in detail (which does make me somewhat biased mind you); but they idea is very clever. We presume that each column type can be described as a mixture of types (much like a mixture model) and then we seek to find the type with the highest weight and then call the correspondent 'type' the real type of the variable. As far as type inference goes, it is very clever, and the best automatic method (that I know of). If other know of others, please do share! – Astrid Sep 24 '18 at 18:07

How to statistically prove if a column has categorical data or not using Python

4 Answers4

Linked

Related