How can I quickly detect cheating variables in large data?

Question

Suppose We have a data set with millions rows and thousands columns and the task is binary classification. When we run a logistic regression model, the performance a lot better than expected, e.g, almost perfect classification.

We suspect there are some cheating variables in data, how can I quickly detect it?

Here cheating variables means a variable that is very indicative to the response and we should not use it. For example, we use if a person make customer service call to predict if a person purchased a product or not.

Your idea of a "cheating variable" here seems similar to the "correlation = causation" fallacy (or perhaps "[postdiction](https://en.wikipedia.org/wiki/Postdiction)"). But your proposal seems more like "How to determine if one predictor dominates outcomes?". This could be useful to flag variables for further [QA/QC](https://en.wikipedia.org/wiki/QA/QC), but is not determinative. In your ending example, a timestamp on the data would be determinitive (i.e. call follows purchase, so "cheat" because causal arrow is backwards). — GeoMatt22, May 05 '17 at 16:43
@GeoMatt22 thanks for your comment, I admit the defection is not clear. I am also thinking if the definition should include linear combination of variables instead of one variable, and define how strong it is to treat as "cheating". — Haitao Du, May 05 '17 at 16:50
I think that "strong association" cannot be used to infer "causal vs. cheat" in a purely logical manner. However in a Bayesian sense, the "too good to be true" prior does not seem worthless. But I am not sure how to formalize this :) (In a particular domain, I guess you could accumulate a "prior causal $R^2$ PDF"?) — GeoMatt22, May 05 '17 at 16:53
process note: you may wish to delay before posting an answer, to encourage feedback. (Or post a note about your intent, as I did [here](https://stats.stackexchange.com/questions/252968/confusion-matrix-metrics-joint-vs-conditional-probabilities).) — GeoMatt22, May 05 '17 at 16:57
This is an interesting question: *in practice*, it's actually an important question. For example, if you set up a Kaggle competition and include a "cheating variable", you may greatly underestimate the difficulty of predicting the outcome based on competition results. *In theory*, it's not an important question: the real question is whether the variables will be available when it comes time for prediction. If they are available, use the cheating variable! But this is an interesting idea for double checking datasets before putting on Kaggle, as an example. — Cliff AB, May 05 '17 at 17:59
thanks @CliffAB I have this problem all the time in my work. I know there is no way we can predict with such accuracy, but have difficult time to debug what went wrong when I have Gb data. The only think I came up with is trying to fit a tree to debug... — Haitao Du, May 05 '17 at 18:42
@OP Aren't you just saying you shouldn't train on variables that won't be available for the prediction task? "Cheating variables" seems poorly defined - if they're giving you good classification, that is good. Unless you shouldn't have access to them when trying to predict. — Apollys supports Monica, May 05 '17 at 18:43
@Apollys in real world it is not easy to tell what's happening there. For example, all the data used in modeling are historical, with large amount of data, hard to see which attributes are collected at what time. — Haitao Du, May 05 '17 at 18:45
I'm not saying it's trivial, I'm just saying it doesn't seem to be a mathematical issue. I.e. a variable being a good predictor doesn't imply it's a cheating variable. — Apollys supports Monica, May 05 '17 at 18:47
@Apollys: as I stated, *in theory*, it's not interesting. But in practice, it could be! For example, suppose a company wants to know if tastes are easily predictable given their huge dataset. If yes, they will spend a lot of money building a data science team to put everything in production. If no, they won't. It may be nice to get a quick snapshot of "is this predictable?" by fitting an ML model on their huge dataset with diverse data, some of which will be available at prediction time, some which will not. So you better much sure you have no cheater variables in this snapshot! — Cliff AB, May 05 '17 at 19:28
@CliffAB it depends on what you mean by "in theory". Issues of data/metadata management, model explainability/auditability, "trustable AI", etc. seem to have enough regularity to perhaps allow some generalization. (Isn't there now effectively a large body of "software engineering theory" arising over the past few decades?) Though it would be interesting perhaps to see what response this Q would get at DatSci.SE also! — GeoMatt22, May 05 '17 at 19:35
@GeoMatt22: yes, my thought was that "in theory", we know what variables will be available at prediction time. According to me, "in practice", it is possible we might not. We could formalize the problem by saying there's a cost in checking whether a variable is available at prediction time. — Cliff AB, May 05 '17 at 19:42
@hxd1011: If you are interested in seeing another instance involving data-leakage see this recent [thread](https://stats.stackexchange.com/questions/275189). Some further references and links provided might be helpful too. — usεr11852, May 05 '17 at 23:36

Alex R. · Accepted Answer · 2017-05-05T17:34:04.853

This is sometimes referred to as "Data Leakage." There's a nice paper on this here:

Leakage in Data Mining: Formulation, Detection, and Avoidance

The above paper has plenty of amusing (and horrifying) examples of data leakage, for example, a cancer prediction competition where it turned out that patient ID numbers had a near perfect prediction of future cancer, unintentionally because of how groups were formed throughout the study.

I don't think there's a clear cut way of identifying data leakage. The above paper has some suggestions but in general it's very problem specific. As an example, you could definitely look at just the correlations between your features and target. However, sometimes you'll miss things. For example, imagine you're making a spam bot detector for a website like stackexchange, where in addition to collection features like message length, content, etc., you can potentially collect information on whether a message was flagged by another user. However, if you want your bot detector to be as fast as possible, you shouldn't have to rely on user-generated message flags. Naturally, spam bots would accumulate a ton of user-generated message flags, so your classifier might start relying on these flags, and less so on the content of the messages. In this way you should consider removing flags as a feature so that you can tag bots faster than the crowd-sourced user effort, i.e. before a wide audience has been exposed to their messages.

Other times, you'll have a very stupid feature that's causing your detection. There's a nice anecdote here about a story on how the Army tried to make a tank detector, which had near perfect accuracy but, ended up detecting cloudy days instead because all the training images with tanks were taken on a cloudy day, and every training image without tanks was taken on a clear day. A very relevant paper on this is: "Why Should I trust you?": Explaining Predictions of Any Classifier - Ribeiro, et. al.

+1 thanks for answering my not well defined questions! and now I know the term and will read their formulation! — Haitao Du, May 05 '17 at 17:18
On the last paragraph: the army apparently [took that lesson to heart](http://www.darpa.mil/program/explainable-artificial-intelligence)! — GeoMatt22, May 05 '17 at 17:30
+1 The most common leak I run into is a leak from the future. This is why I'm not too worried about the new spate of "Just run your data through our algorithm and it'll make a model just like a Data Scientist!" products. Machine learning follows the signal, even if the signal is a leak. — Wayne, May 05 '17 at 20:57

score 2 · Answer 2 · answered May 05 '17 at 16:31

One way of detecting cheating variables is building a tree model and look at first few splits. Here is a simulated example.

cheating_variable=runif(1e3)
x=matrix(runif(1e5),nrow = 1e3)
y=rbinom(1e3,1,cheating_variable)

d=data.frame(x=cbind(x,cheating_variable),y=y)

library(rpart)
library(partykit)
tree_fit=rpart(y~.,d)
plot(as.party(tree_fit))

How can I quickly detect cheating variables in large data?

2 Answers2

Linked