Defining the usability / applicability of the data for machine learning

Question

Are there any indicators in the data for the maximal accuracy of machine learning solutions? Lets say you have these labeled cookie data (web address, time, data about OS and browser...) of 10000 people, and you have to build ML prediction models for their sex. You try some get a 70% accuracy (compared to 50% benchmark of a simple guess that everyone is female). How could you know that it's all what cookie data can give you, that you can only get that much accuracy with that kind of data for your prediction.

For me, the more causality between independent and dependent variables in the data, the better accuracy in ML you can get. But to translating this in a number so you can sell ML projects to your clients, I have to evaluate my ML models on their data. How could I know if my models have reached the upper limit possible?

You can't know for sure. That's why you should try a few different models. — Alex R., Aug 09 '16 at 19:43
This question is like "How can you know what is in the box if before you look into the box?" This is Schrodinger's cat. A decent follow-up question might be "after I have opened the box, how do I figure out how valuable the stuff inside it is". The data could be hidden in time-series, so no one event is as informative as several together. You might google "kaggle telestra network interruption blog winner" (Mario Filho). — EngrStudent, Aug 10 '16 at 12:04
As @EngrStudent you do not know this *before* seeing the data, but I'd add, that there are cases where you do not know this even *after* seeing the data, see: http://stats.stackexchange.com/questions/222179/how-to-know-that-your-machine-learning-problem-is-hopeless — Tim, Aug 10 '16 at 12:30
@Tim - that is a beautiful question. That is a beautiful point. I also liked the secondary answers. Economics and business pay for research, so understanding the economics of the work is also quite important. — EngrStudent, Aug 10 '16 at 14:24
Thank you all for your answers. So I guess the more you understand your data and your ML tools, the better results you will get then. In this case, the answer for the best solution is actually what Kaggle does — Vinh, Aug 11 '16 at 08:03

score 4 · Answer 1 · edited Apr 13 '17 at 12:50

It seems that we all want to know were do we stand with respect to the optimal solution. Are we there yet? Should we invest more time or will it be wasted?

Unfortunately, there is a negative result saying that we cannot know it in general. In "The strength of weak learnability (1990)" Robert E. Schapire showed that hard to learn functions are hard to approximate (see section 6). That means that you might trying to predict a function, getting no success though a perfect prediction is possible.

Please note that though this negative result, in practice you might estimate were you stand. Trying some models as Alex R. wrote is always a good suggestion.

Correlations (e.g., mutual information) can help too but are risky in both directions. for details see here.

Defining the usability / applicability of the data for machine learning

1 Answers1