What are the signs that a classification just wont work?

Question

Basically, I've a simple question:
Are there signs which tells me that I just can't classify my data?

To make things clear, I've a dataset whereby i try to predict the amount of tires changes.
But something tells me that this will be quite hard.

For example, here is a sneak peak of my training set.

2148 -- 0 tires changes
230 -- 1 tires change
1984 -- 2 tires changes
570 -- 3 tires changes
2791 -- 4 tires changes
889 -- 5 tires changes
2645 -- 6 tires changes
807 -- 7 tires changes
1819 -- 8 tires changes
512 -- 9 tires changes
1699 -- +9 tires changes

As you can see, it is very unbalanced, especially with the odd amount of tires changes.

Furthermore, in my dataset, I've +/- 150 features and 95% of those features are basically, dummy data e.g.

gear_4_dummy -- {0,1}
gear_5_dummy -- {0,1}
gear_6_dummy -- {0,1}
gear_7_dummy -- {0,1}
doors_3_dummy -- {0,1}
doors_5_dummy -- {0,1}
seats_2_dummy -- {0,1}
seats_5_dummy -- {0,1}
break_dummy -- {0,1}
berline_dummy -- {0,1}
sport_dummy -- {0,1}
...

and inly 5% of the data are numbers e.g.

price -- numeric
tire_size -- numeric
end_km -- numeric
km_a_year -- numeric
car_price -- numeric
...

Now the thing is, whatever I try the results are just awful...
I've tried Regression, I've tried classification, I've tried clustering.
But all of them won't give me a decent result.

The best thing I can do is +:- 20-35% of accuracy but that would almost be the same as guessing because of the unbalanced data ...

For example, this is what I did recently:

I created a training/val set (data > 2003 and data <=2012)
I created a test set (data > 2013)
I resampled the training/val set (randomly) in such a way that my target class (amount of tires changes) represented each class, equally
(This would mean that, every class contains 460 participants)
460 -- 0 tires changes -- (random sample)
460 -- 1 tires changes -- (doubled the amount)
460 -- 1 tires changes -- (random sample)
460 -- 1 tires changes -- (random sample)
...
By resampling the training set, I believe that the classifier won't be tempted to bias the even pair tires changes. (or with other words, odd pair tires changes are also important)
I trained / tested my dataset
Decision tree
Random forrest
Naive Bayesian
Logistic regression
Regression
...

Without any difference

But

When i took a look, and investigated my data even more then before. I saw something strange.
And that is basically the purpose of my question.

when I investigated the data (in weka) (I was working in python/notebooks/sklearn)
I noticed something. And I wanted to ask if it is normally...
Thus when I looked at my data, I noticed that all of my targets are almost equally represented in each and every variable/parameter/bucket/data piece.

E.G
here is a picture with my target data, after the resampling
And here are data pictures from other features, whereby the Target features are in coller

Start year

End year

Sales year

Lifecycle ages

Month vs new model

rw per

**end km **

**Euro **

**4th quarter **

Tire diameter

As you can see,

It is just like that every column contains a proportional equal amount of target values. (I even would say, it looks like it is for 80-90% random) ...

In contrast, If you look at the Iris data set for example.

Then we see clearly, different things. Like for example, not every column contains a proportional equal amount of target values. And it makes much more sense that a classifier can classify this data set much better. (you can almost do it by just looking at it.

a picture of my confusion matrix

Nevertheless. If someone has some (good) tips or ideas, or an answer on my question. I will always appreciate that.

Regression (sklearn)

As mentioned before, I also tried to execute a regression model But the results where also not that good. (probably because of the lack of numeric values?)

What I tried was:

Loop over all the Features
- Calculate the P-value and MSE
Take the Feature with the best result
Loop over all the Features in combination with my previous best feature set
- Calculate the P-value and MSE
Take the best new features set
Loop over all the Features in combination with my previous best feature set
...

The above algorithm is a bit simplified (I putted every improvement in a priority queue and took the best value, to calculate a new regression (i removed previous calculated feature sets))

Here under, you can see a list, of combinations and values

P-value, MSE, #Features, Features-used

0.23001432437,10.624070884777447,1,end_kms
0.252706916701,10.310964138620037,2,tire_diameter,end_kms
0.261575392281,10.188599117839662,3,tire_diameter,end_duration,end_kms
0.269643515788,10.077277155947542,4,personalcar_dummy,tire_diameter,end_duration,end_kms
0.276184144098,9.987031192941254,5,end_kms,tire_diameter,end_duration,schade_per,personalcar_dummy
0.280545142538,9.926859220879235,6,personalcar_dummy,end_duration,end_kms,schade_per,mnth_vs_new_model,tire_diameter
0.284413337973,9.873486822151795,7,personalcar_dummy,end_duration,end_kms,schade_per,seats2_dummy,mnth_vs_new_model,tire_diameter
0.287928576315,9.824984437619571,8,personalcar_dummy,merkcode_AL_dummy,end_duration,end_kms,schade_per,seats2_dummy,mnth_vs_new_model,tire_diameter
0.291212518569,9.779673419004128,9,personalcar_dummy,end_duration,end_kms,seats2_dummy,schade_per,berline_dummy,tire_diameter,mnth_vs_new_model,merkcode_AL_dummy
...
130 iterations further
...
0.323463084764,9.33468812619479,84,seats4_dummy,merkcode_LA_dummy,coupe_dummy,sale_mnth_1_dummy,end_duration,sale_year_2011_dummy,merkcode_RE_dummy,inzet_year_2012_dummy,euro5_dummy,lifecycle_age_class,sale_year_2010_dummy,seats2_dummy,tractie_a_dummy,berline_dummy,merkcode_IV_dummy,break_dummy,merkcode_AU_dummy,motor_pk,sale_year_2007_dummy,euro4_dummy,inzet_mnth_5_dummy,seats8_dummy,inzet_year_2007_dummy,cat_prijs,merkcode_RO_dummy,merkcode_SU_dummy,inzet_year_2009_dummy,euro3_dummy,sale_mnth_12_dummy,business_dummy,sale_year_2016_dummy,merkcode_VO_dummy,uitvoering_HYBRIDE SPORT_dummy,roadster_dummy,gear_auto_dummy,merkcode_MA_dummy,sale_b2c_dummy,tire_ratio,merkcode_PE_dummy,seats6_dummy,uitvoering_BUS_dummy,merkcode_NI_dummy,sale_mnth_5_dummy,merkcode_SS_dummy,inzet_year_2004_dummy,merkcode_DC_dummy,model_year_2008_dummy,gear7_dummy,kwart_Q4_dummy,sale_year_2012_dummy,model_year_2011_dummy,merkcode_SM_dummy,gear5_dummy,inzet_mnth_4_dummy,uitvoering_HYBRIDE_dummy,sale_mnth_7_dummy,schade_per,inzet_mnth_12_dummy,sale_year_2015_dummy,merkcode_TO_dummy,merkcode_AL_dummy,tire_diameter,inzet_year_2006_dummy,inzet_mnth_10_dummy,uitvoering_HYBRIDE HIGH_dummy,uitvoering_vip_dummy,rw_per,model_year_2012_dummy,kwart_Q2_dummy,seats5_dummy,personalcar_dummy,model_year_2007_dummy,merkcode_OP_dummy,merkcode_CV_dummy,merkcode_CH_dummy,seats9_dummy,end_kms,merkcode_FO_dummy,missing_gear_dummy,mnth_vs_new_model,seats7_dummy,sport_dummy,tire_width,merkcode_VW_dummy

Wouldn't "predict the amount of tires changes" be more a *regression* problem than a *classification* one? — GeoMatt22, Apr 30 '17 at 14:54
@GeoMatt22 - I've already to tried regression (sklearn) but even then, my values aren't that great at all. I'm just wondering, could it be, that the amount of tires changes - is basically a random event which has nothing to do with the type of car you're driving (random, unless i've better features) — Dieter, Apr 30 '17 at 14:59
Dieter: that sort of information (p-value) should be edited into your question, rather than as a comment. As for my comment, I meant you seem to be predicting a count, rather than a label. On your 2nd comment: I would be highly surprised if the # tire changes is unrelated to things like age and/or cumulative driving distance. — GeoMatt22, Apr 30 '17 at 15:03
Too long question, couldn't read it all, but I have a suggestion: find a more appropriate evaluation metric. Predicting (exactly) the number of tire changes might be unfeasible, but is it actually necessary? Perhaps an expected value and a prediction interval would be better. Also, give a look at ordered regression (ordered logistic regression) and poisson regression as well. — Firebug, Apr 30 '17 at 15:06
@GeoMatt22, I've added extra info about the regression. And yes, the end_km and tires width etc have an influence, but it isn't what i would expected — Dieter, Apr 30 '17 at 15:21
I would take @Firebug's comment seriously w.r.t. length. In particular, I do not think the extensive diagnostic figures are likely to get you a better answer, and may or may not be relevant to your issue. (At the least, you could consolidate the data distributions into one figure, cropping vs. full-on screen-caps; and also omit the Iris figures) — GeoMatt22, Apr 30 '17 at 15:31
Also note that a regression problem for count data may require a different approach than for a continuous regression problem or a classification problem (e.g. see [here](https://stats.stackexchange.com/questions/3024/why-is-poisson-regression-used-for-count-data)). — GeoMatt22, Apr 30 '17 at 15:49
I'd recommend treating this as a regression problem; if you predict 3 changes and the actual is 4, that's pretty good. If you predict 3 and the actual is 9, that's not so great. And personally I'd try a randomForest -- as a regression, not as a classifier. It's plausible to me that most of your variables aren't important, e.g. number of seats. — zbicyclist, Apr 30 '17 at 15:51
But you may have a lot of randomness left -- you don't know the quality of the roads, whether the car is being driven past a lot of construction sites where there are nails, etc. — zbicyclist, Apr 30 '17 at 15:53
I also find it interesting how even tire changes are more likely. It makes sense actually. You can try to use this information as well, perhaps concatenate the "classes" into "1 OR 2 changes" instead for example. — Firebug, Apr 30 '17 at 15:58
@Firebug perhaps a result of sales pressure when replacing a problematic tire? — GeoMatt22, Apr 30 '17 at 16:46

score 2 · Answer 1 · answered Apr 30 '17 at 15:49

Besides poorly forming the overly long question you are using classification in an incorrect context. You need risk estimation. See http://www.fharrell.com/2017/01/classification-vs-prediction.html

Also be sure to use a proper accuracy score: http://www.fharrell.com/2017/03/damage-caused-by-classification.html

Engineering Mind · Answer 2 · 2017-04-30T15:57:21.123

very long question.. however, let me tell that classification algorithms usually tries to pick the repeated patterns and adds rules accordingly. no matter how the format of your data looks like. It only depends on the classifier's ability to find these hidden patterns. Numerical values sometimes cause a problem if they vary widely and thus cause messy kind of logic. Sometimes you can limit the data ranges to reduce the classifier bias towards big values by either normalizing the data to limit their ranges and thus make the learning process easier or you can build your own rules to turn the numerical values to nominal ones to ease the learning process on your classifier. as an example for such rules:

if age ranges between 00-20 --> young
if age ranges between 20-40 --> Middle-age
if age ranges between 40-60 --> old

it may also fail in classification if the missing values of the data is huge . However, if this is the case, preprocessing your data before classification should be the best option for you.

Hope that helps.

What are the signs that a classification just wont work?

But

As you can see,

Regression (sklearn)

2 Answers2