0

I have (imbalanced class) classification problem on my hand. Major class proportion being 92.7%. I have tried with most of the training data manipulation techniques, such as Oversampling, under-sampling (both with SMOTE as well). And I have tried with almost all the classifiers available at my disposal. E.g.: Logistic, SVM, KNN, NN etc.

The Steps I am following:

1) Feature creation 2) Data cleaning and processing (outlier detection and removal of outliers/replacing them by mean , standardization, splitting data in train and test format)

3) doing the undersampling/oversampling on train data 4) apply the model on train data 5) and testing the model on test data

I am not getting precision for minor class not more than 10% and my target is to get at least 60%. Am I doing anything fundamentally wrong here?

p.s. I have tried with ensemble techniques as well

EDIT :

It is an attrition model and sample size = 250k. Fields are not known to me, that is anonymous data.

Artiga
  • 303
  • 3
  • 16
  • 3
    Possible duplicate of [How to know that your machine learning problem is hopeless?](https://stats.stackexchange.com/questions/222179/how-to-know-that-your-machine-learning-problem-is-hopeless) – Stephan Kolassa Oct 26 '17 at 09:22
  • @StephanKolassa, I went through the question and answer you have referred. That is forecasting problem. Can that be generalised to a classification problem ? ( All I want to say there is no time component in my data) – Artiga Oct 26 '17 at 09:38
  • 1
    The question I linked to is agnostic as to forecasting or classification, although my answer uses a specific forecasting perspective in the *examples* I used. The underlying argumentation applies equally to classification, especially the conclusions and the bottom line sections. The same for the other answers. – Stephan Kolassa Oct 26 '17 at 09:56
  • Removing the (3) step may be a good idea. – Tim Oct 26 '17 at 10:00
  • @CagdasOzgenc, I have added the information in the EDIT. – Artiga Oct 26 '17 at 10:00
  • @TIm, I tried that as well for the first time. If I remove (3), only 1 observation is getting classified from minor category for few classifier at most 5 and for few 0 are getting classified. – Artiga Oct 26 '17 at 10:02
  • @CagdasOzgenc Altogether there are 72 fields out of which 30 categorical. The target variable- Whether a customer unsubscribes a particular service. It is not time series data. So no question of stationarity. – Artiga Oct 26 '17 at 10:09
  • 1
    @Artiga Do you worry about correct classifications, or number of "1" in the ouptut of classifier? If you want more "1" you can always predict "1" for all cases... – Tim Oct 26 '17 at 10:14
  • 1
    In these kind of analysis data aging is an important aspect. I am assuming that 92.7% did not churn (otherwise it is a bad business anyhow). What's crucial is to have the same time distance between when the non-target data collected and when the target data observed for all samples. You should be modeling the probability of churn within the next 12 months for example. You cannot make a model of "ever churn". For this reason whoever supplied the data should take a snapshot point in time and tell you what happened within the next 12 months. You need at least information about data collection. – Cagdas Ozgenc Oct 26 '17 at 10:17
  • @Tim, I worry about the correct classifications; because cost is calling the all people. . Actually those customers who are likely to churn (according to the model ) will be called to prevent the unsubsription. However 60% precision and recall for 1 would be enough. – Artiga Oct 26 '17 at 10:18
  • 1
    ...cont...If they are dumping all data, obviously they are giving you non-aged data. New clients will not churn tomorrow hence they are bloating your non-churn and reaching the 92%. All clients must have the same waiting time between the non-target and target data observation. Eliminating non-aged data is a good starting point to fix the class inbalance. – Cagdas Ozgenc Oct 26 '17 at 10:20
  • @CagdasOzgenc, Thank you for the comment. so if I get the information regarding customer's length (time, they are using the product ) , then are you suggesting that I should chop off the new customers from the analysis? or carry out the analysis length wise? – Artiga Oct 26 '17 at 10:23
  • 1
    No it is not the same thing. The client can be using the product for 5 years. This can be a parameter in the model and it will be a good parameter, because long term clients are more loyal. The length I am referring to is the time between the non-target data collected and the the target field collected. For example client bought the product in Jan 2005, non target data collected in Jan 2016, target data collected in Jan 2017. Exactly one year difference should be for all data points. You throw away customers that are under 1 year waiting period. – Cagdas Ozgenc Oct 26 '17 at 10:27
  • 1
    ...cont....Another important aspect is that usually systems keep up to date information of clients. This is a problem. If you are taking a snapshot of non-target data for Jan 2015, you should be getting information about the client at that time. For example if he wasn't married back then but married now in the records, you should take it as unmarried. This is usually difficult to establish unless you have a good data warehouse with historical info. All fields should be collected in this manner. – Cagdas Ozgenc Oct 26 '17 at 10:31

0 Answers0