1

I built a classification model (Logistic Regression) in order to classify data in Fraud or Not Fraud. This data is related with online CNP (Card Not Present) transactions and after choosing some parameters that seemed to be related with fraud, I tested the model. For this I used a training set of 225000 examples, and a test set of 75000.

I conducted two different tests, in the first one I had 7 parameters and managed to obtain a 96% accuracy in classification. The problem is that the number of Fraud cases is much lower than the Not Fraud ones, and so regarding the Fraud cases I got an accuracy of only 11% while on the Not Fraud 90 and something %.

In the second test I included more parameters, making a total of 15. Same training and test set size, and although the overall classification dropped to 92%, regarding the cases of Fraud I got an improvement to 30%, and Not Fraud still around the 90s%.

I would like to maintain the overall classification accuracy around 90%, and improve the accuracy on Fraud cases to something like 65~75%, but I can't find more parameters that seem to be relevant, to include in my model and I am stuck as other than that, no more ideas come to mind... Can someone please give me some hints or ideas on what to try next in order to try to achieve these goals?

Also, I have another doubt. Because the values of the parameters that I am using, have a very wide range, I applied Feature Scaling and Mean Normalization over them. I have 300000 example samples (Training set - 225000 and Test set - 75000). My question is, for each of these sets should I calculate the respective Average of each column and the Max - Min, in order to convert the values to a smaller scale, or should I calculate the average and max-min, based on the whole sample (the 300000 examples)?

Cartz
  • 13
  • 5

2 Answers2

0

What proportion of your 225,000 training samples are cases of fraud? I suspect very few. This will cause problems unless you take care in how you build a classifier from the logistic regression.

Given the issue you've described, I assume you are making your classifications based on a cut-off of a probability of $0.5$ from the logistic regression. You either need to choose a more appropriate cut-off, or weight the fraud samples (a weight of $19$ would be appropriate given you have $19$ times more not fraud cases in your data set), or discard a lot of the not fraud cases so you have a balanced data set.

As for your second question, to properly assess the out of sample performance I would calculate the average/min/max of each column only using your training set. As an alternative to scaling you might consider binning the relevant variables.

M. Berk
  • 2,485
  • 1
  • 13
  • 19
  • Yes, in the training sample, the number of Fraud cases is much smaller than the cases of Not Fraud (like 5% Fraud, 95% Not Fraud). I thought about picking all the cases of Fraud and an equal number of Not Fraud cases, but would that make a difference, since all those cases are used in the training sample anyway? Regarding my second question, what you are saying is to use the calculated average/min/max (based on the training set), and apply these values to the test set? Is that it? – Cartz Jan 31 '14 at 20:12
  • I've edited my answer to go into a bit more detail about the issue with the unbalanced data. For the second question, yes, calculate the parameters of the transformation based on your training set and apply them to the test set – M. Berk Feb 01 '14 at 11:57
  • Thanks for the reply. I did as you suggested and used a Training example size with approximately the same number of positive and negative cases (5000 each). I repeated the tests, and in both cases (7 and 15 parameters), the results were very similar. Overall classify accuracy dropped to 50%, but the Fraud cases correctly classified improved to also 50%. I would like to try to apply weights, but regarding that I have this very same problem: http://stats.stackexchange.com/questions/65382/adding-weights-for-highly-skewed-data-sets-in-logistic-regression – Cartz Feb 03 '14 at 10:52
  • I'm not sure what the link says about there being a problem with using weights. Either using weights or using a different cut-off should help with both your issue and the linked issue (which are the same). – M. Berk Feb 03 '14 at 11:06
  • Sorry, you misunderstood me. There is no problem with using weights. The problem is I don't know how to apply them in my implementation of Logistic Regression, the exactly same problem that is referred in the link. – Cartz Feb 03 '14 at 11:36
  • OK, now I understand. What software are you using to fit the logistic regression? – M. Berk Feb 03 '14 at 11:38
  • I'm using Octave, and in my implementation I used the very same equations that the person in the other link used. – Cartz Feb 03 '14 at 11:43
  • Unfortunately I'm not familiar with `Octave` but in `R` it's trivial to specify weights for each observation by setting the `weights` parameter to `glm()`. I would expect it to be equally straight forward for any statistical software. Alternatively you could avoid this issue of setting weights by focusing on the classifier cut-off instead. – M. Berk Feb 03 '14 at 11:46
  • Thanks for your patience @M.Berk. Could you please be more specific about what I have to do regarding the cut-off? – Cartz Feb 03 '14 at 11:50
  • How are you currently performing classification? In other words, how do you go from your logistic regression, the output of which is the probability of each case being fraud, to assigning a label to each case? – M. Berk Feb 03 '14 at 12:13
  • I have a function _oneVsAll_ that receives an _X_ training sample and the respective _y_, the num_labels(2), and _lambda_(0.1). Based on these parameters I use the Octave function _fminunc_ and get an [all_theta]. Then in order to classify the test batch, I have a function _predictOneVsAll_ that receives all_theta and X2 (the Test sample), and I pass the max of each row from sigmoid(X2 * all_theta') to a vector _p_. Then I compare each value of _p_ with _y2_ – Cartz Feb 03 '14 at 12:36
  • But $p$ is a probability and $y2$ is a label (either 0 or 1) correct? Perhaps you need to explain more about what you mean by classifier accuracy in your question. – M. Berk Feb 03 '14 at 13:14
  • No, _p_ is the vector of predictions, based on the max of each row from sigmoid(X2 * all_theta'). _y2_ is a label. The classifier accuracy is the (mean(double(p == y2)) * 100) – Cartz Feb 03 '14 at 13:33
  • What are the dimensions of X2 and all_theta? (I'm confused as to why their product would not be a vector) Isn't the output of the sigmoid function a value between 0 and 1? (I still don't get how the "max of each row from sigmoid(X2*all_theta')" gives a value of either 0 or 1, not between 0 and 1) – M. Berk Feb 06 '14 at 08:22
0

Depending which implementation of logistic regression you're using, scaling will probably already be done. And yes, normalizing your features is very important or the least squares cost at the heart of your algorithm will have some problems. Here's a python standard scaler if you can't find one elsewhere.

As an additional note, the overall "accuracy" isn't something you should be worried about since the recall of fraud cases is probably more important. The recall is just the number of fraud cases you accurately predict correctly divided by the number of true fraud cases.

EDIT:

@Drew75 brought up that I misunderstood how the coefficients were estimated. Scaling shouldn't matter for this particular classifier.

eric chiang
  • 161
  • 3
  • Is this true? Logistic regression doesn't work in the same was as a linear (ordinary least-squares) regression. As far as I know, the logit function criteria is deviance, so scaled vs unscaled shouldn't have an impact. – Drew75 Jan 31 '14 at 18:14
  • I thought that logistic regression still determines linear coefficients before making classifications. Least squares is usually involved in the cost function of that process, or is probably effected by scaling. That's not the only way to determine those coefficients, but I guess it might depend on he logistic regression algorithm? Damn, now I've got a question. – eric chiang Jan 31 '14 at 18:26
  • There are linear coefficients, but they are estimated in relation with to the link function (here logit) via an iterative process. Unliked OLS, there is no closed form solution. – Drew75 Jan 31 '14 at 18:30
  • My bad, post edited. – eric chiang Jan 31 '14 at 18:48