0

I am working on Telcom data for Churn modelling. I have 18 categorical and 2 numeric variables (total charges and monthly charges) in my data set. After handling the missing values, I checked the outliers. When I check the boxplot and boxplot statistics, I understand that I have quite a lot outliers. Please see below output:

> data=dataset_churn
> boxplot.stats(data$TotalCharges)
$stats
[1]   18.800  403.775 1411.900 3867.800 8684.800

$n
[1] 7043

$conf
[1] 1346.683 1477.117

$out
 [1] 676695 707635  15669  16195  51125  53764 116405  47331  10726 395315  13986 203305 490485  20488 535645 216505  58694  71113 703065 630085  30555 788725 275785
[24]  60299  18131 309465 747585 469065  18605 486935 367315 144365   9187  20127 154035 227185 318295 296405  23754 439125 614585 584465  60815  68975  11086 266675
[47]  40264  31716 561775  16926  10685  75443
> 
> 
> 
> boxplot.stats(data$MonthlyCharges)
$stats
[1]  18.25  35.80  70.50  90.15 118.75

$n
[1] 7043

$conf
[1] 69.47676 71.52324

$out
[1]  2015  9635 10845  2525  2025   192  7385   443  1051  9385   356   943   644  8315  8215  8485   786  8985   854 10895  1005  8545  4965 10955  8335  8545  2635
[28]   693  6565   243  1079   948  9475  8405  7465  5515  1053   257  6995  1033   209   207   698  8505 10435  9125  5395   998  4455 10645  8985  1965   203   416
[55]  5505  2035   354   686   443   557   988   744  3545 10495

    

My question is: I am struggling how to handle these outliers. I tried to use sqrt() or log() transformations but none of them worked. So I thought maybe removing all the outliers would be an option or replacing them with the median of the data. (But none of the codes I wrote or found worked, alwys getting the error below) Would you recommend deleting or replacing the data ?

Or, is there any other recommendation you have for me ?

Also, I tried to remove outliers with the code below but it is not working and giving me below error:

> outliers <- boxplot(data$TotalCharges, plot=FALSE)$out

> data$TotalCharges[which(data$TotalCharges %in% outliers),]
Error in data$TotalCharges[which(data$TotalCharges %in% outliers), ] : 
  incorrect number of dimensions

> data$TotalCharges = data$TotalCharges[-which(data$TotalCharges %in% outliers),]
Error in data $ TotalCharges[-which(data $ TotalCharges %in% outliers),  : 
  incorrect number of dimensions

I am quite new to data analysis and struggling a lot with my very first assignment. I would really appreaciate any help you migh provide!

  • 3
    Why not just leave them in? Do you suspect something about those points is incorrect (measurement error or a typo, for instance)? – Dave Sep 09 '21 at 11:48
  • Hey Dave thank you for your response. I believe one of the assumptions of logistic regression is to remove the outliers. And indeed looking at the rest of the data, I suspect that there is a mistake with the very high figures. That's why I decided to remove or replace them. – newbie-data-student Sep 09 '21 at 11:59
  • 2
    Your belief is incorrect: logistic regression makes no assumptions about the presence or absence of "outliers." I use quotation marks here because what constitutes an outlier depends on what you compare it to. It looks like you use univariate methods to identify extreme values. That's OK, but it's practically irrelevant for a regression analysis. You should instead be concerned about *high leverage* observations and observations with *extreme residuals.* – whuber Sep 09 '21 at 12:21
  • 1
    You did not give us any details of your logistic regression. It might well be unaffeced by your univariate "outliers". Give us some details ... but make the regression robust at the outset, spline the continuous predictors, ... See https://stats.stackexchange.com/questions/169348/how-should-i-check-the-assumption-of-linearity-to-the-logit-for-the-continuous-i – kjetil b halvorsen Sep 09 '21 at 12:44
  • Hi All, Thanks for the responses. I am trying to model the churn by using 18 categorical (such as gender, age category, whether have phone or internet service, etc) and 2 numeric variables (total charges and monthly charges). When I check the mean and median values of my continuous variables, and plots of them I see that both ot them are highly skewed and have outliers (50-60 out of 7000 data). According to some online sources, outliers might create issue and needs to be handled before the logistic regression model build. Thats why I wanted to fix outlier issue. – newbie-data-student Sep 09 '21 at 12:50

0 Answers0