I am working on Telcom data for Churn modelling. I have 18 categorical and 2 numeric variables (total charges and monthly charges) in my data set. After handling the missing values, I checked the outliers. When I check the boxplot and boxplot statistics, I understand that I have quite a lot outliers. Please see below output:
> data=dataset_churn
> boxplot.stats(data$TotalCharges)
$stats
[1] 18.800 403.775 1411.900 3867.800 8684.800
$n
[1] 7043
$conf
[1] 1346.683 1477.117
$out
[1] 676695 707635 15669 16195 51125 53764 116405 47331 10726 395315 13986 203305 490485 20488 535645 216505 58694 71113 703065 630085 30555 788725 275785
[24] 60299 18131 309465 747585 469065 18605 486935 367315 144365 9187 20127 154035 227185 318295 296405 23754 439125 614585 584465 60815 68975 11086 266675
[47] 40264 31716 561775 16926 10685 75443
>
>
>
> boxplot.stats(data$MonthlyCharges)
$stats
[1] 18.25 35.80 70.50 90.15 118.75
$n
[1] 7043
$conf
[1] 69.47676 71.52324
$out
[1] 2015 9635 10845 2525 2025 192 7385 443 1051 9385 356 943 644 8315 8215 8485 786 8985 854 10895 1005 8545 4965 10955 8335 8545 2635
[28] 693 6565 243 1079 948 9475 8405 7465 5515 1053 257 6995 1033 209 207 698 8505 10435 9125 5395 998 4455 10645 8985 1965 203 416
[55] 5505 2035 354 686 443 557 988 744 3545 10495
My question is: I am struggling how to handle these outliers. I tried to use sqrt() or log() transformations but none of them worked. So I thought maybe removing all the outliers would be an option or replacing them with the median of the data. (But none of the codes I wrote or found worked, alwys getting the error below) Would you recommend deleting or replacing the data ?
Or, is there any other recommendation you have for me ?
Also, I tried to remove outliers with the code below but it is not working and giving me below error:
> outliers <- boxplot(data$TotalCharges, plot=FALSE)$out
> data$TotalCharges[which(data$TotalCharges %in% outliers),]
Error in data$TotalCharges[which(data$TotalCharges %in% outliers), ] :
incorrect number of dimensions
> data$TotalCharges = data$TotalCharges[-which(data$TotalCharges %in% outliers),]
Error in data $ TotalCharges[-which(data $ TotalCharges %in% outliers), :
incorrect number of dimensions
I am quite new to data analysis and struggling a lot with my very first assignment. I would really appreaciate any help you migh provide!