1

I'm making a logistic regression model but am unsure about whether it is right or not to do the following:

I'm trying to predict if a person will buy a high cost hotel, given by hotel_spend > 250. I know columns such as flight_spend and vehicle_spend are acceptable inputs to the model but am unsure if I could use total_spend as it contains information about hotel_spend which is used to create the target. This is highlighted by the last row where (hotel_spend == total_spend) > 250. My head tells me I shouldn't do this as I'm using the hotel_spend to predict if they will spend a certain amount on a hotel.

I'm looking for advice if this is acceptable or not. In my head I don't think should be done, just looking for other opinions.

flight_spend  hotel_spend  vehicle_spend  total_spend  \
     20           49             33          102   
      0           59              0           59   
     65          100             40          205   
    150          250             50          450   
      0          300              0          300   
hotel_spend_high_spend_label  
           0  
           0  
           0  
           1  
           1  
rolando2
  • 11,645
  • 1
  • 39
  • 60
Fungie
  • 21
  • 2

2 Answers2

2

You shouldn't use any variables that you won't know at the time of prediction. Since you won't know flight_spend when you make predictions (since it is what you're predicting) and total_spend = flight_spend + hotel_spend + vehicle_spend you also won't know total_spend. In general, don't use a variable that depends on the value of what you're predicting. I wrote a blog post that may help elaborate the issue: https://content.nexosis.com/blog/ensuring-data-is-ready-for-machine-learning-algorithms

Danna
  • 31
  • 2
  • The advice not to use variables you won't know at the time of prediction only matters if the purpose is forecasting. If @Fungie's motivation is to understand the drivers of decision-making, there's nothing wrong with using flight_spend and vehicle_spend. – Peter Ellis Feb 06 '18 at 20:27
  • @Fungie specifically said they are using the model for prediction, so the comment is relevant – Danna Feb 06 '18 at 20:28
  • I took that to mean "predict, given the data I've got", not that the ultimate end case is to "predict for a new customer who comes in through the door or onto my website". @Fungie perhaps you could explain the purpose of the model? – Peter Ellis Feb 06 '18 at 20:35
  • @PeterEllis Even if that is what is meant by the question, you still would not know total_spend if you are predicting hotel_spend and so this variable should not be used. If you're saying you know total_spend as well as flight_spend and vehicle_spend, why are you building a model for a variable that is just a linear combination of known values? – Danna Feb 06 '18 at 21:48
  • I want to build a model that will get the probability of getting `hotel_spend_high_spend_label`. I think using `flight_spend` and `vehicle spend` are ok but not `total_spend` as it contains information about the target. The point is to find people who have similar characteristics to the people who got `hotel_spend_high_spend_label` so that can send promotions etc to them. I hope this frames my question better. – Fungie Feb 06 '18 at 22:01
  • @Fungie assuming you will know vehicle_spend and flight_spend for the people you are thinking about sending promotions to, yes, you can build a model with just those. You wouldn't include flight_spend in the model and just the hotel_spend_high_spend_label as the target. – Danna Feb 06 '18 at 22:34
  • @Danna I'd just like to confirm that using `total_spend` would be incorrect? – Fungie Feb 06 '18 at 22:52
  • @Fungie correct, you shouldn't use total_spend – Danna Feb 07 '18 at 17:20
1

Correct, you shouldn't use total_spend because as you say in a sense that variable already encapsulates the response variable you are trying to model. However, it is fine to use flight_spend and vehicle_spend if your purpose is to understand decision-making and the relationship between spend types.

BTW, my experience with data like this is that the relationships will be non-linear you will need to transform the explanatory variables in some fashion. As you have numerous values of zero, you don't want to do this by taking logarithms. A square root, or more generally a Box-Cox-like transformation is one relatively straightforward way of dealing with this; an alternative would be a generalized additive model with splines or similar based on flight and vehicle spend.

Peter Ellis
  • 16,522
  • 1
  • 44
  • 82