0

So me and my friend decided to learn about tuning hyper parameter of xgboost using Grid Search CV for a regression problem. We decided to tune parameter such as maximal depth, column sampling and so on. After the process of grid searching, we found our ideal value of each parameter after checking each value that we decided to check.

After practicing grid searching, my friend decided to look a little deeper. He decided to retry its grid search with a twist of instead of trying multiple hyper parameter at once, he grid search only one hyper parameter.

He noticed that on every hyper parameter, as he increases the value, the loss would decrease until at point that the loss will keep increasing as you increase the value of the hyper parameter.

For example, as he increases the maximal depth, the loss decreases (possibly because as we go deeper, we reduce the level of under fitting) but if he goes further, then the loss increases again (possibly now due to over fitting). Another example is L1 regularization (lasso). If he increases the L1 a little bit, the loss decreases (possibly because the L1 zeroes out irrelevant feature), but if he increases the L1 a little more, the loss increases again (possibly because L1 is too strong that now it also zeroes out relevant feature.)

So my friend said that the loss function progression actually resembles unimodal function, so he later said "If the loss function progression is unimodal, why not we just ternary search it?"

The idea is that we are able to reduce the complexity of the hyper parameter tuning. If there are N features and for each feature we want to check M candidate value, the complexity of a CV Grid Search would be $N^M$. If we do ternary search, then the complexity would be $Nlog(P)$ where P is the size of the difference of the upper bound and lower bound.

I speculated that his algorithm would suffer because you would only be able to try one hyper parameter at a time, but he responded that we can use our knowledge to prioritize which hyper parameter to try first so it would not suffer too badly. For example, we know we should prioritize maximal depth first before the regularization because it has bigger impact to the model.

Is my friend's idea is actually workable?

deemel
  • 2,402
  • 4
  • 20
  • 37
Realdeo
  • 101
  • 2

1 Answers1

1

No, it is not possible, but yes, this is what often people do. It is a bad idea because by optimizing one dimension at a time you ignore the fact that the dimensions can interact with each other.

Imagine that you are considering to buy a car, you go to the web page that offers cars and first sort the cars by price. You pick the cheapest cars and then among those you pick the one that has the best km per liter ratio. Does this strategy guarantees finding the car with the best combination of features? Surely not. Yes it does if you have limited budget and can afford only the cheapest car, but it does not give you globally optimal solution. It can be, and possibly is, true that if you pay a little bit more for the car, then you can find one that has better km per liter ratio and spend less in the long run.

If you want to speed things up, better use random search, it is faster and in many cases it would be more efficient then grid search.

Finally, as I said in the begging, it is often the case that first you do some initial grid search, then fix some parameters and explore only the chosen parameters. This is not a pretty solution, but sometimes this is the only applicable solution.

Tim
  • 108,699
  • 20
  • 212
  • 390