4

I been trying, with no luck, to find the correct algorithm for the following 2 scenarios and I can't seem to get it right.

First scenario

Every day I get data like the following:

+---------+----------+----------+----------+----------+---------+
| day     | keyword1 | keyword2 | keyword3 | keyword4 | success |
|         | clicks   | clicks   | clicks   | clicks   |         |
+---------+----------+----------+----------+----------+---------+
| day1    | 10       | 3        | 5        | 9        | 76      |
| ...     | ...      | ...      | ...      | ...      | ...     |
+---------+----------+----------+----------+----------+---------+

Success is a value which measures how well that day's 'clicks' performed on different values, conversions, etc.

Is there any algorithm I can use to assign a weight to each keyword in relation to the success across various days?

I though of association rules and decision trees but I can't seem to get the idea right on how would those help me.

Second scenario

... which is pretty similar were data has the following structure

+------------+---------+---------+---------+---------+
| importance | value 1 | value 2 | value 3 | value 4 |
+------------+---------+---------+---------+---------+
|          1 |      18 |      21 |      35 |      25 |
|          2 |      93 |      36 |      11 |      56 |
|          3 |      34 |      26 |      47 |      47 |
|          1 |      19 |      20 |      10 |      23 |
|          1 |      17 |      20 |       3 |      25 |
+------------+---------+---------+---------+---------+

In this case what I am trying to do is understand how different values affect the importance value.

In the previous case you can easily see that for importance=1 value 1, value 2, and value 4 are "close" to one another while value 3 is not.

jonsca
  • 1,790
  • 3
  • 20
  • 30
Yak
  • 143
  • 3

1 Answers1

1

For the first scenario, any decent Decision Tree approach should work fine. A couple mature options are gbm or randomForest in R. Each of these can be trained in a "regression" mode with a continuous response. They both then offer feature importance scores. Be sure you read the documentation and use the permutation based importance scores for the most reasonable results. randomForest is likely the easiest; you can get reasonable answers without bothering to tune it much. When training, be sure to set importance=TRUE. And when calling importance() look to the first metric returned.

If you're more focused on predictive accuracy, even the randomForest will need some tuning.

I think the same type of analysis could help with your second problem, but I'm not as sure of the details there.

Shea Parkes
  • 3,224
  • 1
  • 16
  • 13
  • Thanks a lot Shea, I was also starting to analyze the possibility of using a SVN, do you think its a good idea? I will add a cross validation to check the performance of both anyway. Thanks a lot for your time :) – Yak Jul 31 '12 at 15:10
  • Did you mean an SVM? I do not think there is an easy way to get some measure of feature importance out of an SVM. There are likely some wrappers to do permutation based importance scores however. There are many hyperparameters to tune in an SVM though, so expect to spend some time getting reasonable answers. – Shea Parkes Jul 31 '12 at 15:46
  • ops, yeah I ment SVM. Thanks again I will begin testing with Random Forest. I am concerned I will end up with lots of trees but I will make the tests. Thanks for all your time – Yak Jul 31 '12 at 16:08
  • The default in R's `randomForest` is 500 trees. As long as you have less than 100k observations, that's likely a reasonable number to get importance measures. You could cut that back to ~200 and probably still get something useful. – Shea Parkes Jul 31 '12 at 17:15