0

I have data that describes the duration of how long a person views a webpage. This is quite varied and in the context wherein I gathered the data, it was very skewed. People mostly spent short amounts of time in a webpage but sometimes spent a significant amount of time viewing it. I want to discretize the durations (in seconds) into short, medium and long but I don't know how I should do this if the data is skewed.

Initially, I just used tertiles but it seemed kind of off. Tertiles assume equal membership but I'm not sure if this is right because of the skew. Any ideas on a better way of discretizing the values?

EDIT: The reason why I wish to categorize the data is because I want to use it for reinforcement learning. Using the numerical values can increase the search space, so I thought of categorizing the values.

Paul
  • 121
  • 2
  • 2
    have you considered transforming the data first, maybe taking the log of the data to make the data more 'normal'? – Eric Peterson Jun 08 '13 at 02:09
  • i tried taking the log of the data and it does look more "normal" but, how do I go about getting the groupings? – Paul Jun 08 '13 at 02:14
  • 3
    Why would you want groupings in the first place? It rarely makes sense to categorize continuous data; it may help you to read my answer here: [how-to-choose-between-anova-and-ancova-in-a-designed-experiment](http://stats.stackexchange.com/questions/24077//24080#24080), especially below "update". I readily acknowledge that the context differs from yours, but the idea is that discretizing data isn't generally a good thing to do. – gung - Reinstate Monica Jun 08 '13 at 02:49
  • 3
    I agree with you @gung. It's my fault for not including it in the description, but the reason why I want to discretize the data is because I'd like to use it as input for a reinforcement algorithm. Using continuous or numerical data would cause a very large state space which could lead to longer time and more examples for the algorithm to converge. – Paul Jun 08 '13 at 03:12
  • 1
    Please add extra information as an edit to the post, not only in comments. Not everybody reads comments! – kjetil b halvorsen Mar 22 '21 at 15:27
  • How about Gaussian mixture modelling? – corey979 Mar 22 '21 at 19:40

0 Answers0