I have an interesting data set for which I want to answer the question:
Question
How do I rank consumption of a key resource from my data set. I am not interested in predicting, rather measuring relative consumption of a resource. Ideally I would get a coefficient for each variable and use that coefficient to show a value of consumption, so a result might be:
For a Given Time Interval (i.e. 60 Minutes Ago):
V1: 80
V2: 40
V3: 20
Where the V1 with 80 had twice the consumption of Y than V2.
About the data
My data set is organized in time intervals (let's say 1 second):
- Variables: V1, V2, ... Vn where n can range from 3 to 10,000. Vs are intermittently present for microseconds in a given second. Some Vs might be in the 10s, some might be the 10,000s. In a given 60 second period, I might only see a few Vs. I will see all n Vs if I watch for the entire time.
- Y: Y is a resource which intuitively is consumed by each V.
Observations with regressions
I've run a lot of regressions and played with the data, but it's giving me weird answers.
- If I look at smaller number of rows (either random or consecutive), I get very good regression models, 90% R, good error terms.
- It seems the magic number of rows is somewhere between 60 and 300. Anything over 300, and the results are very noisy and lose predictive value.
- If I look at an entire data set, my model gets really noisy and R2 drops pretty substantially.
- The Vs time in milliseconds has very little relation to Y, so it can't be used as a proxy.
- I frequently get a few Vs that have negative coefficients. This is not usable for the result, so I would generally hide these from the report. It's likely they are being affected by a que of Vs which
Local regression
- It seems to me that I could be running local regressions on a set sample size, say 60 second intervals, then combining the coefficients on each of these regressions.
- Does this make sense? I researched local regressions and it seems as though these techniques make sense, but they don't care about the coefficient values.
- If it does make sense, how would I combine the coefficient values? I'm thinking a simple SUM would be best.
- Are there other statistical methods that better accomplish this goal?