Combining many small regressions to answer question "what was most related?"

Question

I have an interesting data set for which I want to answer the question:

Question

How do I rank consumption of a key resource from my data set. I am not interested in predicting, rather measuring relative consumption of a resource. Ideally I would get a coefficient for each variable and use that coefficient to show a value of consumption, so a result might be:

For a Given Time Interval (i.e. 60 Minutes Ago):

V1: 80

V2: 40

V3: 20

Where the V1 with 80 had twice the consumption of Y than V2.

About the data

My data set is organized in time intervals (let's say 1 second):

Variables: V1, V2, ... Vn where n can range from 3 to 10,000. Vs are intermittently present for microseconds in a given second. Some Vs might be in the 10s, some might be the 10,000s. In a given 60 second period, I might only see a few Vs. I will see all n Vs if I watch for the entire time.
Y: Y is a resource which intuitively is consumed by each V.

Observations with regressions

I've run a lot of regressions and played with the data, but it's giving me weird answers.

If I look at smaller number of rows (either random or consecutive), I get very good regression models, 90% R, good error terms.
It seems the magic number of rows is somewhere between 60 and 300. Anything over 300, and the results are very noisy and lose predictive value.
If I look at an entire data set, my model gets really noisy and R2 drops pretty substantially.
The Vs time in milliseconds has very little relation to Y, so it can't be used as a proxy.
I frequently get a few Vs that have negative coefficients. This is not usable for the result, so I would generally hide these from the report. It's likely they are being affected by a que of Vs which

Local regression

It seems to me that I could be running local regressions on a set sample size, say 60 second intervals, then combining the coefficients on each of these regressions.
Does this make sense? I researched local regressions and it seems as though these techniques make sense, but they don't care about the coefficient values.
If it does make sense, how would I combine the coefficient values? I'm thinking a simple SUM would be best.
Are there other statistical methods that better accomplish this goal?

score 1 · Answer 1 · answered Dec 21 '14 at 02:52

If this is data over time, you might consider models that gradually "forget" old data, such as exponential-weighted models; more sophisticated models exist which can account for seasonal and calendar effects.

On the other hand, it also suggests that there might be correlation over time, which you should check for.

score 0 · Answer 2 · answered Aug 29 '13 at 19:31

0

If I look at an entire data set, my model gets really noisy and R2 drops pretty substantially.

How useful are regression statistics for rugged data?

Wavelets seem to fit more with how my visual processing system works, but may not be appropriate for your use case:

https:// en.wikipedia.org/wiki/Wavelet
https:// en.wikipedia.org/wiki/Wavelet_series
https:// en.wikipedia.org/wiki/Multiresolution_analysis

answered Aug 29 '13 at 19:31

Wes Turner

101
2

Does your visual processing system have variables that are constantly dynamic (i.e. some present, some not during any given time interval)? I did some reading on wavelet and its applicability to time series data, but this data doesn't really seem to exhibit those types of wave patterns. Maybe I don't understand it. I'll read more tonight. Yes, you are right about how problematic regression is for noisy data sets. The solution can be descriptive, so not so concerned about the entire model's predictive accuracy. – zlooop Aug 29 '13 at 20:59
That sounds more like a https://en.wikipedia.org/wiki/Liquid_state_machine . – Wes Turner Aug 30 '13 at 00:12
Some more researching has led me to https://en.wikipedia.org/wiki/Stone-Weierstrass_theorem – zlooop Aug 30 '13 at 15:50

Combining many small regressions to answer question "what was most related?"

2 Answers2