1

I have a time series of 70000 data points. I want to separate samples of this time series which have very large values as compared to the time series. How can i threshold sample to separate.

What i am thinking for threshold: if SD of sample is s1 and mean and SD of time series are M and S then if s1>M+S then the sample has larger values is this correct? is this logical? any suggestions?

Tim
  • 108,699
  • 20
  • 212
  • 390
user76816
  • 11
  • 1
  • 2
  • 1
    What you are asking here is how to detect [tag:outliers], you can find multiple answers for such question on this site: http://stats.stackexchange.com/questions/121071/can-we-use-leave-one-out-mean-and-standard-deviation-to-reveal-the-outliers or http://stats.stackexchange.com/questions/129274/outlier-detection-on-skewed-distributions or http://stats.stackexchange.com/questions/37865/is-there-a-simple-way-of-detecting-outliers/37876 , so I would recommend starting with them. – Tim May 11 '15 at 11:24
  • outliers are the specific values of the sample which crosses certain threshold. but in my case i want to consider the whole sample from my population. is my threshold criteria that s1>m+S is logical? – user76816 May 11 '15 at 11:33
  • You can take literally any subset of your sample, it is a matter of your decision, there is nothing logical or illogical in it... However, check the answers in the links provided. You'll learn for example that mean is very sensitive to outliers, so more robust statistic could be more appropriate in here. – Tim May 11 '15 at 11:37
  • You seem to be thinking of a threshold when (SD of sample) > (mean + SD of series). Do you really mean that? If so, it would at best select samples, not individual values. I suspect that you really are asking about value > mean + SD, which is at best an arbitrary threshold which will usually select a large number of values. – Nick Cox May 11 '15 at 11:41
  • yes i am asking about a threshold when (SD of sample) > (mean + SD of series). i am not sure but does this means that this sample has largely spreader values as compared to the mostly data in population? – user76816 May 11 '15 at 11:47
  • It doesnt make much sense to talk about outlier detection without having a particular porpose in mind: What is your purpose with doing this? – kjetil b halvorsen May 11 '15 at 13:20
  • my aim is to remove unexpected sudden jerks from data – user76816 May 11 '15 at 19:59
  • It's (now) not clear (at all to me) what you mean by sample. Please give examples of what you mean. They can be very small examples so long as they clarify what you want. – Nick Cox May 12 '15 at 15:10
  • for example i have a time series $$y=[1, 2, 3, 21, 23, 22, 1, 3, 4]$$ $$M=mean(y), SD=std(y)$$ Now i can divide y in equal segments (samples) for example $$y_1=[1, 2, 3]$$ $$y_2=[21, 23, 22]$$ $$y_3=[1, 3, 4]$$ i can calculate the SD of all these samples. So is this logical to set a threshold that when (SD of sample) > (mean (y) + SD(y)), select that sample as this sample may contain very high values as compared to other values in the time series y. – user76816 May 12 '15 at 21:16
  • Short answer is No. The SD of $y_2$ is exactly the same as for $y_1$, but its mean is quite different. If you want a criterion for high values it would have to be in terms of means, not SDs. But without a tighter specification for the process, any criterion will be quite arbitrary. – Nick Cox May 13 '15 at 00:23
  • but this is just an arbitrary example. but when i tested this on time series data of EEG this gives quite good results. is this criteria is logical? if not @NickCox from your experience can you please suggest me some statistical criteria to separate high peaky intervals? – user76816 May 13 '15 at 21:51
  • Why not give us realistic examples then? – Nick Cox May 13 '15 at 23:51

0 Answers0