1

I have a dataset that looks as follows

userid⇥week1⇥week2 ⇥week3⇥week4⇥week5⇥week6⇥week7

1234⇥39724⇥34377⇥34377⇥38990⇥38298⇥39129⇥40500

2345⇥35960⇥39368⇥39368⇥39368⇥60732⇥37390⇥38836

3456⇥804⇥⇥938⇥⇥938⇥⇥938⇥804⇥⇥0⇥⇥974

4567⇥⇥5296⇥⇥4872⇥⇥4872⇥⇥4872⇥4176⇥0⇥0

here each row is the weekly consumption of electricity for each user, from this now i need to find the users who are having abnormal consumption suddenly, like those of users 3456 and 4567. i.e i need to classify them as outlier/anomaly, i came across certain algorithms like one-class svm for novelty detection but in the dataset as you can see itself has anomalous data also in them, so before actually applying the one-class svm i need to remove the possible anomalous data, is there any good algorithm that will identify them as outliers?

Note:- i also have a dataset that shows daily consumption instead of weeks. Sorry for the way i have presented my data, i am new to this place.

  • Welcome to CrossValidated. The short answer is "No" unless you can give a more precise definition of "outlier". However, this has been discussed here before, see [Outlier detection](http://stats.stackexchange.com/questions/tagged/outliers). Look through those answers, and, if you have additional questions, come on back with them. – Peter Flom Feb 07 '15 at 13:11
  • well if you cant call it an outlier, we can assume it to be an anomalous behavior. such as what is happening with userid 3456 and 4567 where their 6th and 7th week readings are not actually expected values. I want to clean my dataset of these values, is there any method to do the same. Please advice... – user3146895 Feb 07 '15 at 15:16
  • The answer is the same; you need to be more precise in what you mean by "outlier" or "anomalous", then you can write a rule. – Peter Flom Feb 07 '15 at 15:27
  • The short answer is 'Yes!': of course there are algorithms for automatic outlier detection and they do not need the outliers to be defined in any way beside being observations that in some sense don't follow the pattern of the majority of the data. What you ask for is a very routine procedure in data analysis. In fact, there are whole books on the subject as well as many answers to this types of question here. Have you tried the top voted answers [here](http://stats.stackexchange.com/questions/213/what-is-the-best-way-to-identify-outliers-in-multivariate-data?answertab=votes#tab-top)? – user603 Feb 07 '15 at 15:41

0 Answers0