Cluster daily profiles of energy consumption

Question

I have a dataset consisting of half hourly energy consumption figures for few hundred office buildings. I currently try to build a model to cluster daily profiles into 3 groups:

Daily profile for working day.
Daily profile for non-working day.
Daily profile for non-operational day; building might become inactive and it's consumption falls to near zero values.

Distinguishing between "different kinds" of working day profiles as we can observe on second graph (in 2015 office was consuming significantly less energy than in 2017) is not my goal. I would like to base this model only on consumption values, without using date information.

I was wondering if algorithm like kNN would be sufficient for this task or should I look into other options and more necessarily what metrics should be used to describe a daily profile to then be able to tell how similar two profiles are. One idea I had would be to generate line plot for each day and then cluster pictures (I am pretty sure it is the least efficient way to do it, but I guess it could work). Other thought I had was to normalize values for each day to range 0-1, and put values into few bins (like when creating a histogram). Then use these bins to compare profiles to each other and cluster them into groups.

Question What would be most efficient and simple (I'd rather have model that is slightly less accurate but easier to explain than the other way around) approach to tackle this problem?

P.S. I work with Python, I have no knowledge of R.

I did, but I haven't found comprehensive answer that would fit all my needs. — dylan_fan, Jun 27 '18 at 14:26
Here are some: https://stats.stackexchange.com/questions/3238/time-series-clustering-in-r https://stats.stackexchange.com/questions/3331/is-it-possible-to-do-time-series-clustering-based-on-curve-shape https://stats.stackexchange.com/questions/131281/dynamic-time-warping-clustering https://stats.stackexchange.com/questions/9342/is-it-ok-to-use-manhattan-distance-with-wards-inter-cluster-linkage-in-hierarch — kjetil b halvorsen, Jun 29 '18 at 10:55

score 2 · Accepted Answer · answered Jun 27 '18 at 14:24

For time series clustering you have basically two different approaches:

Use of opportune metrics. Basically you compare with some kind of metrics each time series, and you cluster them. A low metric value means two time series are "close". See this for example: Dinamic Time Warping (high computation weight).
Use of features extraction. A second approach might be to extract some features from each time series, and use that as a vector of information, and feed it to a cluster algorithm. More precisely:
- First you start with k time series, each with length n.
- For each TS, you extract m features (mean, variance, number of "jumps" over a threshold, and whatever feature you might find appropriate)
- You form a feature matrix of dimension k x m, M (basically you've preformed dimensionality reduction on the k TS)
- Now you can use any of the clustering methods (k-means, hierarchical clustering, SOM, ecc...) on this matrix M.

Here a couple of example on this approach (a LOT less computationally expensive):

Example 1 Example 2

Cluster daily profiles of energy consumption

1 Answers1

Linked