1

I have a matrix of data (215 rows, 618 cols) the data is xy positional data from a square surface. Most of the data is 0, and very few are 1. When I plot this data I see that the 1's form 2 small clusters...I'd like to use a clustering technique to automatically colour the clusters and to know how many 1's (cells) make up each cluster..? Can I use kmeans or DBSCAN for this..? the examples i've seen answered seem to be on xy numbers data (if that makes sense) and not xy positional data with only 1's & 0's.enter image description here

Any help would be appreciated. Paul.

PaulB.
  • 655
  • 3
  • 6
  • 10

2 Answers2

1

The simple way to do this is to consider positions with a value of one as an observation at that point. Then use something like k-means etc... to do the clustering.

e.g.

A $4\times4$ grid,

$\begin{array}{c|cccc} x\y & 1 & 2 & 3 & 4 \\ \hline 1 & 1 & 1 & 0 & 0\\ 2 & 0 & 1 & 0 & 0\\ 3 & 0 & 0 & 0 & 1\\ 4 & 0 & 0 & 1 & 1\\ \end{array}$

could be treated as a set of observations by their coordinates,

$\begin{array}{cc} x & y \\ \hline 1 & 1 \\ 1 & 2 \\ 2 & 2 \\ 3 & 4 \\ 4 & 3 \\ 4 & 4 \\ \end{array}$.

Jonathan Lisic
  • 1,342
  • 7
  • 16
1

You should transform your data from the current, image-like representation (with values being at a certain x/y position of a matrix) to a data.frame, that has an x, y, and value/target column:

# some dummy data
myData <- data.frame(expand.grid(x=1:20, y=1:20))
myData$target <- ifelse(randu[,1] < 0.8, 0, 1)
# this is how your data could look like
print(myData)
#   x y target
# 1 1 1      0
# 2 2 1      0
# 3 3 1      1
# 4 4 1      0
# 5 5 1      0
# 6 6 1      0

From here on you could e.g. use further approaches, or visualize your data directly (just 2 sample plots that might be a start for further investigation - I would recommend looking at e.g. this answer for more ways):

# classic levelplot
library(lattice)
levelplot(x = target ~ x*y, myData, col.regions=c(0,1))

Levelplot

# scatterplot with alpha
library(scales)
plot(x = myData$x, y = myData$y, pch=19, col= alpha(myData$target+1, 0.5), cex=5)

Scatterplot with alpha

One more thing: you seem to have a target variable in your data (the 0 or 1 values). Note that clustering is usually unsupervised, hence applied on data without a target variable. It could be that techniques similar to e.g. Nearest Centroid Classification would serve better for your purpose.

geekoverdose
  • 3,691
  • 2
  • 14
  • 27