Related to my previous question, I have a dataset of 2D points with an associated label (this label can take 6 different values). As suggested in the answers to my other question, this can be modeled as a marked point process (or 6 different point processes), allowing to apply standard tools to study this dataset.
I would like to take the approach that I first suggested in my first question, and try to apply PCA on this dataset, to see if the different types of points are correlated or not (i.e. are some types always happening together?). Here's how I want to do it:
- Split my 2D space in a grid
- For each cell of that grid, count the number of points of each type. For one cell, this gives me a point in $\mathbb{R}^6 : x_i = (N_1(A_i), N_2(A_i), N_3(A_i), N_4(A_i), N_5(A_i), N_6(A_i))$, where $N_k(A_i)$ is the number of points of the $k^{th}$ point process (corresponding to points of type $k$) in the cell $A_i$
- Combine all the $x_i$ into a matrix $X \in \mathbb{R}^{6 \times M}$ and apply PCA to this matrix.
My question is the following: how do I build the grid? In other words, how do I rediscretize this dataset?
Indeed, the intensities of each process are not equal: some types appear more than others. If I just use a regular grid (all cells have the same area), the resulting points will have one or two components that dominates the others.
I was thinking of building my grid such that each cell has at most $N$ points, thus bounding the norm of the data points, but I don't think this will solve my "balance" problem.
Any suggestion, or pointer to litterature, are appreciated.