Displaying distributions of data with a high number of features

Question

I have data of two different groups. From each data, I have say 100 samples, each sample having 20 features.

I want to display the distribution of each of the two datasets. Now I have some very basic questions:

If I want to fit a distribution to the data of each of the groups, do I fit a distribution to each sample of this group, or do I average over samples and then fit the distribution?
Making a histogram of any of the samples using all of the features, the histogram looks roughly normally distributed. However I assume that being in 20-dimensional space, the underlying distribution of the data would have to be a 20-dimensional gaussian - so what does the distribution of the histogram represent, and which distribution is relevant for classification, say with a classifier which assumes normality?

score 0 · Answer 1 · answered Sep 02 '17 at 14:44

0

One method for visualizing such data is parallel coordinate plot. You can use colors on this plot to differentiate between groups and see if they form consistent clusters.

Other than that it seems you might benefit from reducing the dimensionality using PCA and then doing scatterplot.

answered Sep 02 '17 at 14:44

Jakub Bartczuk

5,526
1
14
36

Thanks for the link, useful to know it, in my case I find it a bit hard to interpret for my data. Letting the problem of dimensionality aside, would one normally look at particular features, and for that feature use the data from all datasamples? I am just confused as of in how far this is helpful, since I assume that classification is based on the joint distribution of features. And if I plot all the features from all the datasamples and the distribution is (roughly) normal, does that imply, that the joint distribution of my data is normal? – Sep 02 '17 at 15:04
"Letting the problem of dimensionality aside..." - what do you mean? You didn't say anything about the classification - if you want to understand how something is classified, you need to fix the classifier first, because different ones use different methods. Data needs not be normal if marginal distribution is normal - see answer to https://stats.stackexchange.com/questions/30159/is-it-possible-to-have-a-pair-of-gaussian-random-variables-for-which-the-joint-d – Jakub Bartczuk Sep 04 '17 at 07:47
I did mention classification above. My goal is to understand what the classifiers results are based on. I am aware that the joint distribution is not necessarily normal if the marginals are, but since visualizing the joint distribution is not straightforward (or not possible?), I try to first get a feel for the data by looking at particular dimensions. – Sep 04 '17 at 11:03
You mentioned classes, but you didn't mention what exactly is the classifier you use. – Jakub Bartczuk Sep 04 '17 at 11:07
SVM and LDA..... – Sep 04 '17 at 12:32

score 0 · Answer 2 · answered Sep 03 '17 at 20:17

0

Think of a lower space embedding with similar characteristics as the higher dimension. Sammon's mapping and t-SNE methods would be valuable approaches.

https://github.com/tompollard/sammon

https://lvdmaaten.github.io/tsne/

answered Sep 03 '17 at 20:17

knk

235
1
7

Displaying distributions of data with a high number of features

2 Answers2