Difference between two multi-variable datasets

Question

I have multiple activities that a person performs, and the corresponding multidimensional values [$x$, $y$, $z$ - coordinates] for two devices and their readings.

I need to find the difference between the readings from the two devices to tell which particular activity has the highest difference in reading between the devices.

Dataset1: The readings corresponding to one device https://i.stack.imgur.com/FtBUl.png

Dataset2: The readings corresponding to the other device https://i.stack.imgur.com/0WLmk.png

I am looking for a systematic way to approach this problem. I could find references to statistical differences such as KL divergence etc and also tests such as ANOVA. Could someone let me know how to methodologically approach such a problem?

Reading your first comment to my answer below, and reexamining your data, there are several things that are not clear to me: (1) Is the information in Dataset1 from Device 1 and the information in Dataset2 from Device 2? (2) do row numbers from Dataset1 correspond to the row numbers from Dataset2? In other words, is there a matching up of the row 0 in Dataset1 with row0 in Datset2 for example (e.g. does, say, row 3 in both datasets correspond to unique subject number 3 that was measured on both devices and that, row 4 is a different subject also measured on both devices)? — StatsStudent, Aug 22 '20 at 03:11
(continued) (3) is the timestamp relevant in your analysis? In other words, is it possible that the time the experiment was conducted could have some bearing on the results/measurements you made or is this irrelevant or unlikely? — StatsStudent, Aug 22 '20 at 03:12
@StatsStudent Yes, the two datasets are information from the two devices. No there is no correspondence between the rows, we have a set of rows that belong to a subject but no matching between individual rows. No, the timestamp is irrelevant. I am looking to comment on the difference between the two readings from the two datasets as whole, i.e all the rows in the 1st data set correspond to a multi dimensional distribution of points (3-d cluster) and how different it is from 2nd dataset. (some metric) — rohan pandey, Aug 22 '20 at 09:40
OK. Thanks for the clarification. Given the updated information that you told, me I think you're best best is, in fact, to pursue the independent samples Hotelling's $T^2$ multivariate test. I've updated my answer below with additional details for you. — StatsStudent, Aug 22 '20 at 23:03
@StatsStudent Thank you for your answer. I also want to know if doing a KS test on each axis and then taking a mean of the KS value at each of the axes is the right approach or not? — rohan pandey, Aug 24 '20 at 03:49
You don't really want to do that as you'll be doing multiple testing which will suffer from Type I error inflation. Also, if you found my answer acceptable, please accept it by clicking the green check mark, rather than upvoting. This helps encourage other users on the site to answer questions. Thanks! — StatsStudent, Aug 24 '20 at 04:04
@StatsStudent have accepted the answer! I, however, do not fully follow the reasoning behind Type1 error inflation. Say I am to calculate the Wasserstein distance for each axis and take a mean for the distance obtained on each axis. — rohan pandey, Aug 24 '20 at 11:26
Thanks. I'm not sure I's clear what you are proposing. What might be a good idea is to open another question with your proposed procedure for using the Wassterstein distance. I was under the impression you were proposing computing means on each axis and testing each of them individually. By doing so, you will have increased the chances of finding a significant result above whatever you set your initial $\alpha$ level to. — StatsStudent, Aug 24 '20 at 13:12

StatsStudent · Accepted Answer · 2020-08-22T23:17:49.010

Are you looking for a way to test if there is a difference in the mean readings between the two devices readings en masse? If so, I'd recommend using a simple multivariate test called the Multivariate Paired Hotelling's $T$-test. You can see how this test works here: https://online.stat.psu.edu/stat505/lesson/7/7.1/7.1.8.

Alternatively, assuming your data meets the necessary assumptions, you could simply analyze this data using a simple linear mixed effects model where each subject is treated as a random effect and the device is a fixed effect (this is essentially a special case of a mixed effects ANOVA, so your intuition was pointing you in the right direction). Additional information about this approach can be found here: https://web.stanford.edu/class/psych252/section/Mixed_models_tutorial.html

Another approach that might be appropriate in this case is to fit a Generalized Estimating Equations model, which is similar to a linear mixed effects model, but the interpretation is slightly different. You can find more information about this approach here: https://online.stat.psu.edu/stat504/node/180/. See my previous answer in this question to understand the difference between this method and the linear mixed effects model: Conditional vs. Marginal models

Best of luck to you!

Update Based on Additional Information from Comments

Thanks for providing some clarifying details in the comments. Based on what you've now told me, I think your best bet is to simply carry out a Two-Sample Hotelling $T^2$-Test. Basically what you want to do is obtain averages of each of your $x$, $y$, and $z$ variables in each dataset separately. The means for dataset1 and dataset2 will be in vectors $X_1$ and $X_1$ respectively. Then you'll carry out Hotelling's $T^2$-Test. In R, this can simply be done with the following:

#Install the package if you dont' have it already installed
#install.packages("rrcov")
library(rrcov)
x1<-c(-0.364761, -0.879730, 2.001495, 0.450623, -2.164352)
y1<-c(8.793503, 9.768784, 11.109070, 12.651642, 13.928436)
z1<-c(1.055084, 1.016998, 2.619156, 0.184555, -4.422485)
x2<-c(7.091625, 4.972757, 3.253720, 2.801216, 3.770868)
y2<-c(-0.591667, -0.158317, -0.191835, -0.155922, -1.051354)
z2<-c(8.195502, 6.696732, 6.107758, 5.997625, 7.731027)

#Create the two datasets
X_1<-data.frame(x1,y1, z1)
X_2<-data.frame(x2,y2,z2)

#Carry out Hotellings T^2 test
T2.test(x=X_1, y=X_2)

Which returns the results:

    Two-sample Hotelling test

data:  X_1 and X_2
T2 = 305.627, F = 76.407, df1 = 3, df2 = 6, p-value = 3.596e-05
alternative hypothesis: true difference in mean vectors is not equal to (0,0,0)
sample estimates:
                     x1        y1        z1
mean x-vector -0.191345 11.250287 0.0906616
mean y-vector  4.378037 -0.429819 6.9457288

Since the $p$-value of this test is quite small (p-value = 3.596e-05), there is sufficient evidence to reject the null hypothesis of: \begin{eqnarray*} H_{0}:\boldsymbol{\mu_{1}} & = & \boldsymbol{\mu_{2}} \end{eqnarray*}

or equivalently, if we denote the means of $x$, $y$, and $z$ from the $i$th dataset as $\mu_{ix}$, $\mu_{iy}$, and $\mu_{iz}$ respectively, for $i=1,2$, the null hypothsis:

\begin{eqnarray*} H_{0}=\begin{pmatrix}\mu_{1x}\\ \mu_{1y}\\ \mu_{1z} \end{pmatrix} & = & \begin{pmatrix}\mu_{2x}\\ \mu_{2y}\\ \mu_{2z} \end{pmatrix} \end{eqnarray*}

and conclude there is sufficient evidence, based on this data, that the two measuring devices are measuring differently.

I hope this help!

Thank you for your answer! The intent of the question is to identify for which particular activity are the two devices giving completely different readings, and give a relative ordering among these activities. How about taking the centroid of each device on x,y,z axis and just calculating the euclidean distance between the two centroids for each device? I am fairly new to the same, and would love to understand a systematic approach to the same. — rohan pandey, Aug 21 '20 at 05:12

Difference between two multi-variable datasets

1 Answers1