0

Question is similar to this unanswered question.

I have several data sets each corresponding to a different location. Every row of each dataset contains a measurement of the same quantity taken at a particular time (though assume there is no autocorrelation). In other words, the first row of each data set contains the measurement of this value at a specific time at a different location for each dataset. It is safe to say that the values on the same row between datasets will be correlated. Each row has a group associated with it (0 or 1, and the same across datasets). I would like to construct a test to see if there is a statistically significant difference between the means of both groups.

In all I have around 2000 observations in each dataset. The group split is not even - approximately 5%/95%. There are 8 datasets in total.

How can I go about constructing a test using all data?

Tim Hargreaves
  • 165
  • 1
  • 5
  • Please provide more details about the nature of your data. Are there only those 2 columns in your data sets? Are the 2 groups the same in all of your data sets? Are the values all on the same scale, representing the same thing, in all of the data sets? Is the same row always in the same group? The correlation of any one row among data sets suggests that this could be modeled as repeated measures along rows, but the details provided so far are too sketchy to know. Also, please say how many rows there are, as the best way to proceed might depend on that. – EdM Dec 23 '19 at 17:50
  • Apologies, I was trying to keep it as simple as possible to save wasting people's time but you're right, that information is important. I will edit now – Tim Hargreaves Dec 23 '19 at 17:52

1 Answers1

1

If there is no autocorrelation within each data set, then this could be handled by a mixed model. As the nature of each row is evidently the same in all the data sets, the 2000 rows would be included as random effects (e.g., the row number could serve the same function as a subject ID in other applications of mixed models) and the group and data set (location) would be treated as fixed effects.

It might be simplest to reformat your 8 separate data sets (one per location) into one large data frame. Within each data set, add a column for rowID and another column with an identifier of the location from which that data set was taken, giving you four data columns. Then just combine them all into a 16000-row by 4-column data frame.

A simple mixed regression model would then be:

value ~ group + location + (1|rowID)

where the (1|rowID) symbol means that each rowID will be associated with a potentially different intercept on the value scale. This model assumes that the value for any individual observation for a rowID is the sum of its intercept and additive contributions from the group and location of the observation, plus a random error, and that there are no interactions beyond these strictly additive effects among group, location, or rowID in terms of determining the value.

As with any regression you might need to transform the values to meet linearity and equal-error-variance assumptions, or you might need to allow for interactions between group and location (that is, the effect of group might depend on the location, and thus vice-versa). In principle you could allow for random effects other than just in terms of the intercept associated with each rowID; see this page for some examples.

Your estimate of the effect of group will then be the regression coefficient estimated for group (if there is no interaction specified between group and location). There can be some issues with properly determining p-values for fixed-effect coefficients in mixed models, but my sense is that with all rowIDs represented once in each location and each rowID having the same group assignment in all cases that shouldn't be a problem. See this page and its links for discussion.

EdM
  • 57,766
  • 7
  • 66
  • 187