Running many $t$-tests to find the most significant way to divide a dataset

Question

I apologize for my stats illiteracy in advance–I'm not a stats guru by any stretch, but am trying to learn. To start, I'll just introduce what my data set looks like, then what I'd like to accomplish.

I am working with geological data (Euclidean vectors) that are numeric (float). I have two variables, and I'll call them $x$ and $y$. In this sample, $x$ ranges from 0–200 and $y$ ranges from 2.4 to 2.9. My sample size is approximately $n=1000$.

If I understand $t$-tests correctly, they are useful for testing whether a difference in means between two groups is statistically significant. In my dataset, for example, I could divide the sample into two groups based on the value of $x$ (e.g. group 1 could be 0–100; group 2 would then be 100–200) and test then whether those groups show meaningful variation with respect to variable $y$.

With that in mind, here is my question that I was hoping this community could help me answer:

I want to divide the dataset into two continuous groups based on the value of $x$ (e.g., one group might be 0–75, the other might be 75–200). In doing so, though, I'd like to find the value of $x$ where the difference in means of the two groups (based on $t$-test) is most significant. Is there an efficient way to run many $t$-tests at once to search for the that value of $x$? The only other constraint I'd want to introduce is a minimum sample size for each group (say, 100). Or is the only way to do this an ad hoc approach?

If you could point me in the right direction or even help me with the correct stats terminology (to make it easier to search the web), I would greatly appreciate it. Thanks.

"*I'd like to find the value of x where the difference in means of the two groups ... is most significant.*" -- to what purpose? The p-value will be meaningless as a result, so why would minimizing it be useful? What is the ultimate point? What questions are you trying to answer? What is it you want to be able to say about your data? — Glen_b, Sep 21 '14 at 03:09
For one, I think it would just be useful to know whether, for example, the 0-75 & 75-200 groups are more different from each other than, say, the 0-125 & 125-200 groups are. The ultimate point is to transform $y$ (in terms of $x$) to a binary with the most statistically significant justification. — dagrha, Sep 21 '14 at 03:46
Except by maximizing the statistic you no longer have justification for claiming significance. The p-values calculated by the usual means are wrong. If there are already groups in your data you can meaningfully ask questions about the relative size of differences, but you shouldn't dichotomize data that are continuous - you're *losing* ability to tell them apart (and biasing esitmates to boot). Best to go back to the most basic questions you want to answer from the data and *ask how to do that*, rather than invent a procedure that won't do what you hope and ask how to implement it. — Glen_b, Sep 21 '14 at 04:12
So in terms of transforming to a binary, are you implying that _arbitrarily_ dichotomizing the dataset would be _equally good_ (or equally bad) to doing it with a series of $t$-tests? — dagrha, Sep 21 '14 at 04:21
No, I didn't say that. (Equally good by what criterion?) ... among other things, I was saying that any dichotomizing as a way of finding out what you seem to want to find out wouldn't be a good idea. — Glen_b, Sep 21 '14 at 05:57
Imagine a continuous distribution that is bimodal; the bimodality would suggest that the sample could be composed of elements from two different populations. If someone asked you to dichotomize that dataset, how would you go about figuring out the best value at which to "slice" the dataset? In other words, above that "slice", any element is _most likely_ to come from the upper population, and below which the element is _most likely_ to come from the lower population. — dagrha, Sep 21 '14 at 16:58
There are numerous methods for this already, for example, unlabelled classification/clustering methods, or if you want to model it as a mixture, there are suitable methods for that. If that's your problem, ask a question about dealing with that. — Glen_b, Sep 22 '14 at 06:23

score 1 · Answer 1 · edited Apr 13 '17 at 12:44

1

Why divide the group in two? What substantive question are you trying to answer by doing that?

Perhaps your substantive question might better be answered if you instead examine the bivariate distribution of $x$ and $y$ using scatter plot methods (say, like whuber's blurred/gamma-corrected scatter plot here), or looking at the marginal bivariate distributions using a scatter plot smoothing regression as in my answer to the same question. That way you get to understand the behavior of $x$ relative to $y$, without forcing an artificial and meaningless division of $y$.

edited Apr 13 '17 at 12:44

Community

1

answered Sep 20 '14 at 22:11

Alexis

26,219
5
78
131

You make a fair point about the "why" of this. In brief, $x$ is a measure of natural radioactivity of a rock sample; $y$ is the density of the rock. The hypothesis would be that rocks that have more natural radioactivity (claystones with Th, K) would be less dense than rocks with less natural radioactivity (e.g. carbonates) in this dataset. A cross-plot of $x$ vs $y$ is very ambiguous- essentially scattershot with a weak relationship. – dagrha Sep 20 '14 at 22:30
1

But a $t$-test based on an aggregation of values into two categories (high and low, say) would hemorrhage statistical power... far worse than treating the data as they actually are: (bounded) continuous measures. A smoothing would provide you with a more nuanced measure of relationship, and have greater statistical power to inform inference. – Alexis Sep 20 '14 at 23:36
For what it's worth, these variables are actually Euclidean vectors (not bounded); in my original question I was just stating the range of this particular sample. Sorry-- I should have clarified! I agree that the graphical tools are powerful and elucidating. I'm still curious about how one might do it only mathematically. I guess I could try writing a script to accomplish it. – dagrha Sep 21 '14 at 03:49
@dazzle nonparametric regressions *are* mathematical. You could also base a parametric representation on what such a nonparametric regression would tell you. – Alexis Sep 21 '14 at 05:42

Running many $t$-tests to find the most significant way to divide a dataset

1 Answers1