Is this a valid approach to hypothesis testing

Question

I have been a machine learning hobbyist for a while now and am attempting to formalise my understanding of the statistical foundations to data analysis and machine learning. My reading so far has basically lead me to think the following approach is valid.

Say I have some data representing the sale of houses; $n$ samples of a dependent variable $Y$ which is the Sale Price of the house, plus a number of independent variables describing features of the house or the transaction. For the purposes of this question I'll consider just one; $X_0$ being "MSZoning", which is a classification of the transaction and thus is a categorical variable.

I want to determine whether or not $X_0$ and $Y$ are associated to a statistically significant level. My choices are to perform either an ANOVA or a Kruskal-Wallis test. I've kinda been lead by my reading to pick ANOVA unless the data fails the assumptions that make that test valid, so going down the list I test for a normal distribution of each group by plotting those (log transformed) distributions:

import pandas as pd
import seaborn as sns
import scipy.stats as ss

# A dataframe is instantiated here

fig, ax = plt.subplots(figsize=(24, 12))
for val in df.MSZoning.unique():
    sns.distplot(np.log1p(df.loc[df.MSZoning == val].SalePrice), ax=ax)
plt.show()

I'm making a judgement call at this point that those distributions are too far from normal for the ANOVA's assumptions to hold, and so I'm going to do a Kruskal-Wallis test. My first question is:

Is there a better way to assess whether the distributions for each group are close enough to normal for the ANOVA assumptions to hold than a simple judgement call?

I then ran the Kruskal-Wallis test using scipy:

a = [df.loc[df.MSZoning == c].SalePrice for c in df.MSZoning.unique()]
H, p = ss.kruskal(*a)
print('H-statistic: ', H, ' p-value: ', p)

And out pops the output; H-statistic: 270.0701971937021 p-value: 3.0807239995999556e-57. That's very strongly significant, so I would then move on to post-hoc analysis. Scipy seems to lack in this regard, so I used scikit-posthocs and chose the Iram-Conover test (purely on the basis of Cross-Validated answers):

sp.posthoc_conover(df, val_col='SalePrice', group_col='MSZoning', p_adjust='holm')

And the output is a table showing the pairwise p-values:

So based on that, my conclusion would be that there is an association between $X_0$ and $Y$ significant to at least the 5% level (but probably much stronger). The C (all) - FV, C (all) - RL, FV - RH, FV - RL, FV - RM, RH - RL, RL - RM groups all showed differences that were significant to at least the 5% level, but the remaining pairs did not show a statistically significant difference.

My second question is: have I made a mistake in my methodology, or is that conclusion reasonable?

score 1 · Accepted Answer · answered Apr 05 '19 at 06:26

1

One thing I'd like to point out is that the data don't have to look perfectly Gaussian to come from a normal distribution. The main thing is that so long as they aren't very skewed, or there aren't outliers a large distance from the typical range of values in the data, it's fine. Unless you have a huge amount of samples from a process where the measurement noise comes from a large number of small sources (such that the errors tend to cancel out -- this process is what generates a Gaussian distribution --), the data will likely never look like a perfect bell curve. Even when they do, they can actually be non-Gaussian! See Rand Wilcox's book on Robust Estimation and Hypothesis testing for more on that..

In any case, nothing about what you've shown in that plot would lead me to use the Kruskal-Wallis over ANOVA, so it's fine.

Hypothesis tests for normality are also bunk. Though you didn't mention them here, I'd like to pass on that bit of info. See Is normality testing 'essentially useless'? and What is wrong with tests of normality? Also you should look into Bayesian inference. Stan is easy to learn and available in Python :)

answered Apr 05 '19 at 06:26

BKV

380
2
10

Thanks for the answer; appreciated. I started a MOOC on Bayesian inference last night :). Your reply here sort of troubles me a little as I did actually run an ANOVA and post hoc t-tests for funsies and although the p-value of the ANOVA was comparable to the Kruskal-Wallis, the t-tests returned significantly different p-values to the Conover I tried here for some group-pairs! I thought that the non-parametric tests ought to perform _similarly_ when used on parametric data. Is that not right? – nors Apr 05 '19 at 07:34
1

Not necessarily. Keep in mind that the non-parametric tests technically answer a different question than their parametric bretheren. Ie, it is often said they compare medians, in analogy to comparing means parametrically.. But this isn't the case: "The null hypothesis..is often said to be that the medians of the groups are equal, but this is only true if..the shape of the distribution in each group is the same. If..different, the Kruskal–Wallis test can reject the null..even though the medians are the same" - [biostats handbook](http://www.biostathandbook.com/kruskalwallis.html) – BKV Apr 06 '19 at 03:05
1

Also, the non-parametric tests generally have lower power than the parametric tests when the data conform to the parametric assumptions. – BKV Apr 06 '19 at 03:06

Is this a valid approach to hypothesis testing

1 Answers1