7

I’m writing some code (JavaScript) to compare benchmark results. I’m using the Welch T-test because the variance and/or sample size between benchmarks is most likely different. The critical value is pulled from a T-distribution table at 95% confidence (two-sided).

The Welch formula is pretty straight-forward, but I am fuzzy on interpreting a significant result. I am not sure if the critical value should be divided by 2 or not. Help clearing that up is appreciated. Also should I be rounding the degrees of freedom, df, to lookup the critical value or would Math.ceil or Math.floor be more appropriate?

  /**
   * Determines if the benchmark's hertz is higher than another.
   * @member Benchmark
   * @param {Object} other The benchmark to compare.
   * @returns {Number} Returns `1` if higher, `-1` if lower, and `0` if indeterminate.
   */
  function compare(other) {
    // use welch t-test
    // http://frank.mtsu.edu/~dkfuller/notes302/welcht.pdf
    // http://www.public.iastate.edu/~alicia/stat328/Regression%20inference-part2.pdf
    var a = this.stats,
        b = other.stats,
        pow = Math.pow,
        bitA = a.variance / a.size,
        bitB = b.variance / b.size,
        df = pow(bitA + bitB, 2) / ((pow(bitA, 2) / a.size - 1) + (pow(bitB, 2) / b.size - 1)),
        t = (a.mean - b.mean) / Math.sqrt(bitA + bitB),
        c = getCriticalValue(Math.round(df));

    // check if t-statistic is significant
    return Math.abs(t) > c / 2 ? (t > 0 ? 1 : -1) : 0;
  }

Update: Thanks for all the replies so far! My colleague posted some more info here, in case that affects the advice.

Mathias Bynens
  • 173
  • 2
  • 6

4 Answers4

9

(1a) You don't need the Welch test to cope with different sample sizes. That's automatically handled by the Student t-test.

(1b) If you think there's a real chance the variances in the two populations are strongly different, then you are assuming a priori that the two populations differ. It might not be a difference of location--that's what a t-test evaluates--but it's still an important difference nonetheless. Don't paper it over by adopting a test that ignores this difference! (Differences in variance often arise where one sample is "contaminated" with a few extreme results, simultaneously shifting the location and increasing the variance. Because of the large variance it can be difficult to detect the shift in location (no matter how great it is) in a small to medium size sample, because the increase in variance is roughly proportional to the squared change in location. This form of "contamination" occurs, for instance, when only a fraction of an experimental group responds to the treatment.) Therefore you should consider a more appropriate test, such as a slippage test. Even better would be a less automated graphical approach using exploratory data analysis techniques.

(2) Use a two-sided test when a change of average in either direction (greater or lesser) is possible. Otherwise, when you are testing only for an increase or decrease in average, use a one-sided test.

(3) Rounding would be incorrect and you shouldn't have to do it: most algorithms for computing t distributions don't care whether the DoF is an integer. Rounding is not a big deal, but if you're using a t-test in the first place, you're concerned about small sample sizes (for otherwise the simpler z-test will work fine) and even small changes in DoF can matter a little.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • Thanks for all your wonderful advice; it’s much appreciated! My colleague [posted some more info here](http://stats.stackexchange.com/questions/5913/interpreting-two-sided-two-sample-welch-t-test/5961#5961); does this affect anything? – Mathias Bynens Jan 04 '11 at 13:20
6

Dividing by 2 is for p-values. If you compare critical values the division by 2 is not necessary. The function getCriticalValue should be the quantile function of Student's t distribution. Thus it should take 2 values, the probability and the degrees of freedom. If you want 2-sided hypothesis as indicated by your code, then you need 0.975 quantile.

For the rounding, since the degrees of freedom are positive Math.round looks good.

mpiktas
  • 33,140
  • 5
  • 82
  • 138
  • +1 For noticing the strange treatment of critical values in the OP's code! – whuber Jan 03 '11 at 17:59
  • Thanks for all your wonderful advice; it’s much appreciated! My colleague [posted some more info here](http://stats.stackexchange.com/questions/5913/interpreting-two-sided-two-sample-welch-t-test/5961#5961); does this affect anything? – Mathias Bynens Jan 04 '11 at 13:20
  • @Mathias Bynens, if you are really going the way of rounding degrees of freedom, the only fix is to drop the division by 2. – mpiktas Jan 04 '11 at 13:25
6

It's not absolutely necessary to round the degrees of freedom to an integer. Student's t-distribution can be defined for all positive real values of this parameter. Restricting it to a positive integer may make the critical value easier to calculate though, depending on how you're doing that. And it will make very little difference in practice with any reasonable sample sizes.

onestop
  • 16,816
  • 2
  • 53
  • 83
  • Thanks for all your wonderful advice; it’s much appreciated! My colleague [posted some more info here](http://stats.stackexchange.com/questions/5913/interpreting-two-sided-two-sample-welch-t-test/5961#5961); does this affect anything? – Mathias Bynens Jan 04 '11 at 13:22
  • A sample size of 5 is a bit low. Can you increase the minimum a bit, say to 10 or 20? It's not just a question of rounding the d.f. The test is more sensitive to non-normality with small samples, and strictly speaking, your rule of sampling until the margin of error is below some threshold should affect the critical values of the test. – onestop Jan 04 '11 at 14:33
  • @onestop: Could you please explain a bit further? Especially the part about affecting the critical values of the test. A larger sample changes the degrees of freedom, since `df = sample size - 1`, but beyond that I’m afraid I don’t really understand what you mean. Thanks in advance! – Mathias Bynens Jan 04 '11 at 17:01
  • The property that distinguishes the t-distribution from the normal is that it takes into account the sample size. See for example http://www.math.unb.ca/~knight/utility/t-table.htm - note how the critical values are substantially larger at small degrees of freedom. – sesqu Jan 04 '11 at 17:55
  • Oh, that's not actually a relevant. The question was about your sampling rule affecting the test. The issue there is that the t-statistics you are comparing to are built with the assumption that consecutive measurements are independent, but yours are not. Since you stop after a margin is sufficiently low, you are stopping all cases where the margin dips below the threshold but would soon come back up, and therefore you slightly overstate significances. – sesqu Jan 04 '11 at 19:05
  • @sesqu One of the reasons for stopping it after it reaches 1% is so tests don’t run forever or all run for 8 seconds. Would it be better to just stop after some time (like 8 seconds) with no percent threshold or lower the threshold to below 1%? We also use 1% as a cutoff for calibrations that are used to counter the cost of some testing overhead, and taking it down to 1% can take some time and 750+ sample sizes. Is there a way to factor in the cutoff to counter any negative affects on significance? – Mathias Bynens Jan 05 '11 at 18:04
  • Yes. The easy way would be to stop at 8 seconds every time. To counter the effects, you would divide each p by 0.95^i-1, depending on what you're really doing. – sesqu Jan 05 '11 at 20:01
1

I'm working with the OP on the benchmarking project and wanted to thank you all for clearing some things up. Also I wanted to provide a bit more information in case that affects the advice.

The sample size ranges from 5 - 700+ (as many as can be completed in 8 seconds or until the margin of error is at or below 1%. The critical values are pulled from an object for simplicity (because other calculations determine the degrees of freedom as sample-size minus 1).

  /**
   * T-Distribution two-tailed critical values for 95% confidence
   * http://www.itl.nist.gov/div898/handbook/eda/section3/eda3672.htm
   */
  T_DISTRIBUTION = {
    '1':  12.706, '2':  4.303, '3':  3.182, '4':  2.776 /* , ... */
  }

Update

I checked and the difference between variances seems rather high.

Variances:
    4,474,400,141.236059
    3,032,977,106.8208385
  226,854,226,665.14194
       24,612,581.169126578

We are testing the operations per second of various code snippets (some are slower, lower ops/sec, others faster, higher ops/sec).

Also we used to simply compare the overlap between means +- margin of error of each but it was suggested that a t-test is better because it can find results with statistical significance.

  • 1
    Please delete the t-table: it is pointless. It would suffice to say your are using the upper 97.5 percentile for your critical region. Of more import is a description of how the testing works and some examples of how the timings are distributed. These variances differ so greatly that it makes no sense to test for equality of means. – whuber Jan 04 '11 at 16:06
  • @whuber, The t-table is used to calc the margin of error of each benchmark result (each benchmark has a sample of runs). Because of such difference in variances would a simple comparison of mean+-margin of error suffice? You can see a sample test http://jsperf.com/benchmark-js-test-page. Testing is done by determining the min time to run a test (in most cases the smallest unit is 1ms so the time needed for a percentage uncertainty of 1% is 50ms ((1 / 2) / 0.01), so a test is executed repeatedly until the time passed is >= 50ms, the time is clocked, and the executions per second are computed – John-David Dalton Jan 04 '11 at 17:11
  • @John-David You might must as well show us a multiplication table to illustrate the multiplying you do :-). The sample test page is well done. You could enhance it slightly by indicating how many iterations each result is based on. You might also screen out obvious outliers. What I'm still unsure of is what you are comparing to what! If you're comparing my ops/sec to someone else's for the same benchmark, then you are probably best off using *logarithms* of the timing for your comparison. That should equalize the variances, eliminating that complication. – whuber Jan 04 '11 at 17:25
  • I believe they are comparing different approaches, not different users (though it is not clear if users are aggregated). That could justify different variances, as one algorithm might be amortized or somesuch. However, since runtimes may not be normal, I would recommend a brief look at some data to see if, for example, an Erlang distribution offset from below and censored from above would be a better fit than the Normal. Also, the test stopping condition isn't independent of the data, which should ideally be addressed. – sesqu Jan 04 '11 at 17:47
  • @whuber I posted the table to show we are not using a formula to resolve the critical value but rather a simple table and to show you/others the values in the table in case you/others took issue with them or caught something. – John-David Dalton Jan 04 '11 at 17:57
  • @whuber, "You might also screen out obvious outliers" > So one pass compute standard deviation (SD) and things, remove outliers, then a second pass recompute SD and margin of error, and things? – John-David Dalton Jan 04 '11 at 18:00
  • @whuber, "What I'm still unsure of is what you are comparing to what!" > In the example *(I linked to earlier)* when you click "Run Tests", and it finishes, it will rank each test ("Sort ascending", "Sort descending", "Don’t sort at all", and so on) based on their own mean ops/sec *(so it's comparing different tests to each other)*. The Browserscope results show the results of other browsers users ran the benchmarks in. – John-David Dalton Jan 04 '11 at 18:01
  • @John-David Thank you for explaining. I was confused because the heading for the Browserscope says "Results" and it is the only tabular summary. I now understand you are (a) expressing standard errors as percentages and (b) expressing mean speeds as percentages of the max. Both indicate you are already thinking in terms of *relative* (*i.e.*, proportional) comparisons, which strongly suggests basing the analysis on logs. (A difference in logarithms translates to a proportional difference in the original units.) – whuber Jan 04 '11 at 18:32
  • @John-David As far as the t-table goes, it's just laying out the results of 100 calculations, exactly like a multiplication table would lay out 100 multiplications. It doesn't matter *how* you get the t-values; what matters is *what they represent.* You are using them to construct two-sided 95% confidence intervals around the mean; that's all we (or anyone else) need to know. – whuber Jan 04 '11 at 18:34
  • @whuber, Thanks for your advice. Would you point me to some resources, or specific formulas for comparing logarithms, I can use/learn about? – John-David Dalton Jan 04 '11 at 18:37
  • @John-David I'm sure the Web is full of such resources. I posted some materials relevant to statistics, for post-graduate professionals who may have forgotten all about logarithms, at http://www.quantdec.com/envstats/notes/class_03/gm_etc.htm . – whuber Jan 04 '11 at 20:36