2

I would like to do the following:

  • Train a classifier on a certain dataset
  • Test the classifier on a certain test set
  • Compute the test error and standard deviation
  • Compute a 95% confidence interval for the true error

I have a training set X_train with 1000 training examples, each having 15 features. The true labels are in Y_train. I also have a test set of X_test, with the true labels in Y_test.

So far, I have come up with the following code:

scores = np.zeros(1000)

clf = RandomForestClassifier(criterion='entropy')
clf.fit(X_train, Y_train)
for i in range(1000):
    score = clf.score(X_test[i], [Y_test[i]])
    scores[i] = score

The above code fits the model with the training set and then uses the clf.score method on every test example separately. Consequently, the scores array is a binary array, because each time a singleton is tested and this is either correctly or incorrectly classified. Next, I compute the test error and standard deviation like this:

ctr = 0
for i in scores:
    if i == 0:
        ctr += 1

test_error = ctr/1000.0
std = scores.std()

I assume the data is approximated by a Normal Distribution since I have 1000 training examples. Then, I compute the 95% confidence interval for the true error like this:

med = test_error
low = test_error - 1.645 * math.sqrt(std)
high = test_error + 1.645 * math.sqrt(std)

My question is: is this a correct way of computing the test error and the 95% confidence interval?

JNevens
  • 269
  • 1
  • 3
  • 15

1 Answers1

1

The 95% confidence interval is incorrect. I think the formula you are looking for is $testError \pm t_{d.f.,\frac{\alpha}{2}} * StdErr$

The first term, $t_{d.f.,\frac{\alpha}{2}}$, is the t-value for a specific $d.f. = degreesOfFreedom$ and $\alpha$. The degrees of freedom follows the size of the test set and is simply $n-1=999$. And $\alpha$ is the significance level you are looking for, in this case $\alpha=0.05$. However, since you are computing a two-sided 95% confidence interval, then the t-value is 1.96 since there will be 2.5% on each side. 1.645 is the one-sided t-value. When you are looking up the t-value make sure to check if the table is reporting one-sided or two-sided values .

Also, we use the t-table instead of a z-table because we are estimating our variance from our data instead of using a known variance. For small data sizes (and small d.f.), notice how the t-value is larger than the corresponding z-value. And as the data size increases (and d.f. increases), we gain more confidence in our variance estimate and the t-value approaches the z-value. Since this test set is so large, the t-value can simply be replaced by the z-value.

Finally, the standard error is incorrect. Square root is taken of the variance: $std=\sqrt{var}$. But most importantly, the confidence interval of mean error uses the Central Limit Theorem. As the test set gets larger, you should be more confident that the mean error is correct and the interval will get smaller. $StdErr=\frac{std}{\sqrt{n}}$

In this case, the specific equation is

$testError \pm 1.96 * \frac{std}{\sqrt{1000}}$

Eric Farng
  • 1,585
  • 10
  • 17