1

I'm trying to classify 3000+ documents based on hand coding of a sample of these documents (approximately 160). I sampled these documents through sampling code in R. These documents will be classified into two categories coded as dummy variables 0 or 1.

I was studying supervised learning and one part I was confused about was whether the test set documents need to be classified prior to running analysis.

I have classifiers for 160 documents but I don't for the test set which is approximately 2840.

Since I don't have prior classifiers I won't be able to evaluate the classification (since I won't be able to derive false negatives false positives, etc.), but I was wondering if this is acceptable.

Again to rephrase, I'm trying to classify documents in the test set using the algorithm derived through hand-coded classifiers in the training set.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
user2298759
  • 175
  • 8

2 Answers2

1

Yes, normally all our samples are already labeled, so we can evaluate the performance of the classifier by looking at how accurately it predicts the labels of an independent set of test data. If we can convince ourselves that the classifier is highly accurate, then perhaps in future we could use it to automatically label new data, because we're confident that those labels will be correct almost all of the time. However, if you're trying to establish your classifier's performance, I'm afraid you're going to need already-labeled data for that.

So my advice would be to either label some more of those documents, or split up the 160 documents you already have into training and test sets (perhaps in a cross-validation procedure, so that you can optimally use what little data you have). To improve learning, ideally you want as much labeled data as possible to train on though, so if there's any possibility that you could get labels for all documents, that could really improve your outcomes. On the other hand, if the point is not to test your classifier but actually put it to use in automatically labelling your remaining documents, then obviously you'll have to work with the labels you have (and then for testing the classifier first, split up your labeled data as I suggested initially).

Ruben van Bergen
  • 6,511
  • 1
  • 20
  • 38
  • Yes, the purpose of my work is your last point - I want to use it to automatically label the remaining documents to save human labor time. But I have one question about then testing the classifier -- what would it imply if the classifier has low performance in this case, where the classifier in the training set is human generated? – user2298759 May 07 '17 at 13:41
  • @user2298759 why would it imply that? – shadowtalker May 07 '17 at 14:04
  • @ssdecontrol I don't know, I haven't tried so I'm just asking what if – user2298759 May 07 '17 at 14:32
  • 1
    If the classifier does a bad job at classifying your labeled examples (i.e. if the labels it generates are often incorrect - more often than is acceptable to you), then that means you probably can't trust the classifier to do well (enough) on your remaining documents. In which case you'd need to do something to improve your classifier, e.g. give it more examples to train on, or more features to use, or use a different algorithm. – Ruben van Bergen May 07 '17 at 15:25
  • @RubenvanBergen Thanks for the advice, I will keep that in mind. – user2298759 May 08 '17 at 11:39
0

This looks like a problem. Click on that tag in the first sentence to see a list of such posts at this site. Maybe you should estimate a logistic regression (or some other direct probability model) on the labeled data, and then use that model to predict probabilities for the unlabeled data. Then you can add those predicted probabilities to the data, there is no need to do a hard classification (or it can wait until really needed.)

Some commenters ask about accuracy. You should avoid such a measure, since it is not a proper scoring rule. Use some proper scoring rule in its place, see Using proper scoring rule to determine class membership from logistic regression and this sored google search for ideas.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467