Evaluation of Method to Classify Large Number of Unlabeled Data

Question

I have a small set of labeled training data around 300 examples with 50 features each. Also I have a large dataset of unlabeled data around 30000 examples with 50 features each. What is the best way to find the labels of the second dataset?

The way I currently use is

Train a linear classifier as much as possible with the labeled data
Use KNN to the unlabeled data 50 at a time and those that are closest to the training examples get to the labeled set.
Train the linear classifier again with the new training data

etc...

What are the labels like? Binary, multiclass, continuous...? — Alex L, Jul 01 '18 at 23:29
@AlexL The labels for the classes are numeric for example 1 for the first class 5 for the fifth etc... — Adam, Jul 01 '18 at 23:30
What kind of linear classifier are you using and how well is it performing? — Alex L, Jul 01 '18 at 23:37
@AlexL I use SVMs. It has around 70% accuracy. The thing is that this is a methodology that I have thought myself. I was asking to find out if the way I use can be trusted and also what is the best way to tackle this situation meaning what is the best method. — Adam, Jul 01 '18 at 23:46

score 3 · Answer 1 · edited Jul 02 '18 at 19:20

3

There is a developing subdomain in machine learning called Active Learning where you know the labels for few and use it to suggest other unlabeled examples that would be most useful to have labels for. Then you can label those examples by-hand and re-train the classifier. It's a special form of semi-supervised learning.

So how do that? There's a python (assuming you are doing everything in Python) library called libact and its source code is on github. It claims it can work well with scikit-learn models.

In particular, libact models can be easily obtained by interfacing with the models in scikit-learn.

P.S friend of mine says it works well only on Ubuntu. So just make a note of it.

Hope this helps.

edited Jul 02 '18 at 19:20

Aaron

3,025
14
24

answered Jul 02 '18 at 04:00

tenshi

487
3
6

1

FYI, that's not what active learning is. Active learning is when the algorithm is able to ask the human to label additional examples that it thinks would be most helpful for improving the classifier. – Aaron Jul 02 '18 at 04:14
1

Yes, that's why I suggested it. Basically, he can stick only with semi-supervised algorithms, where human intervention is required. Thanks for the downvote :( – tenshi Jul 02 '18 at 04:38
1

Edited your answer so I can upvote it – Aaron Jul 02 '18 at 18:39

score 2 · Answer 2 · answered Jul 02 '18 at 18:42

The procedure you are using resembles a technique called self-training. The way it normally works is

Train the classifier on your labeled data.
Use it to predict the labels on all the unlabeled data.
Assume that the classifier is correct for the unlabeled examples for which it is most confident and add them to the labeled training data.
Go back to step 1.

Evaluation of Method to Classify Large Number of Unlabeled Data

2 Answers2

Linked